Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rss订阅:P标签不会单行显示 #528

Open
ifwlzs opened this issue Apr 24, 2024 · 2 comments
Open

rss订阅:P标签不会单行显示 #528

ifwlzs opened this issue Apr 24, 2024 · 2 comments

Comments

@ifwlzs
Copy link

ifwlzs commented Apr 24, 2024

环境

  • nonebot-bison 版本:0.9.2
  • nonebot 版本:2.2.1
  • 安装方式:1(以下方式的一种或者其他方式)
    1. 通过 nb-cli 安装
    2. 使用 poetry/pdm 等现代包管理器安装
    3. 通过 pip install 安装
    4. 克隆或下载项目直接使用
  • 操作系统:windows 2009 (19045.4710)

问题

rss订阅中P标签的文字不会单行显示

日志

请在这里粘贴你的日志
  • [ √ ] 我搜索过了 issue,但是并没有发现过与我类似的问题
  • [ √ ] 我确认在日志中去掉了敏感信息
@suyiiyii
Copy link

问题的原因在第 68 行这里,用 bs 库获取 html 的文本的时候丢失了<p>标签等格式信息

async def parse(self, raw_post: RawPost) -> Post:
title = raw_post.get("title", "")
soup = bs(raw_post.description, "html.parser")
desc = soup.text.strip()
title, desc = self._text_process(title, desc)

In [23]: doc = """
    ...: terterthv<p>cxiobjhoijeraoi</p>jgiojoidfgjk<p>ldfjgioj</p>bvcxninclin
    ...: """

In [24]: soup = bs(doc,"html.parser")

In [25]: soup.get_text()
Out[25]: '\nterterthvcxiobjhoijeraoijgiojoidfgjkldfjgiojbvcxninclin\n'
bs 获取文本换行逻辑

似乎是根据 html 的换行来进行处理的

From https://www.crummy.com/software/BeautifulSoup/bs4/doc/

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
In [14]: html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><
    ...: p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" cl
    ...: ass="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http:/
    ...: /example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</
    ...: p>"""

In [15]: soup = BeautifulSoup(html_doc, 'html.parser');print(soup.get_text())
The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well....
我想到两种解决方法

手动预处理 html

获取描述后先手动进行预处理,例如将<p>替换为<br>,再将<br>替换为\n
再将处理过后的 html 丢给 bs 处理,获得带有格式的文本

html2text

这个库可以把 html 转换成 markdown

In [27]: html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><
    ...: p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" cl
    ...: ass="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http:/
    ...: /example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</
    ...: p>"""

In [28]: h = html2text.HTML2Text()

In [29]: h.ignore_links = True

In [30]: print(h.handle(html_doc))
**The Dormouse's story**

Once upon a time there were three little sisters; and their names
wereElsie,Lacie andTillie;and they lived at the bottom of a well.

...

In [31]: html_doc = """<html><body><p>cxiobjhoijeraoi</p>jgiojoidfgjk<p>ldfjgioj</p>bvcxninclin</body></html>"""

In [32]: h = html2text.HTML2Text()

In [33]: h.ignore_links = True

In [34]: print(h.handle(html_doc))
cxiobjhoijeraoi

jgiojoidfgjk

ldfjgioj

bvcxninclin

经过处理可以获得较为美观的纯文本

@AzideCupric

@felinae98
Copy link
Collaborator

我记得weibo还是什么地方也有类似(手撮的)处理 html 的文本,统一处理一下?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants