-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rss订阅:P标签不会单行显示 #528
Comments
问题的原因在第 68 行这里,用 bs 库获取 html 的文本的时候丢失了 nonebot-bison/nonebot_bison/platform/rss.py Lines 65 to 69 in 1c753f7
In [23]: doc = """
...: terterthv<p>cxiobjhoijeraoi</p>jgiojoidfgjk<p>ldfjgioj</p>bvcxninclin
...: """
In [24]: soup = bs(doc,"html.parser")
In [25]: soup.get_text()
Out[25]: '\nterterthvcxiobjhoijeraoijgiojoidfgjkldfjgiojbvcxninclin\n' bs 获取文本换行逻辑似乎是根据 html 的换行来进行处理的 From https://www.crummy.com/software/BeautifulSoup/bs4/doc/ html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ... In [14]: html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><
...: p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" cl
...: ass="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http:/
...: /example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</
...: p>"""
In [15]: soup = BeautifulSoup(html_doc, 'html.parser');print(soup.get_text())
The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.... 手动预处理 html获取描述后先手动进行预处理,例如将 html2text这个库可以把 html 转换成 markdown In [27]: html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><
...: p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" cl
...: ass="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http:/
...: /example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</
...: p>"""
In [28]: h = html2text.HTML2Text()
In [29]: h.ignore_links = True
In [30]: print(h.handle(html_doc))
**The Dormouse's story**
Once upon a time there were three little sisters; and their names
wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...
In [31]: html_doc = """<html><body><p>cxiobjhoijeraoi</p>jgiojoidfgjk<p>ldfjgioj</p>bvcxninclin</body></html>"""
In [32]: h = html2text.HTML2Text()
In [33]: h.ignore_links = True
In [34]: print(h.handle(html_doc))
cxiobjhoijeraoi
jgiojoidfgjk
ldfjgioj
bvcxninclin 经过处理可以获得较为美观的纯文本 |
我记得weibo还是什么地方也有类似(手撮的)处理 html 的文本,统一处理一下? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
环境
问题
rss订阅中P标签的文字不会单行显示
日志
The text was updated successfully, but these errors were encountered: