-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For some documents, many words get lumped together by get_text() #2755
Comments
There is an algorithm in MuPDF, which generates spaces between characters where it seems appropriate - based on some criteria like font, font size, character width etc. In any case it is a MuPDF issue, and I have submitted a bug report at its issue tracker here. |
Thanks, @JorjMcKie ! Does such a setting exists? |
Thanks for the suggestion. But no, there is no such parameter yet. But how about suggesting this to the MuPDF dvelopers directly in this public Discord channel. Like with the PyMuPDF channel, there always are nice people around, open to discuss anything about MuPDF. Maybe there also are ideas that may help you. You are aware that you can develop a circumvention yourself while waiting for a better solution? Just extract by character |
For consideration, here is test script, that fiddles together an alternative plain text output based on a per-character extraction. import fitz
doc = fitz.open("Gravity.pdf")
page = doc[11]
space_count = 0
space_w = 0
for b in page.get_text("rawdict", flags=fitz.TEXTFLAGS_TEXT, sort=True)["blocks"]:
for l in b["lines"]:
text = ""
chars = []
for s in l["spans"]:
chars.extend(s["chars"])
char_count = len(chars)
if char_count == 1:
print(chars[0]["c"])
continue
for i in range(1, char_count):
c0 = chars[i - 1]
r0 = fitz.Rect(c0["bbox"])
c1 = chars[i]
r1 = fitz.Rect(c1["bbox"])
text += c0["c"]
if r1.x0 - r0.x1 >= 1.5:
text += " "
if c1["c"] == " ":
space_count += 1
space_w += r1.width
print(text + c1["c"])
print()
print(f"space count {space_count}, avg width = {space_w/space_count}") The output is as the reaer of the page would expect:
|
Thanks @JorjMcKie ! I'm expanding the open-source project BrainAnnex.org to also provide a full-text indexing/search feature for documents (incl. PDF) managed by a "Knowledge and multimedia content management system" that employs the power of graph databases... I elaborate in this article. I happen to have a substantial number of PDF books, scientific papers and other documents at my disposal... and I'm using some of them to test the system. The extraction of individual words is the key element for this process. Empirically, I found that I'm definitely impressed by your efforts with I will experiment with the "rawdict" algorithm that you proposed - thanks! - and will report on the results. I understand that word detection in PDF's is something of an art form! |
@BrainAnnex - thank you very much for your feedback!
No, this is as simple as a lower threshold value for inter-character distances. In the case of your example, word separation will work congruently to the reader's perception if a distance larger than 25% of the current to the next character is taken as that threshold. This means, that the following code snippet will produce a satisfactory output: import fitz
doc = fitz.open("Gravity.pdf")
page = doc[11]
for b in page.get_text("rawdict", flags=fitz.TEXTFLAGS_TEXT, sort=True)["blocks"]:
for l in b["lines"]:
text = ""
chars = []
for s in l["spans"]:
chars.extend(s["chars"])
char_count = len(chars)
if char_count == 1:
print(chars[0]["c"])
continue
for i in range(1, char_count):
c0 = chars[i - 1]
r0 = fitz.Rect(c0["bbox"])
c1 = chars[i]
r1 = fitz.Rect(c1["bbox"])
text += c0["c"]
if (c0["c"] != " " and c1["c"] != " "
and r1.x0 - r0.x1 >= r0.width * 0.25):
# distance to next char if both aren't space
text += " "
print(text + c1["c"]) |
There's an experimental alternative available in the recently-released "rebased" implementation of PyMuPDF-1.23.6, making direct use of MuPDF's Extract facility via the Here's some example code that uses both
For me, Depending on how well this works in other cases, you might still be better off using @JorjMcKie's "rawdict" approach. |
Describe the bug
I'm trying to extract text from PDF documents, to isolate individual words and create an indexing system.
For most PDF files, pymupdf (version 1.23.5) does a fine job... but for some files (such as the one enclosed, "Gravity.pdf"), a lot of words emerge glued together.
To Reproduce
The file in question (but NOT the only one!) is : Gravity.pdf
Output
It contains several words fused together, such as in this portion
and,\nsureenough,bothfellwiththesameaccelerationandreachedthe\nMoon’s surface together.2
Screenshots
Full text of the parse:
This is how it looks in the PDF:
Your configuration
print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
on Colab gives:
On my local computer, it gives:
On Colab, I issue
!pip install pymupdf
, and it says:On my local computer, I let PyCharm deal with it. (I think it does a
pip install
)Additional context
Words fused together occur A LOT when parsing the attached PDF.
I suspect you'll say that this file is malformed. Maybe it is... but another software library,
pypdf
, parses it just fine.I have noticed that the lost spaces are far more prevalent in extractions by
pypdf
, compared to PyMuPDF - BUT for some files (such as the one I'm reporting here) it's the opposite, andpypdf
does far better.Empirically, I've noticed an intriguing complementary between
pypdf
andPyMuPDF
: for files where one messes up badly, the other one does well - and vice versa. Maybe a different threshold of how to detect blank spaces in sentences? Maybe some insight to gain from this??Thanks!
The text was updated successfully, but these errors were encountered: