Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams are cut off at Length which extracts incomplete files #809

Closed
qbq-ber-robin-klimonow opened this issue Mar 26, 2024 · 1 comment · Fixed by #838
Closed

Streams are cut off at Length which extracts incomplete files #809

qbq-ber-robin-klimonow opened this issue Mar 26, 2024 · 1 comment · Fixed by #838

Comments

@qbq-ber-robin-klimonow
Copy link

qbq-ber-robin-klimonow commented Mar 26, 2024

Hi,

I stumbled upon an issue while using PdfPig for extracting attachments from a PDF file. I have attached a sample PDF to reproduce the error:

0_ZUGFeRD.pdf

After debugging through the code, I found that the /Length attribute of the stream is set to a wrong value of 11417 bytes, however there is no endstream at the expected position. Instead, there are more bytes, followed by endstream.

PdfPig now seems to just cut all additional bytes off, which seems reasonable at first. However, other Pdf Libraries we tested can handle this pdf file just fine, and so can Adobe Acrobat Reader itself.

Therefore, this seems to be an issue with PdfPig in my oppinion.

The fix for me was to remove the if check in PdfTokenScanner.cs, Line 437

                if (length.HasValue && memoryStream.Length >= length)
                {
                    // Use the declared length to copy just the data we want.
                    byte[] data = new byte[read];
                    memoryStream.Read(data, 0, (int)read);

                    stream = new StreamToken(streamDictionaryToken, data);
                }
                else
                {

And just use the else block which tries to find endobj or endstream and cut off there - which works with the PDF I attached.

Another reason could be the special characters in the attachment, which might have lead to this Length attribute value.

Would this be a valid fix in your oppinion?

@sbruyere
Copy link
Contributor

Dears,

I have the same issue and came to same conclusion. Any hope this can be fixed ?
Thanks !

sbruyere added a commit to sbruyere/PdfPig that referenced this issue May 21, 2024
…gth cutting off Streams

- Fix of Stream invalid Length issue causing stream data being cut off: fix UglyToad#809

- Improve Stream Token read performance by:
  -  simplifying TryReadStream(), avoiding use of MemoryStream, with benefice of already existing Memory Span of "inputBytes"
  - removing the unecessary List<>
BobLd pushed a commit that referenced this issue May 31, 2024
…gth cutting off Streams (#838)

* Improve TryReadStream with simplification & fix of Stream Invalid Length cutting off Streams

- Fix of Stream invalid Length issue causing stream data being cut off: fix #809

- Improve Stream Token read performance by:
  -  simplifying TryReadStream(), avoiding use of MemoryStream, with benefice of already existing Memory Span of "inputBytes"
  - removing the unecessary List<>

* Add Stream with Invalid Length unit test

* Use of Memory<> instead of direct Span to avoid byte array allocation .ToArray.
Suggestion from (https://github.com/UglyToad/PdfPig/pull/838/files/4153e4a1b421aee6158799175ced081c9f533a13#r1619509165)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants