Streams are cut off at Length which extracts incomplete files #809

qbq-ber-robin-klimonow · 2024-03-26T08:28:55Z

Hi,

I stumbled upon an issue while using PdfPig for extracting attachments from a PDF file. I have attached a sample PDF to reproduce the error:

0_ZUGFeRD.pdf

After debugging through the code, I found that the /Length attribute of the stream is set to a wrong value of 11417 bytes, however there is no endstream at the expected position. Instead, there are more bytes, followed by endstream.

PdfPig now seems to just cut all additional bytes off, which seems reasonable at first. However, other Pdf Libraries we tested can handle this pdf file just fine, and so can Adobe Acrobat Reader itself.

Therefore, this seems to be an issue with PdfPig in my oppinion.

The fix for me was to remove the if check in PdfTokenScanner.cs, Line 437

                if (length.HasValue && memoryStream.Length >= length)
                {
                    // Use the declared length to copy just the data we want.
                    byte[] data = new byte[read];
                    memoryStream.Read(data, 0, (int)read);

                    stream = new StreamToken(streamDictionaryToken, data);
                }
                else
                {

And just use the else block which tries to find endobj or endstream and cut off there - which works with the PDF I attached.

Another reason could be the special characters in the attachment, which might have lead to this Length attribute value.

Would this be a valid fix in your oppinion?

sbruyere · 2024-05-20T19:36:29Z

Dears,

I have the same issue and came to same conclusion. Any hope this can be fixed ?
Thanks !

…gth cutting off Streams - Fix of Stream invalid Length issue causing stream data being cut off: fix UglyToad#809 - Improve Stream Token read performance by: - simplifying TryReadStream(), avoiding use of MemoryStream, with benefice of already existing Memory Span of "inputBytes" - removing the unecessary List<>

…gth cutting off Streams (#838) * Improve TryReadStream with simplification & fix of Stream Invalid Length cutting off Streams - Fix of Stream invalid Length issue causing stream data being cut off: fix #809 - Improve Stream Token read performance by: - simplifying TryReadStream(), avoiding use of MemoryStream, with benefice of already existing Memory Span of "inputBytes" - removing the unecessary List<> * Add Stream with Invalid Length unit test * Use of Memory<> instead of direct Span to avoid byte array allocation .ToArray. Suggestion from (https://github.com/UglyToad/PdfPig/pull/838/files/4153e4a1b421aee6158799175ced081c9f533a13#r1619509165)

sbruyere mentioned this issue May 21, 2024

Improve TryReadStream with simplification & fix of Stream Invalid Length cutting off Streams #837

Closed

sbruyere mentioned this issue May 21, 2024

Improve TryReadStream with simplification & fix of Stream Invalid Length cutting off Streams #838

Merged

BobLd closed this as completed in #838 May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streams are cut off at Length which extracts incomplete files #809

Streams are cut off at Length which extracts incomplete files #809

qbq-ber-robin-klimonow commented Mar 26, 2024 •

edited

sbruyere commented May 20, 2024

Streams are cut off at Length which extracts incomplete files #809

Streams are cut off at Length which extracts incomplete files #809

Comments

qbq-ber-robin-klimonow commented Mar 26, 2024 • edited

sbruyere commented May 20, 2024

qbq-ber-robin-klimonow commented Mar 26, 2024 •

edited