Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page.apply_redactions() removes more text than expected in the pdf document. #3433

Open
dameyerdave opened this issue May 2, 2024 · 13 comments

Comments

@dameyerdave
Copy link

Description of the bug

As soon as I apply the reductions all the text and graphics get lost from the pdf.

Source:

Receipt pdf

Annotated:

Receipt_annot

After apply_reductions():

Screenshot 2024-05-02 at 14 40 02

How to reproduce the bug

This is the code I wrote to come tho this:

doc = fitz.open("./Receipt.pdf")
for page in doc:
    for text in some_text_array:
        for area in page.search_for(text, quads=True):
            reduction = page.add_redact_annot(
                area,
                fill=(0, 0, 0),
            )
            reduction.update()

    # here it happens
    page.apply_redactions(0,0,0)

doc.save("./redacted.pdf")
doc.close()

PyMuPDF version

1.24.2

Operating system

MacOS

Python version

3.10

@dameyerdave dameyerdave changed the title Page.apply_redactions() removes all the text in the pdf document. Page.apply_redactions() removes more text than expected in the pdf document. May 2, 2024
@JorjMcKie
Copy link
Collaborator

Please provide all mandatory information - in this case, the reproducing file is missing.

@dameyerdave
Copy link
Author

I'm sorry for that. These are the files:

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented May 3, 2024

Thanks for the examples.
Sorry I cannot find a problem. Made a redaction to remove "David Meyer" and it simply worked!

for r in page.search_for("david meyer"):
    page.add_redact_annot(r)

    
'Redact' annotation on page 0 of original.pdf
page.apply_redactions(0,0,0)
True
doc.ez_save("x-1.24.2.pdf")

image

In the meantime, I also redacted other parts of the page (the text "October 19, 2023") , and they also worked without complaints.

@JorjMcKie JorjMcKie added duplicate fix developed release schedule to be determined Waiting for information and removed duplicate fix developed release schedule to be determined labels May 3, 2024
@aleem75321
Copy link

aleem75321 commented May 5, 2024

HI @JorjMcKie I have faced the same issue while applying Redaction. they remove images which should not be removed or changing text.
test.pdf
test2.pdf

I have attached both pdf to reproduce the issue

test_Original_image
test_after_redacttion

test2_Original_image
test2_after_redacttion
test2_Original_text_issue
test2_after_redact_text_issue

Code:-

import fitz
from pathlib import Path


file_path=Path(r"test_pages/test.pdf")

doc=fitz.open(file_path)
page=doc[0]


blocks=page.get_text("rawdict",flags=fitz.TEXTFLAGS_TEXT,sort=True)["blocks"]  
#Set Colour for outoput PDF
Red = fitz.pdfcolor["red"]

for b in  blocks:
    for l in b["lines"]:  
        for s in l["spans"]:
            for c in s["chars"]:

                if s["size"]>15 and s['color']==2236191: 
                    if c['c']== "ं":
                        try:
                            font = fitz.Font(fontname=s['font'],fontfile=f"{s['font']}.ttf")  # this must be known somehow - or simply try some font else
                        except Exception as e:
                            print(str(e))  
                        redact_box = fitz.Rect(c["bbox"]) 
                        origin_text = fitz.Point(c["origin"]) 
                        redact_box.y1 = redact_box.y1-s['size'] 
                        page.add_redact_annot(redact_box) 
                        # Apply reactions after all text replacements
                        page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE,graphics=fitz.PDF_REDACT_LINE_ART_NONE)
                        # Create Text writer to Write in Page with choose Color
                        tw = fitz.TextWriter(page.rect,color=Red)  
                        #re-insert same text - different color
                        tw.append((origin_text.x,origin_text.y), text=c['c'],fontsize=s['size'],font=font)
                        tw.write_text(page) 

#Saving Backup File furture use 
out_fpath="OUT/"+file_path.stem+".pdf"
doc.save(out_fpath,garbage=3, deflate=True)
doc.close()

PyMuPDF version
1.24.2

Operating system
windows

Python version
3.11.4

@JorjMcKie
Copy link
Collaborator

@aleem75321 please submit this as a different issue - this is too confusing in this context.
When you do, please save the PDF when you have inserted all redactions - before applying them. I need to confirm where your code has put them - without the need to understand your code.
Then attach this PDF to confirm that bad things happen on applying redactions.

@aleem75321
Copy link

aleem75321 commented May 6, 2024

I have summited different issues please see the below link.

Facing Issues after applying redactions they delete some Images or Icons #3439

@dameyerdave
Copy link
Author

dameyerdave commented May 6, 2024

I reduced the application to the bare minimum. I still encounter the same issue. I tried it on mac M3 and on ubuntu linux (Intel) as well as in a docker container with platform: linux/amd64 without success.

import fitz

doc = fitz.open("./original.pdf")
for page in doc:
    for r in page.search_for("David Meyer"):
        page.add_redact_annot(r)

    page.apply_redactions(0, 0)
doc.ez_save("redacted.pdf")

With the following files:

original.pdf
redacted.pdf

I don't know what to try now... If you have another good idea, please let me know...

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented May 6, 2024

@dameyerdave we (a colleague of mine and I) have tried on all 3 platforms now Mac, Linux, Win with fitz.version=('1.24.2', '1.24.1', '20240417000001') and are getting the correct, flawless result.
🤷‍♂️
That is no black rectangle and "David Meyer" removed in total.

@JorjMcKie
Copy link
Collaborator

My only advice is to re-install 1.24.2.
There has been a redaction issue previously. I will try with 1 or 2 previous versions.

@JorjMcKie
Copy link
Collaborator

No such luck:
At least on windows, all versions back to 1.23.26 do work correctly.
So you probably best re-install with the latest version.

@luchux
Copy link

luchux commented May 6, 2024

We are facing exactly the same as everybody posting the bug in this thread.
Our version in the env is
Name: PyMuPDF
Version: 1.24.0

I tried removing the apply_redaction(images=0) and also used all the combos possible for the parameter.
Also tried removing garbage collectors, and deflates when saving.

Exactly the same error as other people:

Original PDF before redaction
Screenshot 2024-05-06 at 6 13 11 PM

After apply.redaction to text "Origin"
Screenshot 2024-05-06 at 6 14 09 PM

We would love to know if you are aware of this bug, and if there is a stable version that works properly without this bug. Thanks a lot!

@luchux
Copy link

luchux commented May 6, 2024

Another example.
Now tested 3 versions: 1.24.0, 1.24.2 failing.

1.23.26: working well ! redaction works

Original before redaction:
Screenshot 2024-05-06 at 6 58 32 PM

After text redacted 1.24.0 and 1.24.2:
Screenshot 2024-05-06 at 6 58 14 PM

after text redacted with 1.23.26 (working!)
Screenshot 2024-05-06 at 7 56 49 PM

@JorjMcKie
Copy link
Collaborator

@luchux - "A picture is worth a thousand words."

Certainly true. But a thousand pictures are not worth a million words!
Please add an example file and no more pictures if we should confirm that yours is another duplicate of #3376.

Please also note, that the problem of this post is yet not reproducible and thus unclear whether it is a bug at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants