preserveOnlyMultipleLineBreaks does not work on PDF when eol:'dos' #199

thegoatherder · 2020-01-30T15:25:46Z

Trying to extract text from PDF using textract.fromFileWithPath() in a Windows environment. Using textract v2.5.0

The following config is set:

{
  preserveOnlyMultipleLineBreaks: true,
  pdftotextOptions: { 
    eol: 'dos', 
    layout: 'raw', 
    encoding: 'UTF-8', 
    splitPages: true }
}

I have found that preserveOnlyMultipleLineBreaks: true is not working as expected. When the setting is on, the output converts \r\n to \r. But AFAIK \r on its own doesn't mean anything in Windows or Unix systems. I'm expecting it instead to convert \r\n\r\n to \r\n and to remove solo \r\n completely from the text output.

Seems like a bug?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preserveOnlyMultipleLineBreaks does not work on PDF when eol:'dos' #199

preserveOnlyMultipleLineBreaks does not work on PDF when eol:'dos' #199

thegoatherder commented Jan 30, 2020 •

edited

preserveOnlyMultipleLineBreaks does not work on PDF when eol:'dos' #199

preserveOnlyMultipleLineBreaks does not work on PDF when eol:'dos' #199

Comments

thegoatherder commented Jan 30, 2020 • edited

thegoatherder commented Jan 30, 2020 •

edited