Skip to content

Latest commit

 

History

History
134 lines (125 loc) · 5.54 KB

README.md

File metadata and controls

134 lines (125 loc) · 5.54 KB

Input Modules

The Input modules in Parsr perform the initial role of importing the raw data from the input files. Each module performs on a particular type of input files, and generate different results. Each module may or may not contain a set of configurable parameters, which (along with the usage documentation) can be consulted in the per-module documentation pages below. Each module returns a valid Document object with an array of Words for each parsed Page.

The Modules

  1. Pdfminer
  2. PDF.js
  3. Tesseract
  4. Google Vision
  5. Amazon Textract
  6. MS Cognitive Services
  7. ABBYY
  8. JSON
  9. MS Word
  10. Email

Supported input formats

Currently, the following file formats are available for Parsr:

Input format Input modules
Pdfminer pdf.js ABBYY Tesseract JSON Extractor Google Vision Amazon Textract MS Cognitive Services
.pdf
.docx
.eml
.tiff
.png
.jpeg
.json
.xml

This means that for processing a pdf file, 4 extractors can be chosen: pdfminer, pdf.js, ABBYY or Tesseract.

Note: not all extractors share the same functionality or return the same information, so one should check for the best extractor given the use case.

Note: when using a json or xml file as input, extractor configuration will be ignored as there is currently only one extractor for each of this formats.