Skip to content

Textual & numeric data extraction with Python using textract, easily shareable with Docker.

License

Notifications You must be signed in to change notification settings

simonkeng/pdf_parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Parser

  1. Install Docker CE

  2. Clone this repo:

git clone https://github.com/simonkeng/pdf_parser.git
  1. cd into pdf_parser directory.

  2. Build docker image from the Dockerfile:

docker build -t pdf_parser .

Usage:

Run the container and execute the python script passing in a document:

docker run -i -t pdf_parser bash -c "python pdf_rip.py test_data.pdf"

You can also extract from multiple files, just place all your PDFs in one folder and copy it over to your docker container.

docker cp pdfs/ 609d09bb400f:/tmp/pdfs/

..replacing 609d09bb400f with your container ID. Now we can run the batch script within a new container.

docker run -i -t pdf_parser bash -c "python batch.py pdf/"

This command will return a container ID. To ensure it ran, and to check the status:

docker logs <containerID>