Skip to content

This project facilitates the extraction of text from PDF files using various Python libraries. It is designed to be flexible, allowing the choice among different text extraction libraries and supporting both single PDF file and directory containing multiple PDF files.

License

Notifications You must be signed in to change notification settings

renan-siqueira/python-pdf-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Text Extraction Tool

Description

This project facilitates the extraction of text from PDF files using various Python libraries. It is designed to be flexible, allowing the choice among different text extraction libraries and supporting both single PDF file and directory containing multiple PDF files.


Project Structure

- main.py
- extractors/
  - __init__.py
  - pypdf2_extractor.py
  - pdfminer_extractor.py
  - pymupdf_extractor.py
  - pdfplumber_extractor.py
- helpers/
  - __init__.py
  - utils.py
- json/
  - params.json

Configuration

  • Python: This project is developed in Python. Ensure you have the latest version of Python installed.
  • Dependencies: The required libraries are listed in each extraction file within the extractors folder. Install them using pip install <library>.

Use the requirements.txt file to install all libraries at once


Usage

  1. Initial Setup: Edit the json/params.json file to set the input path (input_path), output path (output_path), desired libraries (libraries), and log level (log_level).

Example params.json:

{
    "input_path": "/path/to/pdf/or/directory",
    "output_path": "/path/to/output/directory",
    "libraries": ["pypdf2", "pdfminer"],
    "log_level": "INFO"
}
  1. Execution: Run the main.py script to start the text extraction process.
python main.py

Features

  • Text extraction from PDF files using various libraries.
  • Supports processing either a single file or multiple files in a directory.
  • Automatic output folder generation based on input.
  • Flexible configuration via the json/params.json file.

About

This project facilitates the extraction of text from PDF files using various Python libraries. It is designed to be flexible, allowing the choice among different text extraction libraries and supporting both single PDF file and directory containing multiple PDF files.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages