Skip to content

ENTRANT: A Large Financial Dataset for Table Understanding

License

Notifications You must be signed in to change notification settings

izavits/entrant

Repository files navigation

python application python lint License: CC BY 4.0

ENTRANT: A Large Financial Dataset for Table Understanding

Extract and clean tables from financial xlsx files from EDGAR and convert them to JSON with bi-tree positional information and metadata.

Related dataset:

Zavitsanos, E., Mavroeidis, D., Spyropoulou, E., Fergadiotis, M., & Paliouras, G. (2024). ENTRANT: A Large Financial Dataset for Table Understanding [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10667088

Table of Contents

Install

  • Before starting, ideally, it's recommended to switch to a virtual environment first via conda or virtualenv.
  • Install dependencies via pip install -r requirements.txt

Usage

For table extraction from EDGAR:

  • Place the xls files in a directory named data in the project's root.
  • Create a directory named output to store the results.
  • Run extract_tables_multiprocess.py.

For downloading excel reports

  • See fetch_reports.py
  • Pay attention to fair usage of EDGAR

Data

Tests

Use pytest to run the unit tests.

Contributing

See the contributing file!

License

The project is licensed under Creative Commons Attribution 4 license.

About

ENTRANT: A Large Financial Dataset for Table Understanding

Topics

Resources

License

Stars

Watchers

Forks