Skip to content

Dataset of (mostly German) PDFs used to develop pd3f

License

Notifications You must be signed in to change notification settings

pd3f/pd3f-dataset-bmjv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pd3f-dataset-bmjv

Dataset of (mostly German) PDFs used to develop pd3f.

This repository contains the code to scrape and download some public documents (PDFs). The can files be downloaded here: https://data.jfilter.de/nlp/pd3f/bmjv_v1.zip.

Origin of the Dataset

  1. Downloaded "Stellungnahmen zu Referententwürfen" from the BMJV, around 02.04.2022
  2. Prepend filenames with numbers
  3. OCRd for German and English with OCRmyPDF
  4. Sort / group by language
  5. Redo broken OCR (manually detecting errors while working on the PDFs)

License

GPLv3

About

Dataset of (mostly German) PDFs used to develop pd3f

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages