Skip to content

LegalCrawler: A tool for automated scraping of English legal corpora

Notifications You must be signed in to change notification settings

iliaschalkidis/LegalCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Legal Crawler 🐙

A collection of scripts to crawl English legal corpora 📕 from open public domains.

  • The current version supports the following domains:
Corpus Domain Corpus alias
🇪🇺 EU legislation https://eur-lex.europa.eu/ eu
🇬🇧 UK legislation https://legislation.gov.uk/ uk
🇨🇦 Canadian legislation http://laws.justice.gc.ca/eng/ ca
🇯🇵 Japanese legislation http://www.japaneselawtranslation.go.jp/law/ jp
🇫🇮 Finish legislation https://www.finlex.fi/en fi
🇺🇸 US case law* https://case.law/bulk/download/ us

* In order to use the script for US case law, you need to first apply for a researcher account at https://case.law.

  • For US public filings, e.g., contracts, please use the library OpenEDGAR (https://github.com/LexPredict/openedgar) by LexPredict.
  • Documents are saved in raw text format, amend the code if you wish to better handle metadata, document structure, etc.

‼️ Disclaimer ‼️

  • If you aim to use the code, please carefully read the individual license agreements with respect to re-use, re-publication, terms of use, etc. 📝
  • The text cleansing from the original PDF/HTML files is minimal. Consider amending the scripts and/or writing your own post-processing data cleansing process that better fit for each corpus. 🚧
  • These scripts aim to give researchers a kick start for scraping legal corpora from public domains. They should not considered a stand-alone qualified solution. 🚧

Project Requirements:

Python packages

  • json-lines
  • tqdm
  • beautifulsoup4

Linux packages (command line tools)

The following linux packages are used to process PDF documents:

  • pdftocairo
  • pdftotext
  • mutool
  • gs

Quick start:

Install python requirements:

pip install -r requirements.txt

sudo apt-get install libcairo2-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install -y xpdf
sudo apt-get install mupdf mupdf-tools

Download Canadian legislation

python download_legal_corpora.py --corpus ca

Download EU legislation

python download_legal_corpora.py --corpus eu

Download all (EU, UK, CA, FI, JP, US)

python download_legal_corpora.py --corpus all

Citation

In case you use this repo or any derivative in your work, please cite using the following:

@Misc{chalkidis-legalcrawler,
author =   {Ilias Chalkidis},
title =    {{Legal Crawler}: A collection of scripts to crawl English legal corpora from open public domains.},
howpublished = {\url{https://github.com/iliaschalkidis/LegalCrawler/}},
year = {2020--2022}
}

About

LegalCrawler: A tool for automated scraping of English legal corpora

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages