Skip to content

Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.

Notifications You must be signed in to change notification settings

hplt-project/warc2text-runner

Repository files navigation

warc2text-runner

Scripts for parallelized extraction of plain texts from WARC archives. Aiming at common and reproducible extraction approach.

Install

Run

./run_warc2text.sh ../wide15-sample300/ test_filtered 250 ./

takes WARCs from ../wide15-sample300/, saves extracted texts and urls to test_filtered and logs to test_filtered_logs, performs extraction in 250 parallel processes, filters documents using filters from this repository.

To run without filters:

./run_warc2text.sh ../wide15-sample300/ test_filtered 250

Calculate language statistics

cd stats
bash text_stats.sh ../test_filtered ../test_filtered_stats 250

calculates statistics for texts in ../test_filtered extracted by warc2text (number of bytes, words as reported by wc, newlines and documents for each language) and saves it to ../test_filtered_stats in .tsv format. Processes texts in 250 parallel processes. Additionally generates basic plots for some of these metrics and saves to the same folder.

Collected statistics and plots

Language statistics was calculated for cc40, wide00015 and wide00017 For generating custom plots comparing different statistics for several languages and datasets you may want start with this notebook.

Compiling giashard

git clone [email protected]:paracrawl/giashard.git

CGO_ENABLED=0 go build \
  -o giashard-static \
  -a -ldflags '-extldflags "-static"' \
  github.com/paracrawl/giashard/cmd/giashard

CGO_ENABLED=0 go build \
  -o giamerge-static \
  -a -ldflags '-extldflags "-static"' \
  github.com/paracrawl/giashard/cmd/giamerge

Running giashard:

cd path/to/data

./giashard.sh wide00016 mt

That will create a wide00016-shards/mt folder in theory.

About

Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.

Resources

Stars

Watchers

Forks

Packages

No packages published