Skip to content

A 6 GiB corpus of Uruguayan press from early 2000 to late 2022.

Notifications You must be signed in to change notification settings

pln-fing-udelar/uy22

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

🇺🇾 UY22 corpus

This repo contains the notebooks used for scraping the following uruguayan media sites:

  • El Observador (elobservador.com.uy)
  • El País (elpais.com.uy)
  • La Diaria (ladiaria.com.uy)
  • Montevideo Portal (montevideo.com.uy)

Every article scraped was stored as a .json file with the following structure:

{
    "url":      string,
    "id":       int,
    "date":     string,
    "category": string,
    "title":    string,
    "keywords": []string,
    "cover":    string,
    "body":     string,
}

where

  • url: URL pointing to original article
  • id: numeric ID (if exists, else random UID)
  • date: article's timestamp
  • category: article's category
  • title: article's title or header
  • keywords: article's tags
  • cover: URL pointing to article's front image (if any)
  • body: article's body

Every site is assagined a directory, and every articles is stored inside a directory named after its publishing year.

e.g., uy22-raw/ep22/2019/20190101120000-142502-Los_datos_del_Rey_de.json

For every corpus, there are two versions available:

  1. raw: where body contains the raw unprocessed articles' HTML
  2. clean: where body contains just text without HTML tags

Both raw & clean versions are about 6 GiB & 4 GiB respectively (totalling 10.3 GiB) and can be downloaded from here or here.

For every site there's also available an unified+splitted version of every article in a single .txt file. (totalling 2.4 GiB). Slipped means that every line contains a single sentence, and unified means every articles is separated by a blank line. The splitting was made using pln-fing-udelar/ sentence-splitter

542M dic 28 20:47 ep22-unified-splitted.txt
876M dic 27 23:04 eo22-unified-splitted.txt
854M dic 27 18:58 mp22-unified-splitted.txt

The concatenations of these files were used to train a RoBERTa-like LM using the HuggingFace library, and can be found here huggingface.co/datasets/pln-udelar/uy22 or here archive.org.

About

A 6 GiB corpus of Uruguayan press from early 2000 to late 2022.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published