Skip to content

larsaars/KIP_EinfachErklaert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KIP_EinfachErklaert

General

This project was developed as part of the "KI-Projekt" course during the summer term of 2024 at OTH Regensburg by Ben, Felix, Lears and Simon. It is designed to be used for scientific research. The goal of the project is to scrape and match german news articles from sources that provide content in both easy (in german: "leichte" or "einfache Sprache") and standard language. For simplicity, we refer to the articles as easy or hard. Currently supported sources for the articles are:

The code is built modularly. Main modules are:

  • Scrapers: scrape the data from the sources
  • DataHandler: manages the scraped data uniformly and provides an interface for reading, writing, and searching the data
  • Matchers: matches corresponding articles one on one (easy to standard) from the same source. In the future they may also match individual sentences or audio

Modules may be used individually as needed. The current simplified pipeline is:

Pipeline

Data Structure of Scraped Data

The project uses a custom data structure consisting of folders and files (txt, json, csv, html, mp3) to store the scraped data. The The data is stored in the git root directory like:

data/
├── <source>/ (dlf or mdr)
│   ├── matches_<source>.csv
│   ├── <language niveau>/ (easy or hard)
│   │   ├── lookup_<source>_<niveau>.csv
│   │   ├── 2023-06-01-Sample_Article/
│   │   │   ├── Metadata.json
│   │   │   ├── Content.txt
│   │   │   ├── Raw.html
│   │   │   └── Audio.mp3 (if available)

On runtime the data can be read into Pandas DataFrames with the DataHandler read capability.

Developer Guide

Installation

git clone https://github.com/larsaars/KIP_EinfachErklaert.git
cd KIP_EinfachErklaert
pip install -r requirements.txt

Scrapers

The scrapers are designed to be executed on a regular basis (e.g., by weekly cron jobs on a server). The following table shows the most important scrapers with a short explanation and frequency:

File Functionality
scrapers/dlf/scrape_Deutschlandfunk.py Scrapes last week's articles from Deutschlandfunk (hard)
scrapers/dlf/scrape_Nachrichtenleicht.py Scrapes last week's articles from Nachrichtenleicht (easy)
scrapers/dlf/scrape_InstaCaptions.py Scrapes captions of all posts on the "nachrichtenleicht" Instagram profile and analyzes images for titles
scrapers/mdr/current_news_scraper.py Scrapes current easy and hard articles from MDR
scrapers/mdr/historic_news_scraper.py Scrapes historic easy and hard articles from MDR

DataHandler

The DataHandler is not an executable but a module to use when further developing scrapers or matcher and dealing with data storage (read, write search). Examples how to use the DataHandler can be found here.

Matchers

Work in progress.