GitHub

KIP_EinfachErklaert

General

This project was developed as part of the "KI-Projekt" course during the summer term of 2024 at OTH Regensburg by Ben, Felix, Lears and Simon. It is designed to be used for scientific research. The goal of the project is to scrape and match german news articles from sources that provide content in both easy (in german: "leichte" or "einfache Sprache") and standard language. For simplicity, we refer to the articles as easy or hard. Currently supported sources for the articles are:

Nachrichtenleicht, Nachrichtenleicht on Instagram (easy) and Deutschlandfunk (hard)
MDR: Leichte Sprache (easy) and MDR (hard)

The code is built modularly. Main modules are:

Scrapers: scrape the data from the sources
DataHandler: manages the scraped data uniformly and provides an interface for reading, writing, and searching the data
Matchers: matches corresponding articles one on one (easy to standard) from the same source. In the future they may also match individual sentences or audio

Modules may be used individually as needed. The current simplified pipeline is:

Data Structure of Scraped Data

The project uses a custom data structure consisting of folders and files (txt, json, csv, html, mp3) to store the scraped data. The The data is stored in the git root directory like:

data/
├── <source>/ (dlf or mdr)
│   ├── matches_<source>.csv
│   ├── <language niveau>/ (easy or hard)
│   │   ├── lookup_<source>_<niveau>.csv
│   │   ├── 2023-06-01-Sample_Article/
│   │   │   ├── Metadata.json
│   │   │   ├── Content.txt
│   │   │   ├── Raw.html
│   │   │   └── Audio.mp3 (if available)

On runtime the data can be read into Pandas DataFrames with the DataHandler read capability.

Developer Guide

Installation

git clone https://github.com/larsaars/KIP_EinfachErklaert.git
cd KIP_EinfachErklaert
pip install -r requirements.txt

Scrapers

The scrapers are designed to be executed on a regular basis (e.g., by weekly cron jobs on a server). The following table shows the most important scrapers with a short explanation and frequency:

File	Functionality
`scrapers/dlf/scrape_Deutschlandfunk.py`	Scrapes last week's articles from Deutschlandfunk (hard)
`scrapers/dlf/scrape_Nachrichtenleicht.py`	Scrapes last week's articles from Nachrichtenleicht (easy)
`scrapers/dlf/scrape_InstaCaptions.py`	Scrapes captions of all posts on the "nachrichtenleicht" Instagram profile and analyzes images for titles
`scrapers/mdr/current_news_scraper.py`	Scrapes current easy and hard articles from MDR
`scrapers/mdr/historic_news_scraper.py`	Scrapes historic easy and hard articles from MDR

DataHandler

The DataHandler is not an executable but a module to use when further developing scrapers or matcher and dealing with data storage (read, write search). Examples how to use the DataHandler can be found here.

Matchers

Work in progress.

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
datahandler		datahandler
documentation		documentation
matchers		matchers
scrapers		scrapers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datahandler

datahandler

documentation

documentation

matchers

matchers

scrapers

scrapers

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

KIP_EinfachErklaert

General

Data Structure of Scraped Data

Developer Guide

Installation

Scrapers

DataHandler

Matchers

About

Contributors 4

Languages

License

larsaars/KIP_EinfachErklaert

Folders and files

Latest commit

History

Repository files navigation

KIP_EinfachErklaert

General

Data Structure of Scraped Data

Developer Guide

Installation

Scrapers

DataHandler

Matchers

About

Topics

Resources

License

Stars

Watchers

Forks

Languages