Building a Scalable RSS Feed Pipeline with Apache Airflow, Kafka, and MongoDB, Flask Api

About

New Version of project

In today's data-driven world, processing large volumes of data in real-time has become essential for many organizations. The Extract, Transform, Load (ETL) process is a common way to manage the flow of data between systems. In this article, we'll walk through how to build a scalable ETL pipeline using Apache Airflow, Kafka, and Python, Mongo and Flask

Architecture

Scenario

In this pipeline, the RSS feeds are scraped using a Python library called feedparser. This library is used to parse the XML data in the RSS feeds and extract the relevant information. The parsed data is then transformed into a standardized JSON format using Python's built-in json library. This format includes fields such as title, summary, link, published_date, and language, which make the data easier to analyze and consume.

Prerequisites

Setup

Clone the project to your desired location:

$ git clone https://github.com/Stefen-Taime/Scalable-RSS-Feed-Pipeline.git

Execute the following command that will create the .env file containig the Airflow UID needed by docker-compose:

$ echo -e "AIRFLOW_UID=$(id -u)" > .env

Build Docker:

$ docker-compose build

Initialize Airflow database:

$ docker-compose up airflow-init

Start Containers:

$ docker-compose up -d

When everything is done, you can check all the containers running:

$ docker ps

Airflow Interface

Now you can access Airflow web interface by going to http://localhost:8080 with the default user which is in the docker-compose.yml. Username/Password: airflow Now, we can trigger our DAG and see all the tasks running.

Note: run the docker-compose mongo-kafka file before running the dag

Setup kafka, Mongo

navigate to cd mongo-kafka:

$ docker-compose up -d

Execute the following command that will create SinkConnector for Mongo

$ curl -X POST \-H "Content-Type: application/json" \
--data @mongo-sink.json \
http://localhost:8083/connectors



$ docker exec -ti mongo mongo -u debezium -p dbz --authenticationDatabase admin localhost:27017/demo

$ show collections;
$ db.rss_feeds_collection.find();

Run api

$ python api.py

References

Apache Airflow Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dags		dags
images		images
mongo-kafka		mongo-kafka
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

images

images

mongo-kafka

mongo-kafka

Dockerfile

Dockerfile

README.md

README.md

docker-compose.yaml

docker-compose.yaml

requirements.txt

requirements.txt

Repository files navigation

Building a Scalable RSS Feed Pipeline with Apache Airflow, Kafka, and MongoDB, Flask Api

About

Architecture

Scenario

Prerequisites

Setup

Airflow Interface

Setup kafka, Mongo

References

About

Releases

Packages

Languages

Stefen-Taime/Scalable-RSS-Feed-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Building a Scalable RSS Feed Pipeline with Apache Airflow, Kafka, and MongoDB, Flask Api

About

Architecture

Scenario

Prerequisites

Setup

Airflow Interface

Setup kafka, Mongo

References

About

Resources

Stars

Watchers

Forks

Languages