Skip to content

A parser that crawls Wikipedia (Marathi) and stores the pages as JSONs, which can be used for RAG.

License

Notifications You must be signed in to change notification settings

adhishthite/wikipedia-RAG-app

Repository files navigation

Wikipedia Crawler

This app crawls through Wikipedia and stores the pages in JSON files. The JSONs can be used for RAG and LLM-finetuning. This is still a Work in Progress, so please feel expected to see some bugs.

Overview

This Wikipedia Crawler has APIs to crawl through Wikipedia and store the pages in JSON files. The JSONs can be used for RAG and LLM-finetuning. The app is built using Flask and MongoDB for data storage. The app is containerized using Docker and can be deployed using GitHub Actions.

Features

  • Integration with NGINX and GUNICORN
  • Simplified structure for easy project initiation
  • Use of best practices and recommended plugins
  • Integration with Docker for easy deployment
  • Use of MongoDB for data storage and Redis for caching
  • Integrated with GitHub Actions

Getting Started

To get started with this template, follow these steps:

  1. Clone the repository.

    git clone https://github.com/adhishthite/wikipedia-RAG-app.git
  2. Navigate to the repository

    cd wikipedia-RAG-app
  3. Rename the .env-t file to .env and add/update the required environment variables.

    mv .env-t .env
  4. Build the Docker image using docker-compose.

    docker-compose up --build

[WIP]

License

Feedback

I welcome feedback and suggestions. Please feel free to open an issue or submit a pull request.


About

A parser that crawls Wikipedia (Marathi) and stores the pages as JSONs, which can be used for RAG.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published