Bharat Info

This project aims to take authentic data from different (typically government) sources, which describes India's demography, geography, sociography, etc., and load into a combined data warehouse for visualisation and analytics

Datasets currently planned to include:

National Family Health Survey (NFHS 5) (MoHFW) - Done 100%
Census 2011 (MoH) [Waiting for 2021 to be released]
National Crime Records Bureau (NCRB 2023) (MoH)
Forest Survey of India (FSI 2023) (MoEFCC)

The data is unfortunately not clean at all, there are typos in the tables which make automatic extraction difficult, the data is typically published in PDF files which makes extraction a complete nightmare

This project is coded in Python, utilising its many libraries for collection, wrangling, cleaning, processing, loading, etc.

The stack:

Scraping using Selenium
Extracting tables from pdfs using Camelot
OCR using PyTesseract
Cleaning using Re (Regex), Numpy and Pandas
Load to Pandas dataframe and save as .csv files

The project is planned to shift to PySpark soon. Currently the scale of data does not warrant distributed systems, but as more and more datasets are added, and considering the objective of this project is to create a super database where one can join any table with any other table (and find, for instance, the correlation of crime with deforestation, or that of vehicles owned with cases of Diabetes

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
__pycache__		__pycache__
ocr output		ocr output
resource/states/output		resource/states/output
settings		settings
states/output		states/output
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
collection_lib.py		collection_lib.py
main.py		main.py
main_old.py		main_old.py
preproc.py		preproc.py
preproc_error.log		preproc_error.log
temp.py		temp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

ocr output

ocr output

resource/states/output

resource/states/output

settings

settings

states/output

states/output

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

collection_lib.py

collection_lib.py

main.py

main.py

main_old.py

main_old.py

preproc.py

preproc.py

preproc_error.log

preproc_error.log

temp.py

temp.py

Repository files navigation

Bharat Info

About

Releases

Packages

Contributors 2

Languages

arijit08/bharat_info

Folders and files

Latest commit

History

Repository files navigation

Bharat Info

About

Resources

Stars

Watchers

Forks

Languages