Skip to content

a crawler for Wikipedia (for now only the English pages)

Notifications You must be signed in to change notification settings

nazaninsbr/Wikipedia-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia Crawler

this crawler starts from the homepage and crawls all the links, saving the result in a rethinkdb database it then counts the number of word repeats.

How to run:

first run the database and then run the code

rethinkdb
python main.py --db 
python main.py --website

--db uses the database to count the number of repeats and --website first crawls and writes to the database and calculates the word count.

Requirements

you need to have rethinkdb installed, you can do so using:

pip install -r requirements.txt

About

a crawler for Wikipedia (for now only the English pages)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages