wikivisual

The goal of this project is to take a group of Wikipedia pages in the same category, and reateual representation of their importance calculated as Page Rank. Wikipedia has an enormous amount of information that it can be intimidating and difficult to extract immediately the most important pieces of information about a single topic. This project attempts to give users a very quick way to determine what the most essential articles are for a specific Wikipedia category and navigate to their specific points of interest.

I would like to acknowledge Giuseppe Attardi and his wikiextractor project, Jim Vallandignham and his gates_bubbles project and Professor Mikhail Gronas of the Russian Department at Dartmouth and Tim Tregubov of the CS Department at Dartmouth for all of their advice, and Wikipedia for providing me with the dataset that made this project possible.

visualization

In the home directory there is a index.html page that uses d3js to construct a bubble chart visualization. The size of each bubble is based on the Page Rank and colors for each bubble is assigned by the magnitude of the page rank, which is broken down into 3 groups. By clicking on a bubble, you will be taken to the wikipedia page for that article. The visualization is based off the gates_bubbles project. An in depth tutorial about the visualization which I used can be found here.

visualization usage

D3 needs to be run from a webserver due to how it imports data files.

See more here: https://github.com/mbostock/d3/wiki#using

So, to run this visualization locally, from the Terminal, navigate to the directory you checked it out to

 cd ~/code/path/to/wikivisual

Then start a webserver locally. The simpliest way is to python's built in webserver:

 python -m SimpleHTTPServer 3000

You can also use a static web server like fenix and point one of your servers to the wikivisual directory

data

The data folder contains lists of pageIds of specific categories. These lists were generated by simple SQL queries on the categoryLinks SQL table. The CSV file generated by the script is also saved here. The XML dumps and the SQL tables for this project can be found here.

scripts

pageRank.py is a Python script that extracts all of the pages from our data lists and calculates Page Rank for those pages.

extractPage.py is a Python script that extracts a Wikipedia page given the page ID.

The tool is written in Python and requires no additional library.

For further information about extractPage and wikiextractor, see the project Home Page or the Wiki.

This is a beta version that performs template expansion by preprocesssng the whole dump and extracting template definitions. The current version keeps a cache of parsed templates, achieving a speedup of twice over the previous version.

Script Usage

The pageRank.py script is invoked with a Wikipedia dump file as an argument and the data list as the second argument.

usage: extractPage.py input1 input2

positional arguments:
  input1		XML wiki dump file
  input2		Datalist

The extractPage.py script is invoked with a Wikipedia dump file as an argument.

usage: extractPage.py [--id PAGEID] input

positional arguments:
  input                 XML wiki dump file

css

This folder contains all of the CSS files. We use bootstrap in this project

###js

vis.js contains the script for the d3.js visualization. CustomTooltip.js contains the code for the node tooltips. Libs contains of the javascript libraries we are using which include d3.js, jquery, bootstrap, modernizr, and underscore.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
css		css
data		data
img		img
js		js
json		json
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
index_bubble.html		index_bubble.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

css

css

data

data

img

img

js

js

json

json

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

index.html

index.html

index_bubble.html

index_bubble.html

Repository files navigation

wikivisual

visualization

visualization usage

data

scripts

Script Usage

css

About

Releases

Packages

Languages

License

jason-feng/wikivisual

Folders and files

Latest commit

History

Repository files navigation

wikivisual

visualization

visualization usage

data

scripts

Script Usage

css

About

Resources

License

Stars

Watchers

Forks

Languages