Skip to content

Dataset of Influenza Incidence and Wikipedia Pagecounts (and Pageviews).

Notifications You must be signed in to change notification settings

fluTN/influenza-wikipedia-dataset

Repository files navigation

Influenza and Wikipedia Dataset

DOI

Data Description

This dataset contains data which record ILI activity levels in several European countries, starting from the 2007-2008 influenza season to the 2018-2019 one. It comprises also Wikipedia's pageviews and pagecounts data extracted for several specific pages.

The directories are named in such a way:

  • wikipedia_{country}: they contain the pageviews/pagecounts data for the selected Wikipedia's pages. The pageviews are divided by year and the pageviews/pagecounts are aggregated for each week. Each file contains the following columns:
    • week: a string composed by year-week_number;
    • Several other columns which are named as the Wikipedia's page monitored;
  • {country}: they contain the influenza incidence data for the specified country. The incidence information is divided for each influenza seasons (which spans over two years). The file are thus named {year}_{year+1}.csv. Each file contains the following columns:
    • week: a string composed by year-week_number;
    • incidence: the incidence of influenza cases over 100000 people in that specific week;

Moreover, inside each wikipedia_{country} directory there is another layer of division (this division is present also inside the {country} directories, but it matters only for the Wikipedia's pageviews since for the incidence data the division was done only for improving the usability):

  • complete: contains the entire dataset, done by merging the pageviews and pagecounts data;
  • pageviews: contains only the data from the pageviews (they are available only from May 2015);
  • pagecounts: contains only the data from the pagecounts (it was the first method used to analyze traffic on Wikipedia's pages). The data here range from 2007 to 2015.
  • cyclerank/pagerank: they contain the complete dataset, but the data refer to a set of specific pages selected by using the CycleRank or the PageRank algorithm.
  • cyclerank_pageviews/pagerank_pageviews: contains only the data from the pageviews (they are available only from May 2015), but the data refer to a set of specific pages selected by using the CycleRank or the PageRank algorithm.

The only difference is the USA directory in which the incidence data are provided in one unique file called 2007_2013.csv. Moreover, for the USA, only the pagecounts data were extracted.

Other Directories

The keywords directory contains the lists of Wikipedia's pages selected. Each file is named keywords_{country}.csv and it contains a simple list of all pages monitored. There are also other files called keywords_{method}_{country}.csv in which there is a simple list of all the pages monitored that were chosen by using the given {method} (e.g. CycleRank or PageRank).

License

The influenza incidence values were extracted from several sources:

Licensing information about these datasets is unclear, while the copyright on these data lies with the institution that produced them, we believe that we can share this data for research purposes. Please refer to the original websites for further information.

The pageviews dataset have been extracted from Wikimedia's pagecounts-raw dataset, which is released in the Public Domain.

How to cite

DOI

  • De Toni, Giovanni, Consonni, Cristian, and Montresor, Alberto. “Influenza activity levels and Wikipedia pageviews 2007-2018.” doi: 10.5281/zenodo.2248501.

Questions?

For further info send us an email.