Skip to content

This repo is for my article with Analytics Vidhya. In this project, we embark on organizing set of articles from Wikipedia using the Wikipedia library into similar groups (or clusters).

Notifications You must be signed in to change notification settings

inuwamobarak/document-clustering

Repository files navigation

Document Clustering with Python Using Wikipedia articles

Link to project article: Pending...

Wikipedia-logo-textonly

Problem Statement

After business hours, databases and storage facilities could easily get crowded with papers and files, making their manual processing laborious and time-consuming. This makes it necessary to use machine learning techniques to automate this operation. Clustering analysis will be used in this study to address this sampling some Wikpedia articles.

Prerequisites: Knowledge of clustering techniques or previous elbow method use are prerequisites for finishing this assignment, but both are greatly advantageous.

The idea of clustering techniques or prior experience utilizing the elbow approach is a great advantage.

Approach

We will employ a Wikipedia Python library made availabe for easy access to Wikipedia pages. We deal with 14 articles using elbow method which employs heurisics and then K-Mean algorithm for clustering.

Dataset Description

The dataset contains 14 articles including topics which are:

  • Analytics
  • Lawsuit
  • Military
  • Economy
  • Health
  • Education
  • Food
  • Languages
  • Africa
  • Countries
  • Finance
  • Earth
  • Agriculture
  • Plants

All the articles are imported and first converted into vectors for suitability.

Feel free to follow me and ask questions:

https://twitter.com/InuwaAbraham

https://www.linkedin.com/in/mobarak-inuwa/

https://www.analyticsvidhya.com/blog/author/inuwamobarak/

https://mobarak.mystrikingly.com/

References/Links:

Image Source: By Wikimedia Foundation — Wikimedia Foundation, Public Domain, https://commons.wikimedia.org/w/index.php?curid=12611181

Sklearn K-Means Clustering: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#

Python Wikipedia Package: https://pypi.org/project/wikipedia/

About

This repo is for my article with Analytics Vidhya. In this project, we embark on organizing set of articles from Wikipedia using the Wikipedia library into similar groups (or clusters).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published