Skip to content

This is a small project for Big Data Computing course, applying Dimensionality Reduction, Sampling and Clustering for topic detection in text documents.

Notifications You must be signed in to change notification settings

AlessandraMonaco/Topic-Detection-for-Big-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

TOPIC DETECTION FOR BIG DATA

Introduction

Topic detection aims to identify different topics discussed in a corpus of textual documents, describing the different topics through a set of keywords that help us to understand them. Many different approaches are known for this purpose: probabilistic generative models such as topic modeling, soft clustering techniques like Gaussian Mixture Models, hard clustering algorithms like the famous K-Means. This project implements text clustering using K-Means++ on a large dataset. The goal is to improve efficiency of the K-Means algorithm without loosing effectiveness.

Dataset

The dataset is not available on this repository, due to its large dimension. It contains 314808 reviews, mostly in english. The cluster labels were available for the evaluation, that is performed using accuracy metric.

Methodology

We compared the results obtained through different approaches:

  • A standard K-Means on the original dataset, that took about 1 hour of computation
  • Dimensionality reduction through Truncated SVD and the standard K-Means
  • Mini-batch K-Means on the original dataset
  • Random Sampling (10% of the original dataset) with Dimensionality Reduction and standard K-Means

Results

Clusters were analyzed printing most relevant keywords contained in the centroids and through WordCloud visualization. All the techniques provided very similar results in terms of centroid information and accuracy, and proved to be much faster than the standard K-Means on the original dataset.

We obtained a cluster regarding reviews for some pet product, and a cluster regarding reviews for baby products.

More details are described in the report file.

About

This is a small project for Big Data Computing course, applying Dimensionality Reduction, Sampling and Clustering for topic detection in text documents.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published