Skip to content

Final project for the Data Science course from the Master's degree in Information Systems and Computer Engineering at Instituto Superior Técnico, Portugal.

Notifications You must be signed in to change notification settings

diogoViegas/Data-Science-Course-Project

Repository files navigation

DataScienceCourseProject

This repository contains all the code written in the elaboration of the final project of the course of Data Science of the MSc program in Computer Science and Engineering at IST (2019/2020).

The project goal was the application of data science techniques to discover information in two distinct problems (datasets). It was expected that we explored the datasets and adequately select and learn models suited for the data. Additionally, we should criticize the results achieved, hypothesize causes for the limited performance of certain models and identify opportunities to improve the mining process in a final succinct report.

Project Collaborators:

-André Patrício - https://github.com/Andrempp

-Bernardo Santos - https://github.com/BSantosCoding

-Diogo Viegas

The datasets used are: Parkinson Disease (pd_speech_features.csv). Source data and description in: https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification

and

Covertype (covtype.info + covtype.data). Source data and description in: https://archive.ics.uci.edu/ml/datasets/Covertype

The structure of the project is the following: final_report.pdf - the final report delivered with all the requested analysis.

20190921.EnunciadoProjecto 2019.pdf - project description

course_labs - contains auxiliary notebooks done throughout the semester, used to learn the implementation of certain data science techniques.

data - contains the 2 datasets analyzed throughout this project.

course_project - contains the code implemented for the project, detailed below

course_project

aux_libs - Auxiliary libraries provided by the faculty members and further modified by the students.

clf_tunning - Set of files where we test and plot the best hyperparameters for each classifier used in the solution. Numbers "1" and "2" correspond respectively to "pd_speech_features" and "covtype". The plots generated by this files are stored in the folder images.

imgs - Set of subfolders with images corresponding to the plots drawn by the various files of the project.

pattern_mining - Python files where we explore which are the adequate preprocessing techniques to apply.

statistical_analysis - Statistical analysis of both datasets.

results - The final results for all classifiers and clustering techniques used for each dataset.

About

Final project for the Data Science course from the Master's degree in Information Systems and Computer Engineering at Instituto Superior Técnico, Portugal.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published