Skip to content

This is the Capstone Project of my Machine Learning Engineer with Microsoft Azure Nanodegree Program by Udacity.

Notifications You must be signed in to change notification settings

sg7801/Parkinsons-Disease-Prediction-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parkinson's Disease Prediction Using Azure ML

Summary

In this project, we have used the Parkinson's Disease Dataset that contains the biomedical voice measurements of various people. Hence, we seek to predict if a person has the disease or not by using two algorithms. We will be comparing the accuracy of Hyperdrive Run with tuned hyperparameters and AutoML Run on Microsoft Azure. The result will be binary, i.e. "0" for healthy and "1" for those with the disease. After comparing the performances of both algorithms, we deploy the best performing one. The model can be consumed from the generated REST endpoint.

Diagram

Dataset Overview

The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. The original study published the feature extraction methods for general voice disorders.

Parkinson's Disease is a brain disorder that targets the nervous system of human body. This results in tremors, stiffness and disturbs or slows the movement. When the nerve cell damage in the brain cause the dopamine levels to go down, it leads to Parkinson's. Immediate medication can help to control symptoms.

It is a multivarite dataset that contains the range of biomedical voice measurements of 31 people, out of which 23 had Parkinson's disease. Each column is itself a voice measure and rows correspond to 195 voice recordings from those individuals.

parkinson's stages

Citation:

Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)

Task:

Since, its a classification task with binary output, the column "status" will be used to determine if a person is healthy, denoted by "0" or is having Parkinson's disease, denoted by "1".

Attributes:

  • Matrix column entries (attributes):
  • name - ASCII subject name and recording number
  • MDVP:Fo(Hz) - Average vocal fundamental frequency
  • MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
  • MDVP:Flo(Hz) - Minimum vocal fundamental frequency
  • MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
  • MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
  • NHR,HNR - Two measures of ratio of noise to tonal components in the voice
  • status - Health status of the subject (one) - Parkinson's, (zero) - healthy
  • RPDE,D2 - Two nonlinear dynamical complexity measures
  • DFA - Signal fractal scaling exponent
  • spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Access to the dataset:

Since the dataset was available in ASCII CSV format, therefore it has been provided in this repository itself here.

Automated ML Run

Firstly, We used the TabularDatasetFactory to create a dataset from the provided link. Then we split the train and test sets and upload them to datastore. Then, we define the task as per the below mentioned code.

  • Below are the settings used for AutoML Run :
automl_settings = {
    "n_cross_validations": 5,
    "experiment_timeout_minutes" :20,
    "primary_metric": 'accuracy',
    "max_concurrent_iterations": 4,
}
automl_config = AutoMLConfig(
    task="classification",
    compute_target=compute,
    enable_early_stopping= True,
    max_cores_per_iteration=-1,
    training_data=training_data,
    label_column_name="status",
    **automl_settings
    )
  • Below are the defination and reasons why we above settings for our AutoML Run:

automlsettings

Results of AutoML Run

After the submission, we found that VotingEnsemble Algorithm resulted with the best model with accuracy 0.97906, precision_score_weighted 0.99740 and precision_score_micro 0.99580. Enabling of the automatic featurisation resulted in Data guardrails including Class balancing detection, Missing feature values imputation and High cardinality feature detection that checks over the input data to ensure quality in training the model.

  • Below image shows that the run has been successfully completed in notebook 1  Run completed AML

  • Below is the image showing Best performing model 2  AML BEST MODEL

  • Now, we retrieved and saved the best model 3 retrieveing and saving best model

  • Below images show the explaination of Voting Ensemble Algorithm 4 5

Hyperdrive Run

I started with the Training Script - train.py which used the Scikit-Learn Logistic Regression. It starts with a clean_data function that cleans the missing values from the dataset and one hot encodes data. I passed the required parameters and then imported the data from the specific URL using TabularDatasetFactory. Then, the data was split into the train and test sets. Finally, parameters were passed in the Logistic Regression Algorithm.

Hyperparameter tuning, termination policy and estimator in Hyperdrive run

6 Hyperdrive

Firstly, we create the different parameters that will be used during the training. They are "--C" and "--max_iter". On these we have used the RandomParameterSampling. Then we use "uniform" that specifies the uniform distribution from which the samplers are taken for "--C" and "choice" to choose values from the discrete set of values for "--max_iter".

The Parameter Sampler chosen was - RandomParameterSampling. The major edge it has over other Samplers is of choosing random values from the search space with ease. It can choose values for the hyperparameters by exploring wider pool of values than others.

Then, we define our early termination policy with evaluation_interval=1, slack_factor=0.02, slack_amount=None and delay_evaluation=0 using the BanditPolicy class. This is done to terminate the run that are not performing up to the mark. Starting at specified evaluation_interval, any run resulting in smaller value of primary metric gets cancelled automatically.

Then, we create the estimator and the hyperdrive. We have used train.py to perform the Logistic Regression algorithm. Since the output that we will predict is binary i.e. "0" for healthy or "1" for those with disease , hence we used Logistic Regression.

Now, we define the Hyperdrive Configuration. We give max_concurrent_runs value of 4, i.e. the maximum parallel iterations will be four and max_total_runs will be 22 since we only have 195 rows to evaluate.

Results of the Hyperdrive Run:

  • Below screenshot shows the completed hyperdrive run: 8 run details

Best Model: The Best Model of Hyperdrive had Accuracy of 0.9056603773584906, Regularization Strength of 0.04411012133409599 and Max iterations of 200. This resulted in value of '--C' = 0.04411012133409599 and '--max_iter' = 200 .

  • Below screenshot shows the best model details:

7 Best Model

  • Below screenshot shows the hyperdrive run in workspace: 9 hyperdrive in workspace

Model Deployment

Since the best performing model came to be AutoML run that had Voting Ensemble Algorithm with accuracy of 0.97906. Hence now we will deploy it.

  • Firstly, we will register the model, create an inference configuration and then deploy the model as a webservice.

10

  • Then, we download the conda environment file and define environment. Then we download the scoring file produced by AutoML and set inference configuration.

11

  • Now, we set the ACI Webservice configuration. Then, we deploy model as webservice.

12

  • We get the Service State as "Healthy", Scoring URI and Swagger URI respectively.

13

  • Now, we select any three samples from the dataframe. Then, we conert records to json data file.

14

  • We used 3 random sample points to test our endpoint for actual value and predicted value from dataset.

15

16

  • Here, we can see the endpoint in workspace in "Healthy" Deployment State.

20

  • Below screenshot shows the REST Endpoint with Authentication keys (both primary key and secondary key) available. 21

Screen Recording:

Link to the screencast is here.

Future Work:

  • The major areas of future improvement involve the running of the model for much longer time and trying different parameters to get even better accuracy.
  • We can use GPU's instead of CPU's to improve the performance. Since CPU's might reduce the costs but in terms of performance and accuracy GPU's outperform CPU's.
  • We can enable Deep Learning as well in the Auto ML Experiment for better results as it will consider different patterns and algorithms. Hence, improving the accuracy.

About

This is the Capstone Project of my Machine Learning Engineer with Microsoft Azure Nanodegree Program by Udacity.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published