Parkinson's Disease Prediction Using Azure ML

Summary

In this project, we have used the Parkinson's Disease Dataset that contains the biomedical voice measurements of various people. Hence, we seek to predict if a person has the disease or not by using two algorithms. We will be comparing the accuracy of Hyperdrive Run with tuned hyperparameters and AutoML Run on Microsoft Azure. The result will be binary, i.e. "0" for healthy and "1" for those with the disease. After comparing the performances of both algorithms, we deploy the best performing one. The model can be consumed from the generated REST endpoint.

Dataset Overview

The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. The original study published the feature extraction methods for general voice disorders.

Parkinson's Disease is a brain disorder that targets the nervous system of human body. This results in tremors, stiffness and disturbs or slows the movement. When the nerve cell damage in the brain cause the dopamine levels to go down, it leads to Parkinson's. Immediate medication can help to control symptoms.

It is a multivarite dataset that contains the range of biomedical voice measurements of 31 people, out of which 23 had Parkinson's disease. Each column is itself a voice measure and rows correspond to 195 voice recordings from those individuals.

Citation:

Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)

Task:

Since, its a classification task with binary output, the column "status" will be used to determine if a person is healthy, denoted by "0" or is having Parkinson's disease, denoted by "1".

Attributes:

Matrix column entries (attributes):
name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
status - Health status of the subject (one) - Parkinson's, (zero) - healthy
RPDE,D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Access to the dataset:

Since the dataset was available in ASCII CSV format, therefore it has been provided in this repository itself here.

Automated ML Run

Firstly, We used the TabularDatasetFactory to create a dataset from the provided link. Then we split the train and test sets and upload them to datastore. Then, we define the task as per the below mentioned code.

Below are the settings used for AutoML Run :

automl_settings = {
    "n_cross_validations": 5,
    "experiment_timeout_minutes" :20,
    "primary_metric": 'accuracy',
    "max_concurrent_iterations": 4,
}
automl_config = AutoMLConfig(
    task="classification",
    compute_target=compute,
    enable_early_stopping= True,
    max_cores_per_iteration=-1,
    training_data=training_data,
    label_column_name="status",
    **automl_settings
    )

Below are the defination and reasons why we above settings for our AutoML Run:

Results of AutoML Run

After the submission, we found that VotingEnsemble Algorithm resulted with the best model with accuracy 0.97906, precision_score_weighted 0.99740 and precision_score_micro 0.99580. Enabling of the automatic featurisation resulted in Data guardrails including Class balancing detection, Missing feature values imputation and High cardinality feature detection that checks over the input data to ensure quality in training the model.

Below image shows that the run has been successfully completed in notebook
Below is the image showing Best performing model
Now, we retrieved and saved the best model
Below images show the explaination of Voting Ensemble Algorithm

Hyperdrive Run

I started with the Training Script - train.py which used the Scikit-Learn Logistic Regression. It starts with a clean_data function that cleans the missing values from the dataset and one hot encodes data. I passed the required parameters and then imported the data from the specific URL using TabularDatasetFactory. Then, the data was split into the train and test sets. Finally, parameters were passed in the Logistic Regression Algorithm.

Hyperparameter tuning, termination policy and estimator in Hyperdrive run

Firstly, we create the different parameters that will be used during the training. They are "--C" and "--max_iter". On these we have used the RandomParameterSampling. Then we use "uniform" that specifies the uniform distribution from which the samplers are taken for "--C" and "choice" to choose values from the discrete set of values for "--max_iter".

The Parameter Sampler chosen was - RandomParameterSampling. The major edge it has over other Samplers is of choosing random values from the search space with ease. It can choose values for the hyperparameters by exploring wider pool of values than others.

Then, we define our early termination policy with evaluation_interval=1, slack_factor=0.02, slack_amount=None and delay_evaluation=0 using the BanditPolicy class. This is done to terminate the run that are not performing up to the mark. Starting at specified evaluation_interval, any run resulting in smaller value of primary metric gets cancelled automatically.

Then, we create the estimator and the hyperdrive. We have used train.py to perform the Logistic Regression algorithm. Since the output that we will predict is binary i.e. "0" for healthy or "1" for those with disease , hence we used Logistic Regression.

Now, we define the Hyperdrive Configuration. We give max_concurrent_runs value of 4, i.e. the maximum parallel iterations will be four and max_total_runs will be 22 since we only have 195 rows to evaluate.

Results of the Hyperdrive Run:

Below screenshot shows the completed hyperdrive run:

Best Model: The Best Model of Hyperdrive had Accuracy of 0.9056603773584906, Regularization Strength of 0.04411012133409599 and Max iterations of 200. This resulted in value of '--C' = 0.04411012133409599 and '--max_iter' = 200 .

Below screenshot shows the best model details:

Below screenshot shows the hyperdrive run in workspace:

Model Deployment

Since the best performing model came to be AutoML run that had Voting Ensemble Algorithm with accuracy of 0.97906. Hence now we will deploy it.

Firstly, we will register the model, create an inference configuration and then deploy the model as a webservice.

Then, we download the conda environment file and define environment. Then we download the scoring file produced by AutoML and set inference configuration.

Now, we set the ACI Webservice configuration. Then, we deploy model as webservice.

We get the Service State as "Healthy", Scoring URI and Swagger URI respectively.

Now, we select any three samples from the dataframe. Then, we conert records to json data file.

We used 3 random sample points to test our endpoint for actual value and predicted value from dataset.

Here, we can see the endpoint in workspace in "Healthy" Deployment State.

Below screenshot shows the REST Endpoint with Authentication keys (both primary key and secondary key) available.

Screen Recording:

Link to the screencast is here.

Future Work:

The major areas of future improvement involve the running of the model for much longer time and trying different parameters to get even better accuracy.
We can use GPU's instead of CPU's to improve the performance. Since CPU's might reduce the costs but in terms of performance and accuracy GPU's outperform CPU's.
We can enable Deep Learning as well in the Auto ML Experiment for better results as it will consider different patterns and algorithms. Hence, improving the accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Environment Files		Environment Files
AUTOML.ipynb		AUTOML.ipynb
HYPERDRIVE.ipynb		HYPERDRIVE.ipynb
README.md		README.md
parkinsons.txt		parkinsons.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parkinson's Disease Prediction Using Azure ML

Summary

Dataset Overview

Citation:

Task:

Attributes:

Access to the dataset:

Automated ML Run

Results of AutoML Run

Hyperdrive Run

Hyperparameter tuning, termination policy and estimator in Hyperdrive run

Results of the Hyperdrive Run:

Model Deployment

Screen Recording:

Future Work:

About

Releases

Packages

Languages

sg7801/Parkinsons-Disease-Prediction-Model

Folders and files

Latest commit

History

Repository files navigation

Parkinson's Disease Prediction Using Azure ML

Summary

Dataset Overview

Citation:

Task:

Attributes:

Access to the dataset:

Automated ML Run

Results of AutoML Run

Hyperdrive Run

Hyperparameter tuning, termination policy and estimator in Hyperdrive run

Results of the Hyperdrive Run:

Model Deployment

Screen Recording:

Future Work:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages