Indicators-of-Heart-Disease

My effort has been to do this project with logistic regression About Dataset Key Indicators of Heart Disease 2020 annual CDC survey data of 400k adults related to their health status What topic does the dataset cover? According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol. Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Computational developments, in turn, allow the application of machine learning methods to detect "patterns" from the data that can predict a patient's condition.

Where did the dataset come from and what treatments did it undergo? Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. As the CDC describes: "Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.". The most recent dataset (as of February 15, 2022) includes data from 2020. It consists of 401,958 rows and 279 columns. The vast majority of columns are questions asked to respondents about their health status, such as "Do you have serious difficulty walking or climbing stairs?" or "Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]". In this dataset, I noticed many different factors (questions) that directly or indirectly influence heart disease, so I decided to select the most relevant variables from it and do some cleaning so that it would be usable for machine learning projects.

What can you do with this dataset? As described above, the original dataset of nearly 300 variables was reduced to just about 20 variables. In addition to classical EDA, this dataset can be used to apply a range of machine learning methods, most notably classifier models (logistic regression, SVM, random forest, etc.). You should treat the variable "HeartDisease" as a binary ("Yes" - respondent had heart disease; "No" - respondent had no heart disease). But note that classes are not balanced, so the classic model application approach is not advisable. Fixing the weights/undersampling should yield significantly betters results. Based on the dataset, I constructed a logistic regression model and embedded it in an application you might be inspired by: https://share.streamlit.io/kamilpytlak/heart-condition-checker/main/app.py. Can you indicate which variables have a significant effect on the likelihood of heart disease?

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
Untitled.ipynb		Untitled.ipynb
heart_2020_cleaned.csv		heart_2020_cleaned.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Untitled.ipynb

Untitled.ipynb

heart_2020_cleaned.csv

heart_2020_cleaned.csv

Repository files navigation

Indicators-of-Heart-Disease

About

Releases

Packages

Languages

MahdiNavaei/Indicators-of-Heart-Disease

Folders and files

Latest commit

History

Repository files navigation

Indicators-of-Heart-Disease

About

Topics

Resources

Stars

Watchers

Forks

Languages