Skip to content

Using Python, I explored and engineered data from Enron email metadata and financials, building and tuning a number of scikit-learn classification models to detect persons of interest.

License

Notifications You must be signed in to change notification settings

KalebCoberly/ML_classification_Enron_emails

Repository files navigation

Machine Learning in Python, scikit-learn classification

This is my project for Udacity's Intro to Machine Learning class, Identify Fraud in the Enron Dataset, part of Udacity's Data Analyst nanodegree, which is part of Western Governor University's Bachelor of Science in Data Management and Data Analytics.

The entire process can be run from 'poi_id.py' and is explained in 'Free-Response Questions.ipynb', but I have included supplemental materials and other resources.

My first true ML project, it's pretty messy. I wouldn't consider this (or any of my school projects thus far) a finished deliverable.

Files

  • /data: Contains the starting dataset pickled, and several pickled dictionaries of performance metrics created during the algorithm selection and tuning process not carried out in the final script.
  • /supplemental_material: Contains (most of) the (messy) notebooks I used along the way to explore and experiment with the data and the ML process itself. While I don't recommend running these notebooks, they are there to show my work and how I thought. They should be viewed in this order: 'initial_wrangle.ipynb', 'handling_eda_etc.ipynb', 'feature_engineering.ipynb', 'selection.ipynb', followed by the gridsearch notebooks.
  • Free-Reponse Questions.ipynb: A notebook with Udacity's questions regarding the project and my process, with my reponses. While Udacity asked for shorter reponses, the extent to which I took the project warranted longer responses in order to address each point of each set of questions and their associated rubric items.
  • poi_id.ipynb: A notebook of final script from which all cleaning, engineering, tuning, validation, and evaluation is run. It breaks up output for easier reference.
  • enron61702insiderpay.pdf: PDF of financial data with footnotes, from FindLaw.com.
  • environment.yml: The conda environment I used.
  • Free-Response Questions.html: HTML of 'Free-Reponse Questions.ipynb'.
  • my_classifier.pkl: A final (not best) classifier model. It's a scikit-learn pipeline containing a tune feature selection algorithm and a tuned classifier.
  • my_dataset.pkl: The dataset (as a dictionary) with the features to be plugged into the above model. It includes 'poi' which is the target feature.
  • my_features_list.pkl: The list of the features in my_dataset.
  • poi_id.py: The final script from which all cleaning, engineering, tuning, validation, and evaluation is run.

References

Python imports etc.:

Data:

Other people's approaches:

I read these writeups to see how others have approached the same problem. Though I didn't borrow any code, nor ideas that aren't already common, William Koehrsen's article reminded me to validate the data against the total columns, and reading his explanation saved me the trouble of puzzling out why there were errors.

General information about the scandal and the data:

Education/reference:

About

Using Python, I explored and engineered data from Enron email metadata and financials, building and tuning a number of scikit-learn classification models to detect persons of interest.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published