Skip to content

Natural Language Processing (NLP) and programmatic data extraction in large scale fraud investigations.

Notifications You must be signed in to change notification settings

AmMoPy/NLP_Enron_Emails

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Background:

Enron Corporation was an American energy, commodities, and services company based in Houston, Texas. It was founded by Kenneth Lay in 1985 as a merger between Lay's Houston Natural Gas and InterNorth, both relatively small regional companies. Before its bankruptcy on December 3, 2001.

Dataset:

Enron Corpus is a database of over 500k real emails generated by 150 Enron employees mostly senior management; It was obtained by the Federal Energy Regulatory Commission during its investigation of Enron's collapse and was latter made public.

The dataset does not include attachments, and some messages have been deleted.

Project Motivation:

Given the size of available data, it can be overwhelming to explore and identify potential useful pieces of evidence or clues. This project is demonstrating one way of implementing Natural Language Processing (NLP) and programmatic data extraction in a large scale fraud investigation, using real data.

Along the way there are also some useful NLP and other methods deployed here that have general application, for example:

  • Comparing content of text files through hashing
  • Identifying unique and recurring words
  • Text summarization using deep learning models
  • Creating word cloud :D

Disclaimer:

This project is for demonstration purpose only and is not intended to draw conclusion whatsoever; detailed content of the emails will not be displayed despite the fact that it is publicly available elsewhere.

Resources:

Check my YT Channel

About

Natural Language Processing (NLP) and programmatic data extraction in large scale fraud investigations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published