README

The transcripts are licenced on the creative commons by nc-sa 2.5 licence

In this project I will apply some text (NLP ) analyses on all the transcripts from security now (SN) episodes.

SN is a highly informative podcast by Steve Gibson and Leo Laporte about security related news and or explanations of concepts. The series is long running for almost 13 years at the time of writing. In the early years there were entire episodes dedicated to a certain topic, later on the security related news has taken more of a foreground.

Steve has someone transcribe all of the audio files, that means we can use NLP tools to analyze all of the text.

I've only listened to the last few years so that is what I'm most interested in.

What does the data look like?

example of top of file:

GIBSON RESEARCH CORPORATION http://www.GRC.com/

SERIES:     Security Now!
EPISODE:        #20
DATE:       December 29, 2005
TITLE:      A SERIOUS new Windows vulnerability - and Listener Q&A #2
SPEAKERS:   Steve Gibson & Leo Laporte
SOURCE FILE:    http://media.GRC.com/sn/SN-020.mp3
FILE ARCHIVE:   http://www.GRC.com/securitynow.htm
    
DESCRIPTION:  On December 28th a serious new Windows vulnerability appeared and was immediately exploited by a growing number of malicious web sites to install malware.  Many worse viruses and worms are expected soon.  We start off discussing this, and our show notes provide a quick necessary workaround until Microsoft provides a patch.  Then we spend the next 45 minutes answering and discussing interesting listener questions.

LEO LAPORTE:  This is Security Now! with Steve Gibson, Episode 20, for December 29, 2005.

STEVE GIBSON:  Last episode of this year.

LEO:  The last episode of 2005.  And we've done 20 of them.

STEVE:  Yeah.

As you can see this text is very structured and is somewhat easily parsed into analysis-ready data.

This project has two parts:

build a scraper that downloads/ reads in all of the text
- iterate through all of the links (don't download if you already have it)
- extract the metadata on top of the file (Date, Title, speakers, sourcefile, Description)
- a row per sentence
build cool stuff on top of this file
- classifier that predicts who speaks?*
- sentiment analyses per episode, per season
- bot that talks like Steve and Leo*
- topic model or word2vec*
- network analysis of words

Building a scraper

The scraping part I've kept relatively easy, I knew the files were in txt format and very structered on the website. I chose to just generate a set of links and download the files, check them for errors and read the files in.

More details can be found on the page scraper_description

Building extraction tools

I extracted all the episode information, title, date, hosts, episode number, description and extracted all the lines that contained spoken text, identified the speaker created a linenumber and combined that all into 1 dataframe.

More info in the part extracting features

Final product:

A dataframe with a row for every episode and normal columns for episode information, and 1 list-column containing a new dataframe with a linenumber, speaker, and what text was spoken.

Actually the deleting of files and reading into a dataframe were done in the extracting features file.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
R		R
README_files		README_files
.gitignore		.gitignore
LICENSE		LICENSE
NLP_SN.Rproj		NLP_SN.Rproj
README.Rmd		README.Rmd
README.md		README.md
SCRAPER_description.Rmd		SCRAPER_description.Rmd
SCRAPER_description.md		SCRAPER_description.md
df_sn.RDS		df_sn.RDS
extracting_features.Rmd		extracting_features.Rmd
extracting_features.md		extracting_features.md
markov_chains.R		markov_chains.R
notes.md		notes.md
sentiment_p_ep_sn.png		sentiment_p_ep_sn.png
steve_vs_leo.Rmd		steve_vs_leo.Rmd
text2vec.Rmd		text2vec.Rmd
top24components_sn_top500.png		top24components_sn_top500.png
word2vec2.Rmd		word2vec2.Rmd
word2vec_like_julia.Rmd		word2vec_like_julia.Rmd
word2vec_like_julia.html		word2vec_like_julia.html

License

RMHogervorst/NLP_SN

Folders and files

Latest commit

History

Repository files navigation

README

What does the data look like?

Building a scraper

Building extraction tools

About

Topics

Resources

License

Stars

Watchers

Forks

Languages