Skip to content

Complete Web Scraping of TED.com for Metadata, Transcript, Audio, Video, Images using Parallel Programming

License

Notifications You must be signed in to change notification settings

The-Gupta/TED-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TED-Scraper

Web Scraping of TED.com for complete Metadata, Transcript, Audio, Video, Images using Parallel Programming.

Environment: Google Colab with Google Drive without any Hardware Accelerator. Python: 3.6.9

Scraped Data

Context

I was looking for an interesting dataset for a personal Data Science project, and I'm a fan of TED. So, I looked for the TED dataset, found Rounka's but it is incomplete and outdated. Then, I scraped myself and made it super fast using Parallel Programming. Now, it downloads all Metadata along with the Transcript in 300 seconds of all 4609 Talks on the website*. This is the most comprehensive TED Talk dataset which includes media files (images, audio, and video) too!

*Scraped on 24-JUN-20. One can scrape entire TED.com using the code to get the latest dataset in 5 minutes.

Downloading media files take less than 2 hours in total - 2 minutes for photos of Speaker and Talk, 10 minutes for Audio, 1.5 hours for videos.
TED_Talk.xlsx and TED_Talk.csv contain Metadata and Transcript. Folder Names are intuitive. All media files are named by talk__id, except in PHOTO__SPEAKER files are named by speaker__id of the primary Speaker.

The code shows a way to scrape at scale.