TED-Scraper

Web Scraping of TED.com for complete Metadata, Transcript, Audio, Video, Images using Parallel Programming.

Environment: Google Colab with Google Drive without any Hardware Accelerator. Python: 3.6.9

Context

I was looking for an interesting dataset for a personal Data Science project, and I'm a fan of TED. So, I looked for the TED dataset, found Rounka's but it is incomplete and outdated. Then, I scraped myself and made it super fast using Parallel Programming. Now, it downloads all Metadata along with the Transcript in 300 seconds of all 4609 Talks on the website*. This is the most comprehensive TED Talk dataset which includes media files (images, audio, and video) too!

*Scraped on 24-JUN-20. One can scrape entire TED.com using the code to get the latest dataset in 5 minutes.

Downloading media files take less than 2 hours in total - 2 minutes for photos of Speaker and Talk, 10 minutes for Audio, 1.5 hours for videos.
TED_Talk.xlsx and TED_Talk.csv contain Metadata and Transcript. Folder Names are intuitive. All media files are named by talk__id, except in PHOTO__SPEAKER files are named by speaker__id of the primary Speaker.

The code shows a way to scrape at scale.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
Scraper.ipynb		Scraper.ipynb
TED_Talk_URLs.txt		TED_Talk_URLs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Scraper.ipynb

Scraper.ipynb

TED_Talk_URLs.txt

TED_Talk_URLs.txt

Repository files navigation

TED-Scraper

Context

About

Releases

Packages

Languages

License

The-Gupta/TED-Scraper

Folders and files

Latest commit

History

Repository files navigation

TED-Scraper

Context

About

Topics

Resources

License

Stars

Watchers

Forks

Languages