Skip to content

Latest commit

 

History

History
197 lines (143 loc) · 9.86 KB

README.md

File metadata and controls

197 lines (143 loc) · 9.86 KB

Using the YouTube scraper, you can extract data from keyword search results, scrape detailed data on videos, like and dislike ratio and channels, download captions and scrape comment sections.

Unlike with the official YouTube API, with this YouTube scraper, you can scrape the results without quota limits and log in requirement.

Our YouTube API is open-source and you can easily run it locally or on your system. Contributions are welcome.

Features

  • Scrape videos by specifying a multiple search keywords or URLs to get video details, including e.g. like/dislike ratio.
  • Scrape channel details (username, description, number of subscribers etc.)
  • [NEW] Scrape and download YouTube subtitles and captions (both auto- and user-generated) in any language from any country.
  • [NEW] Scrape YouTube comment section (no nested comments at the moment though).

Tutorial

For a more detailed explanation of how to scrape YouTube read a step-by-step tutorial on our blog.

And for more ideas on how to use the extracted data, check out our industries pages for concrete ways web scraping results are already being used across the projects and businesses of various scale and direction - in media and marketing, for instance.

Cost of usage

On average, scraping 1000 items from YouTube via Apify platform will cost you around 2.5 USD credits off your subscription plan. For more details about the plans we offer, platform credits and usage, see the platform pricing page.

If you're not sure how much credits you've got left on your plan and whether you might need to upgrade, you can always check your limits in the Settings -> Usage and Billing tab in your Console.
The easiest way to know how many credits your actor will need is to perform a test run.

Proxy usage

This actor, as most social media-related scrapers, requires Proxy servers to run properly. You can use either your own proxy servers or you can use Apify Proxies. We recommend using dataset proxies to achieve the best scraping potential of this actor.

Input parameters

If this actor is run on our Platform, a user-friendly UI there will help you out in configuring all the necessary and optional parameters of this scraper before running it. Our YouTube actor recognizes the following input fields:

  • searchKeywords - Your YouTube search query, say Nimbus 2000 reviews; this one can be used instead of a URL.

    • startUrls - A more accurate alternative to searchKeywords. By inserting specific URLs from YouTube you can provide search, channel or videos URLs.
  • maxResults - sets how many videos should be scraped from each search or channel. Defaults to 50, but you can leave it empty for unlimited search.

  • maxComments - Limits the number of comments that you want to scrape. 0 or empty means no comments will be scraped.

  • downloadSubtitles - Scrape both user-generated and auto-generated captions and convert them to SRT format. Boolean value, defaults to false.

    • subtitlesLanguage - Download only subtitles of the selected language (possible values "en", "de", "es"...)
    • preferAutoGeneratedSubtitles - Prefer the autogenerated speech-to-text subtitles to the user made ones.
    • saveSubsToKVS - Saves the scraped subtitles in the Apify Key Value Store.
  • proxyConfiguration (required) - Configures proxy settings

  • verboseLog (required) - Turns on verbose logging for accurate monitoring and having more details about the runs.

See more technical details of the input parameters in the Input Schema tab of this actor.

Example

{
    "searchKeywords": "Terminator dark fate",
    "maxResults": 30,
    "startUrls": [{
        "url": "https://www.youtube.com/channel/UC8w/videos" // channel videos
    }, {
        "url": "https://www.youtube.com/results?search_query=finances" // search queries
    }, {
        "url": "https://www.youtube.com/watch?v=kJQP7kiw5Fk" // videos
    }],
    "proxyConfiguration": {
        "useApifyProxy": true
    },
    "verboseLog": false
}

YouTube Scraper output

After the actor finishes the run, it will store the scraped results in a the Dataset. Each YouTube video becomes a separate record in the dataset (see a JSON example below). Using the Apify platform, you can choose to present and download the contents of the dataset in different data formats (JSON, RSS, XML, HTML Table...).

Example

{
  "title": "Terminator: Dark Fate - Official Trailer (2019) - Paramount Pictures",
  "id": "oxy8udgWRmo",
  "url": "https://www.youtube.com/watch?v=oxy8udgWRmo",
  "viewCount": 15432,
  "date": "2019-08-29T00:00:00+00:00",
  "likes": 121000,
  "dislikes": 23000,
  "channelName": "Paramount Pictures",
  "channelUrl": "https://www.youtube.com/channel/UCF9imwPMSGz4Vq1NiTWCC7g",
  "numberOfSubscribers": 1660000,
  "details": "Welcome to the day after <a class=\"yt-simple-endpoint style-sco..."
}

See the Apify API reference to learn in more detail about getting results from this YouTube Scraper.

How can you use the data extracted from YouTube:

  • Compile reviews of products and services - make purchasing and investment decisions backed by data.

  • Monitor YouTube for brand awareness - keep track of brand mentions, audience reach and web reputation.

  • Estimate the impact of YouTube campaigns - estimate ROI for advertisement or referrals from YouTube channels and scale marketing campaigns accordingly.

  • Apply scraped data in journalism - track down and tackle fake news, bot activity, as well as illegal, misleading or harmful content. Dissect big news topics and analyze sentiment on web.

  • Collect data for any kind of research - identify and follow emerging trends or topics and even predict the new ones: globally or by country and language.

Changelog

You can see all newest changes to this YouTube scraper listed in this CHANGELOG.md file.

Notes for the developers on customizing the actor

Here are the calculations for a typical resource usage of YouTube Scraper on Apify platform:

Resource Average Max
Memory 480.3 MB 1.1 GB
CPU 53% 140%

This actor uses xPaths to find DOM elements; they are all stored in one file for easy update. All xPath variables and functions end in 'Xp'.

Extend output function

Extend output function allows you to omit output, add some extra properties to the output by using the page variable or change the shape of your output altogether:

async ({ item }) => {
    // remove information from the item
    item.details = undefined;
    // or delete item.details;
    return item;
}
async ({ item, page }) => {
    // add more info, in this case, the shortLink for the video
    const shortLink = await page.evaluate(() => {
        const link = document.querySelector('link[rel="shortlinkUrl"]');
        if (link) {
            return link.href;
        }
    });

    return {
        ...item,
        shortLink,
    }
}
async ({ item }) => {
    // omit item, just return null
    return null;
}

Extend scraper function

Extend scraper function allows you to add functionality to the existing baseline behavior. For example, you may enqueue related videos, but not recursively:

async ({ page, request, requestQueue, customData, Apify }) => {
    if (request.userData.label === 'DETAIL' && !request.userData.isRelated) {
        await page.waitForSelector('ytd-watch-next-secondary-results-renderer');

        const related = await page.evaluate(() => {
            return [...document.querySelectorAll('ytd-watch-next-secondary-results-renderer a[href*="watch?v="]')].map(a => a.href);
        });

        for (const url of related) {
            await requestQueue.addRequest({
                url,
                userData: {
                    label: 'DETAIL',
                    isRelated: true,
                },
            });
        }
    }
}

NB: If this specific function throws an exception, it will retry the same URL it was visiting again.

Acknowledgments and personal data

This scraper collects cookies and privacy consent dialogs on your behalf. Therefore, you should be aware that the results from your YouTube scraping might contain personal data.

Personal data is protected by GDPR (EU Regulation 2016/679), and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so.

If you're unsure whether your reason is legitimate, consult your lawyers. You can also read our blog post on the legality of web scraping.

Other video and social media scrapers

We have other video-related scrapers in stock for you; to see more of those, check out the Video Category in Apify Store or the compilation of Social Media Scrapers.

Your feedback

We’re always working on improving the performance of our actors. So if you’ve got any technical feedback about the work of our YouTube API, or simply found a bug, please create an issue on the Github page and we’ll get to it.