Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in listing all available subtitle tracks #288

Open
Angel756984 opened this issue May 18, 2024 · 0 comments
Open

Bug in listing all available subtitle tracks #288

Angel756984 opened this issue May 18, 2024 · 0 comments

Comments

@Angel756984
Copy link

Hi, I've discovered a bug in listing all available subtitle tracks for videos with more manually created transcripts with same language code. I'm using latest Python on latest Windows and PyCharm, but it does not care, with online services like repl.it is exactly the same.

Only as example, you can test with the following video Trump campaign sets sights on another deep-blue state and in general with all videos published on the Fox News channel, but the same for quite a few channels of broadcast networks, as they have all the following caption tracks:

image

You can run the below code:

from youtube_transcript_api import YouTubeTranscriptApi
subs = YouTubeTranscriptApi.list_transcripts('STjvfE4HVXY')
for sub in subs:
   print(f'code:<{sub.language_code}> auto:<{sub.is_generated}> lang:<{sub.language}>')

to obtain the next result:

image

This is the PyCharm debug view:

image

As you can see the track CC1 is missing from the list of tracks available for the video which I believe is not present as among the manually created tracks both track CC1 and track DTVCC1 have same language code 'en' which is used as the only key for the dictionary separated only for autogenerated and manually generated tracks. So given track CC1 - as shown below in the JSON extracted from the HTML video page - is listed first than track DTVCC1, the code saves CC1 in the dictionary with key 'en' among the manually generated tracks and when it later finds DTVCC1 with same code 'en' again in manually generated it overwrites CC1 which then no longer appears.

  "captions": {
    "playerCaptionsTracklistRenderer": {
      "captionTracks": [
        {
          "baseUrl": "...",
          "name": {
            "simpleText": "Inglese (generati automaticamente)"
          },
          "vssId": "a.en",
          "languageCode": "en",
          "kind": "asr",
          "isTranslatable": true,
          "trackName": ""
        },
        {
          "baseUrl": "...",
          "name": {
            "simpleText": "Inglese - CC1"
          },
          "vssId": ".en.uYU-mmqFLq8",
          "languageCode": "en",
          "isTranslatable": true,
          "trackName": "CC1"
        },
        {
          "baseUrl": "...",
          "name": {
            "simpleText": "Inglese - DTVCC1"
          },
          "vssId": ".en.JkeT_87f4cc",
          "languageCode": "en",
          "isTranslatable": true,
          "trackName": "DTVCC1"
        },
        {
          "baseUrl": "...",
          "name": {
            "simpleText": "Inglese (Stati Uniti)"
          },
          "vssId": ".en-US",
          "languageCode": "en-US",
          "isTranslatable": true,
          "trackName": ""
        }
      ],

This bug I believe can be solved by using 'trackName' as an addition to the dictionary tracks key so that keys such as 'en CC1' and 'en-DTVCC1' would no longer cause loss of tracks with same language code and both manually generated.

Thank you and let me know please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant