Missing transcripts #150

xenova · 2022-04-29T21:18:37Z

When fetching transcripts for https://www.youtube.com/watch?v=gdsUKphmB3Y, I only get a subset of the available transcripts.

Using library:

>>> from youtube_transcript_api import YouTubeTranscriptApi
>>> video_id = 'gdsUKphmB3Y'
>>> transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
>>> for t in transcript_list:
...     print(t)
... 
en ("English - DTVCC1")[TRANSLATABLE]
rm ("Romansh - DTVCC3")
en ("English (auto-generated)")[TRANSLATABLE]

On YouTube:

jdepoix · 2022-05-05T08:32:52Z

Hi @xenova, thanks for reporting this.
The problem here seems to be that both English - CC1 and English - DTVCC1 use the language code en (same for the Romansh). The TranscriptList object holds the transcripts in a dict where the language code is the key. Therefore, there can only be one transcript per language code in that dict. I wasn't aware that multiple transcripts using the same language code is a thing 😱

I am afraid this can't be fixed without introducing breaking changes, as we apparently can no longer consider the language code a reliable identifier. Have you encountered multiple instances of this happening? How big of a problem is this? 🤔

xenova · 2022-05-05T08:35:40Z

Oh wow that's quite surprising!

I have downloaded > 1 million transcripts for an ML project I'm working on (https://www.github.com/xenova/sponsorblock-ml) and only had 1 problem with this, so, it is most likely not that big of an issue.

jdepoix · 2022-05-05T09:02:11Z

Thanks for reporting back! It's good to know, that this isn't too much of an issue. I have never encountered it myself, although I have scraped quite a few of transcripts.

I might just leave this as is. To fix this we would have to return a list from all calls which currently just retrieve a single transcript, to account for the unlikely event that multiple transcripts could be returned. This would require quite a bit of rewriting and would most likely break a lot of code depending on this module. The other option is to retrieve transcripts for a given language using its vssId, however, this seems way more impractical, as that would require the user (of this module) to first find out the vssId of the language he/she is looking for.

I guess the only practical option is adding the vssId as an optional param to fetch, or a separate fetchByVssId method, which would at least provide a way to work around this in case you are encountering this issue. This still requires a bit of rewriting as the TranscriptList can no longer use dicts internally. The fetch method could then throw an exception when it is asked to retrieve a transcript for a language code it has multiple transcripts for, to let the user know, that vssId must be used here.

Any thoughts on this?

xenova · 2022-05-05T12:49:26Z

Right, this is definitely a simple problem with an anything-but-simple solution.

As you mentioned, the most important thing is not to break code that breaks modules which depend on it, so your second option seems quite practical.

I have seen implementations (in django I believe) of a "MultiDict" (or something like that) which acts exactly as a dictionary (allowing for indexing), but allows for duplicate keys. This is normally implemented by mapping keys to a list, and when indexing, you just return the first element.

Another way to implemented with an auxiliary dictionary used to map keys to the index of their first appearance (so that you can still index normally), but allows for iterating over the container if you need a specific item.

For example, you could have a multidict:
d = { 'a': 1, 'b': 2, 'a': 3 }
such that d['a'] returns 1 and d['b'] returns 2. As mentioned above, this would be implemented by storing a list of values x = [1,2,3], and a dictionary y = {'a': 1, 'b': 2}, such that d['a'] = x[y['a']]=1.

jdepoix added the bug Something isn't working label May 5, 2022

This was referenced Jul 20, 2023

Fix Via's default YouTube transcript selection hypothesis/via#1110

Closed

Save transcripts in the DB hypothesis/via#927

Closed

GorujoCY mentioned this issue Feb 4, 2024

Download YouTube-specific subtitle format called SRV3 #245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing transcripts #150

Missing transcripts #150

xenova commented Apr 29, 2022

jdepoix commented May 5, 2022 •

edited

xenova commented May 5, 2022 •

edited

jdepoix commented May 5, 2022

xenova commented May 5, 2022 •

edited

Missing transcripts #150

Missing transcripts #150

Comments

xenova commented Apr 29, 2022

jdepoix commented May 5, 2022 • edited

xenova commented May 5, 2022 • edited

jdepoix commented May 5, 2022

xenova commented May 5, 2022 • edited

jdepoix commented May 5, 2022 •

edited

xenova commented May 5, 2022 •

edited

xenova commented May 5, 2022 •

edited