Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing transcripts #150

Open
xenova opened this issue Apr 29, 2022 · 4 comments
Open

Missing transcripts #150

xenova opened this issue Apr 29, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@xenova
Copy link
Contributor

xenova commented Apr 29, 2022

When fetching transcripts for https://www.youtube.com/watch?v=gdsUKphmB3Y, I only get a subset of the available transcripts.

Using library:

>>> from youtube_transcript_api import YouTubeTranscriptApi
>>> video_id = 'gdsUKphmB3Y'
>>> transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
>>> for t in transcript_list:
...     print(t)
... 
en ("English - DTVCC1")[TRANSLATABLE]
rm ("Romansh - DTVCC3")
en ("English (auto-generated)")[TRANSLATABLE]

On YouTube:
image

@jdepoix
Copy link
Owner

jdepoix commented May 5, 2022

Hi @xenova, thanks for reporting this.
The problem here seems to be that both English - CC1 and English - DTVCC1 use the language code en (same for the Romansh). The TranscriptList object holds the transcripts in a dict where the language code is the key. Therefore, there can only be one transcript per language code in that dict. I wasn't aware that multiple transcripts using the same language code is a thing 😱

I am afraid this can't be fixed without introducing breaking changes, as we apparently can no longer consider the language code a reliable identifier. Have you encountered multiple instances of this happening? How big of a problem is this? 🤔

@xenova
Copy link
Contributor Author

xenova commented May 5, 2022

Oh wow that's quite surprising!

I have downloaded > 1 million transcripts for an ML project I'm working on (https://www.github.com/xenova/sponsorblock-ml) and only had 1 problem with this, so, it is most likely not that big of an issue.

@jdepoix
Copy link
Owner

jdepoix commented May 5, 2022

Thanks for reporting back! It's good to know, that this isn't too much of an issue. I have never encountered it myself, although I have scraped quite a few of transcripts.

I might just leave this as is. To fix this we would have to return a list from all calls which currently just retrieve a single transcript, to account for the unlikely event that multiple transcripts could be returned. This would require quite a bit of rewriting and would most likely break a lot of code depending on this module. The other option is to retrieve transcripts for a given language using its vssId, however, this seems way more impractical, as that would require the user (of this module) to first find out the vssId of the language he/she is looking for.

I guess the only practical option is adding the vssId as an optional param to fetch, or a separate fetchByVssId method, which would at least provide a way to work around this in case you are encountering this issue. This still requires a bit of rewriting as the TranscriptList can no longer use dicts internally. The fetch method could then throw an exception when it is asked to retrieve a transcript for a language code it has multiple transcripts for, to let the user know, that vssId must be used here.

Any thoughts on this?

@jdepoix jdepoix added the bug Something isn't working label May 5, 2022
@xenova
Copy link
Contributor Author

xenova commented May 5, 2022

Right, this is definitely a simple problem with an anything-but-simple solution.

As you mentioned, the most important thing is not to break code that breaks modules which depend on it, so your second option seems quite practical.

I have seen implementations (in django I believe) of a "MultiDict" (or something like that) which acts exactly as a dictionary (allowing for indexing), but allows for duplicate keys. This is normally implemented by mapping keys to a list, and when indexing, you just return the first element.

Another way to implemented with an auxiliary dictionary used to map keys to the index of their first appearance (so that you can still index normally), but allows for iterating over the container if you need a specific item.

For example, you could have a multidict:
d = { 'a': 1, 'b': 2, 'a': 3 }
such that d['a'] returns 1 and d['b'] returns 2. As mentioned above, this would be implemented by storing a list of values x = [1,2,3], and a dictionary y = {'a': 1, 'b': 2}, such that d['a'] = x[y['a']]=1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants