New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ie/youtube] Extract comments with or without new format #9775
Conversation
cc @shoxie007 |
Thanks for the effort. I had wanted to do this but was pressed for time. But what you've posted is quite elegant and effective, so it saves me the trouble. I'd like to propose some changes:
|
Np, thanks for the review. Orig patch author @minamotorin also made a comment about the confusingly named I replaced the dict access with (I think) correctly typed calls to |
I read and re-read @minamotorin's comment and found it intriguing:
I put this definition to the test. I loaded a video with a logged-in Youtube account, as this is the only way that field likeCountLiked can mean anything. I hit the like button for one or more comments, then re-loaded the comments, then studied the JSON response. Here is what the fields in the JSON mean:
I tested your extractor as it currently is and it reflected these values. So I'll repeat what I wrote in my first comment: Please obtain the like_count from key likeCountA11y instead of likeCountNotliked. If likeCountNotliked is used, then an erroneous value will be returned when the user passes a cookie to yt-dlp (ie a logged-in session) and yt-dlp obtains the comments data when logged in. When logged in, the like_count can be one less than it should be. I tested and verified this myself with the extractor as it currently is. My suggestion: Also suggest expressing this: |
A Ha! Thanks for sorting that out. I fixed both issues. |
I think the current implementention does not work with old format. yt-dlp/yt_dlp/extractor/youtube.py Lines 3565 to 3572 in 3ef6517
However, I couldn't get the old format response now so I can't test it now. |
Now that you bring it up, I did get the "Incomplete data received.." a few times when downloading comments for some videos, even with the commentViewModel JSON response. I wonder if it had anything to do with this bit of code. I'll study and test it further myself and see what the issue is... |
This fix may work. - 'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel'))]]
+ 'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel', 'commentRenderer'))]] |
I have been testing this. I wasn't able to access I've been playing around, something like that should do it: diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py
index b4f33e7f7..6b87ab55d 100644
--- a/yt_dlp/extractor/youtube.py
+++ b/yt_dlp/extractor/youtube.py
@@ -3306,7 +3306,7 @@ def _extract_heatmap(self, data):
'value': ('intensityScoreNormalized', {float_or_none}),
})) or None
- def _extract_comment(self, view_model, entity, parent=None):
+ def _extract_comment(self, view_model, entity, entity_payloads, parent=None):
entity_payload = traverse_obj(entity, ('payload', 'commentEntityPayload', {dict}))
comment_id = entity_payload.get('properties').get('commentId')
@@ -3344,10 +3344,12 @@ def _extract_comment(self, view_model, entity, parent=None):
if author_is_uploader is not None:
info['author_is_uploader'] = author_is_uploader
- comment_abr = traverse_obj(
- entity, ('payload', 'engagementToolbarStateEntityPayload', 'heartState'), expected_type=str)
- if comment_abr is not None:
- info['is_favorited'] = comment_abr == 'TOOLBAR_HEART_STATE_HEARTED'
+ toolbar_state_key = entity_payload.get('properties', {}).get('toolbarStateKey')
+ if toolbar_state_key:
+ tool_bar_entity = next((d for d in entity_payloads if d.get('entityKey') == toolbar_state_key), None)
+ if tool_bar_entity:
+ heart_state = traverse_obj(tool_bar_entity, ('payload', 'engagementToolbarStateEntityPayload', 'heartState'))
+ info['is_favorited'] = heart_state == 'TOOLBAR_HEART_STATE_HEARTED'
info['author_is_verified'] = traverse_obj(entity_payload, ('author', 'isVerified')) == 'true'
@@ -3470,7 +3472,7 @@ def extract_thread(contents, entity_payloads):
entity = entity
break
- comment = self._extract_comment(view_model, entity, parent)
+ comment = self._extract_comment(view_model, entity, entity_payloads, parent)
if comment.get('is_pinned'):
tracker['pinned_comment_ids'].add(comment_id) But unfortunately requires some looping around through the array for each comment to find the matching |
In such trying situations, look to our Lord and Savior: traverse_obj How I would re-write everything: # In _comment_entries >> extract_thread:
# --------------------------------------
# .... Existing code remains as is
comment_key = view_model.get("commentKey")
toolbar_state_key = view_model.get("toolbarStateKey")
# This usage of traverse_obj returns a list of relevant entities
# - NOTE: This: v["entityKey"] in [comment_key, toolbar_state_key]
# is shorthand for: v["entityKey"] == comment_key or v["entityKey"] == toolbar_state_key
entities = traverse_obj(entity_payloads, lambda _, v: v["entityKey"] in [comment_key, toolbar_state_key])
# Call _extract_comment using "entities" instead of the former "entity"
comment = self._extract_comment(view_model, entities, parent)
# Then in _extract_comment
# ------------------------
def _extract_comment(self, view_model, entities, parent=None): # change "entity" to "entities"
comment_entity_payload = traverse_obj(entities, (..., 'payload', 'commentEntityPayload', {dict}), get_all=False)
toolbar_entity_payload = traverse_obj(entities, (..., 'payload', 'engagementToolbarStateEntityPayload', {dict}), get_all=False)
# ....
# NOTE: "entity_payload" should be changed to "comment_entity_payload" in existing code
# ....
if toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED':
info['is_favorited'] = True
# .... |
Are you still with us @jakeogh? Would you kindly integrate the changes proposed?:
I've tested the code on numerous videos. It's working. Let's get this pull request merged ASAP. People have been asking and wondering about the broken comments extraction. |
Thanks for the ping, I have been without time this week, but I'll be able to dig back into this tonight. I have tested this on ~1k videos (some very old) and haven't been able to trigger an issue as far as I can tell, but I havent verified this. I need to go back over my results. Is there a way to find an ID that triggers the situation where |
Youtube isn't consistent in which videos it uses the commentRenderer model for. It uses it at random. My own experience is that about 10-20% of videos will use the commentRenderer model in the comments-response JSON when downloading an entire channel for example. I think people's experience may vary depending on geo-location. I'm not sure. Without @minamotorin's fix, when I obtain the comments for an entire channel with hundreds of videos, I routinely get: An example of a terminal output for such a video:
I think you may be able replicate the issue. Here, run this command (using the build which doesn't have the fix suggested above) in Unix to download comments for the first 100 videos of the Whatifalthist channel. This command will format the output, so that it's more readable, and will print the terminal output to a log file for later examination. It might take a while to download everything, for which reason I included a log file which can be studied later: As for the second fix which I suggested in response to the issue which @bbilly1 raised, it should be obvious why it's necessary. The heartState key (which reveals whether or not a comment has been hearted) is in another entity (containing engagementToolbarStateEntityPayload), not in the entity containing commentEntityPayload. So two entities need to be extracted and passed to _extract_comment. |
@shoxie007 I really appreciate the detailed reproducer and explanation. I've added the fix from @minamotorin, however, before that, I ran the suggested tests and was unable to reproduce the issue. My IP geolocates to AZ, USA:
A manual check of the 100 Checking my previous tests, I found no anomalies, so YT does not appear to be serving |
@jakeogh I'll venture a guess as to why you have such a seamless experience: You're close to Youtube's main servers and have a fast or well-networked connection. Therefore, Youtube delivers the response-JSONs perfectly the first time. You don't get any incomplete responses. Or it could be that Youtube is sending commentRenderer responses to certain geo-locations only, whereas in the US, it only uses commentViewModel. If you're determined to replicate the issue, maybe try a proxy server in another geo-location if you have access to one. In any case, there is no harm in applying minamotorin's fix. It makes the data-integrity check of the response-JSON more thorough. Honestly, I don't know WHY it works, just that it does. When it's: if not traverse_obj(response, *variadic(check_get_keys)): |
Okay, the simple reason is
In such cases, check_get_keys = ...... else ('commentThreadRenderer', 'commentViewModel') ..... treats the response as incomplete because the response does not have both To be more specific, here is which keys are used.
The reply threads of old format response is when If I understand correctly, The code if not traverse_obj(response, *variadic(check_get_keys)): tests whether |
Whew! @bashonly I sincerely appreciate the detailed and easy to follow review. All of the requested changes were made, the only outstanding issues are:
|
@jakeogh Thanks, will look into this. I'm also wondering if @bashonly truly intends that all those keys - |
No. Sounds like a bug in old code.
Similar to above, |
re: re:
pukkandan's suggestion above should fix everything |
I had downloaded comments in January, before Youtube made changes and the the extractor broke. I had used yt-dlp v2024.01.05.232702. This is what a yt-dlp-generated comment dict looked like: {
"id": "Ugyv3aJcyIlwjWWWvPp4AaABAg",
"text": "Super video!!!!!!! Me gust\u00f3!",
"like_count": null,
"author_id": "UCYbXPcWIN9fxGKxCELWjapQ",
"author": "@fcp1955",
"author_thumbnail": "https://yt3.ggpht.com/ytc/AIf8zZQBkz1GQX7uX41M8pmfiimGylV2P8rw6ctq15tMGQ=s176-c-k-c0x00ffffff-no-rj",
"parent": "root",
"_time_text": "1 month ago",
"timestamp": 1703376000,
"author_url": "https://www.youtube.com/channel/UCYbXPcWIN9fxGKxCELWjapQ",
"author_is_uploader": false,
"is_favorited": false
} |
so your comment about |
Yes, I meant that the old extractor used to display like_count at all times in the dict, whereas I though you meant that the new extractor should omit keys which resolve to 0 or false. |
yt_dlp/extractor/youtube.py
Outdated
comment = self._extract_comment(view_model, entities, parent) | ||
comment = self._extract_comment(entities, parent) | ||
if comment: | ||
comment['is_pinned'] = traverse_obj(view_model, ('pinnedText', {str})) is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About this.... Does this mean that there will be a key is_pinned
for each and every single comment? I don't think this is necessary since only one comment will ever be pinned. Why have this key in every single comment dict? It will bloat the final comments data.
Perhaps the same should also apply to keys:
author_is_uploader
is_favorited
In general, the number of comments for which these values are true will be comparatively small. Therefore, why not just omit them from the comment-dict if they're false?
Having said this, I'm aware that in previous versions of the extractor, these keys (except is_pinned
) were included in the final dict even if they were false. So people are used to that format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. The alternative to omit a non-truthy is_pinned
field would be to do this:
comment = self._extract_comment(entities, parent)
if comment and traverse_obj(view_model, ('pinnedText', {str})) is not None:
comment['is_pinned'] = True
Though the lack of a field or a None
value typically should be reserved for when we don't have that data, rather than when we know it's False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should decide what fields to omit if-and-only-if the value is false, on the basis that the value is very rarely true. @pukkandan what's your opinion? I gave my rationale earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lack of a field or a
None
value typically should be reserved for when we don't have that data, rather than when we know it'sFalse
this
We should decide what fields to omit if-and-only-if the value is false, on the basis that the value is very rarely true.
It cannot be done for any field since we want to be able to distinguish between not just True/False, but also a "Data not extracted" state
It will bloat the final comments data.
Having "too much" data available has never been a concern we want to address. The infodict can be processed by third party scripts if that is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any outstanding issues left to address? The recent changes work for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having "too much" data available has never been a concern we want to address. The infodict can be processed by third party scripts if that is an issue.
Agreed, while it is potentially "bloat", it's not really that much, specially since we talking about videos which are so much larger in size. Data-not-extracted state is helpful info to have, for further/future processing (health checks or something like that), if it would be ever a thing or someone makes a tool for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me from a quick look. Thanks all involved for helping fix this :)
IMPORTANT: PRs without the template will be CLOSED
Description of your pull request and other information
Youtube comment format changed, this pull attempts to handle the new case where the key
frameworkUpdates
is present as well as the case without. The original patch is from @minamotorin #9358 (comment)Fixes #9358
Template
Before submitting a pull request make sure you have:
In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:
What is the purpose of your pull request?