[ie/youtube] Extract comments with or without new format #9775

jakeogh · 2024-04-24T04:08:26Z

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

Youtube comment format changed, this pull attempts to handle the new case where the key frameworkUpdates is present as well as the case without. The original patch is from @minamotorin #9358 (comment)

Fixes #9358

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

…rved

pukkandan · 2024-04-24T06:55:32Z

cc @shoxie007

shoxie007 · 2024-04-24T08:38:10Z

Thanks for the effort. I had wanted to do this but was pressed for time. But what you've posted is quite elegant and effective, so it saves me the trouble. I'd like to propose some changes:

Consider obtaining the like_count from key likeCountA11y instead of likeCountNotliked, because likeCountNotliked is rather ambiguous. What does "like count not liked" mean? Also cast the result to an int, otherwise, it will show up as a string with quotes around it in the final comments dict. (If a comment has no likes, the like count shows up as " ".) So maybe this:
'like_count': str_to_int(self._search_regex(r'^([\d]+)', toolbar_dict.get('likeCountA11y'), 'like_count', fatal=False)),
Also, I believe there are certain coding conventions for yt-dlp which are intended to minimize errors, or at least make them non-fatal: https://github.com/yt-dlp/yt-dlp/blob/master/CONTRIBUTING.md#developer-instructions
One of these is to avoid addressing a path in a dict directly. Instead, it's best to use intermediary functions like try_get or traverse_obj:
uploader = traverse_obj(meta, ('user', 'name')) # correct
uploader = meta['user']['name'] # incorrect
So in _extract_comment, this line for example:
'text': entity_payload['properties']['content']['content'],
could be re-written in two ways that I know of:
'text': try_get(entity_payload, lambda x: x['properties']['content']['content'], str),
'text': traverse_obj(entity_payload, ('properties', 'content', 'content', {str}), get_all=False),

jakeogh · 2024-04-24T09:24:57Z

Np, thanks for the review. Orig patch author @minamotorin also made a comment about the confusingly named likeCountNotliked: #9358 (comment) afaict it's the same info (for now haha) except one does not need string parsing outside of the conversion to int.

I replaced the dict access with (I think) correctly typed calls to try_get().

shoxie007 · 2024-04-24T13:41:12Z

I read and re-read @minamotorin's comment and found it intriguing:

likeCountLiked is used when the user click “like button”, and otherwise likeCountNotLiked is used.

I put this definition to the test. I loaded a video with a logged-in Youtube account, as this is the only way that field likeCountLiked can mean anything. I hit the like button for one or more comments, then re-loaded the comments, then studied the JSON response. Here is what the fields in the JSON mean:

likeCountLiked = 1 if you've liked ANY comment (not just the one particular comment in question) in the entire comment section, and no one else has liked the comment.
- But then, likeCountLiked = (1 + likeCountNotLiked) if others have liked the comment. To reiterate, even if you didn't like the particular comment, still 1 is added. The 1 denotes that you liked at least one comment in the entire comments section.
- What a nonsensical data value! I can't think of a scenario in which it would be meaningful and useful.
likeCountNotLiked = total number of likes by other users besides you. So if you've liked a comment, and 9 others have also liked the comment, bringing the total to 10 likes, the value of likeCountNotLiked is actually 9. However, if you're logged out of the Youtube account, likeCountNotLiked will equal the total number of likes, in this case 10.

I tested your extractor as it currently is and it reflected these values.

So I'll repeat what I wrote in my first comment: Please obtain the like_count from key likeCountA11y instead of likeCountNotliked. If likeCountNotliked is used, then an erroneous value will be returned when the user passes a cookie to yt-dlp (ie a logged-in session) and yt-dlp obtains the comments data when logged in. When logged in, the like_count can be one less than it should be. I tested and verified this myself with the extractor as it currently is.

My suggestion:
'like_count': str_to_int(self._search_regex(r'^([\d]+)', entity_payload.get('likeCountA11y'), 'like_count', fatal=False)) or 0,

Also suggest expressing this:
entity_payload = entity['payload']['commentEntityPayload']
differently:
entity_payload = traverse_obj(entity_payload, ('payload', 'commentEntityPayload', {dict}))
entity_payload = try_get(entity_payload, lambda x: x['payload']['commentEntityPayload'], dict),

jakeogh · 2024-04-24T20:18:23Z

A Ha! Thanks for sorting that out. I fixed both issues.

minamotorin · 2024-04-25T05:37:26Z

I think the current implementention does not work with old format.
In such cases, it may stop at the following point, because old format sometimes (not always) does not have both commentThreadRenderer and commentViewModel but has commentRenderer.
This code might needs some more device.

yt-dlp/yt_dlp/extractor/youtube.py

Lines 3565 to 3572 in 3ef6517

 if not is_forced_continuation and not (tracker['est_total'] == 0 and tracker['running_total'] == 0): 

 check_get_keys = [[*continuation_items_path, ..., ( 

 'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel'))]] 

 try: 

 response = self._extract_response( 

 item_id=None, query=continuation, 

 ep='next', ytcfg=ytcfg, headers=headers, note=note_prefix, 

 check_get_keys=check_get_keys)

However, I couldn't get the old format response now so I can't test it now.

shoxie007 · 2024-04-25T05:54:41Z

I think the current implementention does not work with old format. In such cases, it may stop at the following point, because old format sometimes (not always) does not have both commentThreadRenderer and commentViewModel. This code might needs some more device.

Now that you bring it up, I did get the "Incomplete data received.." a few times when downloading comments for some videos, even with the commentViewModel JSON response. I wonder if it had anything to do with this bit of code. I'll study and test it further myself and see what the issue is...

minamotorin · 2024-04-25T05:54:55Z

This fix may work.

-          'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel'))]]
+          'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel', 'commentRenderer'))]]

bbilly1 · 2024-04-25T13:50:13Z

I have been testing this. I wasn't able to access is_favorited. If I'm understanding this correctly, engagementToolbarStateEntityPayload is located in a different entity linked by the toolbarStateKey, unfortunately somewhere in the entity_payloads list.

I've been playing around, something like that should do it:

diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py
index b4f33e7f7..6b87ab55d 100644
--- a/yt_dlp/extractor/youtube.py
+++ b/yt_dlp/extractor/youtube.py
@@ -3306,7 +3306,7 @@ def _extract_heatmap(self, data):
                 'value': ('intensityScoreNormalized', {float_or_none}),
             })) or None
 
-    def _extract_comment(self, view_model, entity, parent=None):
+    def _extract_comment(self, view_model, entity, entity_payloads, parent=None):
         entity_payload = traverse_obj(entity, ('payload', 'commentEntityPayload', {dict}))
         comment_id = entity_payload.get('properties').get('commentId')
 
@@ -3344,10 +3344,12 @@ def _extract_comment(self, view_model, entity, parent=None):
         if author_is_uploader is not None:
             info['author_is_uploader'] = author_is_uploader
 
-        comment_abr = traverse_obj(
-            entity, ('payload', 'engagementToolbarStateEntityPayload', 'heartState'), expected_type=str)
-        if comment_abr is not None:
-            info['is_favorited'] = comment_abr == 'TOOLBAR_HEART_STATE_HEARTED'
+        toolbar_state_key = entity_payload.get('properties', {}).get('toolbarStateKey')
+        if toolbar_state_key:
+            tool_bar_entity = next((d for d in entity_payloads if d.get('entityKey') == toolbar_state_key), None)
+            if tool_bar_entity:
+                heart_state = traverse_obj(tool_bar_entity, ('payload', 'engagementToolbarStateEntityPayload', 'heartState'))
+                info['is_favorited'] = heart_state == 'TOOLBAR_HEART_STATE_HEARTED'
 
         info['author_is_verified'] = traverse_obj(entity_payload, ('author', 'isVerified')) == 'true'
 
@@ -3470,7 +3472,7 @@ def extract_thread(contents, entity_payloads):
                             entity = entity
                             break
 
-                    comment = self._extract_comment(view_model, entity, parent)
+                    comment = self._extract_comment(view_model, entity, entity_payloads, parent)
 
                 if comment.get('is_pinned'):
                     tracker['pinned_comment_ids'].add(comment_id)

But unfortunately requires some looping around through the array for each comment to find the matching EngagementToolbar.

shoxie007 · 2024-04-26T02:34:31Z

But unfortunately requires some looping around through the array for each comment to find the matching EngagementToolbar.

In such trying situations, look to our Lord and Savior: traverse_obj

How I would re-write everything:

# In _comment_entries >> extract_thread:
# --------------------------------------
# .... Existing code remains as is

comment_key = view_model.get("commentKey")
toolbar_state_key = view_model.get("toolbarStateKey")

# This usage of traverse_obj returns a list of relevant entities
# - NOTE: This:            v["entityKey"] in [comment_key, toolbar_state_key]
#   is shorthand for:      v["entityKey"] == comment_key or v["entityKey"] == toolbar_state_key
entities = traverse_obj(entity_payloads, lambda _, v: v["entityKey"] in [comment_key, toolbar_state_key])

# Call _extract_comment using "entities" instead of the former "entity"
comment = self._extract_comment(view_model, entities, parent)

# Then in _extract_comment
# ------------------------
def _extract_comment(self, view_model, entities, parent=None):   # change "entity" to "entities"
     comment_entity_payload = traverse_obj(entities, (..., 'payload', 'commentEntityPayload', {dict}), get_all=False)
     toolbar_entity_payload = traverse_obj(entities, (..., 'payload', 'engagementToolbarStateEntityPayload', {dict}), get_all=False)
          
     # ....
     # NOTE: "entity_payload" should be changed to "comment_entity_payload" in existing code
     # ....
     
     if toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED':
          info['is_favorited'] = True

     # ....

shoxie007 · 2024-05-05T07:05:54Z

Are you still with us @jakeogh? Would you kindly integrate the changes proposed?:

Add commentRenderer in code for check_get_keys
Modify extract_thread and _extract_comment to take account of the fact that the heartState key is in a different entity

I've tested the code on numerous videos. It's working. Let's get this pull request merged ASAP. People have been asking and wondering about the broken comments extraction.

jakeogh · 2024-05-05T20:33:58Z

@shoxie007

Thanks for the ping, I have been without time this week, but I'll be able to dig back into this tonight. I have tested this on ~1k videos (some very old) and haven't been able to trigger an issue as far as I can tell, but I havent verified this. I need to go back over my results. Is there a way to find an ID that triggers the situation where old format sometimes (not always) does not have both commentThreadRenderer and commentViewModel? What would the result be? Clearly it's not throwing an exception or I would have hit it by now, so I assume the result would be 0 comments on a video that has comments? That's what I intend to look for when I double check my own test results. If you have a moment before I get to it, it would help if you converted your rewrite into a diff off the current pull.

shoxie007 · 2024-05-05T21:32:53Z

@jakeogh

I have let my scripts test this on thousands of videos (some very old) and haven't been able to trigger an issue as far as I can tell.

Youtube isn't consistent in which videos it uses the commentRenderer model for. It uses it at random. My own experience is that about 10-20% of videos will use the commentRenderer model in the comments-response JSON when downloading an entire channel for example. I think people's experience may vary depending on geo-location. I'm not sure.

Without @minamotorin's fix, when I obtain the comments for an entire channel with hundreds of videos, I routinely get:
WARNING: [youtube] Incomplete data received. Giving up after 3 retries
for the aforementioned 10-20% of videos. And not just for one one or two comment threads. It's a fail for almost every single comment thread for the video in question. At the very end, few or no comments are downloaded.

An example of a terminal output for such a video:

[youtube] Extracting URL: https://www.youtube.com/watch?v=NX7cJD58sUE
[youtube] NX7cJD58sUE: Downloading webpage
[youtube] NX7cJD58sUE: Downloading ios player API JSON
[youtube] NX7cJD58sUE: Downloading android player API JSON
WARNING: [youtube] Skipping player responses from android clients (got player responses for video "aQvGIIdgFDM" instead of "NX7cJD58sUE")
[youtube] NX7cJD58sUE: Downloading m3u8 information
[youtube] Downloading comment section API JSON
[youtube] Downloading ~99 comments
[youtube] Sorting comments by newest first
[youtube] Downloading comment API JSON page 1 (0/~99)
[youtube]     Downloading comment API JSON reply thread 1 (1/~99)
WARNING: [youtube] Incomplete data received. Retrying (1/3)...
[youtube]     Downloading comment API JSON reply thread 1 (1/~99)
WARNING: [youtube] Incomplete data received. Retrying (2/3)...
[youtube]     Downloading comment API JSON reply thread 1 (1/~99)
WARNING: [youtube] Incomplete data received. Retrying (3/3)...
[youtube]     Downloading comment API JSON reply thread 1 (1/~99)
WARNING: [youtube] Incomplete data received. Giving up after 3 retries
[youtube]     Downloading comment API JSON reply thread 2 (6/~99)
WARNING: [youtube] Incomplete data received. Retrying (1/3)...
[youtube]     Downloading comment API JSON reply thread 2 (6/~99)
WARNING: [youtube] Incomplete data received. Retrying (2/3)...
[youtube]     Downloading comment API JSON reply thread 2 (6/~99)
WARNING: [youtube] Incomplete data received. Retrying (3/3)...
[youtube]     Downloading comment API JSON reply thread 2 (6/~99)
WARNING: [youtube] Incomplete data received. Giving up after 3 retries
....
....
WARNING: [youtube] Incomplete data received. Retrying (1/3)...
[youtube]     Downloading comment API JSON reply thread 7 (31/~99)
WARNING: [youtube] Incomplete data received. Retrying (2/3)...
[youtube]     Downloading comment API JSON reply thread 7 (31/~99)
WARNING: [youtube] Incomplete data received. Retrying (3/3)...
[youtube]     Downloading comment API JSON reply thread 7 (31/~99)
WARNING: [youtube] Incomplete data received. Giving up after 3 retries
[youtube] Extracted 32 comments
[info] NX7cJD58sUE: Downloading 1 format(s): 247+251
[info] Writing '%(comments)j' to: 04__Video Comments (JSONs)/2023.11.18 - Variables in Russia's Downfall.comments.json

I think you may be able replicate the issue. Here, run this command (using the build which doesn't have the fix suggested above) in Unix to download comments for the first 100 videos of the Whatifalthist channel. This command will format the output, so that it's more readable, and will print the terminal output to a log file for later examination. It might take a while to download everything, for which reason I included a log file which can be studied later:
yt-dlp --no-overwrites --ignore-errors --compat-options filename-sanitization --abort-on-unavailable-fragment --write-comments --no-write-info-json --skip-download --playlist-items 1-100 --print-to-file "%(comments)j" "%(upload_date>%Y.%m.%d)s - %(title)s.comments.json" "https://www.youtube.com/playlist?list=UU5Dw9TFdbPJoTDMSiJdIQTA" 2>&1 | sed -uE 's/^(\[download\] Downloading item [0-9]+ of [0-9]+)/\n\1/' | tee "DL_Log_for_Comments_[$(date +'%Y.%m.%d_%H-%M')].txt"

As for the second fix which I suggested in response to the issue which @bbilly1 raised, it should be obvious why it's necessary. The heartState key (which reveals whether or not a comment has been hearted) is in another entity (containing engagementToolbarStateEntityPayload), not in the entity containing commentEntityPayload. So two entities need to be extracted and passed to _extract_comment.

jakeogh · 2024-05-06T04:47:26Z

@shoxie007 I really appreciate the detailed reproducer and explanation. I've added the fix from @minamotorin, however, before that, I ran the suggested tests and was unable to reproduce the issue. My IP geolocates to AZ, USA:

$ yt-dlp https://www.youtube.com/watch?v=NX7cJD58sUE --write-info-json --write-comments
[youtube] Extracting URL: https://www.youtube.com/watch?v=NX7cJD58sUE
[youtube] NX7cJD58sUE: Downloading webpage
[youtube] NX7cJD58sUE: Downloading ios player API JSON
[youtube] NX7cJD58sUE: Downloading android player API JSON
WARNING: [youtube] Skipping player responses from android clients (got player responses for video "aQvGIIdgFDM" instead of "NX7cJD58sUE")
[youtube] NX7cJD58sUE: Downloading m3u8 information
[youtube] Downloading comment section API JSON
[youtube] Downloading ~99 comments
[youtube] Sorting comments by newest first
[youtube] Downloading comment API JSON page 1 (0/~99)
[youtube]     Downloading comment API JSON reply thread 1 (1/~99)
[youtube]     Downloading comment API JSON reply thread 2 (7/~99)
[youtube]     Downloading comment API JSON reply thread 3 (10/~99)
[youtube]     Downloading comment API JSON reply thread 4 (19/~99)
[youtube]     Downloading comment API JSON reply thread 5 (23/~99)
[youtube]     Downloading comment API JSON reply thread 6 (28/~99)
[youtube]     Downloading comment API JSON reply thread 7 (32/~99)
[youtube]     Downloading comment API JSON reply thread 8 (34/~99)
[youtube]     Downloading comment API JSON reply thread 9 (40/~99)
[youtube] Downloading comment API JSON page 2 (44/~99)
[youtube]     Downloading comment API JSON reply thread 1 (45/~99)
[youtube]     Downloading comment API JSON reply thread 2 (50/~99)
[youtube]     Downloading comment API JSON reply thread 3 (61/~99)
[youtube]     Downloading comment API JSON reply thread 4 (66/~99)
[youtube]     Downloading comment API JSON reply thread 5 (71/~99)
[youtube]     Downloading comment API JSON reply thread 6 (75/~99)
[youtube]     Downloading comment API JSON reply thread 7 (86/~99)
[youtube]        Downloading comment replies API JSON page 1 (96/~99)
[youtube] Extracted 99 comments
[info] NX7cJD58sUE: Downloading 1 format(s): 247+251
[info] Writing video metadata as JSON to: Variables in Russia's Downfall [NX7cJD58sUE].info.json
[download] Destination: Variables in Russia's Downfall [NX7cJD58sUE].f247.webm
[download] 100% of  996.75KiB in 00:00:00 at 1.64MiB/s
[download] Destination: Variables in Russia's Downfall [NX7cJD58sUE].f251.webm
[download] 100% of  721.74KiB in 00:00:00 at 1.17MiB/s
[Merger] Merging formats into "Variables in Russia's Downfall [NX7cJD58sUE].webm"
Deleting original file Variables in Russia's Downfall [NX7cJD58sUE].f251.webm (pass -k to keep)
Deleting original file Variables in Russia's Downfall [NX7cJD58sUE].f247.webm (pass -k to keep)

$ yt-dlp --no-overwrites --ignore-errors --compat-options filename-sanitization --abort-on-unavailable-fragment --write-comments --no-write-info-json --skip-download --playlist-items 1-100 --print-to-file "%(comments)j" "%(upload_date>%Y.%m.%d)s - %(title)s.comments.json" "https://www.youtube.com/playlist?list=UU5Dw9TFdbPJoTDMSiJdIQTA" 2>&1 | sed -uE 's/^(\[download\] Downloading item [0-9]+ of [0-9]+)/\n\1/' | tee "DL_Log_for_Comments_[$(date +'%Y.%m.%d_%H-%M')].txt"

<snip>

[download] Finished downloading playlist: Uploads from Whatifalthist

$ echo $?
0

$ ls *.comments.json | wc -l
100

$ ls DL_Log_for_Comments_*
1.1M -rw-r--r-- 1 user user 1.1M 2024-05-05 21:30:14 'DL_Log_for_Comments_[2024.05.05_20-12].txt'

$ grep WARNING DL_Log_for_Comments_\[2024.05.05_20-12\].txt | grep -v "Skipping player responses from android clients" | wc -l
0

A manual check of the 100 *.comments.json files, I find they have the expected comment count, and the DL_Log_for_Comments log has no unexpected WARNING: messages.

Checking my previous tests, I found no anomalies, so YT does not appear to be serving commentRenderer to my IP. I'll be able to fix the heartState key issue tomorrow.

shoxie007 · 2024-05-06T06:13:15Z

@jakeogh I'll venture a guess as to why you have such a seamless experience: You're close to Youtube's main servers and have a fast or well-networked connection. Therefore, Youtube delivers the response-JSONs perfectly the first time. You don't get any incomplete responses. Or it could be that Youtube is sending commentRenderer responses to certain geo-locations only, whereas in the US, it only uses commentViewModel. If you're determined to replicate the issue, maybe try a proxy server in another geo-location if you have access to one.

In any case, there is no harm in applying minamotorin's fix. It makes the data-integrity check of the response-JSON more thorough. Honestly, I don't know WHY it works, just that it does. When it's:
check_get_keys = ...... else ('commentThreadRenderer', 'commentViewModel') .....
and _extract_response is called, it checks that either commentThreadRenderer or commentViewModel keys are present. But when it's changed to:
check_get_keys = ...... else ('commentThreadRenderer', 'commentViewModel', 'commentRenderer') .....
it checks that either one of three keys is present. However, if commentThreadRenderer is present, then does it even check that commentRenderer is present too? It seems redundant to add it, but apparently it does help somehow. @minamotorin, if you're arround would you kindly explain how/why adding it improves robustness of data extraction? I can't quite make sense of the chain of code which is executed in the extractor. It gets confusing to me at this line:

if not traverse_obj(response, *variadic(check_get_keys)):

minamotorin · 2024-05-06T08:26:52Z

would you kindly explain how/why adding it improves robustness of data extraction?

Okay, the simple reason is

old format sometimes (not always) does not have both commentThreadRenderer and commentViewModel but has commentRenderer

In such cases,

check_get_keys = ...... else ('commentThreadRenderer', 'commentViewModel') .....

treats the response as incomplete because the response does not have both commentThreadRenderer and commentViewModel.

To be more specific, here is which keys are used.

	reply threads	other than replies
old format response	`commentRenderer`	`commentThreadRenderer`
new format response	`commentViewModel`	`commentThreadRenderer`

The reply threads of old format response is when commentRenderer is required in check_get_keys.
If the response has commentRenderer, it won't have commentThreadRenderer.

If I understand correctly, self._extract_response treats the response as incomplete if the response does not have keys of check_get_keys, and ('commentThreadRenderer', 'commentViewModel', 'commentRenderer') means that the response should have at least of one of these keys.

The code

if not traverse_obj(response, *variadic(check_get_keys)):

tests whether traverse_obj(response, *variadic(check_get_keys)) success for the response.
In other words, check_get_keys has the same syntax as traverse_obj to check if the response has keys.

jakeogh · 2024-05-15T03:10:11Z

Whew! @bashonly I sincerely appreciate the detailed and easy to follow review. All of the requested changes were made, the only outstanding issues are:

When the old comment extraction code is used (youtube served me the old format a few times in my testing), I noticed if the like_count is 0, the like_count key is omitted from the .info.json file. Should the like_count == 0 key be omitted in the new code path as well?
~~My note on the author_is_verified commit above.~~ Should a key be omitted if it is false? author_is_uploader is included in the json output whether it's true or false.
As I noted in [ie/youtube] Extract comments with or without new format #9775 (comment) I am unable to test the commentRenderer code path, so hopefully @shoxie007 and @minamotorin can give it a test.

shoxie007 · 2024-05-15T07:19:57Z

@jakeogh Thanks, will look into this. I'm also wondering if @bashonly truly intends that all those keys - author_is_verified, author_is_uploader, like_count etc - should be omitted if false or 0. In the old extractor, like_count and author_is_uploader were present at all times, even if value was 0 or false. But upon further reflection, perhaps it's good to omit certain keys if their value is null, especially if their value is null almost all the time. I still think like_count should stay though, even if 0.

pukkandan · 2024-05-15T07:43:46Z

When the old comment extraction code is used (youtube served me the old format a few times in my testing), I noticed if the like_count is 0, the like_count key is omitted from the .info.json file. Should the like_count == 0 key be omitted in the new code path as well?

No. Sounds like a bug in old code. 0 should be returned. Omitting a key or setting it to None means that the field was not extracted. Setting like_count=0 on the other hand means "there are no likes" and is a separate case.

~~My note on the author_is_verified commit above.~~ Should a key be omitted if it is false? author_is_uploader is included in the json output whether it's true or false.

Similar to above, False is different than None (same as unset) and should be returned.

yt_dlp/extractor/youtube.py

bashonly · 2024-05-15T09:09:41Z

I'm also wondering if @bashonly truly intends that all those keys - author_is_verified, author_is_uploader, like_count etc - should be omitted if false or 0. In the old extractor, like_count and author_is_uploader were present at all times, even if value was 0 or false. But upon further reflection, perhaps it's good to omit certain keys if their value is null, especially if their value is null almost all the time. I still think like_count should stay though, even if 0.

@shoxie007

re: like_count, I wasn't aware that the current code would omit the field if we didn't manually fallback to 0 -- can you confirm this is the case?

re: author_is_verified and is_pinned, I was just mirroring old code, which only set them to True or else omitted them.

author_is_uploader and is_favorited were oversights on my part, my bad.

pukkandan's suggestion above should fix everything ~~(except maybe like_count?)~~

shoxie007 · 2024-05-15T10:13:12Z

@bashonly

re: like_count, I wasn't aware that the current code would omit the field if we didn't manually fallback to 0 -- can you confirm this is the case?

re: author_is_verified and is_pinned, I was just mirroring old code, which only set them to True or else omitted them.

author_is_uploader and is_favorited were oversights on my part, my bad.

pukkandan's suggestion above should fix everything (except maybe like_count?)

I had downloaded comments in January, before Youtube made changes and the the extractor broke. I had used yt-dlp v2024.01.05.232702. This is what a yt-dlp-generated comment dict looked like:

{
        "id": "Ugyv3aJcyIlwjWWWvPp4AaABAg",
        "text": "Super video!!!!!!! Me gust\u00f3!",
        "like_count": null,
        "author_id": "UCYbXPcWIN9fxGKxCELWjapQ",
        "author": "@fcp1955",
        "author_thumbnail": "https://yt3.ggpht.com/ytc/AIf8zZQBkz1GQX7uX41M8pmfiimGylV2P8rw6ctq15tMGQ=s176-c-k-c0x00ffffff-no-rj",
        "parent": "root",
        "_time_text": "1 month ago",
        "timestamp": 1703376000,
        "author_url": "https://www.youtube.com/channel/UCYbXPcWIN9fxGKxCELWjapQ",
        "author_is_uploader": false,
        "is_favorited": false
}

bashonly · 2024-05-15T10:16:01Z

so your comment about like_count was only in reference to old code?

yt_dlp/extractor/youtube.py

shoxie007 · 2024-05-15T10:21:05Z

so your comment about like_count was only in reference to old code?

Yes, I meant that the old extractor used to display like_count at all times in the dict, whereas I though you meant that the new extractor should omit keys which resolve to 0 or false.

yt_dlp/extractor/youtube.py

shoxie007 · 2024-05-15T21:14:11Z

yt_dlp/extractor/youtube.py

- comment = self._extract_comment(view_model, entities, parent)
+ comment = self._extract_comment(entities, parent)
+ if comment:
+ comment['is_pinned'] = traverse_obj(view_model, ('pinnedText', {str})) is not None


About this.... Does this mean that there will be a key is_pinned for each and every single comment? I don't think this is necessary since only one comment will ever be pinned. Why have this key in every single comment dict? It will bloat the final comments data.

Perhaps the same should also apply to keys:
author_is_uploader
is_favorited
In general, the number of comments for which these values are true will be comparatively small. Therefore, why not just omit them from the comment-dict if they're false?

Having said this, I'm aware that in previous versions of the extractor, these keys (except is_pinned) were included in the final dict even if they were false. So people are used to that format.

You are correct. The alternative to omit a non-truthy is_pinned field would be to do this:

comment = self._extract_comment(entities, parent) if comment and traverse_obj(view_model, ('pinnedText', {str})) is not None: comment['is_pinned'] = True

Though the lack of a field or a None value typically should be reserved for when we don't have that data, rather than when we know it's False

We should decide what fields to omit if-and-only-if the value is false, on the basis that the value is very rarely true. @pukkandan what's your opinion? I gave my rationale earlier.

the lack of a field or a None value typically should be reserved for when we don't have that data, rather than when we know it's False

this

We should decide what fields to omit if-and-only-if the value is false, on the basis that the value is very rarely true.

It cannot be done for any field since we want to be able to distinguish between not just True/False, but also a "Data not extracted" state

It will bloat the final comments data.

Having "too much" data available has never been a concern we want to address. The infodict can be processed by third party scripts if that is an issue.

Are there any outstanding issues left to address? The recent changes work for me.

Having "too much" data available has never been a concern we want to address. The infodict can be processed by third party scripts if that is an issue.

Agreed, while it is potentially "bloat", it's not really that much, specially since we talking about videos which are so much larger in size. Data-not-extracted state is helpful info to have, for further/future processing (health checks or something like that), if it would be ever a thing or someone makes a tool for it.

coletdjnz

Looks fine to me from a quick look. Thanks all involved for helping fix this :)

jakeogh added 5 commits April 23, 2024 12:18

apply patch from issues/9358#issuecomment-2072600506

ee81ca4

fix typo in previous patch, like count, and use direct dict access

16cb4fe

handle KeyError: 'frameworkUpdates' when the old comment format is se…

6083596

…rved

fix old comment extraction

2ef6563

fix like_count

4da1db9

pukkandan added the site-bug Issue with a specific website label Apr 24, 2024

jakeogh added 2 commits April 24, 2024 00:03

fix indent

800906c

fix another indent

2763473

replace dict access with try_get()

17bb443

This was referenced Apr 24, 2024

[YouTube] comments are not downloading #9358

Closed

[BUG] incorrect votes. egbertbouman/youtube-comment-downloader#143

Closed

replace dict access with traverse_obj() and use likeCountA11y

3ef6517

pukkandan mentioned this pull request Apr 30, 2024

ytdlp scrape comments and replies no longer work , whats the fixed code now? #9833

Closed

9 tasks

add commentRenderer fix from @minamotorin

a1102d7

jakeogh added 7 commits May 14, 2024 17:57

remove .get() call from content

4701ad6

move continue block, filter() comment_keys for None

6f5c669

use get_first() and remove .get()

f6ced29

use single traversal

8d428b4

use traverse_obj for time_text

0ef6c93

whitespace change

1cee8e7

remove pinned_text var

743ed06

jakeogh added 2 commits May 14, 2024 20:16

add {bool}

1adea35

fix author_is_verified

1872982

pukkandan requested changes May 15, 2024

View reviewed changes