Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[post-processing] Refactor MoveFilesPP to respect non-video files #9774

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

kieraneglin
Copy link

@kieraneglin kieraneglin commented Apr 23, 2024

Description of your pull request and other information

Related to #9445 along with this comment #9445 (comment)

Issue:

The MoveFilesAfterDownloadPP pre-processor moves all files as expected, but it only updates the filepath for the main media file in the info dict. This means that file moves related to thumbnails and subtitles do take place, but the filepath for these entries in the info map isn't updated.

Example:

yt-dlp https://www.youtube.com/watch?v=UfmkgQRmmeE \
  --write-thumbnail \
  --output "%(title)s.%(ext)s" \
  --output "thumbnail:%(title)s-thumb.%(ext)s" \
  --print-to-file "after_move:%()j" output.json

After running the above command, you'll see that the thumbnail is correctly appended with -thumb on the filesystem, but the thumbnails section in output.json references the original non-thumb path.

Fix:

My approach roughly looks like this:

  • Remove all logic to do with files_to_move - that is no longer relevant
  • MoveFilesAfterDownloadPP has been updated to determine the final filename using prepare_filename
    • NOTE: this preserves all file extensions. So --write-all-thumbnail's indexed filenames work as expected
  • Once the new filename is determined, the file is moved and the new filepath is written to the appropriate entry in the info dict
  • Also deletes the filepath entries from the info dict if a file was downloaded as an embed but is deleted after processing

Important to note that this PR only concerns itself with files that are moved as a standalone step (requested_subtitles, thumbnails, main media file). Files that are written straight to their final location (description, etc.) are not processed by MoveFilesAfterDownloadPP. This is the same behaviour as before but I wanted to call it out

I'll be the first to say that I'm NOT a Python guy, so please let me know if there are better ways to accomplish things.

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

@pukkandan
Copy link
Member

When I created this PP originally, the filepath keys in the respective dicts did not yet exist. Now that we have that, can't we use those as path to move from instead and completely get rid of __files_to_move?

I believe the path to move to can be obtained by running prepare_filename in the PP. Or would that not work? I believe the only path that is currently not saved inside the infodict is description/annotation (deprecated). We can save it in a new field like description_filename similar to infojson_filename. Let me know what you think.

@pukkandan pukkandan added enhancement New feature or request core-triage triage requested from a core dev labels Apr 23, 2024
@kieraneglin
Copy link
Author

kieraneglin commented Apr 23, 2024

That's a really good shout! I had tried something related to prepare_filename but I gave up on that approach when I found that the returned string didn't respect the file's extension and I figured that I was using it for something other than what it was designed for.

Looking deeper, I see that gets handled elsewhere thanks to replace_extension. Knowing that, I'll look into that approach again! I'm not sure if it'd work for --write-all-thumbnails but I'll handle that explicitly if need be

self.assertTrue(os.path.exists(audiofile), '%s doesn\'t exist' % audiofile)
os.unlink(filename)
os.unlink(audiofile)
run_pp({'keepvideo': True, 'outtmpl': filename}, SimplePP)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this out specifically - these changes necessitated an outtmpl (or setting title and id for the default outtmpl to use). I don't think this is an issue, but I wanted to confirm!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand, the reason is that the old implementation didn't concern itself with outtmpls at all - it just got a list of source/destination filenames and it moved files based on that.

The new implementation uses prepare_filename which allows it to keep source/destination logic contained entirely within the MoveFilesPP module, but the side effect is that it needs to be able to generate a valid outtmpl somehow

@kieraneglin
Copy link
Author

kieraneglin commented Apr 24, 2024

@pukkandan Thanks for the tip - that ended up working out! I've updated the description and this is ready for re-review with a few notes:

  • This doesn't add the description_filepath (or annotation_filepath) changes you mentioned - I plan to make these changes as part of a followup PR since I didn't want to have this PR doing too many discrete things
  • This still doesn't fix the issue where the filepath for embedded thumbnails is written to the infojson - this will be another PR (or possibly included in the description_filepath one I mentioned above)
  • If it helps, here's a command I was using for testing:
rm -f output.json && \
  ./yt-dlp.sh https://www.youtube.com/watch\?v\=Bx1AiQdMQro \
  --write-thumbnail \
  --convert-thumbnail jpg \
  --write-description \
  --write-subs \
  --write-auto-subs \
  --sub-langs en,de \
  --convert-subs srt \
  --output "%(title)s.%(ext)s" \
  --output "thumbnail:%(title)s-thumb.%(ext)s" \
  --output "subtitle:%(title)s-sub.%(ext)s" \
  --output "description:%(title)s-desc.%(ext)s" \
  --print-to-file "after_move:%()j" output.json

and I'd add in these as-needed:

  --write-all-thumbnails
  --paths "temp: <TODO: tmp directory here>" 
  --no-overwrites

Comment on lines +2102 to +2104
ext = ext if ext.startswith('.') else '.' + ext

return '{}{}'.format(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For viz. I've updated this method to allow you to pass an ext with or without a leading ..

This doesn't impact existing usage and makes it easier to pipe together existing commands like:

_, ext = os.path.splitext('foo/bar/file.mp4')
print(ext) # => .mp4
replace_extension(filename, ext)

@kieraneglin
Copy link
Author

@pukkandan (not trying to rush this review, just letting you know where I'm at!)

I've been playing around with the followup PR(s) I want to make and I know the general approach I want to take for ensuring filepath-related data is accurate for all file types. I've also determined that those changes are somewhat dependent on this change so I'll wait until you're happy with this + it's merged before I put those up 🤙

@pukkandan
Copy link
Member

pukkandan commented Apr 26, 2024

Sorry, but proper review will take some time. From a quick look, here's some things:

  • The fix for embedthumbnails should be added to this PR imo since I believe merging this without fixing it causes a regression in --embed-thumbnails -P temp:....
  • pre/post_process are public functions and so we cannot change it's signature. Just pass through files_to_move as-is, adding a deprecation warning if files_to_move is not None:
  • The .suffixes trick is unreliable since filename (excluding what we add) can already have . in it. I'm not sure how to properly deal with this though
  • Thanks for pointing out he test changes (requiring outtmpl). It does need a closer look in regards to API compat, but let it be for now.
  • Pls revert unrelated changes, like in _write_description, for example.

Otherwise the impl looks more or less what I had in mind 👍

@kieraneglin
Copy link
Author

No rush at all! Thank you for the feedback - I'll review and make those changes 👍

The .suffixes trick is unreliable [...]

Woof. I'm not sure how to solve that one in all cases, but I'll think about it. If I can't come up with anything maybe I'll add some one-off logic to handle the numeric indexed extensions from --write-all-thumbnails

@pukkandan
Copy link
Member

pukkandan commented Apr 26, 2024

If I can't come up with anything maybe I'll add some one-off logic to handle the numeric indexed extensions from --write-all-thumbnails

I also think we'll have to resort to that. Note that a similar issue exists for subtitles


Won't the current code break -P infojson: and -P description:?, or am I misreading? mb

@kieraneglin
Copy link
Author

kieraneglin commented Apr 26, 2024

Won't the current code break -P infojson: and -P description:?, or am I misreading?

I don't think so but I'll check again! The new code shouldn't touch anything that's written directly to its final filepath like the .description file (see here).

More generally, this code should only concern itself with things that were previously stored under __files_to_move. My changes shouldn't impact any files besides those few filetypes

Copy link
Author

@kieraneglin kieraneglin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pukkandan (again, no rush! I get it). I've made the requested changes, including the issue with filepaths not being deleted even if the file was being embedded. I've made that change for thumbnails, subs, and infojson but please let me know if there is a filetype I'm missing!

Thanks for the guidance 🙌

self.assertTrue(os.path.exists(audiofile), '%s doesn\'t exist' % audiofile)
os.unlink(filename)
os.unlink(audiofile)
run_pp({'keepvideo': True, 'outtmpl': filename}, SimplePP)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand, the reason is that the old implementation didn't concern itself with outtmpls at all - it just got a list of source/destination filenames and it moved files based on that.

The new implementation uses prepare_filename which allows it to keep source/destination logic contained entirely within the MoveFilesPP module, but the side effect is that it needs to be able to generate a valid outtmpl somehow

for f in files_to_delete:
infodict['__files_to_move'].setdefault(f, '')
else:
if not self.params.get('keepvideo', False):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I get eyes on this to confirm that this logic inversion works as expected now that I've deleted the files_to_move related stuff?

Comment on lines +3680 to +3681
if files_to_move is not None:
self.report_warning('[pre_process] "files_to_move" is deprecated and may be removed in a future version')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For viz: wanted to make sure this messaging was on-brand for deprecating a parameter rather than an entire method

@@ -4339,6 +4330,7 @@ def _write_thumbnails(self, label, info_dict, filename, thumb_filename_base=None
self.to_screen(f'[info] There are no {label} thumbnails to download')
return ret
multiple = write_all and len(thumbnails) > 1
info_dict['__multiple_thumbnails'] = multiple
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I'm determining whether thumbnails need to use an indexed file "extension". I could use get_param('write_all_thumbnails') and similar in the MoveFilesPP, but I reckon doing it this way lessens the risk of logic here getting out of sync with logic in MoveFilesPP if this is refactored/modified in the future

@@ -224,4 +224,8 @@ def run(self, info):
thumbnail_filename if converted or not self._already_have_thumbnail else None,
original_thumbnail if converted and not self._already_have_thumbnail else None,
info=info)

if not self._already_have_thumbnail:
info['thumbnails'][idx].pop('filepath', None)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the .pop(name, None) pattern a few places to prevent raising an error if a key doesn't exist. While it shouldn't be strictly necessary based on my testing, there is so much I don't know about yt-dlp that I figured this was the safer bet

@kieraneglin
Copy link
Author

@pukkandan Again, no rush but is there anything I can do here to make it easier to review and test? Let me know if I can help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-triage triage requested from a core dev enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants