Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

"podcast-transcribe-episode" doesn't manage to transcode files with non-video "video" streams, e.g. mjpeg #805

Open
pypt opened this issue Aug 17, 2021 · 0 comments
Labels

Comments

@pypt
Copy link
Contributor

pypt commented Aug 17, 2021

Podcast transcoding fails for some episodes because:

$ docker service logs $(docker service ls | grep podcast-transcribe-episode-temporal-worker | awk '{ print $1 }')
<...>
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | INFO podcast_transcribe_episode.workflow: Fetching, transcoding, storing episode for story 2017569382...
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | INFO podcast_transcribe_episode.transcode: Found a supported audio stream
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | INFO podcast_transcribe_episode.transcode: Transcoding '/tmp/fetch_transcode_store_episodec6iy_g28/raw_enclosure' to '/tmp/fetch_transcode_store_episodec6iy_g28/transcoded_episode'...
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | [mp3 @ 0xaaaaf46417d0] Skipping 1 bytes of junk at 62145.
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | [mp3 @ 0xaaaaf46417d0] Estimating duration from bitrate, this may be inaccurate
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | Input #0, mp3, from '/tmp/fetch_transcode_store_episodec6iy_g28/raw_enclosure':
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   Metadata:
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     title           : EVERYTHING YOU EVER WANTED TO KNOW ABOUT COVID THAT THE GOVERNMENT WON'T TELL YOU
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     lyrics-ENG      : <p>INTRODUCTION; WHY OBESITY IS A BIG RISK FACTOR; ZINC AND ACTIVATORS; NUTRACEUTICALS AND BOTANICALS; GARLIC, A SUPERFOOD</p>
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     album           : The Michael Savage Show
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     genre           : Podcast
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     date            : 2021
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   Duration: 00:59:06.64, start: 0.000000, bitrate: 192 kb/s
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     Stream #0:0: Audio: mp3, 44100 Hz, mono, fltp, 192 kb/s
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     Stream #0:1: Video: mjpeg (Progressive), yuvj420p(pc, bt470bg/unknown/unknown), 500x500 [SAR 72:72 DAR 1:1], 90k tbr, 90k tbn, 90k tbc (attached pic)
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     Metadata:
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |       title           : image
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |       comment         : Other
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | Stream map '0:v' matches no streams.
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | To ignore this, add a trailing '?' to the map.
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | Activity PodcastTranscribeActivities::fetch_transcode_store_episode failed: CalledProcessError(Command '['ffmpeg', '-nostdin', '-hide_banner', '-i', '/tmp/fetch_transcode_store_episodec6iy_g28/raw_enclosure', '-map', '-0:v', '/tmp/fetch_transcode_store_episodec6iy_g28/transcoded_episode']' returned non-zero exit status 1.)
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | Traceback (most recent call last):
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   File "/usr/local/lib/python3.8/dist-packages/temporal/activity_loop.py", line 69, in activity_task_loop_func
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     return_value = await fn(*args)
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   File "/opt/mediacloud/src/podcast-transcribe-episode/python/podcast_transcribe_episode/workflow.py", line 124, in fetch_transcode_store_episode
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     raw_enclosure_transcoded = transcode_file_if_needed(
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   File "/opt/mediacloud/src/podcast-transcribe-episode/python/podcast_transcribe_episode/transcode.py", line 88, in transcode_file_if_needed
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     subprocess.check_call(ffmpeg_command)
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     raise CalledProcessError(retcode, cmd)
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | subprocess.CalledProcessError: Command '['ffmpeg', '-nostdin', '-hide_banner', '-i', '/tmp/fetch_transcode_store_episodec6iy_g28/raw_enclosure', '-map', '-0:v', '/tmp/fetch_transcode_store_episodec6iy_g28/transcoded_episode']' returned non-zero exit status 1.

(Sample episode that fails: https://traffic.megaphone.fm/ADV5935473959.mp3?updated=1628579716)

To make transcriptions work, we remove video streams from incoming episodes if we find any:

# If there's video in the file (e.g. video), remove it
if media_info.has_video_streams:
# Discard all video streams
ffmpeg_args.extend(['-map', '-0:v'])

Whether or not the episode has video streams is determined here:

elif stream['codec_type'] == 'video':
has_video_streams = True

But it turns out that quite a few episodes have their episode's static thumbnail attached as a "video" stream, e.g.:

$ ffmpeg -i ADV5935473959.mp3
<...>
[mp3 @ 0x55f3de42a2c0] Skipping 1 bytes of junk at 62145.
[mp3 @ 0x55f3de42a2c0] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from 'ADV5935473959.mp3':
  Metadata:
    title           : EVERYTHING YOU EVER WANTED TO KNOW ABOUT COVID THAT THE GOVERNMENT WON'T TELL YOU
    lyrics-ENG      : <p>INTRODUCTION; WHY OBESITY IS A BIG RISK FACTOR; ZINC AND ACTIVATORS; NUTRACEUTICALS AND BOTANICALS; GARLIC, A SUPERFOOD</p>
    album           : The Michael Savage Show
    genre           : Podcast
    date            : 2021
  Duration: 00:59:06.10, start: 0.000000, bitrate: 192 kb/s
    Stream #0:0: Audio: mp3, 44100 Hz, mono, fltp, 192 kb/s
    Stream #0:1: Video: mjpeg (Progressive), yuvj420p(pc, bt470bg/unknown/unknown), 500x500 [SAR 72:72 DAR 1:1], 90k tbr, 90k tbn, 90k tbc (attached pic)
    Metadata:
      title           : image
      comment         : Other
At least one output file must be specified

(That's Stream #0:1 here.)

FFMPEG advises us to "add a trailing '?' to the map" but that probably won't work with the speech to text engine, so let's remake transcode_file_if_needed() to remove all non-audio streams, e.g. video, attached JPEGs, text files, etc. - one can attach quite a few things to media files: https://ffmpeg.org/doxygen/trunk/group__lavu__misc.html#ga9a84bba4713dfced21a1a56163be1f48)

@jtotoole, could you:

@pypt pypt added the bug label Aug 17, 2021
@jtotoole jtotoole removed their assignment Dec 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants