Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: GPU_HUNG when both encoder and decoder #288

Open
DaveHu-TVU opened this issue Jul 6, 2023 · 16 comments
Open

[Bug]: GPU_HUNG when both encoder and decoder #288

DaveHu-TVU opened this issue Jul 6, 2023 · 16 comments
Assignees

Comments

@DaveHu-TVU
Copy link

Which component impacted?

Decode, Encode

Is it regression? Good in old configuration?

Yes, it's good in old version

What happened?

CPU: 12th Gen Intel(R) Core(TM) i7-12700
kernel: Linux tvu-desktop 5.15.0-69-generic #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

vpl: 2023Q1(https://github.com/oneapi-src/oneVPL-intel-gpu/releases/tag/intel-onevpl-23.1.5)

Reproduction steps:
console1:
/opt/intel/media/share/vpl/samples/_bin/sample_decode h265 -i v3_1080i5994.h265 -o /dev/null -timeout 10000

console2:/opt/intel/media/share/vpl/samples/_bin/sample_encode h264 -i cnn.yuv -o /dev/null -w 1920 -h 1080 -timeout 10000 -nv12

[ERROR], sts=MFX_ERR_GPU_HANG(-21), SynchronizeFirstTask, SyncOperation fail or timeout at /opt/src/vpl-dispatcher_src/tools/legacy/Sample_encode/src/pipeline_encode.cpp:178

[ERROR], sts=MFX_ERR_GPU_HANG(-21), GetFreeTask, m_TaskPool.SynchronizeFirstTask failed at /opt/src/vpl-dispatcher_src/tools/legacy/Sample_encode/src/pipeline_encode.cpp:2239

[ERROR], sts=MFX_ERR_GPU_HANG(-21), Run, m_pmfxENC->EncodeFrameAsync failed at /opt/src/vpl-dispatcher_src/tools/legacy/Sample_encode/src/pipeline_encode.cpp:2487

[ERROR], sts=MFX_ERR_GPU_HANG(-21), main, pPipeline->Run failed at /opt/src/vpl-dispatcher_src/tools/legacy/Sample_encode/src/Sample_encode.cpp:1970

What's the usage scenario when you are seeing the problem?

Immersive Media

What impacted?

After testing, we found that:
When decoding H264/H265 encoded by intel msdk or vpl and encoding at the same time, it can work;
When decoding H265 encoded by our other platform (Amba H2) and encoding at the same time, it is easy to have GPU_HUNG
v3_1080i5994.zip

Debug Information

image

Do you want to contribute a patch to fix the issue?

Yes, I'm glad to submit a patch to fix it

@nyanmisaka
Copy link

Kernel version 5.15 is too old for 12th Gen.
Install the latest linux-firmware and update kernel to 6.1 and try again.

https://github.com/intel/media-driver#known-issues-and-limitations

@DaveHu-TVU
Copy link
Author

DaveHu-TVU commented Jul 6, 2023

I had update kernel to 6.1.0-1015-oem and linux-firmware to 20220329 and get the sample result.

image

@DaveHu-TVU
Copy link
Author

DaveHu-TVU commented Jul 7, 2023

I used another cpu(12th Gen Intel(R) Core(TM) i9-12900H)
ubuntu 22.04
kernel: 6.1.0-1015-oem and the latest linux-firmware

cmd: /opt/intel/media/share/vpl/samples/_bin/sample_decode h265 -i v3_1080i5994.h265 -o /dev/null -timeout 10000

Just decoding the H265 file encoded by Amba H2(v3_1080i5994.h265) platform will show the error(no encode at the same time) MFX_ERR_DEVICE_FAILED(-17). please see the log
mfxlib_Pid2426_Tid140450659272512.log

Decoding started
Frame number: 2560, fps: 126.187, fread_fps: 0.000, fwrite_fps: 0.0000
[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), RunDecoding, DecodeFrameAsync returned error status at /opt/src/vpl-dispatcher_src/tools/legacy/sample_decode/src/pipeline_decode.cpp:1980
Frame number: 2561, fps: 126.234, fread_fps: 0.000, fwrite_fps: 0.000
[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), RunDecoding, Unexpected error!! at /opt/src/vpl-dispatcher_src/tools/legacy/sample_decode/src/pipeline_decode.cpp:2100
...

Also I can decode H265 files normally when I use intel msdk encoding(v2_500k_1080i5994.h265)
Please compare the difference between these two files for decoding.

v3_1080i5994.zip

v2_500k_1080i5994.zip

@nyanmisaka
Copy link

nyanmisaka commented Jul 8, 2023

No issue with ffmpeg qsv decoder (built with onevpl). I think it should be a sample_decode issue.

ffmpeg -hwaccel qsv -hwaccel_output_format qsv -c:v hevc_qsv -i v3_1080i5994.h265 -f null -
ffmpeg version 6.0-Jellyfin Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 13.1.1 (GCC) 20230429
  configuration: --prefix=/usr/lib/jellyfin-ffmpeg --target-os=linux --extra-version=Jellyfin --disable-doc --disable-ffplay --disable-ptx-compression --disable-shared --disable-libxcb --disable-sdl2 --disable-xlib --enable-gpl --enable-version3 --enable-static --enable-gmp --enable-gnutls --enable-chromaprint --enable-libfontconfig --enable-libass --enable-libbluray --enable-libdrm --enable-libfreetype --enable-libfribidi --enable-libmp3lame --enable-libopus --enable-libopenmpt --enable-libtheora --enable-libvorbis --enable-libdav1d --enable-libwebp --enable-libvpx --enable-libx264 --enable-libx265 --enable-libzvbi --enable-libzimg --enable-libshaderc --enable-libplacebo --enable-vulkan --enable-opencl --enable-vaapi --enable-amf --enable-libvpl --enable-ffnvcodec --enable-cuda --enable-cuda-llvm --enable-cuvid --enable-nvdec --enable-nvenc
  libavutil      58.  2.100 / 58.  2.100
  libavcodec     60.  3.100 / 60.  3.100
  libavformat    60.  3.100 / 60.  3.100
  libavdevice    60.  1.100 / 60.  1.100
  libavfilter     9.  3.100 /  9.  3.100
  libswscale      7.  1.100 /  7.  1.100
  libswresample   4. 10.100 /  4. 10.100
  libpostproc    57.  1.100 / 57.  1.100
[hevc @ 0x557611b8ec80] PPS id out of range: 0
    Last message repeated 1 times
[hevc @ 0x557611b8ec80] Error parsing NAL unit #3.
[hevc @ 0x557611b8da00] Stream #0: not enough frames to estimate rate; consider increasing probesize
Input #0, hevc, from 'v3_1080i5994.h265':
  Duration: N/A, bitrate: N/A
  Stream #0:0: Video: hevc (Main), yuv420p(tv, progressive), 1920x540 [SAR 1:1 DAR 32:9], 59.94 fps, 59.94 tbr, 1200k tbn
libva info: VA-API version 1.19.0
libva info: Trying to open /usr/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_19
libva info: va_openDriver() returns 0
libva info: VA-API version 1.19.0
libva info: Trying to open /usr/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_19
libva info: va_openDriver() returns 0
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (hevc_qsv) -> wrapped_avframe (native))
Press [q] to stop, [?] for help
[hevc_qsv @ 0x55761217a900] More data is required to decode header
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf60.3.100
  Stream #0:0: Video: wrapped_avframe, qsv(tv, top coded first (swapped)), 1920x540 [SAR 1:1 DAR 32:9], q=2-31, 200 kb/s, 59.94 fps, 59.94 tbn
    Metadata:
      encoder         : Lavc60.3.100 wrapped_avframe
frame= 1199 fps=0.0 q=-0.0 Lsize=N/A time=00:00:19.98 bitrate=N/A speed=40.8x    0x
video:562kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown

@nyanmisaka
Copy link

Using the bitstream filter hevc_metadata of ffmpeg can fix your v3 clip.

ffmpeg -i v3_1080i5994.h265 -bsf:v hevc_metadata -c:v copy -y v3_fixed.h265
/usr/bin/vpl-sample_decode h265 -i v3_fixed.h265 -o /dev/null -timeout 10000

@DaveHu-TVU
Copy link
Author

Hi @nyanmisaka Thanks for your help. I did some test and have more infomation about this issue.
I used the vpl version:2022q2 and found this verion have no this issue, then I used the vpl version:2022Q3 and found this isuue.
I contine replace intel-driver to 22.4.4 when use the vpl 2023Q1 and found it can work too.
So I think this is a issue in intel-driver after version > 22.4.4. Can you provide a patch to fix it ? Thanks.

@DaveHu-TVU
Copy link
Author

Hi @nyanmisaka I saw your libva version is libva info: Found init function __vaDriverInit_1_19
What version of vpl are you using? I think if we use the different libva media-driver will have different result. Thanks

@nyanmisaka
Copy link

Libva version is not related to this issue. I'm testing the latest tag intel-onevpl-23.3.0.
Also I can't test media-driver 22.4.4 since it's too old to support my Arc discrete GPU.

I'm not from intel and probably can't help you fix this.
Since the regression seems to be caused by media-driver, you can open a ticket over there.

@chenhao5-Intel
Copy link
Contributor

Hi @nyanmisaka Thanks for your help. I did some test and have more infomation about this issue. I used the vpl version:2022q2 and found this verion have no this issue, then I used the vpl version:2022Q3 and found this isuue. I contine replace intel-driver to 22.4.4 when use the vpl 2023Q1 and found it can work too. So I think this is a issue in intel-driver after version > 22.4.4. Can you provide a patch to fix it ? Thanks.

Hi Dave.
What's the test scenario of your above effort? Decode v3_1080i5994.h265 + Encode cnn.yuv? Or just decode test on your i9-12900H platform?
The error returned is MFX_ERR_GPU_HANG(-21) or MFX_ERR_DEVICE_FAILED(-17)?

@DaveHu-TVU
Copy link
Author

DaveHu-TVU commented Jul 14, 2023

Q1:Decode v3_1080i5994.h265 + Encode cnn.yuv
Q2: Both 12900H and 12700
Q3:MFX_ERR_GPU_HANG(-21)

@DaveHu-TVU
Copy link
Author

DaveHu-TVU commented Jul 14, 2023

Hi @chenhao5-Intel
I have also reproduced the decoding failure MFX_ERR_DEVICE_FAILED(-17) using 2022q2, but it seems to be more difficult to reproduce, I haven't found a stable way to reproduce it yet, I'm working on it. force on 2023Q1 issue first, Thanks

@chenhao5-Intel
Copy link
Contributor

Q1:Decode v3_1080i5994.h265 + Encode cnn.yuv Q2: Both 12900H and 12700 Q3:MFX_ERR_GPU_HANG(-21)

You mean you can reproduce the encode hang issue: "[ERROR], sts=MFX_ERR_GPU_HANG(-21), SynchronizeFirstTask, SyncOperation fail or timeout at /opt/src/vpl-dispatcher_src/tools/legacy/Sample_encode/src/pipeline_encode.cpp:178" on both 12900H and 12700?

@chenhao5-Intel
Copy link
Contributor

chenhao5-Intel commented Jul 18, 2023

Hi @DaveHu-TVU @nyanmisaka
I have successfully reproduced this issue on both i7-12700 and i9-12900H + Ubuntu 22.04 env.

There are two issue scenarios: (On both i7-12700 and i9-12900H)

  1. When just decode v3_1080i5994.h265 which is encoded by Amba H2, sample_decode will report GPU_HANG:
    Decoding started
    Frame number: 260541, fps: 2394.031, fread_fps: 0.000, fwrite_fps: 0.000
    [ERROR], sts=MFX_ERR_GPU_HANG(-21), RunDecoding, DecodeFrameAsync returned error status at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/pipeline_decode.cpp:1980
    Frame number: 260542, fps: 2394.034, fread_fps: 0.000, fwrite_fps: 0.000
    [ERROR], sts=MFX_ERR_GPU_HANG(-21), RunDecoding, Unexpected error!! at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/pipeline_decode.cpp:2100
    [ERROR], sts=MFX_ERR_GPU_HANG(-21), main, Pipeline.RunDecoding failed at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/sample_decode.cpp:904

Driver log shows no related errors reported and VPL log shows cm_mem_copy.cpp[Line: 3115]CopyVideoToSys: returns MFX_ERR_GPU_HANG. Analysis WIP.

  1. When decoding v3_1080i5994.h265 and meanwhile encoding cnn.yuv, both decode and encode will report GPU_HANG:
    For decode:
    Decoding started
    Frame number: 1586, fps: 57.032, fread_fps: 0.000, fwrite_fps: 6847.8366
    [ERROR], sts=MFX_ERR_GPU_HANG(-21), RunDecoding, DecodeFrameAsync returned error status at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/pipeline_decode.cpp:1980
    Frame number: 1587, fps: 57.068, fread_fps: 0.000, fwrite_fps: 6849.847
    [ERROR], sts=MFX_ERR_GPU_HANG(-21), RunDecoding, Unexpected error!! at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/pipeline_decode.cpp:2100
    [ERROR], sts=MFX_ERR_GPU_HANG(-21), main, Pipeline.RunDecoding failed at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/sample_decode.cpp:904

For encode:
Processing started
Frame number: 1600
[ERROR], sts=MFX_ERR_GPU_HANG(-21), SynchronizeFirstTask, SyncOperation fail or timeout at /opt/src/sources/oneVPL-disp/tools/legacy/sample_encode/src/pipeline_encode.cpp:178
[ERROR], sts=MFX_ERR_GPU_HANG(-21), GetFreeTask, m_TaskPool.SynchronizeFirstTask failed at /opt/src/sources/oneVPL-disp/tools/legacy/sample_encode/src/pipeline_encode.cpp:2239
[ERROR], sts=MFX_ERR_GPU_HANG(-21), Run, m_pmfxENC->EncodeFrameAsync failed at /opt/src/sources/oneVPL-disp/tools/legacy/sample_encode/src/pipeline_encode.cpp:2487
[ERROR], sts=MFX_ERR_GPU_HANG(-21), main, pPipeline->Run failed at /opt/src/sources/oneVPL-disp/tools/legacy/sample_encode/src/sample_encode.cpp:1970
Frame number: 1680
Encoding fps: 324

Analyzed log and found that LibVA will report: [LIBVA]:CRITICAL - StatusReport:261: Something unexpected happened in HW, return error to application

As for MFX_ERR_DEVICE_FAILED(-17), it may be a duplicate issue of GPU_HANG.
So next step let us focus on decoding v3_1080i5994.h265 scenario first as it may affect the two other issue.

If you have any question, please let me know. Thanks.

BRs,
Hao

@DaveHu-TVU
Copy link
Author

Hi @chenhao5-Intel
We are using VPL2023Q1, so the version I compiled is oneVPL GPU Runtime 2023Q1 Release - 23.1.5 (libmfx-gen.1.2.8)
I have reproduced the issue on 12900H with different video formats 1080p5994 1080i5994 720p5994 and put the console log in the attachment.
Also I've intercepted the video of the same clip with different encoding and put it in the github comments.
The one starting with msdk is generated with media sdk encoding and the one developed by amba is generated with amba h2 encoding.
Uploading msdk_1080p5994.zip…

[amba_720p5994.zip](https://github.com/oneapi-src/
msdk_720p5994.zip
amba_1080p5994.zip
msdk_1080i5994.zip
oneVPL-intel-gpu/files/12093848/amba_720p5994.zip)
amba_1080i5994.zip

@chenhao5-Intel
Copy link
Contributor

Hi @DaveHu-TVU and all,

We have root-caused this issue. We have updated the codes and will open source it soon.

To check this at your side, please test it on i9-12900H, run "export INTEL_MEDIA_RESET_WATCHDOG=0" first and then run sample app commands. There should be no issues.

For Linux i7-12700, please refer to this known issue: https://community.intel.com/t5/Media-Intel-oneAPI-Video/GPU-hangs-when-decoding-2-HEVC-UHD-streams-444-10-bits-Y410/td-p/1431771

@DaveHu-TVU
Copy link
Author

OK, Thanks for your help, @chenhao5-Intel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants