Detection score of the segment #42

zhouzhou-zheng · 2024-05-22T03:30:36Z

When testing, I input a 150s video into the model.

Test Scenario 1: The input video is of a woman dancing, and the query text is "a woman is dancing." The model correctly detects the corresponding segment, which meets expectations.

Test Scenario 2: The input video does not contain any clips of a woman dancing; it is just a video of a woman sitting on a chair. The query text is "a woman is dancing," yet the model still detects a corresponding segment, which does not meet expectations.

Test Scenario 3: The input is a combination of videos from Scenario 1 and Scenario 2. The query text is "a man is playing basketball." There are no men or basketball scenes in the video, but among the top 10 results, there are still segments with high scores.

My question is, for a test video and a query text, is there always a highly scored positive segment detected? What is the reason for this phenomenon? Is it because during your training, each video always has at least one segment that corresponds to the query text as a positive example?

wjun0830 · 2024-05-29T11:48:27Z

When testing, I input a 150s video into the model.

Test Scenario 1: The input video is of a woman dancing, and the query text is "a woman is dancing." The model correctly detects the corresponding segment, which meets expectations.

Test Scenario 2: The input video does not contain any clips of a woman dancing; it is just a video of a woman sitting on a chair. The query text is "a woman is dancing," yet the model still detects a corresponding segment, which does not meet expectations.

Test Scenario 3: The input is a combination of videos from Scenario 1 and Scenario 2. The query text is "a man is playing basketball." There are no men or basketball scenes in the video, but among the top 10 results, there are still segments with high scores.

My question is, for a test video and a query text, is there always a highly scored positive segment detected? What is the reason for this phenomenon? Is it because during your training, each video always has at least one segment that corresponds to the query text as a positive example?

Great point! the mentioned problem isnt addressed in this work.
You may want to try our new model https://github.com/wjun0830/CGDETR
This work may partially address the mentioned problem!

wjun0830 · 2024-05-29T11:49:54Z

And we agree with your opinion that the problem is because the input videos always include video segments that correspond to text queries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detection score of the segment #42

Detection score of the segment #42

zhouzhou-zheng commented May 22, 2024

wjun0830 commented May 29, 2024

wjun0830 commented May 29, 2024

Detection score of the segment #42

Detection score of the segment #42

Comments

zhouzhou-zheng commented May 22, 2024

wjun0830 commented May 29, 2024

wjun0830 commented May 29, 2024