Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slide segments and extracted text to harvesting and DB, enable paella slide previews #1163

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

owi92
Copy link
Member

@owi92 owi92 commented Apr 26, 2024

This adds the ocr'd slide texts as well as a list of timestamped frames to the harvesting sync code and stores them in the DB.
In order the show the slide previews, paella-slide-plugins was added and configured to use the timestamped frames.

Needs opencast/opencast#5757 to work. Once that is merged, released and used on our test Opencast, the changes can be tested with fresh uploads. We'll still need some mechanism to apply segmentation and ocr (and speech-to-text as well) to existing videos.

(Can be reviewed commit by commit, though note that the migration from the second commit was extended in the third)

@owi92 owi92 added the changelog:user User facing changes label Apr 26, 2024
@github-actions github-actions bot temporarily deployed to test-deployment-pr1163 April 26, 2024 13:03 Destroyed
@github-actions github-actions bot temporarily deployed to test-deployment-pr1163 April 26, 2024 14:02 Destroyed
@github-actions github-actions bot temporarily deployed to test-deployment-pr1163 April 26, 2024 14:23 Destroyed
@github-actions github-actions bot temporarily deployed to test-deployment-pr1163 April 26, 2024 14:36 Destroyed
@github-actions github-actions bot added the status:conflicts This PR has conflicts that need to be resolved label Apr 29, 2024

This comment was marked as resolved.

Copy link
Member

@LukasKalbertodt LukasKalbertodt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just code review, no testing yet)
Looks overall pretty good. I have a few small nits, but there is only one bigger one: about how to store the startTime.

32: "event-slide-text",
32: "event-slide-data",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the segments really "slide data"? Maybe just be verbose with event-slide-text-and-segments?

alter table events
-- The default above was just for all existing records. New records should
-- require this to be set.
alter column segments drop default,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also add a not null constraint here? I feel like we missed that in 14-event-captions too? Like, we don't want the field to be null, right?

return {
id: "frame_" + time,
mimetype: "image/jpeg",
time: time,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time: time,
time,


create type event_segment as (
uri text,
start_time text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer it, if we stored this like duration, namely as number of milliseconds as an int. The string takes more space and I don't think we can use it for anything. The only "consumer" of this number seems to be Paella, which wants seconds anyway. Converting from milliseconds to seconds is obv much easier than this string parsing. So that basically means moving ocTimeStringToSeconds into the harvest code and converting before writing it to the DB. Not sure what we should do in case the parsing fails... probably just ignore all segments?

It would have been even better to already just send the milliseconds in the harvest API, but I didn't catch that when looking at the OC PR. Although... we could still fix it? It's a Tobira internal API and Tobira has not used that yet, so we can fix it without bumping the API version... mhhh

Comment on lines +483 to +484
order: 102,
tabIndex: 17,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are some high numbers... is that intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good call. I copied that from https://paellaplayer.upv.es/#/playground and forgot to adjust those numbers.

@github-actions github-actions bot added the status:conflicts This PR has conflicts that need to be resolved label May 16, 2024
Copy link

This pull request has conflicts ☹
Please resolve those so we can review the pull request.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:user User facing changes status:conflicts This PR has conflicts that need to be resolved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants