Add background segmentation mask #142

eehakkin · 2024-05-08T22:28:51Z

Hi!

This adds capabilities, constraints and settings for background segmentation mask. Those are fairly obvious.

For the feature to be useful, the actual background segmentation mask must be provided to web apps. There are various ways to do that:

In my PoC, I changed the stream of video frames to be a stream of interleaved background segmentation mask and real video frames and extended video frame metadata with a background segmentation mask flag so that web apps can tell segmentation mask and real video frames apart.
However, that makes it awkward to process such streams and very unclear how to encode them.
In this PR, the real video frame and background segmentation mask frame are bundled together which simplifies processing of the streams and allows encoders to encode real video frames normally. The background segmentation mask frames for their part are mostly for local consumption only.
Another option would be to utilize an alpha channel. However, there are problems with that approach:
- Some pixel formats (such as NV12 and NV21) do not have corresponding alpha channel formats. So it would not be possible to such add an alpha channel and then later to drop it in order to get the original frame. Instead of that, the whole frame would have to be converted to a difference format.
- There are no canvas composite operations, for instance, to operate with alpha masks whereas they work great with grayscale masks.
- Pixels which are certainly background would be completely transparent. For completely transparent pixel the color is practically irrelevant and some compression algorithms could group all completely transparent pixel together and thus lose color information.

/cc @riju

Preview | Diff

riju · 2024-05-14T10:09:15Z

Thanks @eehakkin
In the explainer, we list the difference between Blur and Mask , provide an example code to create a green screen using this feature and also a demo of what BG Segmentation MASK looks like in our Chrome PoC and what you can do (replacement, gif, image, green screen, etc).

In many cases, it might be important to have access to the original camera feed, so BG MASK retains the original frames intact, does segmentation and provides mask frames in addition to the original video frames thus web applications receive both the original frames and mask frames in the same video frame stream

This PR follows up our presentation of BG Segmentation MASK in the monthly WebRTC WG call [Minutes]

PTAL @jan-ivar @aboba @alvestrand @youennf

eladalon1983

I think the general thrust of this effort is very useful for Web applications.

eladalon1983 · 2024-05-14T13:38:30Z

index.html

+ <p>A background segmentation mask with
+ white denoting certainly foreground,
+ black denoting certainly background and
+ grey denoting uncertainty.</p>


Is it really only "uncertainty" that's represented? Is it perhaps sometimes partial transparency, and sometimes ambiguity?

Could anything be said here to clarify that shades of grey tend more towards the foreground/background based on being lighter/darker?

eladalon1983 · 2024-05-14T13:39:29Z

index.html

+ <h3>VideoFrame interface extensions</h3>
+ <pre class="idl">
+partial interface VideoFrame {
+ readonly attribute VideoFrame? backgroundSegmentationMask;


I imagine this isn't going to suffer infinite recursion because the second layer deep will be guaranteed nullable. But it still strikes me as a bit odd to expose a full VideoFrame here, with all its present and future fields, when what we really wish to get is a matrix of integer values of a limited range.

Yes, recursion is definitely not wanted.

While I by no mean insist on VideoFrame, I think that it is benefial, if the background segmentation mask can be directly passed, for instance, to Canvas.drawImage() or such.

Additionally, because usages of background segmentation masks are manifold (it could be post-processed remotely, locally on CPU or on GPU, etc.) and sources and pre-processing could vary (maybe the source is a boolean matrix or an integer matrix or a GPU texture), it would be good IMHO if the API didn't enforce a particular storage or representation. A VideoFrame is good in that.

Why is the attribute readonly? If JS wishes to modify the background segmentation mask of a frame, how can you do it? Create a new video frame with a new segmentation mask member? How is that passed to the video frame constructor?

Note VideoFrame is defined by the Media WG, so I think this needs to be discussed there.

Unless we make backgroundSegmentationMask metadata? Either way, we should involve the Media WG here based on w3c/webcodecs#607 (comment).

If JS wishes to modify the background segmentation mask of a frame, how can you do it? Create a new video frame with a new segmentation mask member?

These are good questions I suspect the Media WG can answer. They made VideoFrame and its metadata immutable and define its interaction model.

Like @eladalon1983 I find it odd to expose a full VideoFrame for a mask.

eladalon1983 · 2024-05-14T13:45:46Z

index.html

+};
+
+partial dictionary MediaTrackConstraintSet {
+ ConstrainBoolean backgroundSegmentationMask;


Would it ever be interesting and feasible to tweak the parameters by which segmentation is done?

Atleast on Windows, the platform model does not allow tweaking segmentation parameters today. Using tensorflow.js with BodyPix model for Blur, I see there's atleast a segmentationThreshold parameter. Maybe it's the same as foregroundThresholdProbability with the MediaPipeSelfieSegmentation model ?

Did you have some other parameters in mind ?

Did you have some other parameters in mind?

I am not knowledgeable enough on what parameters would be best to include. I was mostly wondering if this is something we foresee extending from a boolean to a set of parameters, and if so, whether there was a viable path for such future extensions given the current API shape.

In Media Capture API, the parameter space is flat and not hierarchical.

As an example, there is a constrainable property called whiteBalanceMode which can be constrained to manual. If one then wants to manually change the white balance, there is a constrainable property called colorTemperature which can be constrained separately in order to do that.

So if we later would like to add a numeric constrainable property called backgroundSegmentationThreshold (which could change the segmentation mask to be pre-processed to an blank and white mask according to the threshold without shades of grey) or a string constrainable property called backgroundSegmentationModel (to use the particular AI model), we could certainly do that.

eladalon1983 · 2024-05-28T12:56:00Z

By the way, having spoken to some people who work on camera effects in video-conferencing applications, I have some more feedback. (Not sure if this has been discussed in the past.)

Video conferencing applications often have to be very careful about what models they use, two interesting reasons being:

Inclusion. For example, ensuring people of different skin color are treated equitably. Not only is this important for ensuring customer satisfaction - sometimes it's even a regulatory requirement.
Consistency.

I am getting the feeling that, if we want serious Web apps to use this valuable work, it might be necessary to also expose something about the underlying model. I am not sure what the MVP is in that regard; possibly even just some stable identifier that apps can use against an allowlist of models/implementation that they had vetted and found sufficient?

riju · 2024-05-28T14:49:21Z

By the way, having spoken to some people who work on camera effects in video-conferencing applications, I have some more feedback. (Not sure if this has been discussed in the past.)

Video conferencing applications often have to be very careful about what models they use, two interesting reasons being:

Inclusion. For example, ensuring people of different skin color are treated equitably. Not only is this important for ensuring customer satisfaction - sometimes it's even a regulatory requirement.

Consistency.

I am getting the feeling that, if we want serious Web apps to use this valuable work, it might be necessary to also expose something about the underlying model. I am not sure what the MVP is in that regard; possibly even just some stable identifier that apps can use against an allowlist of models/implementation that they had vetted and found sufficient?

Good feedback. The way we plan to implementing this API today on Chrome/Edge is using the platform models which are presently shipping by default on the underlying OS.
For Windows it would be Windows Studio Effects models,
On MacOS, it would be Apple's Vision models
On ChromeOS, it is likely to be a Mediapipe selfie segmeter when it happens.

If you are making a native app today without bringing your own models, likely you will use what the platform provides.
I would say OS teams do take care of Inclusion when training the models. I can see the Model Card and training info on a few MediaPipe/TFLite models.

I think when users bring their own models, this is a serious issue to consider. Also when major platforms are shipping efficient on-device models by default in the OS, does it make sense for every app to bring their own Segmentation models ? Differentiation vs Efficiency trade-offs.

Consistency:
I hear that many would like to have the same UX across platforms so that their use cases - green screen, BG Replacement look the same. That's why this would give the Mask data and developers can implement their use case on top of that. I understand the Mask data itself won't be pixel perfect across platforms, but could they use MediaStreamTrackProcessor or Canvas operations with Mask data to minimize any difference in the underlying models ?

eladalon1983 · 2024-05-30T08:28:04Z

I think we should spin off the discussion about identifying the model (or some of its properties) out of this PR and into an issue.

Just some quick clarifications, though.

I imagine everyone makes a serious effort to be inclusive nowadays. But video-conferencing applications might nevertheless face a regulatory requirement to demonstrate that they had done some due diligence before relying on a model provided by a third-party. So the concern I am raising here is not "is the model inclusive" but rather "can an app using the model know that it's inclusive and make that claim to regulators." (I'm not an expert here and I do not intend to cosplay one. Just a topic for you to consider if you want to ensure widespread adoption of this API.)
The standards by which inclusion is judged may change over time. It might be necessary to update allowlists and blocklists of models over time. A Web-based video-conferencing app in 2027 might not be able to rely on a model built into an un-updated user agent from 2025.
The specific worries about consistency which I am channeling here, are about the consistency of the segmentation model.

riju · 2024-05-30T11:03:02Z

I think we should spin off the discussion about identifying the model (or some of its properties) out of this PR and into an issue.

Just some quick clarifications, though.

I imagine everyone makes a serious effort to be inclusive nowadays. But video-conferencing applications might nevertheless face a regulatory requirement to demonstrate that they had done some due diligence before relying on a model provided by a third-party. So the concern I am raising here is not "is the model inclusive" but rather "can an app using the model know that it's inclusive and make that claim to regulators." (I'm not an expert here and I do not intend to cosplay one. Just a topic for you to consider if you want to ensure widespread adoption of this API.)

@aboba : Is it possible to share more information of how Microsoft does due diligence before putting in the OS ?

The standards by which inclusion is judged may change over time. It might be necessary to update allowlists and blocklists of models over time. A Web-based video-conferencing app in 2027 might not be able to rely on a model built into an un-updated user agent from 2025.

Very true. I am expecting platform vendors to update models (maybe via drivers or OS updates) as hardware becomes more capable.

The specific worries about consistency which I am channeling here, are about the consistency of the segmentation model.

eladalon1983 · 2024-05-30T12:26:52Z

Very true. I am expecting platform vendors to update models (maybe via drivers or OS updates) as hardware becomes more capable.

Even if a video conferencing app runs on {UA, UA-version, OS, OS-version}, it might still not know definitively which model is used, as that might be subject to experiments, out-of-band updates, etc. Apps might require more information exposed to them about the segmentation model before they can use it.

alvestrand

I see that the approach has changed to an extra video frame argument. I think this is a better approach. But still having questions.

alvestrand · 2024-06-05T09:06:44Z

index.html

+ <h3>VideoFrame interface extensions</h3>
+ <pre class="idl">
+partial interface VideoFrame {
+ readonly attribute VideoFrame? backgroundSegmentationMask;


Why is the attribute readonly? If JS wishes to modify the background segmentation mask of a frame, how can you do it? Create a new video frame with a new segmentation mask member? How is that passed to the video frame constructor?

jan-ivar

Hi @riju, does this PR resolve an open issue? If not, can you open one?

I see this PR modifies VideoFrame which is defined by the Media WG, so I think those parts need to be discussed there.

I think this could use their expertise.

jan-ivar · 2024-06-05T22:58:49Z

index.html

+ <h3>VideoFrame interface extensions</h3>
+ <pre class="idl">
+partial interface VideoFrame {
+ readonly attribute VideoFrame? backgroundSegmentationMask;


Note VideoFrame is defined by the Media WG, so I think this needs to be discussed there.

Unless we make backgroundSegmentationMask metadata? Either way, we should involve the Media WG here based on w3c/webcodecs#607 (comment).

If JS wishes to modify the background segmentation mask of a frame, how can you do it? Create a new video frame with a new segmentation mask member?

These are good questions I suspect the Media WG can answer. They made VideoFrame and its metadata immutable and define its interaction model.

Like @eladalon1983 I find it odd to expose a full VideoFrame for a mask.

jan-ivar · 2024-06-05T23:08:54Z

This API was discussed in https://www.w3.org/2024/04/23-webrtc-minutes.html#t08

eehakkin · 2024-06-06T12:49:08Z

I replaced partial interface VideoFrame with partial dictionary VideoFrameMetadata. That is more standard way to extend a VideoFrame, I supppose. I also changed to the type of the new member from VideoFrame to ImageBitmap. That avoid recursion.

I should later add also an example, I think.

jan-ivar · 2024-06-07T17:52:10Z

Note that this still requires registering the metadata like was done in w3c/webcodecs#607.

Add background segmentation mask

558751d

eladalon1983 self-requested a review May 14, 2024 11:27

eladalon1983 reviewed May 14, 2024

View reviewed changes

Update index.html

9dafe7d

alvestrand reviewed Jun 5, 2024

View reviewed changes

jan-ivar requested changes Jun 5, 2024

View reviewed changes

Extend VideoFrameMetadata with ImageBitmap

1a5a12b

riju mentioned this pull request Jun 10, 2024

Announcement: Background Segmentation metadata entry to WebCodecs VideoFrame Metadata Registry w3c/webcodecs#800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add background segmentation mask #142

Add background segmentation mask #142

eehakkin commented May 8, 2024 •

edited by pr-preview bot

riju commented May 14, 2024 •

edited

eladalon1983 left a comment

eladalon1983 May 14, 2024

eehakkin May 30, 2024

eladalon1983 May 14, 2024

eehakkin May 30, 2024

alvestrand Jun 5, 2024

jan-ivar Jun 5, 2024

eladalon1983 May 14, 2024

riju May 15, 2024 •

edited

eladalon1983 May 15, 2024

eehakkin May 30, 2024

eladalon1983 commented May 28, 2024

riju commented May 28, 2024 •

edited

eladalon1983 commented May 30, 2024 •

edited

riju commented May 30, 2024

eladalon1983 commented May 30, 2024 •

edited

alvestrand left a comment

alvestrand Jun 5, 2024

jan-ivar left a comment

jan-ivar Jun 5, 2024

jan-ivar commented Jun 5, 2024

eehakkin commented Jun 6, 2024

jan-ivar commented Jun 7, 2024

Add background segmentation mask #142

Are you sure you want to change the base?

Add background segmentation mask #142

Conversation

eehakkin commented May 8, 2024 • edited by pr-preview bot

riju commented May 14, 2024 • edited

eladalon1983 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riju May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eladalon1983 commented May 28, 2024

riju commented May 28, 2024 • edited

eladalon1983 commented May 30, 2024 • edited

riju commented May 30, 2024

eladalon1983 commented May 30, 2024 • edited

alvestrand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-ivar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-ivar commented Jun 5, 2024

eehakkin commented Jun 6, 2024

jan-ivar commented Jun 7, 2024

eehakkin commented May 8, 2024 •

edited by pr-preview bot

riju commented May 14, 2024 •

edited

riju May 15, 2024 •

edited

riju commented May 28, 2024 •

edited

eladalon1983 commented May 30, 2024 •

edited

eladalon1983 commented May 30, 2024 •

edited