define key researcher use-cases for story image extraction and storage #708

rahulbot · 2020-05-20T18:01:18Z

To make some technical decisions I think we need to more concretely design the primary use cases we have in mind so far. Here's my stab at a list, and the underlying tech feature it might rely on:

review a summary of visual language across a small corpus (ie. top stories) - maybe use image tree map
review a summary of visual language across a large corpus (ie. a timespan) - some high-level view of clusters, like Leon's mosaic does
trace the appearance of an image over time in a topic - search by image similarity
search for stories using images similar to one the researcher identifies - search by image similarity

This is the thinking shining that led me to #658.

hroberts · 2020-05-20T18:14:15Z

these all look great to me. is there some way we can produce each of these on a one off basis to evaluate before building them into the platform? we have arguably already done #1. alternatively, we could make a bet that this is the set of products we want and build the minimal platform to deliver them. a key difference I see is that the first two only require us to collect and process a small subset of the images, whereas the last two require us to process all images in a topic and also build an indexing system to be able to find them. maybe start with the first two and build from there?

…

-hal

On Wed, May 20, 2020 at 1:01 PM rahulbot ***@***.***> wrote: To make some technical decisions I think we need to more concretely design the primary use cases we have in mind so far. Here's my stab at a list, and the underlying tech feature it might rely on: - review a summary of visual language across a small corpus (ie. top stories) - maybe use image tree map - review a summary of visual language across a large corpus (ie. a timespan) - some high-level view of clusters, like Leon's mosaic does - trace the appearance of an image over time in a topic - search by image similarity - search for stories using images similar to one the researcher identifies - search by image similarity — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_708&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=ZG5zQSvWqtnTH0wRvyFYC_d6zrfVvEgDE9i6fIBNFxc&s=LMWy1F37DNQeugeIm30z3dpJrkAyP4vPYMs_TQjPaTQ&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T7NEETPBVOQCV72CD3RSQLH3ANCNFSM4NGF4OKQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=ZG5zQSvWqtnTH0wRvyFYC_d6zrfVvEgDE9i6fIBNFxc&s=G3Iw3qj3NrHpJO4DR1C1pTrb1pDu9USBLHrroFCzA0U&e=> .

cindyloo · 2020-05-20T18:38:52Z

note: we also have the potential to analyze by facial detection and identification..

I think we've proved the desire and feasibility for use case #1. Minimally surfacing/storing the image and url at least regarding 1 and 2 would make for a flexible initial implementation

the ability to search by image similarity would be an incredible capability as there is little out there to do such things, but no trivial implementation

rahulbot · 2020-05-20T20:22:25Z

Glad this list feels like a good start. I think #2 has been fairly validated as useful too (see @cindyloo repo MediaCloud-Image-Tests).

I think you're right that this argues for extracting and surfacing the URL of the top image as a way to get started with 1 & 2. It would also let us try out some out-of-band approaches to 3 and 4 more quickly (with the top image at least). We kind of discussed this in #593, but also more recently.

To be concrete: I'm proposing we take a first step towards image support by adding a pipeline stage to every story in a topic that extracts and stores the top image URL (via Newspaper3k because we have validated that). This should be returned in topic-story-list results so it can be used easily. I can split this off to a new issue to discuss details if folks generally agree.

The key point this is pushing me towards is that separating URLs from images can help us implement a first stage faster and give us a non-critical-path playground to more easily try out solutions for some of these features.

rahulbot created this issue from a note in Image Support (Usage) May 20, 2020

rahulbot mentioned this issue May 20, 2020

design top images mosaic, tree map or other display for topic top stories mediacloud/web-tools#1814

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

define key researcher use-cases for story image extraction and storage #708

define key researcher use-cases for story image extraction and storage #708

rahulbot commented May 20, 2020 •

edited

Loading

hroberts commented May 20, 2020 via email

cindyloo commented May 20, 2020 •

edited

Loading

rahulbot commented May 20, 2020

define key researcher use-cases for story image extraction and storage #708

define key researcher use-cases for story image extraction and storage #708

Comments

rahulbot commented May 20, 2020 • edited Loading

hroberts commented May 20, 2020 via email

cindyloo commented May 20, 2020 • edited Loading

rahulbot commented May 20, 2020

rahulbot commented May 20, 2020 •

edited

Loading

cindyloo commented May 20, 2020 •

edited

Loading