Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

define key researcher use-cases for story image extraction and storage #708

Open
rahulbot opened this issue May 20, 2020 · 3 comments
Open

Comments

@rahulbot
Copy link
Contributor

rahulbot commented May 20, 2020

To make some technical decisions I think we need to more concretely design the primary use cases we have in mind so far. Here's my stab at a list, and the underlying tech feature it might rely on:

  1. review a summary of visual language across a small corpus (ie. top stories) - maybe use image tree map
  2. review a summary of visual language across a large corpus (ie. a timespan) - some high-level view of clusters, like Leon's mosaic does
  3. trace the appearance of an image over time in a topic - search by image similarity
  4. search for stories using images similar to one the researcher identifies - search by image similarity

This is the thinking shining that led me to #658.

@hroberts
Copy link
Contributor

hroberts commented May 20, 2020 via email

@cindyloo
Copy link

cindyloo commented May 20, 2020

note: we also have the potential to analyze by facial detection and identification..

I think we've proved the desire and feasibility for use case #1. Minimally surfacing/storing the image and url at least regarding 1 and 2 would make for a flexible initial implementation

the ability to search by image similarity would be an incredible capability as there is little out there to do such things, but no trivial implementation

@rahulbot
Copy link
Contributor Author

Glad this list feels like a good start. I think #2 has been fairly validated as useful too (see @cindyloo repo MediaCloud-Image-Tests).

I think you're right that this argues for extracting and surfacing the URL of the top image as a way to get started with 1 & 2. It would also let us try out some out-of-band approaches to 3 and 4 more quickly (with the top image at least). We kind of discussed this in #593, but also more recently.

To be concrete: I'm proposing we take a first step towards image support by adding a pipeline stage to every story in a topic that extracts and stores the top image URL (via Newspaper3k because we have validated that). This should be returned in topic-story-list results so it can be used easily. I can split this off to a new issue to discuss details if folks generally agree.

The key point this is pushing me towards is that separating URLs from images can help us implement a first stage faster and give us a non-critical-path playground to more easily try out solutions for some of these features.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Development

No branches or pull requests

3 participants