Skip to content

crim-ca/FrVD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FrVD: French Video Description dataset

CC BY-NC-SA 4.0

Le contenu suivant est également disponible en français.

Table of Contents

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

Citation

@techreport{FrVD_TechicalReport,
  title       = "Projet FAR-VVD: Rapport final des travaux au 18e mois du projet",
  author      = "Francis Charette-Mignault, Edith Galy, Lise Rebout, Mathieu Provencher",
  institution = "Centre de recherche informatique de Montréal (CRIM)",
  address     = "405 Ogilvy Avenue #101, Montréal, QC H3N 1M3",
  year        = 2021,
  month       = jul
}

Project Description

Fr-VD is a dataset composed of video clip references (videos not included), of video-descriptions (VD), and metadata (scenes, characters, actions) in French for deep learning.

It comes with a utility tool for visualization of video clips, video-descriptions and actions.

Context

During recent years, CRIM has developed an expertise in the production and diffusion of video-description for audio-visual works. We have produced video-descriptions of 142 movies and TV shows in French by adding metadata for the identification of scenes, characters and actors.

As part of a partnership with the Fond d'accessibilité à la radiodiffusion (FAR), the previous corpus has been enriched by integrating the identification of recognized visual actions in order to construct a complete dataset in French that could be employed for tasks such as deep learning for the automated production of VD or the detection of visual elements (scene, characters, actions).

During this project, methods have been explored to detect actions:

  • in video clips using the visual model SlowFast;
  • in VD using both manual and automatic annotation strategies.

Dataset Details

To summarize, the produced dataset contains:

  • references to video clips (either a film or TV show episode)
  • time intervals within the video for each annotated item:
    • annotations of scenes and characters (including actors);
    • manually transcribed videodescription in French (FR);
    • common liguistic annotations: segmentation of VD into sentences, tokens, grammatical categories, and positional indices of tokens within sentences using Stanza library;
    • textual videodescription annotations: using both manually and automatically annotated actions (each variant is specified where applicable in metadata of the dataset);
    • textual AVA VD annoatations: actions from the AVA dataset detected directly from transcribed video-descriptions (using multiple proposed alignement strategies).
    • visual AVA SlowFast annotations: annotated actions using model SlowFast pretrained with the AVA dataset (in English - 80 actions);
    • visual Kinetics SlowFast annotations: annotated actions using model SlowFast pretrained with the Kinetics-600 dataset (in English - 600 actions);

Following sections describe in futher details the elements presented above.

Annotations in Video-Descriptions

Textual annotations of the transcribed video-description has been produced to identify actions and entities related to the actions. Annotations have been created manually for 45% of the entire VD corpus, and automatically for the remaining 55% using a bi-LSTM model (developed by CRIM) trained onto the manually annotated portion.

Annotations Format

Annotations are identified with the following definitions. Note that the French terminology is employed to preserve correspondance with real annotations produced in the dataset.

  • action denotes a verb, distinguished into categories:
    • cas général
      (e.g.: Des visiteurs marchent dans le parc, i.e.: visitors walk in the parc);
    • verbe passif
      (e.g.: Une serveuse est projetée sur une étagère de verres, i.e.: A waitress is projected onto a shelf of glasses);
    • verbe support (in this case, which is the substituted verb)
      (e.g.: faire la vaisselle, prendre une bouffée de cigarette, i.e.: do the dishes, take a puff of cigarette).
  • Added Attributes:
    • type négation
    • type verbe de substitution (for verbs with a support attribute)
      (e.g.: faire la vaisselle => nettoyer, prendre une bouffée de cigarette => fumer, i.e.: do the dishes => cleaning, take a puff of cigarette => smoke).
  • Entities implied with action, distinguishing for each case components:
    • sujet
    • objet direct
    • objet indirect essentiel
    • and providing the specific type of each one, using one of:
      • humain
      • animal
      • objet
      • concept

Actions Annotations from Video

Actions detected from videos employ the SlowFast implementation.

The original code of the authors linked above allows the use of the most recent pretrained weights. Some additional features (see SlowFast Pull Request #358) have also been added in order to produce necessary prediction logs to retrieve inference actions from videos.

  • Pretrained model weights for the Kinetics-600 dataset have been employed to produce the corresponding annotations in the dataset;
  • Pretrained model weights for the AVA dataset have also been employed to generate corresponding annotations in the dataset. These annotations also make use of the various improvements relevant to AVA.

The detectron2 model is employed (within SlowFast) to allow recognition of actions per individual using detected bounding boxes rather than predicting actions globally over the full video frames.

Actions Annotations from VD

Annotations from VD has been aligned with available actions from the AVA dataset (80 labels).

To do so, following steps have been applied:

  • filtering and extraction of actions with human subjects (with or without objet direct) from VD annotations;
  • normalization of subjects (object direct are generalized into "someone", i.e.: "quelqu'un" in French)
  • processed and obtained in total ~9000 unique actions in French
    (e.g.: sortir, apercevoir QQUN, i.e.: leave, to notice SOMEONE)

Alignment Strategies

This step represents the stratégies that were employed to obtain a mapping between actions defined on the text domain (obtained from VD annotations) and the video domain (obtained from video inferences).

In order to align the ~9000 actions with the AVA labels, many alignment methods have been explored. Each of the explored strategies are available in the *Fr-VD- dataset. They are provided using the following attributes named after the corresponding methods applied for each case:

  • using lexical resources (resslex for ressource lexicales in French) with data augmentation (retrieval of synonyms following translation of English AVA actions in French), with concatenation of following resources:
  • using manual alignment for the 300 most frequent actions within the corpus (which represents more than 50% of all occurrences):
    • manual: representing the manually aligned action when available (empty value - is used if the action is not one of the 300 principal actions, and _ is used when alignment is impossible)
    • prox: represents the degree of semantic proximity of the action against the gold annotations (1 meaning an adequate alignment was made, 2 for a loose alignment, and 0 if no AVA action could be matched for alignment)

Reference Gold Samples

Complete annotation and manual alignment has been accomplished over 6 movies and TV show episodes within the corpus. They are based on very different genres in order to form a diverse and robust set of gold reference annotation samples. Selected videos are:

Only above cases offer the alignment categorised as gold combined with prox indicator described in the Alignment Strategies in *Fr-VD- dataset.

Metadata File Contents

Metadata including SRT and JSON formatted annotations can be downloaded from the below location:

FrVD: French Video Description
Note: Videos are not included.

Visualization Tool

The tool provided in crim-ca/FrVD-visualization-tool repository can be employed for synchronized visualization of metadata annotation files of this FrVD dataset and videos of movies/TV shows.

Contributors and Acknowledgement

The offered dataset (Fr-VD) has been created by CRIM contributors.

The project received financial support from Fond d'accessibilité à la radiodiffusion (FAR).

Video-descriptions have been originally transcribed in the context of past projects that received financial support from the Office des personnes handicapées du Québec (OPHQ) and from the Programme de soutien, à la valorisation et au transfert (PSVT).

References

Fr-VD: Dataset and Visualization Tool

Please refer to the above citation for referencing this dataset and related work.

Documents FAR-VVD - Progress Report (Part 1) and FAR-VVD - Final Report (Part 2) (both in French) offer additional details regarding applied methodologies to generate metadata included in FrVD dataset as well as further exploration about steps accomplished to validate produced annotations.

Additional Bibliographical References

See references page that contains detailed and complete references to resources.