Skip to content

Preliminary design

Sylvain Lesage edited this page Jul 22, 2022 · 4 revisions

Goal

The datasets server aims at providing services for the datasets of the Hugging Face Hub using a web API.

Justification

Datasets can be very big. Getting metadata, fetching data, querying data or processing requires a lot of resources (time, bandwidth, computing, storage). For some use cases (notebook, webpages, etc), these resources are not available. The datasets server is a third-party that bears the cost of the resources and provides a curated list of services on the datasets through a lightweight web API.

Impact

The Hugging Face Hub would ideally be the one-stop shop for ML datasets in the near future. To increase the usage of Hub datasets, it's crucial to provide the services the users need to do the work. By providing specialized services, the datasets server will allow the Hub to add value to the dataset pages (view the data, show stats, do queries, etc.)

Reference

Ecosystem around the Hub datasets

The datasets of the Hugging Face Hub can be accessed directly using git or HTTP or using the datasets library.

Other related projects:

This project is an evolution of datasets-preview-backend (previous name of this repository), which provided the list of configs, splits, and first rows of the datasets (using the streaming mode of datasets)

Implementation

Services

The datasets server will provide the following services through a web API

  • get the metadata of a dataset: tags, configs, splits, features (columns)...
  • get the first N rows of a split
  • get a quality report on how well the dataset can be accessed using datasets (has metadata, can be downloaded, can be streamed, etc.)
  • generate the dataset-info.json (see https://github.com/huggingface/datasets/issues/3507#issue-1091214808)
  • get basic statistics about a split: number of samples, size in bytes
  • get statistics about a column of a split: distribution, mean, median, etc.
  • get a range of rows of a split (random access)
  • post SQL queries (https://github.com/huggingface/data-measurements-tool: frequent words, average+std sentence length, average+std word length,number of samples per tag/label)
  • scan files for vulnerabilities (related to security scan)

Out of scope

At least for a first version, the following points are out of scope:

Implementation challenges

Multiple aspects must be taken into account for the implementation. Not all are equally important.

  • size (and cost) of the storage: some datasets are very big (several TB, generally for audio or vision)
  • number of files: some datasets have a lot of small files
  • bandwidth: downloading (and uploading regularly) big datasets takes a lot of bandwidth. It should not be a problem on the datasets server's side, but it might be on the other side (not all the datasets' data files are hosted on the Hub). The dataset hosting platform might rate-limit or have disponibility issues
  • changes: the datasets are live objects that are versioned with git
  • changes: the hosted data files might change
  • response time: generating the response to the services can take time, because 1. the dataset must be downloaded, but also 2. querying the data on the local files can take time too (possibly reduce-like operations)
  • access rights: some datasets are gated, others are private, others must be downloaded manually
  • private hub: on-premise hubs might also want to benefit from the services provided by the dataset server
  • security: downloading a dataset requires executing arbitrary code (the .py script), which might generate security issues.
  • dependencies: the .py script might require packages, but there is currently no way to specify the dependencies.
  • resources: the processes might take too many resources (memory, CPU, storage, time)

Implementation decisions

  • the services are provided only for a selection of datasets, all of them being "small" public datasets from the Hugging Face Hub
  • the datasets are stored on the server (be it in their original form or in a transformed form: parquet, arrow, SQL?)
  • the services are provided only for one version of the dataset. Ideally: the last revision of the main branch of a dataset repository
  • the datasets are updated regularly to try to give access to the "current" version (possibilities: webhooks on git changes, periodic check of the ETags, manual trigger)
  • the services responses of the static services (statistics, first rows, metadata) are cached
  • the dynamic services (SQL query, random access) can be rate-limited, possibly per user through a token
  • indexes are set up to query the dataset contents
  • the processes on a dataset will run as jobs in an isolated python environment, with all the required dependencies already installed
  • the jobs resources will be bound (memory, CPU, storage, time)
  • the general cost of every dataset (storage, jobs, queries) is evaluated
  • use streaming when possible to speed up the dataset refreshes (see https://huggingface.slack.com/archives/C0311GZ7R6K/p1651592155530169?thread_ts=1651590983.338949&cid=C0311GZ7R6K) or to provide a fallback (if the dataset is too big to be stored on the disk for example)

Technologies and infra

See https://github.com/huggingface/datasets-server/tree/main/infra