Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an Orcasound data catalogue and facilitate data access #12

Open
valentina-s opened this issue Feb 22, 2023 · 2 comments
Open

Create an Orcasound data catalogue and facilitate data access #12

valentina-s opened this issue Feb 22, 2023 · 2 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@valentina-s
Copy link
Contributor

valentina-s commented Feb 22, 2023

This project aims to facilitate Orcasound Data Access. Orcasound data is part of the Registry of Open Data on AWS. Due to the streaming structure of the data (small .ts files), it can be a bit hard for a newcomer to query the data. The goal of this project is to improve the quality of the Orcasound data by following the FAIR(Findability, Accessibility, Interoperability, and Reuse) principles for scientific digital assets. The aim is to build a data catalogue and a user friendly package to facilitate the access and abstract the dependence on the data structure which may change in the future. Useful features will be the ability to quickly identify when data are available and retrieve audio based on node, time range, time frequency, etc. into a desired output format. The orca-hls-utils package has some of this functionality and would benefit from more abstraction, testing, documentation. Many other projects will benefit from this package.

Expected outcomes: A Python package to ease access for free, open Orcasound audio data.

Required Skills:
Object Oriented Python, Project Packaging

Bonus Skills:
ffmpeg, Cloud Computing, experience working with large datasets

Mentors:
Valentina, Scott

Difficulty level: Hard

Project Size: 175 or 350 h

Resources:
OOIPY: a package for accessing data from Ocean Observatories Initiative
Amazon S3 Inventory: a service to create an inventory catalogue for data on Amazon S3 which can be automatically updated and stored in csv or parquet format.
ffspec: Python package to interface with different filesystems in the same way

Points to consider in your proposal:

How would you optimize for accessing many small files?
Can you parallelize some operations?
Can you isolate the dependence on the cloud provider?
Can access to a catalogue abstract and speed up the data access?
Can some data be cached?
What would be the API?

Getting Started:
Get acquainted yourself with the Orcasound data on AWS: access.md
Look through these notebooks experimenting with accessing data. Compare the performance reading data directly with orca-hls-utils vs through the parquet catalogues. Can you make some speed improvements?

@scottveirs
Copy link
Member

@vaibhavmehrotraml @ttan06 @zprice12

@paulcretu As we consider this issue further and also revise orcanode code this year, it may be worth re-visiting the file naming convention and size/duration for the FLAC data in the archive-orcasound-net S3 bucket.

Are there ways we can align with the BCHN file naming conventions at the same time we re-organize Orcasound data access to optimize ambient-sound-analysis efficiency (e.g. parallelization, cost)?

Screenshot 2024-02-02 at 11 21 58 AM

@scottveirs scottveirs self-assigned this Feb 2, 2024
@scottveirs scottveirs added the help wanted Extra attention is needed label Feb 2, 2024
@scottveirs
Copy link
Member

scottveirs commented Feb 2, 2024

Here are a few related discussions, issues, and places where hls-utils are used:

  1. 2023 discussion of a new audio data naming scheme for Orcasound (including potential alignment with the BCHN formats used by @ben-hendricks )
  2. 2018 issue in orcanode seeking human-readable file names (which guided initial decisions about the FLAC filenames that we've been generating for the last 12 months at Port Townsend as an experiment in lossless streaming and associated costs)
  3. The OrcaHello live inference system accesses the HLS streams via the PrepareDataForPredictionExplorer.py script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
Status: No status
Development

No branches or pull requests

2 participants