Skip to content

HackTheDinos/SUNYdigs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#SUNYdigs

Overview

Paleontologists throughout the 20th century used field notebooks to keep detailed logs of their expeditions. Previous work at the museum has given us these notes as scanned images and as very imperfect text transcriptions. The text has never analyzed for potentially relevant pieces of information that could lead to new understanding of past expeditions. "These data are very frequently requested by researchers from around the world, but their imperfect nature make them less useful than they could be."

The SUNYdigs team decided that the best solution for this project was to gamify this "Dig Up the Past" challenge by creating a platform where users would transcribe without being overwhelmed.

</h5> 

Inspiration

The project was in part inspired by the hugely successful crowdsourcing project - [reCAPTCHA](https://www.cylab.cmu.edu/partners/success-stories/recaptcha.html)
Here's a [link](https://www.youtube.com/watch?v=-Ht4qiDRZE8) to the TED talk about Massive-scale online collaboration by Luis Von Ahn,
founder of reCAPTCHA and CEO at Duolingo.

Design

As a first step, the system would perform text-based image segmentation on all scanned pages.
For every scanned page in the [journal](https://github.com/amnh/HacktheDinos/tree/master/challenges/Dig-Up-The-Past) a corresponding folder is created with the same title that contains all the words segmented. A text file is also created alongside the original image of the scanned page, with the same title that contains each word's metadata in form of a JSON object.

`{"img_page": path/to/parent_image, "year": 1899, "word": [path/to/word_image_file, "!@#$%", number_of_votes], "author": "brown"}`

This metadata remains in queue till five upvotes have been gathered for a particular transcription. After this, it is saved to a database and archived.

Improvements and Observations

* 'Suggest' feature to aid guessing of difficult Paleontology terms for a layman transcribing documents.
* Higher resolution images would help in getting words out of images with better accuracy. * The text-based image segmentation is currently optimized for sparsely populated and well-formatted journal pages.

Issues

Git issues have been created for problems needing immediate attention. Feel free to contribute. :)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published