ProseAlign

Efficient global word-level sequence alignment approximation via (recursive) Aho-Corasick prepass(es).

Word-level diffing algorithms were found to be insufficient for aligning/diffing prose.

This currently aligns a Vietnamese translation of Harry Potter with a Google speech-to-text transcription of its corresponding Audiobook because for some ungodly reason the voice actor did not faithfully record the translation (minor phrasing differences / insertions / deletions). This is required to create a faithful punctuated transcription of the Audiobook while also correcting the stt transcription's errors.

Global char-level sequence alignment (Needleman-Wunsch) is too slow for texts of ~26k character sequences.

A custom word-level Needleman-Wunsch may have sufficed. It will need to be implemented for the resulting mismatch sections (after a few levels of recursive Aho-Corasick) anyway, so we will see if all this has been a waste of time.

TODO:

rewrite to be one self-contained function
implement recursion
implement custom Needleman-Wunsch

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
README.md		README.md
ac.py		ac.py
ac_16_1.txt		ac_16_1.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

README.md

README.md

ac.py

ac.py

ac_16_1.txt

ac_16_1.txt

Repository files navigation

ProseAlign

About

Releases

Packages

Languages

7UR7L3/ProseAlign

Folders and files

Latest commit

History

Repository files navigation

ProseAlign

About

Resources

Stars

Watchers

Forks

Languages