Skip to content

kuhumcst/clarin-tei

 
 

Repository files navigation

CLARIN TEI reader

This is a new synchronized facsimile and transcription reader for the TEI files on clarin.dk.

It is a fork of the Glossematics source code with many changes made to TEI styling, metadata retrieval and page structure fitting these TEI files, which are quite different from the ones at https://glossematics.dk.

Data preparation

I downloaded the "everyman" dataset from https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/46 and extracted every zip file.

The extracted TIF files were recursively converted and renamed using the following commands (taken from kuhumcst#20):

find . -name '*.tif' -exec mogrify -format jpg -quality 70 {} +
find . -name '*.jpg' -exec rename 's/(?<!.tif).jpg/.tif.jpg/g' {} +

And to remove the remaining TIF files:

find . -name "*.tif" -type f -exec rm -f {} \;

To create thumbnails for search results:

mkdir thumbs
find . -name '*.jpg' -exec convert '{}' -resize 360x640 -set filename:newname "%t.%e" 'thumbs/thumb-%[filename:newname]' \;

Server setup

The directory /etc/clarin-tei serves as the home directory of the system. The image and TEI files are to be found somewhere within the directory structure of /etc/clarin-tei/files while this Git repository is cloned at /etc/clarin-tei/clarin-tei.

The system requires Docker to run and is initialised as a systemd service:

cp system/clarin-tei.service /etc/systemd/system/clarin-tei.service
systemctl enable clarin-tei
systemctl start clarin-tei

Currently, this system requires a separate reverse proxy to be available on the public Internet.

For e.g. an nginx setup such as the one running on alf.hum.ku.dk, the following snippet should be included:

location /clarin {
	include proxy_params;
	proxy_pass http://127.0.0.1:6789/;
}

This will proxy requests to the CLARIN TEI web service running on localhost:6789.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Clojure 83.0%
  • CSS 15.5%
  • Dockerfile 1.4%
  • Shell 0.1%