Skip to content

Self-hosted documents with built-in OCR and search

License

Notifications You must be signed in to change notification settings

montanadev/librarian

Repository files navigation

tests librarian codecov

Librarian

Librarian is an easy-to-use viewer for scanned home documents

Features:

  • support for PDFs, JPGs and PNGs
  • document backups to a mounted volume (or a NAS via NFS!)
  • search engine for scanned text (OCR via Google Compute Vision)
  • tagging, folders, organize how you want

Demo

Check out a demo at https://librarian-demo.montanadev.com

Installation

Docker

$ docker run -p 8000:8000 \
             -e DATABASE_URL=postgresql://user:password@address/database \
    ghcr.io/montanadev/librarian:main

Kubernetes

apiVersion: v1
kind: ConfigMap
metadata:
  name: librarian-config
  labels:
    app: librarian
data:
  DATABASE_URL: 'postgresql://user:password@address/database'
  
---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: librarian
  labels:
    app: librarian
spec:
  selector:
    matchLabels:
      app: librarian
  replicas: 1
  template:
    metadata:
      labels:
        app: librarian
    spec:
      containers:
      - name: librarian
        image: ghcr.io/montanadev/librarian:main
        imagePullPolicy: Always
        envFrom:
          - configMapRef:
              name: librarian-config
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "256Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "2000m"

Don't skip! Google Cloud Vision API

Librarian's OCR is performed by GCV -- it can't detect text without credentials. To get an API key:

  1. Go to https://cloud.google.com/docs/authentication/getting-started
  2. Follow the Creating a service account > Cloud Console instructions. Create a new project, if necessary.
  3. Visit the API library page, search for Cloud Vision API
  4. Enable the Cloud Vision API for the service account you just created
  5. Go back to Librarian, click Settings, and paste the service account JSON key into the Cloud Vision API Key box

As of writing, each month the first 1k pages are free and each 1k pages after that are $1.60.

Configuration

The only required environment variable is DATABASE_URL, which should be pointed at a working postgres instance. The rest are optional.

Name Default Example Description
DATABASE_URL postgresql://username:[email protected]/librarian Database to store document metadata
ALLOWED_HOSTS * localhost,my-site.com Django setting (more)
SECRET_KEY Django setting (more)
ALLOW_REUPLOAD false Set true to allow the same document to be reuploaded as unique documents
DISABLE_ANNOTATION false Set to true if you don't like OCR and document search

Security

It would be a real bad idea to put Librarian in a public environment.

Librarian doesn't (currently) require logins, or block anonymous access. I also haven't made XSS prevention and enforcing file types a priority.

Development

Prerequisites

Tools used to build Librarian

  • make
  • npm
  • python (>3.9)
  • poetry
  • libnfs
  • imagemagick
  • postgres
  • openssl

You can install some of these on macOS via Homebrew

$ brew install node [email protected] poetry libnfs imagemagick postgres openssl

For the backend

# on Macs with the M1/2 chip, you may encounter gcrpio issues, use the following command to install
$ LDFLAGS="-L/opt/homebrew/opt/openssl@3/lib -L/opt/homebrew/opt/libnfs/lib ${LDFLAGS}" \
  CPPFLAGS="-I/opt/homebrew/opt/libnfs/include -I/opt/homebrew/opt/openssl@3/include ${CPPFLAGS}" \
  GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1 \
  GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 \
    poetry install

# the LD/CPPFLAGS may need to be adjusted based on where brew installs libnfs openssl
# ex.
# LDFLAGS=-L/opt/homebrew/opt/openssl@3/lib\ -L/usr/local/Cellar/libnfs/5.0.2/lib \
# CPPFLAGS="-I/usr/local/Cellar/libnfs/5.0.2/include -I/opt/homebrew/opt/openssl@3/include ${CPPFLAGS}" \
# GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1 \
# GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 \
#   poetry install

# for everyone else
$ poetry install

$ createdb librarian
# run database migrations (if postgres isnt running, start with `brew services start postgres`)
$ make migrate

# start the server
$ make run

For the frontend

$ cd client
$ npm i
$ npm start

See Makefile for additional commands.

Scripts

Test uploads without drag-n-dropping on the frontend

$ curl 'http://0.0.0.0:8000/api/documents/home-title.pdf' -H 'Content-Type: application/pdf' --data-binary  '@home-title.pdf'

Roadmap

See roadmap.md