Django-docviewer for Elasticsearch and Haystack 2.0:

This django-docview is a fork of document-viewer https://github.com/oxys-net/django-docviewer which is also a fork of https://github.com/NYTimes/document-viewer. In the oxys-net fork All dependecies to jammit and ruby have been removed and replaced by django-pipeline document-viewer was only a client viewer, django-docview store document data and generate data using docsplit (https://github.com/documentcloud/docsplit) and celery.

There are two reason of this fork to exist:

Support to elasticsearch. This means that I have to configure this django-viewer with the beta haystack(soon the stable release)
Automatic indexing. In the oxys-net fork, it was necessary to rebuild_index or update_index manually (python manage update_index). So, I included the library celery-haystack

Summary of the changes:

Haystack 2.0.0-beta : included and configured in search_indexes.py
celery-haystack : included and configured in search_indexes.py
django-celery : easy starting of the celery servery inside the django environment (python manage.py celery worker)
pyelasticsearch : included in the installation of the demo (instead of Whoosh)
elasticsearch : configured in the settings of the demo
docviewer : minor bugs that affects the process of inherit from the main model (document)

Please read original licences in docviewer directory.

Installation of system dependencies:

Install all the packages (the next line has been tried only in Ubuntu 12.04 64b and 12.10 64b):

sudo apt-get install rabbitmq-server rubygems graphicsmagick poppler-utils pdftk ghostscript tesseract-ocr yui-compressor git python-pip python-dev build-essential npm openjdk-7-jre -y

You need to install docsplit:

Install:
```
sudo gem install docsplit
```
Try it:
```
docsplit
```

This is part of the oxys-net/django-docviewer configuration:

sudo ln -s /usr/local/bin/docsplit /usr/bin/docsplit
sudo ln -s /usr/bin/yui-compressor /usr/local/bin/yuicompressor

Install yuglify (need it for production):
```
npm install yuglify
```

Install the elasticsearch:

cd ~
wget https://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.19.11.deb
sudo dpkg -i elasticsearch-0.19.11.deb

Install the django-docviewer:

Run the following in pip (inside your virtualenv):

pip install -e git+git://github.com/robertour/django-docviewer.git#egg=django-docviewer

Add the following apps to the INSTALLED_APPS of your Django settings:

'pipeline',        # necessary for compression and docviewer templates
'djcelery',        # necessary for python manage.py celery worker
'celery_haystack', # necessary for automatic rebuild_index
'haystack',        # necessary for manual rebuild_index
'docviewer',

Add the pipeline configuration to your Django settings:

# Pipeline configuration
STATICFILES_STORAGE = 'pipeline.storage.PipelineCachedStorage'
PIPELINE = False
PIPELINE_CSS_COMPRESSOR = 'pipeline.compressors.yui.YUICompressor'
PIPELINE_JS_COMPRESSOR = 'pipeline.compressors.yui.YUICompressor'

Add the celery configuration to your Django settings:

# Celery configuration
BROKER_URL='amqp://guest:guest@localhost:5672//'

Add the haystack configuration to your Django settings:

#Haystack configuration

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
        'URL': 'http://127.0.0.1:9200/',
        'INDEX_NAME': 'haystack',
    },
}

Add the docviewer configuration to your Django settings (decide your directories):

# Docviewer Configuration
from os.path import join
#PROJECT_ROOT is my actual path to the project. You may have to change this a bit
DOCVIEWER_DOCUMENT_ROOT = join(PROJECT_ROOT,'docs/')
DOCVIEWER_DOCUMENT_URL = '/docs/'
DOCVIEWER_IMAGE_FORMAT =  'png'

Update your database and launch:
1. Update database:
```
python manage.py syncdb
```
2. Launch your site:
```
python manage.py runserver localhost:8000
```
3. Access the site in the URL http://localhost:8000/admin/
4. Logging with the user created in syncdb or any other admin
5. Go to the following address:
```
localhost:8000/admin/sites/site/1/
```
6. Check the domain name is correct ("localhost:8000" if you are developing). Or change it to the your real domain name. This is mandatory for the docviewer to find the images of your pdfs. You will need to restart the server:
python manage.py runserver localhost:8000

Testing the installation:

Start the server:

python manage.py runserver localhost:8000

In another terminal run the celery service:
```
python manage.py celery worker
```
Add a scanned pdf document (for convenience, there is one in ~/git/django-docviewer/test.pdf) through the admin interface:
```
localhost:8000/admin/document/
```
You will need to wait a few seconds while docsplit splits the document and elasticsearch index it. You can see the status in the admin interface. When the status is 'ready', you can search in the following URL (make sure you search with an appropiate term that is insider your pdf):
```
localhost:8000/search/
```
You can also try accessing the document directly:

access the document : http://localhost:8000/viewer/1/demo.html

Disabling stop words:

Open the elasticsearch.yml:

sudo nano /etc/elasticsearch/elasticsearch.yml

Add the following to the configuration file (in the Index section):

index:
   analysis:
       analyzer:
        # set standard analyzer with no stop words as the default for both indexing and searching
       default:
            type: standard
            stopwords: _none_

Delete the haystack index (Warning, this is going to delete all the index):
```
curl -XDELETE 'http://localhost:9200/haystack/'
```
Restart the elasticsearch service:
```
sudo service elasticsearch restart
```

Name		Name	Last commit message	Last commit date
Latest commit History 334 Commits
demo		demo
docviewer		docviewer
screenshots		screenshots
.gitignore		.gitignore
README.rst		README.rst
festos		festos
setup.py		setup.py
test.pdf		test.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Django-docviewer for Elasticsearch and Haystack 2.0:

Summary of the changes:

Installation of system dependencies:

Install the django-docviewer:

Testing the installation:

Disabling stop words:

About

Releases

Packages

Languages

CulturePlex/django-docviewer

Folders and files

Latest commit

History

Repository files navigation

Django-docviewer for Elasticsearch and Haystack 2.0:

Summary of the changes:

Installation of system dependencies:

Install the django-docviewer:

Testing the installation:

Disabling stop words:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages