The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
-
Updated
May 24, 2024 - Java
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
File Extension Fix Tool - Find and rename files with wrong extensions.
Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.
Elasticsearch File System Crawler (FS Crawler)
Book Management System for e-bibliomaniacs
A TYPO3 CMS extension that provides Apache Tika functionality
Extract text from a document by Apache Tika
opensearch related code
A cross-platform command line tool for parallelised content extraction and analysis.
Incremental crawling capabilities for Apache Tika. Crawl content out of e.g. file systems, http(s) sources (webcrawling) imap(s) servers or your own arbitrary data sources. LeechCrawler offers additional Tika parsers providing these crawling capabilities.
Processing system for the search engine service in Liquid Investigations.
DocClusterizer is a Java desktop application designed to analyze and cluster documents based on their content similarity. The application utilizes Lucene and Tika libraries to process various file extensions such as txt, pdf, docx, and pptx.
Add a description, image, and links to the tika topic page so that developers can more easily learn about it.
To associate your repository with the tika topic, visit your repo's landing page and select "manage topics."