A very simple news crawler with a funny name
-
Updated
May 18, 2024 - Python
A very simple news crawler with a funny name
Get text content from any file
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Extract embedded metadata from HTML markup
Translate visual novels and other games in real time
Module for automatic summarization of text documents and HTML pages.
This GitHub repository hosts the notebooks and tools developed as part of this thesis to automate the extraction, processing, and analysis of data from the MICCAI 2023 conference, aiding in the systematic review and providing a structured foundation for further research in this crucial area.
A TYPO3 CMS extension that provides Apache Tika functionality
OCR with Tesseract and OpenCV: Extract text from images effortlessly. Preprocess with OpenCV for accuracy. Display results and save output. Easy integration for document digitization and data entry automation.
A self-hosted search engine for documents.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Heuristic based boilerplate removal tool
Golang PDF library for creating and processing PDF files (pure go)
Dataiku DSS plugin to perform optical character recognition (OCR) using the Tesseract engine.
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
The objective is to analyze text content from a list of URLs. This involves extracting article titles and text, then performing natural language processing to generate metrics like sentiment, readability, and word usage. Finally, the results are stored for further analysis or visualization.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."