Skip to content

Apache Nutch Plugin to capture Raw, Unstripped HTML content that nutch crawls

Notifications You must be signed in to change notification settings

anupamkumar/raw-html

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

raw-html -- Apache Nutch plugin

Apache Nutch Plugin to capture Raw, Unstripped HTML content that nutch crawls

Apache nutch by default strips all HTML tags before indexing. What if you wanted to keep the tag ? ... Enter raw-html plugin. Use this nutch plugin to store raw html.

To do that download the built version from the build directory. Instructions avaiable in the readme there.

The plugin can be easily modified to make it extract/filter/both specific tags. Go to source directory if want to do that; download, modify and build it yourself! It's easy. Instructions avaiable in the readme there.

About

Apache Nutch Plugin to capture Raw, Unstripped HTML content that nutch crawls

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages