lefex: A Tool for LExical FEature eXtraction

This project contains Hadoop jobs for extraction of features of words and texts. Currently, the following types of features can be extracted:

CoNLL. Given a set of HTML documents in the CSV format url<TAB>s3-path<TAB>html-document and outputs the dependency parsed documents in the CoNLL format. See the de.uhh.lt.lefex.CoNLL.HadoopMain class.
ExtractTermFeatureScores. Given a corpus in plain text format, extract word count (word<TAB>count), feature count (feature<TAB>count), and word-feature count (word<TAB>feature<TAB>count) and save these into CSV files. This job is used for feature extraction in the JoSimText project: the computation of distributional thesaurus can be performed taking as input the output of this job. See the de.uhh.lt.lefex.ExtractTermFeatureScores.HadoopMain class.
ExtractLexicalSampleFeatureScores. Given a lexical sample dataset for word sense disambiguation in CSV format, extract features of the target word in context and add them as an extra column. Currently, the system supports extraction of three types of features of a target word: co-occurrences, dependency features, and trigrams. See the de.uhh.lt.lefex.ExtractLexicalSampleFeatures.HadoopMain class.
SentenceSplitter. This job take a plain text corpus as an input and outputs a file with exactly one sentence per line. See the de.uhh.lt.lefex.SentenceSplitter.HadoopMain class.

To build the project you may need to install a JoBimText jar file which contains a custom (non mavenified) dependency collapsing UIMA annotator. To do it use the following script.

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
src/main		src/main
.gitignore		.gitignore
.travis.yml		.travis.yml
CoNLL.sh		CoNLL.sh
ExtractTermFeaturesScores.sh		ExtractTermFeaturesScores.sh
LICENSE.txt		LICENSE.txt
README.md		README.md
SplitSentences.sh		SplitSentences.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/main

src/main

.gitignore

.gitignore

.travis.yml

.travis.yml

CoNLL.sh

CoNLL.sh

ExtractTermFeaturesScores.sh

ExtractTermFeaturesScores.sh

LICENSE.txt

LICENSE.txt

README.md

README.md

SplitSentences.sh

SplitSentences.sh

pom.xml

pom.xml

Repository files navigation

lefex: A Tool for LExical FEature eXtraction

About

Releases

Packages

Contributors 2

Languages

License

uhh-lt/lefex

Folders and files

Latest commit

History

Repository files navigation

lefex: A Tool for LExical FEature eXtraction

About

Topics

Resources

License

Stars

Watchers

Forks

Languages