Please commit any changes so we can merge easily
(notes from the meeting with Jesse)
-
Feature detectors:
- N-grams (Dennis), Done
- No-grams (Dennis)
- Word2Vec
- Regex (Panni)
-
Classifiers:
- Vawpal Wabbit
- Semi supervised learning (Csaba: in progress)
The n-gram feature combines 1-grams(=BoW), 2-grams ...., n-grams for feature creation. Now I set the minimum document-frequency to 1/10000 and it improved the BoW for about 2%. Yet it might be beneficial to indeed include a lot of features when involving 2- and 3-grams. Do you have an idea for a good classifier that can handle a lot of features ?
I opened another branch for no-grams because I had to remove a lot of stopwords. It's prediction power is not better than randomness. I think because a no-gram in average only appears in every third review.
Creating 2-grams of the form: no+adjective and then: if adj = positive --> 2-gram is negative 2-gram if adj = negative --> 2-gram is positive 2-gram And then simply count amount of positive vs. negative 2-grams
Error fixed thanks to Dennis. There's another
error, I (Csaba) will solve it tomorrow night. Until then,
do not use the SemiSupervised
class.