Add sample ML-based topic modeling support #170

DonggeLiu · 2017-06-29T00:48:58Z

tokenize articles

coveralls · 2017-06-29T01:35:45Z

Coverage increased (+0.03%) to 58.601% when pulling e24f3b7 on topic_modelling into 08b0329 on master.

…ic_modelling

…r yet) 2. A path helper to assit import 3. modified token_pool to make it compatible with LDA model

… into topic_modelling

…ic_modelling

coveralls · 2017-07-03T23:56:30Z

Changes Unknown when pulling a23aa13 on topic_modelling into ** on master**.

coveralls · 2017-07-03T23:56:53Z

Changes Unknown when pulling a23aa13 on topic_modelling into ** on master**.

DonggeLiu · 2017-07-04T00:13:31Z

Two LDA model (with different package, not sure which one is better yet)
A path helper to assist import
modified token_pool to make it compatible with LDA models

pypt · 2017-07-07T03:14:57Z

Hey Dongge, thanks for the PR! Good job on encapsulating both LDA libraries into classes! Also, it's great that you've put in effort into docstring'ing your methods - as you might have noticed, we don't write docstrings for our methods (but I suppose we should).

Before you continue working on LDA, I'd like to merge this PR into master branch. When working in a team, it is a common practice to integrate and deploy code as often as possible to prevent separate branches from diverging too far from whatever's running in production. Merging and deploying now would allow us to maybe make use of your code already, plus it might give you some extra confidence in your work too!

Before the merge, I'd like you to consider making some additional changes in your branch. Some of these changes are common "good" practices used in software development, some others are just ways we like to do things. I hope that you won't feel overwhelmed by the huge list of changes.

General:

Write at least some basic sanity unit tests.

As detailed in coding guidelines, it is a good practice to test your code automatically; even more so in your case, as you'll be likely away after the summer, and we'll be left with your code which, when untested, is pretty hard to maintain.

I suppose the unit tests should cover at least:
- token pool -- sentence tokenization (with How do you do? parameter, does one get ['How', 'do', 'you', 'do']; also, what happens with sentences non-breaking prefixes like I enjoy cartoons, e.g. the ones drawn by Dr. Seuss.?), stopword removal (do stopwords get removed as they should?)
- LDA -- with a given list of stories (or single story tokens), do we get the expected list of suggested topics?
Unsure why do you need path_helper.py at all. I've deleted it and everything works fine for me.
Refactor sample calls of classes (e.g. model = ModelLDA(); print(model.summarize())) into unit tests, or use the if __name__ == "__main__": trick it you want to leave them.

Python executes code when it imports the module, so if we chose to import your modules from somewhere else, we would run sample calls to modules without us knowing about it happening.

As I understand, you use those calls to test your code. I think the best thing to do here would be for you to write unit tests for every thing that you would like to test, and then run those unit tests repeatedly while working on your code. This practice is called test-driven development (TDD for short), it basically means that one should treat the unit tests not as a formality to be added to the code after completing it, but as a useful tool to be used while doing the initial development.

If you would like to leave your sample calls in the code, use the if __name__ == "__main__": trick detailed in the link above.
Reformat code with PyCharm's keyboard shortcut.
Declare function parameter and return types.

Python 3's type hinting is great! Here's the excerpt from the coding guidelines to help you with it.
Decouple token_pool and model_* modules, I think they shouldn't know about each other.

As LDA models simply spit out a list of topics from a list of terms, I don't think they should be doing any database access themselves. Separating those concerns (fetching + tokenizing stories, and guessing topics from tokenized stories) would make it easier to cover both functions with unit tests too.

Token pool:

Use NLTK in word tokenization.

Tokenizing text into tokens is not that easy as removing non-\w symbols via regexp as is the case in your current implementation. Consider the following sentence:
```
  I enjoy cartoons, e.g. the ones drawn by Dr. Seuss.
```
If you chose to remove symbols (or split the sentence into tokens by \s), e.g. and Dr. Seuss would get split into separate tokens (e, g; Dr, Seuss), which is not what you want.

What's commonly done in this case is maintaining a list of non-breaking prefixes for every language. However, I'd suggest you don't dig too deep into sentence tokenization and just reuse whatever NLTK has to provide at this point.
Why are you serializing sentences to json in token_pool.py instead of just returning a list?
I think you forgot to stem the terms. I'd suggest you use NLTK's English language Porter stemmer for that.
Make methods not used elsewhere private (prepend __, e.g. rename fetch_sentences() to __fetch_sentences()) or at least protected (prepend a single _).
You probably want to get a list of stories and their sentences, not a quasi-random list of sentences that belong to random stories. Also, it is likely that you want only English language stories and have them in the order in which they appeared in the story. So, please reuse the SQL query (or a variant of it) that I've sent you in an email titled "Sample dump, database tables, tokenization and stopwords":
```
SELECT stories_id,
       sentence
FROM stories
    INNER JOIN story_sentences
        ON stories.stories_id = story_sentences.stories_id
WHERE stories.language = 'en' -- some stories are not in English
ORDER BY stories.stories_id,
         story_sentences.sentence_number
```

Prepend mc_root_path() to STOP_WORDS:

STOP_WORDS = os.path.join(mc_root_path(), "lib/MediaWords/Languages/resources/en_stopwords.txt")

Make db a parameter instead of reconnecting with connect_to_db() every time.

The unit test should connect_to_db() for you and pass the database handler to the TokenPool helper.

LDA:

Add gensim, lda to requirements.txt. Both are third-party libraries and need to be installed somehow. We install dependencies using Pip, which uses requirements.txt.
import gensim only once; use gensim.corpora and gensim.models afterwards.
Make ModelLDA and ModelLDA2 into concrete subclasses of an abstract class that defines interface, with an abstract method summarize() expecting a list of parameters and returning predictable return values.
Rename ModelLDA and ModelLDA2 to something more meaningful, maybe ModelGenSim and ModelLDA?
Move initialization stuff to class constructors.

On every call to summarize(), you're re-initializing a couple of resources which (I think) would be better off if made into instance variables and initialized in a constructor.
Reduce procedural code by adding more parameters to class methods instead of having them just "do the magic thing" and return something:
- Make summarize() accept a list of stories and their tokens as a parameter instead of having them fetch those stories themselves via TokenPool.
- Eliminate STORY_NUMBER, method should probably do its job with all the stories passed to it.
- Make TOPIC_NUMBER into a parameter (maybe with a default value) in order to make it easier to vary the number of topics to be generated using the code
- Make WORD_NUMBER into a parameter (probably with a default value too).
- Make number of iterations configurable everywhere (via parameter to constructor / summarize()).
- What is RANDOM_STATE?
Passing arguments in an array with "magic" indexes (stemmed_tokens[0], stemmed_tokens[1]) is to be avoided because it's not apparent what is being held as values under those indexes (e.g. is stemmed_tokens[0] a story ID or a list of tokens?). Please reorganize those into an object or find some other solution.

How do we interpret the return value of summarize()?

ModelLDA returns:

{16: [(0, '0.043*"migrants" + 0.017*"Rio" + 0.017*"Grande" + 0.015*"asylum"')], 11: [(0, '0.053*"Utah" + 0.032*"couples" + 0.021*"ban" + 0.016*"Judge"')], 13: [(0, '0.130*"Sebelius" + 0.087*"Kathleen" + 0.087*"marred" + 0.087*"resigning"')], 14: [(0, '0.027*"Republicans" + 0.024*"Obama" + 0.024*"Senate" + 0.022*"Republican"')], 15: [(0, '0.024*"league" + 0.021*"lawsuit" + 0.014*"hockey" + 0.014*"violence"')]}

ModelLDA2 returns:

{16: ['Grande', 'asylum', 'immigration', 'smugglers'], 17: ['rice', 'Persian', 'fried', 'sabzi'], 18: ['Lerner', 'Republican', 'conservative', 'contempt'], 19: ['journalists', 'judge', 'Greste', 'Jazeera'], 20: ['TV', 'media', 'Twitter', 'Facebook'], 21: ['Colbert', 'CBS', 'Letterman', 'Moonves'], 11: ['Utah', 'couples', 'ban', 'plaintiffs'], 13: ['Lerner', 'Republican', 'conservative', 'contempt'], 14: ['Obama', 'Republicans', 'Senate', 'Bush'], 15: ['league', 'lawsuit', 'violence', 'complaint']}

Given that both methods are supposed to achieve the same thing, they should have identical return values.

Why does summarize() operate only on a single story per iteration in model_lda.py and on all stories in model_lda2.py?
How will it work with millions of stories (the terms of which won't fit into RAM)? Can you make both models into objects that could be trained iteratively, i.e.:
```
x = ModelLDA()
x.add(story1_tokens)
x.add(story2_tokens)
x.add(story3_tokens)
# <...>
print(x.topics())
```

Side notes:

The following is admittedly some copy-pasta from another pull request that I have reviewed recently :), but I hope that you will find it useful too:

Admittedly, we're pretty far from "practice what you preach" when it comes to commit messages, but out of general curiosity, skim through the legendary How to Write a Git Commit Message page.

I find that putting in more effort into my own commit messages becomes more and more important for me over time, as I constantly have to revisit my own code after months or years, so when committing some changes, I try to think of the "future me" (or some other developer) when trying to explain what my code is all about.

One last strategy that I try to employ to do my future-self a favor is that I try to make my commits atomic, i.e. I avoid having "fix stuff" commits with changes over 10kLoC. While it takes more time to do all those separate commits, it:

allows me to git revert stuff,
makes testing easier - one can then pinpoint the test failure to a specific change in the code; also, you can make a small commit, push it to Travis's continuous integration and continue working while waiting for Travis to tell you whether or not something breaks,
makes thinking about own code more systematic, i.e. the need to have atomic commits forces me to think about the implementation in the means of discrete steps that I should take to do something, and I find it's harder to make various mistakes and forget things that way.

Very promising work so far, Dongge, looking forward to your fixes!

1. Made every variable and method priavte if possible 2. Reformatted code with Pycharm shortcut 3. Added tests for TokenPool (works well) and ModelGensim (does work due to 'no module named XXX' problem when model_gensim is calling its abstract parent) 4. Decoupled token_pool and model_* 5. Used if __name__ == '__main__' to give a simple demonstration on how to use each mehtod Model_* 1. Renamed mode_lda.py and model_lda2.py to model_gensim.py (which uses the Gensim package) and model_lda.py (which uses the LDA package) 2. Added a abstract parent class TopicModel.py 3. Moved some code from summarise() to add_stories() (a. better structure of code; b. improved performance) 4. Changed some constants to function arguments (e.g. total_topic_num, iteration_num, etc.) TokenPool 1. Added mc_root_path() when locating the stopwords file 2. Modified query in token pool: 1. added "INNER JOIN stories WHERE language='en'" to guarantee all stories are in English 2. added "LIMIT" and corresponding "SELECT DISTINCT ... ORDER BY..." to guarantee only fetch the required number of stroies (thus improves performance) 3. added "OFFSET" 3. Restructured token_pool.py, so that the stories are traversed only once (thus improves performance) 4. Decoupled DB from token_pool.py 5. Replace regex tokenization with nltk.tokenizer 6. Added nltk.stem.WordNetLemmatizer to lemmatize (which gives a better result than stemming) tokens

…ic_modelling

…model_lda

pypt · 2017-07-20T01:20:33Z

Please sort the newly added dependencies in requirements.txt as asked in the header of the file.
You mention in requirements.txt that one needs to install NLTK data but you don't do install it anywhere in the install scripts, so this is one of the reasons why the Travis build is failing.

The general idea of doing those Travis builds is to ensure that everything what needs to be installed gets installed by the setup scripts, and that all the unit tests pass with whatever gets installed by Travis. Each Travis build is run in a "clean room" (a virtual machine that gets created just for the occassion), so you have to install everything yourself with a setup script to make your unit tests work.

So, you probably want to add something like:
```
echo "Installing NLTK data..."
if [ `uname` == 'Darwin' ]; then
    NLTK_DATA_PATH=/usr/local/share/nltk_data
else
    NLTK_DATA_PATH=/usr/share/nltk_data
fi
$COMMAND_PREFIX python$PYTHON3_MAJOR_VERSION -m nltk.downloader all -d "$NLTK_DATA_PATH"
```
at the end of install/install_python_dependencies.sh.

Some notes about what's happening in the lines above:
- Installing NLTK data page says that the recommended system location for NLTK data is different on OS X and Linux
- We can't use nltk.download() because we have no GUI on Travis nor production, so we just install everything.
- COMMAND_PREFIX is set to either sudo (on Linux) or an empty string (on OS X) because /usr/local/share/nltk_data is user-writable on OS X and /usr/share/nltk_data needs a privileged user access on Linux.)
You probably want to add /usr/share/nltk_data (Travis runs on Ubuntu) to cache/directories/ in file .travis.yml too; it takes a while to install all the NLTK data so we might as well cache it after installing once. Not sure if this is going to work though, but worth a try.
Separate LDA model unit tests and token pool tests from the database.

Another reason why Travis builds are failing is that the database on Travis doesn't contain any data. On your local machine, you've imported the test dump that I've sent you and thus the tests on your system pass because there's data present in PostgreSQL that you're running. PostgreSQL on Travis doesn't contain any data by default (unless you load some data into PostgreSQL before you do testing), so that's why your tests are failing.

On Perl, we have a MediaWords::Test::DB::test_on_test_database helper that, when used, preloads the database with a schema; then we write some data to PostgreSQL, run a subroutine that is being tested, compare the results, and throw this test database away. Unfortunately, this helper hasn't been ported to Python yet, but there's no need for you to use it. Instead, please test the individual helper methods (sentence tokenization, stopword removal) with some text data hardcoded into the test, e.g.:
```
def test_tokenize_sentence_into_words():
    assert tokenize_sentence_into_words('Dongge is in Australia.') == ['Dongge', 'is', 'in', 'Australia']
    # ...

def test_remove_stopwords():
    assert remove_stopwords(['Dongge', 'is', 'in', 'Australia']) == ['Dongge', 'Australia']
    # ...
```
This would:
1. Break the dependency of your tests from the database.
2. Make tests faster.
3. Allow you to test out various input values (None, empty strings, true positives, false negatives, ...) more easily.
Please apply the same approach to your model tests too.
Remove path_helper.py.

No idea why your PyCharm is unable to resolve modules.
1. Are you using PyCharm?
2. Do you use Media Cloud's virtualenv in PyCharm (Preferences -> Project: mediacloud -> Project Interpreter; interpreter should be 3.5.x virtualenv at ~/path/to/mc-venv)?
3. Is the file mc-venv/lib/python3.5/site-packages/mediacloud.pth present on your system, and do the contents of that file point to ../../../../mediacloud/?
Move tests to the same directory as the modules that are being tested, as per our coding guidelines.
Get rid of warnings (there are some in test_model_gensim.py and test_model_lda.py)
(from last PR review) Why are you serializing sentences to json in token_pool.py instead of just returning a list?

Right now you seem to serialize the database results to JSON and then unserialize it right afterwards. How come?
If _eliminate_symbols() no longer does symbol elimination but tokenizes sentences to words, I think it should have an appropriate name too.
Keep table names in the SQL queries instead of having them in constants.

One has to read and modify the SQL query more often than the table / column names change, so it's not worth storing them in separate variables (which are used only once anyway) and sacrifice simplicity and readability that way.
(from last PR review) Please use the SQL query from the last PR review.

I'm not sure if your SQL query does what you want it to do. It reads from story_sentences, then joins stories, then joins story_sentences again? Also, ordering by sentence_number is missing too. If my provided SQL query doesn't work for you, please let me know.
You wrote me some nice and useful explanations in how both of your model approaches work. I think it would be great to have them in docstrings and / or comments.

coveralls · 2017-08-14T14:07:26Z

Coverage decreased (-0.004%) to 58.513% when pulling 809aad7 on topic_modelling into 97c110d on master.

…ter efficiency and performance I will combine these two later

…en combined with tune_with_polynomial

…ic_modelling

… and cache dependencies

This allows more flexibility in Travis (i.e. use larger samples if we can run tests longer in Travis)

2. improve performance based on empirical results

added more comments

coveralls · 2017-08-22T00:58:50Z

Coverage increased (+0.009%) to 58.438% when pulling 016d01c on topic_modelling into aeba6f7 on master.

coveralls · 2017-09-01T02:04:47Z

Coverage increased (+0.009%) to 58.481% when pulling 9ff15ff on topic_modelling into 2aff8d4 on master.

Create token_pool.py

e24f3b7

tokenize articles

DonggeLiu requested a review from pypt June 29, 2017 00:48

DonggeLiu added 5 commits July 3, 2017 12:19

added the file created last time

9535b81

Merge branch 'master' of github.com:berkmancenter/mediacloud into top…

934da4b

…ic_modelling

1. Two LDA model (with different package, not sure which one is bette…

2a8a0f2

…r yet) 2. A path helper to assit import 3. modified token_pool to make it compatible with LDA model

Merge branch 'topic_modelling' of github.com:berkmancenter/mediacloud…

e888805

… into topic_modelling

Merge branch 'master' of github.com:berkmancenter/mediacloud into top…

a23aa13

…ic_modelling

pypt changed the title ~~Create token_pool.py~~ Add sample ML-based topic modeling support Jul 7, 2017

pypt self-assigned this Jul 7, 2017

pypt added the enhancement label Jul 7, 2017

DonggeLiu added 10 commits July 11, 2017 09:55

1. Define types for parameters and return values

83a31a7

Merge branch 'master' of github.com:berkmancenter/mediacloud into top…

ced8bb4

…ic_modelling

isolate import gensim to see if it causes failure #3839

943c696

verifying the reason of errors

3db49ee

reformat the output of model_gensim to make it in the same format as …

06d1d37

…model_lda

1. updated tests according to the changes I made in model_gensim.py

e027dad

added tests for model_lda.py

336c0d8

trying to fix the 'module' object has no attribute 'plugin' problem

178226b

reference topic_model module with full path

ebc4715

pypt and others added 4 commits July 20, 2017 05:29

Merge branch 'master' into topic_modelling

39c5e8c

added the requirement for sklearn, which supports the NMF algorithm

716fe91

Added msg for each assertion

f66ead6

added msg for each assertion

2d6c12d

DonggeLiu added 26 commits August 20, 2017 00:43

improved polynomial tuning algorithm

f819366

no longer test tune_with_iteration as polynomial has a sigificant bet…

9869ca8

…ter efficiency and performance I will combine these two later

larger sample for Travis to test against

e185dd0

modify tests accroding to change in sample_stories.txt

3545e0e

use smaller sample size so that Travis will not fail

7816ec8

do not test limit if limit is not specified

94ebc24

improved tune with polynomial algorithm

c1c257e

removed uncessary tune_with_iteration as its advantage/feature has be…

6d09265

…en combined with tune_with_polynomial

fixed the algorithm of optimal point finder

2479107

removed useless codes

51dd0ec

Merge branch 'master' of github.com:berkmancenter/mediacloud into top…

620afb4

…ic_modelling

Disable unit tests temporarily for Travis to have a chance to compile…

5ead4f2

… and cache dependencies

Cache WordNet of NLTK

0fb4e4a

set test cases back

87efd01

revert the changes made on .travis.yml

6ea203b

added more story samples

b675559

new commits from git pull origin master

8753442

removed unnecessary code to keep higher level of accuracy

e39415b

changed sample file name

a674d26

this sample file has been replaced by 3 files with different size

6267f72

This allows more flexibility in Travis (i.e. use larger samples if we can run tests longer in Travis)

use a smaller sample to test on Travis due to limit restriction

d4e9d48

1. break large block of codes up to more funcitons

0c3f7ee

2. improve performance based on empirical results

remove uncessary code

4c12748

restructured tests to reduce running time

720dd7a

further improvements on the code structure

97afc48

added more comments

remove redudent code

016d01c

Merge branch 'master' into topic_modelling

9ff15ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample ML-based topic modeling support #170

Add sample ML-based topic modeling support #170

DonggeLiu commented Jun 29, 2017

coveralls commented Jun 29, 2017

coveralls commented Jul 3, 2017

coveralls commented Jul 3, 2017

DonggeLiu commented Jul 4, 2017

pypt commented Jul 7, 2017

pypt commented Jul 20, 2017

coveralls commented Aug 14, 2017

coveralls commented Aug 22, 2017

coveralls commented Sep 1, 2017

Add sample ML-based topic modeling support #170

Are you sure you want to change the base?

Add sample ML-based topic modeling support #170

Conversation

DonggeLiu commented Jun 29, 2017

coveralls commented Jun 29, 2017

coveralls commented Jul 3, 2017

coveralls commented Jul 3, 2017

DonggeLiu commented Jul 4, 2017

pypt commented Jul 7, 2017

pypt commented Jul 20, 2017

coveralls commented Aug 14, 2017

coveralls commented Aug 22, 2017

coveralls commented Sep 1, 2017