Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Add sample ML-based topic modeling support #170

Open
wants to merge 107 commits into
base: master
Choose a base branch
from
Open

Conversation

DonggeLiu
Copy link

tokenize articles

tokenize articles
@DonggeLiu DonggeLiu requested a review from pypt June 29, 2017 00:48
@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 58.601% when pulling e24f3b7 on topic_modelling into 08b0329 on master.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a23aa13 on topic_modelling into ** on master**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a23aa13 on topic_modelling into ** on master**.

@DonggeLiu
Copy link
Author

  1. Two LDA model (with different package, not sure which one is better yet)
  2. A path helper to assist import
  3. modified token_pool to make it compatible with LDA models

@pypt pypt changed the title Create token_pool.py Add sample ML-based topic modeling support Jul 7, 2017
@pypt pypt self-assigned this Jul 7, 2017
@pypt
Copy link
Contributor

pypt commented Jul 7, 2017

Hey Dongge, thanks for the PR! Good job on encapsulating both LDA libraries into classes! Also, it's great that you've put in effort into docstring'ing your methods - as you might have noticed, we don't write docstrings for our methods (but I suppose we should).

Before you continue working on LDA, I'd like to merge this PR into master branch. When working in a team, it is a common practice to integrate and deploy code as often as possible to prevent separate branches from diverging too far from whatever's running in production. Merging and deploying now would allow us to maybe make use of your code already, plus it might give you some extra confidence in your work too!

Before the merge, I'd like you to consider making some additional changes in your branch. Some of these changes are common "good" practices used in software development, some others are just ways we like to do things. I hope that you won't feel overwhelmed by the huge list of changes.

General:

  • Write at least some basic sanity unit tests.

    As detailed in coding guidelines, it is a good practice to test your code automatically; even more so in your case, as you'll be likely away after the summer, and we'll be left with your code which, when untested, is pretty hard to maintain.

    I suppose the unit tests should cover at least:

    • token pool -- sentence tokenization (with How do you do? parameter, does one get ['How', 'do', 'you', 'do']; also, what happens with sentences non-breaking prefixes like I enjoy cartoons, e.g. the ones drawn by Dr. Seuss.?), stopword removal (do stopwords get removed as they should?)

    • LDA -- with a given list of stories (or single story tokens), do we get the expected list of suggested topics?

  • Unsure why do you need path_helper.py at all. I've deleted it and everything works fine for me.

  • Refactor sample calls of classes (e.g. model = ModelLDA(); print(model.summarize())) into unit tests, or use the if __name__ == "__main__": trick it you want to leave them.

    Python executes code when it imports the module, so if we chose to import your modules from somewhere else, we would run sample calls to modules without us knowing about it happening.

    As I understand, you use those calls to test your code. I think the best thing to do here would be for you to write unit tests for every thing that you would like to test, and then run those unit tests repeatedly while working on your code. This practice is called test-driven development (TDD for short), it basically means that one should treat the unit tests not as a formality to be added to the code after completing it, but as a useful tool to be used while doing the initial development.

    If you would like to leave your sample calls in the code, use the if __name__ == "__main__": trick detailed in the link above.

  • Reformat code with PyCharm's keyboard shortcut.

  • Declare function parameter and return types.

    Python 3's type hinting is great! Here's the excerpt from the coding guidelines to help you with it.

  • Decouple token_pool and model_* modules, I think they shouldn't know about each other.

    As LDA models simply spit out a list of topics from a list of terms, I don't think they should be doing any database access themselves. Separating those concerns (fetching + tokenizing stories, and guessing topics from tokenized stories) would make it easier to cover both functions with unit tests too.

Token pool:

  • Use NLTK in word tokenization.

    Tokenizing text into tokens is not that easy as removing non-\w symbols via regexp as is the case in your current implementation. Consider the following sentence:

      I enjoy cartoons, e.g. the ones drawn by Dr. Seuss.
    

    If you chose to remove symbols (or split the sentence into tokens by \s), e.g. and Dr. Seuss would get split into separate tokens (e, g; Dr, Seuss), which is not what you want.

    What's commonly done in this case is maintaining a list of non-breaking prefixes for every language. However, I'd suggest you don't dig too deep into sentence tokenization and just reuse whatever NLTK has to provide at this point.

  • Why are you serializing sentences to json in token_pool.py instead of just returning a list?

  • I think you forgot to stem the terms. I'd suggest you use NLTK's English language Porter stemmer for that.

  • Make methods not used elsewhere private (prepend __, e.g. rename fetch_sentences() to __fetch_sentences()) or at least protected (prepend a single _).

  • You probably want to get a list of stories and their sentences, not a quasi-random list of sentences that belong to random stories. Also, it is likely that you want only English language stories and have them in the order in which they appeared in the story. So, please reuse the SQL query (or a variant of it) that I've sent you in an email titled "Sample dump, database tables, tokenization and stopwords":

    SELECT stories_id,
           sentence
    FROM stories
        INNER JOIN story_sentences
            ON stories.stories_id = story_sentences.stories_id
    WHERE stories.language = 'en' -- some stories are not in English
    ORDER BY stories.stories_id,
             story_sentences.sentence_number
  • Prepend mc_root_path() to STOP_WORDS:

    STOP_WORDS = os.path.join(mc_root_path(), "lib/MediaWords/Languages/resources/en_stopwords.txt")
  • Make db a parameter instead of reconnecting with connect_to_db() every time.

    The unit test should connect_to_db() for you and pass the database handler to the TokenPool helper.

LDA:

  • Add gensim, lda to requirements.txt. Both are third-party libraries and need to be installed somehow. We install dependencies using Pip, which uses requirements.txt.

  • import gensim only once; use gensim.corpora and gensim.models afterwards.

  • Make ModelLDA and ModelLDA2 into concrete subclasses of an abstract class that defines interface, with an abstract method summarize() expecting a list of parameters and returning predictable return values.

  • Rename ModelLDA and ModelLDA2 to something more meaningful, maybe ModelGenSim and ModelLDA?

  • Move initialization stuff to class constructors.

    On every call to summarize(), you're re-initializing a couple of resources which (I think) would be better off if made into instance variables and initialized in a constructor.

  • Reduce procedural code by adding more parameters to class methods instead of having them just "do the magic thing" and return something:

    • Make summarize() accept a list of stories and their tokens as a parameter instead of having them fetch those stories themselves via TokenPool.
    • Eliminate STORY_NUMBER, method should probably do its job with all the stories passed to it.
    • Make TOPIC_NUMBER into a parameter (maybe with a default value) in order to make it easier to vary the number of topics to be generated using the code
    • Make WORD_NUMBER into a parameter (probably with a default value too).
    • Make number of iterations configurable everywhere (via parameter to constructor / summarize()).
    • What is RANDOM_STATE?
  • Passing arguments in an array with "magic" indexes (stemmed_tokens[0], stemmed_tokens[1]) is to be avoided because it's not apparent what is being held as values under those indexes (e.g. is stemmed_tokens[0] a story ID or a list of tokens?). Please reorganize those into an object or find some other solution.

  • How do we interpret the return value of summarize()?

    ModelLDA returns:

    {16: [(0, '0.043*"migrants" + 0.017*"Rio" + 0.017*"Grande" + 0.015*"asylum"')], 11: [(0, '0.053*"Utah" + 0.032*"couples" + 0.021*"ban" + 0.016*"Judge"')], 13: [(0, '0.130*"Sebelius" + 0.087*"Kathleen" + 0.087*"marred" + 0.087*"resigning"')], 14: [(0, '0.027*"Republicans" + 0.024*"Obama" + 0.024*"Senate" + 0.022*"Republican"')], 15: [(0, '0.024*"league" + 0.021*"lawsuit" + 0.014*"hockey" + 0.014*"violence"')]}

    ModelLDA2 returns:

    {16: ['Grande', 'asylum', 'immigration', 'smugglers'], 17: ['rice', 'Persian', 'fried', 'sabzi'], 18: ['Lerner', 'Republican', 'conservative', 'contempt'], 19: ['journalists', 'judge', 'Greste', 'Jazeera'], 20: ['TV', 'media', 'Twitter', 'Facebook'], 21: ['Colbert', 'CBS', 'Letterman', 'Moonves'], 11: ['Utah', 'couples', 'ban', 'plaintiffs'], 13: ['Lerner', 'Republican', 'conservative', 'contempt'], 14: ['Obama', 'Republicans', 'Senate', 'Bush'], 15: ['league', 'lawsuit', 'violence', 'complaint']}

    Given that both methods are supposed to achieve the same thing, they should have identical return values.

  • Why does summarize() operate only on a single story per iteration in model_lda.py and on all stories in model_lda2.py?

  • How will it work with millions of stories (the terms of which won't fit into RAM)? Can you make both models into objects that could be trained iteratively, i.e.:

    x = ModelLDA()
    x.add(story1_tokens)
    x.add(story2_tokens)
    x.add(story3_tokens)
    # <...>
    print(x.topics())

Side notes:

The following is admittedly some copy-pasta from another pull request that I have reviewed recently :), but I hope that you will find it useful too:

Admittedly, we're pretty far from "practice what you preach" when it comes to commit messages, but out of general curiosity, skim through the legendary How to Write a Git Commit Message page.

I find that putting in more effort into my own commit messages becomes more and more important for me over time, as I constantly have to revisit my own code after months or years, so when committing some changes, I try to think of the "future me" (or some other developer) when trying to explain what my code is all about.

One last strategy that I try to employ to do my future-self a favor is that I try to make my commits atomic, i.e. I avoid having "fix stuff" commits with changes over 10kLoC. While it takes more time to do all those separate commits, it:

  • allows me to git revert stuff,
  • makes testing easier - one can then pinpoint the test failure to a specific change in the code; also, you can make a small commit, push it to Travis's continuous integration and continue working while waiting for Travis to tell you whether or not something breaks,
  • makes thinking about own code more systematic, i.e. the need to have atomic commits forces me to think about the implementation in the means of discrete steps that I should take to do something, and I find it's harder to make various mistakes and forget things that way.

Very promising work so far, Dongge, looking forward to your fixes!

1. Made every variable and method priavte if possible
2. Reformatted code with Pycharm shortcut
3. Added tests for TokenPool (works well) and ModelGensim (does work due to 'no module named XXX' problem when model_gensim is calling its abstract parent)
4. Decoupled token_pool and model_*
5. Used if __name__ == '__main__' to give a simple demonstration on how to use each mehtod

Model_*
1. Renamed mode_lda.py and model_lda2.py to model_gensim.py (which uses the Gensim package) and model_lda.py (which uses the LDA package)
2. Added a abstract parent class TopicModel.py
3. Moved some code from summarise() to add_stories() (a. better structure of code; b. improved performance)
4. Changed some constants to function arguments (e.g. total_topic_num, iteration_num, etc.)

TokenPool
1. Added mc_root_path() when locating the stopwords file
2. Modified query in token pool:
	1. added "INNER JOIN stories WHERE language='en'" to guarantee all stories are in English
	2. added "LIMIT" and corresponding "SELECT DISTINCT ... ORDER BY..." to guarantee only fetch the required number of stroies (thus improves performance)
	3. added "OFFSET"
3. Restructured token_pool.py, so that the stories are traversed only once (thus improves performance)
4. Decoupled DB from token_pool.py
5. Replace regex tokenization with nltk.tokenizer
6. Added nltk.stem.WordNetLemmatizer to lemmatize (which gives a better result than stemming) tokens
@pypt
Copy link
Contributor

pypt commented Jul 20, 2017

  1. Please sort the newly added dependencies in requirements.txt as asked in the header of the file.

  2. You mention in requirements.txt that one needs to install NLTK data but you don't do install it anywhere in the install scripts, so this is one of the reasons why the Travis build is failing.

    The general idea of doing those Travis builds is to ensure that everything what needs to be installed gets installed by the setup scripts, and that all the unit tests pass with whatever gets installed by Travis. Each Travis build is run in a "clean room" (a virtual machine that gets created just for the occassion), so you have to install everything yourself with a setup script to make your unit tests work.

    So, you probably want to add something like:

    echo "Installing NLTK data..."
    if [ `uname` == 'Darwin' ]; then
        NLTK_DATA_PATH=/usr/local/share/nltk_data
    else
        NLTK_DATA_PATH=/usr/share/nltk_data
    fi
    $COMMAND_PREFIX python$PYTHON3_MAJOR_VERSION -m nltk.downloader all -d "$NLTK_DATA_PATH"

    at the end of install/install_python_dependencies.sh.

    Some notes about what's happening in the lines above:

    • Installing NLTK data page says that the recommended system location for NLTK data is different on OS X and Linux
    • We can't use nltk.download() because we have no GUI on Travis nor production, so we just install everything.
    • COMMAND_PREFIX is set to either sudo (on Linux) or an empty string (on OS X) because /usr/local/share/nltk_data is user-writable on OS X and /usr/share/nltk_data needs a privileged user access on Linux.)

    You probably want to add /usr/share/nltk_data (Travis runs on Ubuntu) to cache/directories/ in file .travis.yml too; it takes a while to install all the NLTK data so we might as well cache it after installing once. Not sure if this is going to work though, but worth a try.

  3. Separate LDA model unit tests and token pool tests from the database.

    Another reason why Travis builds are failing is that the database on Travis doesn't contain any data. On your local machine, you've imported the test dump that I've sent you and thus the tests on your system pass because there's data present in PostgreSQL that you're running. PostgreSQL on Travis doesn't contain any data by default (unless you load some data into PostgreSQL before you do testing), so that's why your tests are failing.

    On Perl, we have a MediaWords::Test::DB::test_on_test_database helper that, when used, preloads the database with a schema; then we write some data to PostgreSQL, run a subroutine that is being tested, compare the results, and throw this test database away. Unfortunately, this helper hasn't been ported to Python yet, but there's no need for you to use it. Instead, please test the individual helper methods (sentence tokenization, stopword removal) with some text data hardcoded into the test, e.g.:

    def test_tokenize_sentence_into_words():
        assert tokenize_sentence_into_words('Dongge is in Australia.') == ['Dongge', 'is', 'in', 'Australia']
        # ...
    
    def test_remove_stopwords():
        assert remove_stopwords(['Dongge', 'is', 'in', 'Australia']) == ['Dongge', 'Australia']
        # ...

    This would:

    1. Break the dependency of your tests from the database.
    2. Make tests faster.
    3. Allow you to test out various input values (None, empty strings, true positives, false negatives, ...) more easily.

    Please apply the same approach to your model tests too.

  4. Remove path_helper.py.

    No idea why your PyCharm is unable to resolve modules.

    1. Are you using PyCharm?
    2. Do you use Media Cloud's virtualenv in PyCharm (Preferences -> Project: mediacloud -> Project Interpreter; interpreter should be 3.5.x virtualenv at ~/path/to/mc-venv)?
    3. Is the file mc-venv/lib/python3.5/site-packages/mediacloud.pth present on your system, and do the contents of that file point to ../../../../mediacloud/?
  5. Move tests to the same directory as the modules that are being tested, as per our coding guidelines.

  6. Get rid of warnings (there are some in test_model_gensim.py and test_model_lda.py)

  7. (from last PR review) Why are you serializing sentences to json in token_pool.py instead of just returning a list?

    Right now you seem to serialize the database results to JSON and then unserialize it right afterwards. How come?

  8. If _eliminate_symbols() no longer does symbol elimination but tokenizes sentences to words, I think it should have an appropriate name too.

  9. Keep table names in the SQL queries instead of having them in constants.

    One has to read and modify the SQL query more often than the table / column names change, so it's not worth storing them in separate variables (which are used only once anyway) and sacrifice simplicity and readability that way.

  10. (from last PR review) Please use the SQL query from the last PR review.

    I'm not sure if your SQL query does what you want it to do. It reads from story_sentences, then joins stories, then joins story_sentences again? Also, ordering by sentence_number is missing too. If my provided SQL query doesn't work for you, please let me know.

  11. You wrote me some nice and useful explanations in how both of your model approaches work. I think it would be great to have them in docstrings and / or comments.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.004%) to 58.513% when pulling 809aad7 on topic_modelling into 97c110d on master.

…ter efficiency and performance

I will combine these two later
This allows more flexibility in Travis (i.e. use larger samples if we can run tests longer in Travis)
2. improve performance based on empirical results
@coveralls
Copy link

Coverage Status

Coverage increased (+0.009%) to 58.438% when pulling 016d01c on topic_modelling into aeba6f7 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.009%) to 58.481% when pulling 9ff15ff on topic_modelling into 2aff8d4 on master.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants