Skip to content

kozistr/movie-rate-prediction

Repository files navigation

Movie Rate Prediction

영화 평점 예측 with Tensorflow

License: MIT Total alerts Language grade: Python

Environments

  • OS : Ubuntu 16.04+ / Windows 10
  • CPU : any (quad core ~)
  • GPU : GTX 1060 6GB ~
  • RAM : 16GB ~
  • Library : TF 1.x with CUDA 9.0~ + cuDNN 7.0~

Prerequisites

  • Python
  • MySQL DB
  • tensorflow 1.x
  • numpy
  • gensim and konlpy and soynlp
  • mecab-ko
  • pymysql
  • h5py
  • tqdm
  • pymysql
  • (Optional) java 1.7+
  • (Optional) PyKoSpacing
  • (Optional) MultiTSNE (for visualization)
  • (Optional) matplotlib (for visualization)

DataSet

DataSet Language Sentences Words Size
NAVER Movie Review Korean 8.86M 391K About 1GB

Movie Review Data Distribution

dist

Usage

1.1 Installing Dependencies

# Necessary
$ sudo python3 -m pip install -r requirements.txt
# Optional
$ sudo python3 -m pip install -r opt_requirements.txt

1.2 Configuration

# In ```config.py```, there're lots of params for scripts. plz re-setting

2. Parsing the DataSet

$ python3 movie-parse.py

3. Making DataSet DB

$ python3 db.py

4. Making w2v/d2v embeddings (skip if u only wanna use Char2Vec)

$ python3 preprocessing.py

usage: preprocessing.py [-h] [--load_from {db,csv}] [--vector {d2v,w2v}]
                        [--is_analyzed IS_ANALYZED]

Pre-Processing NAVER Movie Review Comment

optional arguments:
  -h, --help            show this help message and exit
  --load_from {db,csv}  load DataSet from db or csv
  --vector {d2v,w2v}    d2v or w2v
  --is_analyzed IS_ANALYZED
                        already analyzed data

5. Training a Model

$ python3 main.py --refine_data [True or False]

usage: main.py [-h] [--checkpoint CHECKPOINT] [--refine_data REFINE_DATA]

train/test movie review classification model

optional arguments:
  -h, --help            show this help message and exit
  --checkpoint CHECKPOINT
                        pre-trained model
  --refine_data REFINE_DATA
                        solving data imbalance problem

Repo Tree

│
├── comments          (NAVER Movie Review DataSets)
│    ├── 10000.sql
│    ├── ...
│    └── 200000.sql
├── w2v               (Word2Vec)
│    ├── ko_w2v.model (Word2Vec trained gensim model)
│    └── ...
├── d2v               (Doc2Vec)
│    ├── ko_d2v.model (Dov2Vec trained gensim model)
│    └── ...
├── model             (Movie Review Rate ML Models)
│    ├── textcnn.py
│    └── textrnn.py
├── image             (explaination images)
│    └── *.png
├── ml_model          (tf pre-trained model saved in here)
│    ├── checkpoint
│    ├── ...
│    └── charcnn-best_loss.ckpt
├── config.py         (Configuration)
├── tfutil.py         (handy tfutils)
├── dataloader.py     (Doc/Word2Vec model loader)
├── movie-parser.py   (NAVER Movie Review Parser)
├── db.py             (DataBase processing)
├── preprocessing.py  (Korean normalize/tokenize)
├── visualize.py      (for visualizing w2v)
└── main.py           (for easy use of train/test)

Pre-Trained Models

Here's a google drive link. You can download pre-trained models from here !

  • Embedding Models

    • Word2Vec model : here
  • M.L Models

    • TextCNN model : here
    • TextRNN model : here

Models

  • TextCNN

img

credited by Toxic Comment Classification kaggle 1st solution

  • TextRNN

img

credited by Toxic Comment Classification kaggle 1st solution

Results

DataSet is not good. So, the result also isn't pretty good as i expected :(
Refining/Normalizing raw sentences are needed!

  • TextCNN (Char2Vec)

img

Result : train MSE 1.553, val MSE 3.341
Hyper-Parameter : rand, conv kernel size [10,9,7,5,3], conv filters 256, drop out 0.7, fc unit 1024, adam, embed size 384

  • TextCNN (Word2Vec)

img

Result : train MSE 3.410
Hyper-Parameter : non-static, conv kernel size [2,3,4,5], conv filters 256, drop out 0.7, fc unit 1024, adadelta, embed size 300

  • TextRNN (Word2Vec)

img

Result : train MSE 3.646
Hyper-Parameter : non-static, rnn cells 128, attention 128, drop out 0.7, fc unit 1024, adadelta, embed size 300

  • TextRNN (Char2Vec)

SOON!

Visualization

You can just simply type tensorboard --logdir=./ml_model/

Word2Vec Embeddings (t-SNE)

img

Perplexity : 80
Learning rate : 10
Iteration : 310

To-Do

  1. deal with word spacing problem

ETC

Any suggestions and PRs and issues are WELCONE :)

Author

HyeongChan Kim / @kozistr