keras-chinese-resume-parser-and-analyzer

Deep learning project that parses and analyze chinese resumes.

The objective of this project is to use Keras and Deep Learning such as CNN and recurrent neural network to automate the task of parsing a chinese resume.

Overview

Parser Features

Chinese NLP using SnowNLP
Extract chinese texts using pdfminer.six and python-docx from PDF nad DOCX
Rule-based resume parser has been implemented.

Deep Learning Features

Tkinter-based GUI tool to generate and annotate deep learning training data from pdf and docx files
Deep learning multi-class classification using recurrent and cnn networks for
- line type: classify each line of text extracted from pdf and docx file on whether it is a header, meta-data, or content
- line label classify each line of text extracted from pdf and docx file on whether it implies experience, education, etc.

The included deep learning models that classify each line in the resume files include:

cnn.py
- 1-D CNN with Word Embedding
- Multi-Channel CNN with categorical cross-entropy loss function
cnn_lstm.py
- 1-D CNN + LSTM with Word Embedding
lstm.py
- LSTM with category cross-entropy loss function
- Bi-directional LSTM/GRU with categorical cross-entropy loss function

Usage 1: Rule-based Chinese Resume Parser

The sample code below shows how to scan all the resumes (in PDF and DOCX formats) from a [demo/data/resume_samples] folder and print out a summary from the resume parser if information extracted are available:

from keras_cn_parser_and_analyzer.library.rule_based_parser import ResumeParser
from keras_cn_parser_and_analyzer.library.utility.io_utils import read_pdf_and_docx


def main():
    data_dir_path = './data/resume_samples' # directory to scan for any pdf and docx files
    collected = read_pdf_and_docx(data_dir_path)
    for file_path, file_content in collected.items():

        print('parsing file: ', file_path)

        parser = ResumeParser()
        parser.parse(file_content)
        print(parser.raw) # print out the raw contents extracted from pdf or docx files

        if parser.unknown is False:
            print(parser.summary())

        print('++++++++++++++++++++++++++++++++++++++++++')

    print('count: ', len(collected))


if __name__ == '__main__':
    main()

Usage 2: Deep Learning Resume Parser

Step 1: training data generation and annotation

A training data generation and annotation tool is created in the demo folder which allows resume deep learning training data to be generated from any pdf and docx files stored in the demo/data/resume_samples folder, To launch this tool, run the following command from the root directory of the project:

cd demo
python create_training_data.py

This will parse the pdf and docx files in demo/data/resume_samples folder and for each of these file launch a Tkinter-based GUI form to user to annotate individual text line in the pdf or docx file (clicking the "Type: ..." and "Label: ..." buttons multiple time to select the correct annotation for each line). On each form closing, the generated and annotated data will be saved to a text file in the demo/data/training_data folder. each line in the text file will have the following format

line_type   line_label  line_content

line_type and line_label has the following mapping to the actual class labels

line_labels = {0: 'experience', 1: 'knowledge', 2: 'education', 3: 'project', 4: 'others'}
line_types = {0: 'header', 1: 'meta', 2: 'content'}

Step 2: train the resume parser

After the training data is generated and annotated, one can train the resume parser by running the following command:

cd demo
python dl_based_parser_train.py

Below is the code for dl_based_parser_train.py:

import numpy as np

from keras_cn_parser_and_analyzer.library.dl_based_parser import ResumeParser


def main():
    random_state = 42
    np.random.seed(random_state)

    output_dir_path = './models'
    training_data_dir_path = './data/training_data'

    classifier = ResumeParser()
    batch_size = 64
    epochs = 20
    history = classifier.fit(training_data_dir_path=training_data_dir_path,
                             model_dir_path=output_dir_path,
                             batch_size=batch_size, epochs=epochs,
                             test_size=0.3,
                             random_state=random_state)


if __name__ == '__main__':
    main()

Upon completion of training, the trained models will be saved in the demo/models/line_label and demo/models/line_type folders

The default line label and line type classifier used in the deep learning ResumeParser is WordVecBidirectionalLstmSoftmax. But other classifiers can be used by adding the following line, for example:

from keras_cn_parser_and_analyzer.library.dl_based_parser import ResumeParser
from keras_cn_parser_and_analyzer.library.classifiers.cnn_lstm import WordVecCnnLstm

classifier = ResumeParser()
classifier.line_label_classifier = WordVecCnnLstm()
classifier.line_type_classifier = WordVecCnnLstm()
...

(Do make sure that the requirements.txt are satisfied in your python env)

Step 3: parse resumes using trained parser

After the trained models are saved in the demo/models folder, one can use the resume parser to parse the resumes in the demo/data/resume_samples by running the following command:

cd demo
python dl_based_parser_predict.py

Below is the code for dl_based_parser_predict.py:

from keras_cn_parser_and_analyzer.library.dl_based_parser import ResumeParser
from keras_cn_parser_and_analyzer.library.utility.io_utils import read_pdf_and_docx


def main():
    data_dir_path = './data/resume_samples' # directory to scan for any pdf and docx files

    def parse_resume(file_path, file_content):
        print('parsing file: ', file_path)

        parser = ResumeParser()
        parser.load_model('./models')
        parser.parse(file_content)
        print(parser.raw)  # print out the raw contents extracted from pdf or docx files

        if parser.unknown is False:
            print(parser.summary())

        print('++++++++++++++++++++++++++++++++++++++++++')

    collected = read_pdf_and_docx(data_dir_path, command_logging=True, callback=lambda index, file_path, file_content: {
        parse_resume(file_path, file_content)
    })

    print('count: ', len(collected))


if __name__ == '__main__':
    main()

Configure to run on GPU on Windows

Step 1: Change tensorflow to tensorflow-gpu in requirements.txt and install tensorflow-gpu
Step 2: Download and install the CUDA® Toolkit 9.0 (Please note that currently CUDA® Toolkit 9.1 is not yet supported by tensorflow, therefore you should download CUDA® Toolkit 9.0)
Step 3: Download and unzip the cuDNN 7.4 for CUDA@ Toolkit 9.0 and add the bin folder of the unzipped directory to the $PATH of your Windows environment

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
demo		demo
keras_cn_parser_and_analyzer		keras_cn_parser_and_analyzer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

keras-chinese-resume-parser-and-analyzer

Overview

Parser Features

Deep Learning Features

Usage 1: Rule-based Chinese Resume Parser

Usage 2: Deep Learning Resume Parser

Step 1: training data generation and annotation

Step 2: train the resume parser

Step 3: parse resumes using trained parser

Configure to run on GPU on Windows

About

Releases

Packages

Languages

License

chen0040/keras-chinese-resume-parser-and-analyzer

Folders and files

Latest commit

History

Repository files navigation

keras-chinese-resume-parser-and-analyzer

Overview

Parser Features

Deep Learning Features

Usage 1: Rule-based Chinese Resume Parser

Usage 2: Deep Learning Resume Parser

Step 1: training data generation and annotation

Step 2: train the resume parser

Step 3: parse resumes using trained parser

Configure to run on GPU on Windows

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages