FREQUENTLY ASKED QUESTIONS - NLP

SIVAGURUNATHAN VELAYUTHAM

[email protected]

INTRODUCTION

Frequently asked questions popularly called as F.A.Q, provides a list of Questions and Answers, commonly asked in some context, and pertaining to a particular topic. FAQ are mostly used where questions tend to occur. The convenient way to share FAQ with others is writing an article and storing it in offline. In this case, the articles might not be FAQ - not necessarily questions and answers. However, FAQ used refer all those documents and postings which are offline. With advancement in Internet, people tend to share the documents or articles in online. People prefer to ask questions in online forums, chat with customer support and reading reviews. These modes helped the user to find the right answer which are relevant for them. In recent times, users having access to a lot of data, where they could not find an appropriate answer for the question. This leads to FAQ as irrelevant if the answer provided in one or more FAQ, user does not get the answer what he/she looks for. Natural Language processing (NLP) is a branch of artificial intelligence concerned with automatic interpretation and generation of human language like text, voice etc. It solves the problem of finding relevant question for user by applying NLP techniques like stemming, lemmatization and semantic features on the questions.

REQUIREMENTS

Implement a FAQ that will produce improved results using NLP features and techniques. Input will be a set of FAQ's and answers. User's input natural language question/statement and generate one or more FAQ's that match the user's input question/statement.

DATASET:

This dataset contains Question and Answer data from Amazon by matching ASINs(Amazon Standard Identification Number).

Sample Question and Answer:

asin- id of the product
questionType - type of question, could be yes/no or open-ended
answerType - type of answer, could be yes/no or '?' (if the polarity of the answer could not be predicted)
AnswerTime- raw answer time stamp
UnixTime: converted to unix timestamp
Question - question as text
Answer - answer as text Here we went with Pet Supplies in the product category, since it contains more natural and distinguishes text to process than any other category.

ARCHITECTURE

IMPLEMENTATION

After getting the dataset, following are the steps involved in implementing the NLP Pipeline

From the input dataset, parse the JSON data and store the raw data in database
Extract the raw data from database, and parse that to the Tokenization using PBT and do unigram count probability and add weights to each of the question and store the result back to the database
From the SEARCH_UI Page, user types the question. It will flow through this pipeline and find the unigram probability for the user typed question and match the best probable from the database by brute force i.e. looking at all the records from the database
Above method is not efficient, as it is scanning the entire database and comparing it with every record and find the best one.
Build an advanced NLP pipeline that can extract features like lemma, stem, part of speech tag, dependency parse tree and synonyms (other meaning from WordNet) for the given sentence and store the results back to the database for building the model.
Built a model using the Word2Vec by aggregating the features extracted from the previous step. Here WordNet feature is used for training the model.
When user type the question from the ADVANCE_SEARCH_UI Page, the sentence will go through the advance NLP pipeline and extract the feature out of it.
Send this extracted feature to the model and predict the most similar words.
In the next step, we extract the questions based on the predicted words from the model and display it to the user
Update the model, after completing the request from the user. In this way model can be trained a lot more and its accuracy can be improved.

RESULTS

TEST CASE USING ADVANCE NLP TECHNIQUE

FUTURE WORK

With recent advancement in the deep learning, we can implement architecture based on the Recurrent Neural Networks (RNN's) and Convolutional Neural Networks (CNN) or combination of both. Using genism, we implemented a word2vec and find the most similar word from that. Other options like we can build this from RNN combined with bi-LSTM using tensor flow which can give better performance and accuracy for larger datasets. For improving the query performance, we can use push this data to elastic search or solar so that processing on partial text will be even faster.

TECHNOLOGIES USED

PROGRAMMING LANGUAGES: JAVA, PYTHON
TOOLS USED: STANFORD NLP, DROPWIZARD, HIBERNATE, GOOGLE GUAVA, JACKSON, JWI, FLASK
SERVER: JERSEY
UI COMPONENTS: HTML, CSS, JAVASCRIPT

RESOURCES AND LINKS

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
build/tmp/jar		build/tmp/jar
dict		dict
docs		docs
gradle/wrapper		gradle/wrapper
out/production/resources		out/production/resources
src		src
.gitignore		.gitignore
FAQ.yaml		FAQ.yaml
README.md		README.md
_config.yml		_config.yml
gradlew		gradlew
gradlew.bat		gradlew.bat
test_data.pdf		test_data.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FREQUENTLY ASKED QUESTIONS - NLP

SIVAGURUNATHAN VELAYUTHAM

[email protected]

INTRODUCTION

REQUIREMENTS

DATASET:

ARCHITECTURE

IMPLEMENTATION

RESULTS

TEST CASE USING ADVANCE NLP TECHNIQUE

FUTURE WORK

TECHNOLOGIES USED

About

Releases

Packages

Languages

SivagurunathanV/FAQ

Folders and files

Latest commit

History

Repository files navigation

FREQUENTLY ASKED QUESTIONS - NLP

SIVAGURUNATHAN VELAYUTHAM

[email protected]

INTRODUCTION

REQUIREMENTS

DATASET:

ARCHITECTURE

IMPLEMENTATION

RESULTS

TEST CASE USING ADVANCE NLP TECHNIQUE

FUTURE WORK

TECHNOLOGIES USED

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages