FactParser

An overly-ambitious goal of creating a project to take a block of text (story, paragraph, etc.) and extract the facts from it for use in downstream systems (chatbots?).

The (perhaps mistaken) assumption is that language is used to communicate facts and ask questions. The basic questions that can be answered with language are: Who? What? When? Where? Why? How? How much?

In most languages, verbs boil down to a few basic verbs that cover most ideas (https://www.englishclub.com/vocabulary/common-verbs-25.htm).

An assumed rough approach to extract facts from a paragraph would be to do the following:

Load a text
Break it into sentences (maybe this would be optional?)
Identify subjects, verbs, objects (into an SVO model)... surprisingly difficult to do.
Identify which subjects are the same amongst different sentences. Challenging due to pronouns.
Once subjects are identified, for each sentence where the subject is referenced, process verbs and objects to extract.
Insert into data structure, with subject being key, and subdetails being children/subkeys
Query info

I first started out by learning a small amount how to use NLTK, which is great but has perhaps too many levers to be quickly useful. It's like a box of nails, some wood, and a hammer... maybe some steel poles, screws, and a drill would be better for this task. So I began looking for what others had already done.

I found this question, which provided several approaches for how to tease out the S-V-O components, but fails due to pronouns and indirect references within a paragraph. It did, however, lead me to examine the spaCy library. https://stackoverflow.com/questions/39763091/how-to-extract-subjects-in-a-sentence-and-their-respective-dependent-phrases

Noting the above, I then started to explore the spaCy library, how it's used, and found that others have already created a coreference engine, https://github.com/huggingface/neuralcoref. This allows us to know, with a reasonable degree of accurracy, which subjects and objects are recurring in various sentences, so that we can map each topic to its supporting sentences (remember, while perfect mapping would be nice, we have to start somewhere first).

So at this point, what is left is to create the associations between the "subjects" and their "objects" in a way that is queryable. That will be a challenge.

Update ( January 22, 2018 ):

Working today on neuralcoref transforming paragraphs and updating sentences in place to reflect coreferences.
Found possible methodolgy for using spacy to classify parts of sentence from above step, which then can be used to create intents.json for use in other chatbot project (https://github.com/gaolaowai/Chatbot-Examples/tree/master/NN-key-lookup)

Update ( April 9, 2018):

Research into this is still ongoing... while utilizing NLP deep learning networks, I've also been playing with integrating in some HTM theory (see Numenta and cortical.io), to provide a layer of relational inference between different vocabulary items. Code still isn't terriby publishable yet.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
usr/bin		usr/bin
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usr/bin

usr/bin

LICENSE

LICENSE

README.md

README.md

Repository files navigation

FactParser

Update ( January 22, 2018 ):

Update ( April 9, 2018):

About

Releases

Packages

Languages

License

gaolaowai/FactParser

Folders and files

Latest commit

History

Repository files navigation

FactParser

Update ( January 22, 2018 ):

Update ( April 9, 2018):

About

Resources

License

Stars

Watchers

Forks

Languages