Key information extration from text and graph visilization. Inspired by TextGrapher.
How to represent a text in a simple way is a chanllenge topic. This peoject try to extraction key information from the text by NLP methods, which contain NER extraction, relation detection, keywords extraction, frequencies words extraction. And finally show the key information in a graph way.
Utilizes spaCy to extract named entities such as persons, organizations, and locations from news articles. Relationship Extraction: Identifies relationships between entities using pretrained model and generates JSON files representing node-edge relationships.
- each model and NER code genrate seprate json files
- then we find the differnce between two json files and genrate a final json file.
- This json file will be used for BFS seacrhing and ploting of graph
- This Final json file have the ner TAG,NODE LABEL,and relationship between nodes.
from news_graph import NewsMining
content = 'Input you text here'
Miner = NewsMining()
Miner.main(content)
This will generate the graph.html
.
- Red:Location
- Blue:Person
- Green:organization
- Grey:other
The following line initializes the SpaCy language model for English language processing:
nlp = spacy.load('en_core_web_lg')
The model loaded here is 'en_core_web_lg', which is a large English language model trained on web text data.
The code defines a Python class named NewsMining, encapsulating functionality related to news mining: Initializing the NewsMining Class The constructor method (init) initializes various attributes of the NewsMining class:
The code snippet also includes additional methods such as clean_spaces, remove_noisy, and collect_ners, which perform tasks like cleaning text, removing noisy characters, and collecting named entities, respectively.
Explanation of Python Code for News Mining Extracting Triples The extract_triples method takes a sentence as input and returns Subject-Verb-Object (SVO) triples:
def extract_triples(self, sent):
svo = []
tuples = self.syntax_parse(sent)
child_dict_list = self.build_parse_chile_dict(sent, tuples)
for tuple in tuples:
rel = tuple[-1]
if rel in self.SUBJECTS:
sub_wd = tuple[1]
verb_wd = tuple[3]
obj = self.complete_VOB(verb_wd, child_dict_list)
subj = sub_wd
verb = verb_wd.text
if not obj:
svo.append([subj, verb])
else:
svo.append([subj, verb + ' ' + obj])
return svo
The extract_keywords method extracts the top 10 keywords from a list of word-postag pairs:
def extract_keywords(self, words_postags):
return self.textranker.extract_keywords(words_postags, 10)
The main method is a placeholder for the main functionality of news mining:
The get_events method returns the extracted events and Named Entity Recognition (NER) results:
def get_events(self):
return self.events, self.result_dict
Instantiating the NewsMining Class The NewsMining class is instantiated as news_miner:
news_miner = NewsMining()