Vidal data mining is a project based on data scrapping and mining using Python
and Unitex. The idea of the project is to extract drugs (medics)
information from VIDAL website
then match each drug
with its every possible prescriptions
based on a big medical corpus
data file corpus-medical.txt
that contains information about a medical corpus and a history its doctors visits reports.
Notes:
- Vidal website contains information about medications and parapharmacy products.
- The project is based on FRENCH language
Vidal website:
Content of corpus-medical.txt
:
- Execute the python script
scrapper.py
to extract drugs substances from VIDAL Website-
scrapper.py letter1-letter2
-
letter1-letter2
argument represent the range of characters. For example'A-Z'
- The script generates two files
Subst.dic
andinfo.txt
subst.dic
contains all substances extracted with theUnitex
dictionary suffix added to match theUnitex
dictionary format.dic
info.txt
contains extraction statistics. The number of substance by each letter and the total number of the extracted substances- Execute the python script
enrch.py
to enrich our collected substance dictionary. The script scrapes new substances from the filecorpus-medical.txt
and add them into new dictionarysubst_enri.dic
. Also, it will delete duplicated occurrences and sort the substances in both filessubst.dic
&subst_enri.dic
- Open
Unitex
and useFRENCH
as language. - Move the files
subset_enri.dic
subset.dic
to the path ofUnitex
DELA folder located in User's documents folder- Example of my path:
D:\Users\Asus\Documents\Unitex-GramLab\Unitex\French\Dela
- Example of my path:
- Apply preprocessing & lexical parsing to
corpus-medical.txt
- Open
subset_enri.dic
in DELA and compress the dictionary into FST. Two filessubst_enri.bin
andsubs_enri.inf
should be generated as in DELA folder - Apply the same steps for
subset.dic
- Open
projetpy.grf
in FSgraph to visualize extraction graphs schemasprojetpy.grf
represents the main graph that consists of 3 graphs (3 possible matchs):projetpy1.grf
projetpy2.grf
projetpy3.grf
- Notes:
<n+subst>
matches a dictionary word. In our case, it is thedrug
name scrapped earlier (subst_enri.dic
andsubst.dic
)<MOT>
matches a word match like\w
in regular expressions<NB>
matches a number like\d
in regular expressions- For more information to understand the graph syntax please refer to Unitex documentation
- Apply lexical ressources to the preprocessed text previously
- Select
subst_enri.bin
andsubst.bin
in user ressources anddela.fr
in system ressources - The final step consists of locating patterns and building concordances:
-
Chose locate pattern
-
Select the
projetpy.grf
graph -
Select
all_matches
and merge with output text -
Index all occurrences in text
-
Build concordance to visualize the results
The results are stored in
corpus-medical_snt\concord.html
file located in the same folder ofcorpus-medical.txt
Use a web browser for better formatting
-