-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Drop Issues (AIDA test A) #3
Comments
The ARGMAX results represent the "Local Mention" prior and they should be much higher cf Table 3 in our paper. What p(e|m) indexes do you use ? Are you using the ones that we provided, to be found here: https://polybox.ethz.ch/index.php/s/IOWjGrU3mjyzDSV/authenticate It seems there is a big overlap between PBOH and LocalMention (the "common Loopy - ARGMAX" part). I will try to re-run it tonight on a fresh machine if you still cannot solve this issue. Can you please send me your full output log file by e-mail ? |
Hi, thanks for your fast response. I have not changed the method or the index itself. All I have changed is updating the index address in code, add UTF-8 encoding when using Source.fromFile(), and the AIDA dataset name (The one given in AIDA.scala is "testa_testb_aggregate" which I didn't a file with this name so I used the output file from "aida-yago2-dataset.jar" ). Also, I only ran the AIDA test A and ignored all the other dataset to save time. I am going run it again to see if the result is the same. If so, I will send you the output log. |
I am not sure what is the output file from "aida-yago2-dataset.jar", but your testa_testb_aggregate should contain the AIDA-A and AIDA-B datasets and be generated as described on the MPI website. It should look as follows (sorry, it has a license from MPI and I cannot upload the full file myself). One word per each line, with annotations when the word is part of a mention, tab separated:
|
My dataset do have these lines, so the dataset should be fine. |
Something is clearly wrong with the p(e|m) index that you use. "perc missing mentions from index : 14.97" is the percentange of mentions m that are not found in the dictionary, while "perc missing entities from mention index : 17.02" is the percentage of gold entities that do not appear in the respective mention entry. These should be together less than 5%. Looking at your log file I see that even common names like "kurdish", "tunisia" or "boston" are missing. Can you please check if they appear in your p(e|m) file (called mek-top-freq-crosswikis-plus-wikipedia-lowercase-top64.txt which should be constructed as a concatenation of the 2 files mek-top-freq-crosswikis-plus-wikipedia-lowercase-top64.txt.part_a*) ? |
This should give a non-empty output:
namely:
You need to create a new file containing the contents of both files mek-top-freq-crosswikis-plus-wikipedia-lowercase-top64.txt.part_a*, and update its path here: https://github.com/dalab/pboh-entity-linking/blob/master/src/main/scala/index/AllIndexesBox.scala#L19 . Similarly, you need to update the paths of all other index files listed in the same scala file. Let me know if it works. |
Thank you so much. It is hard for me to target the problem. I will check the indexes. |
According to the paper, the performance PBoH on AIDA test A is 86.63/85.48. Due to the upgrade of gerbil, the performance of PBoH is give here is 75.19/73.3.
However, when try to reproduce the result, it gives the following result (64.84/64.32).
I used the index file from polybox. The location are index files are updated.
I changed from
val file = "/media/hofmann-scratch/Octavian/entity_linking/marinah/AIDA/testa_testb_aggregate"
to "AIDA-YAGO2-dataset.tsv" which is generated by files downloaded from MPI-info.
3. I use
to run the code because the command
scala -J-Xmx90g target/PBoH-1.0-SNAPSHOT-jar-with-dependencies.jar testPBOHOnAllDatasets max-product
Did I made any mistakes in the process? How can I reproduce the result in Gerbil?
Thanks.
The text was updated successfully, but these errors were encountered: