Skip to content
This repository has been archived by the owner on Nov 10, 2022. It is now read-only.

Identical score if only one of two identifiers match #89

Open
wetneb opened this issue Sep 17, 2020 · 7 comments
Open

Identical score if only one of two identifiers match #89

wetneb opened this issue Sep 17, 2020 · 7 comments

Comments

@wetneb
Copy link
Owner

wetneb commented Sep 17, 2020

Originally posted by @hroest at OpenRefine/OpenRefine#3191:

When doing reconciliation with Wikidata for example, a match will often produce a score of 100% if only name matches to the lemma or any alternative identifiers. This is the case even if a second column is provided that can be used as an external identifier. The score will still be a 100 even if the second identifier does not match and there is one match where both columns match, basically making the single-column match indistinguishable from the two-column match.

Maybe I am doing this wrong, I would appreciate some help.

Proposed solution

The score should be higher for the match where both the Lemma name and the identifier match and lower for those that do not match.

Alternatives considered

  • there are none, I have to go through all Lemmata by hand to find the right one

Additional context

I am working with data where there is often items that have the same name and need a secondary identifier to be distinguished.

An example:

Lemma: Teufen
External Identifier: 007804
External Identifier used: HLS https://www.wikidata.org/wiki/Property:P902
Correct match: https://www.wikidata.org/wiki/Q67209

Here is how I do the reconciliation:
image

Here is the result:
image

What I expect: I expect that the correct match https://www.wikidata.org/wiki/Q67209 where the Lemma and the external identifier P902 match will get the highest score.

What happens instead:

There are 5 hits that all have 100 match score:

@wetneb
Copy link
Owner Author

wetneb commented Sep 17, 2020

What happens here is that there is no item with the requested identifier value:
https://www.wikidata.org/w/index.php?sort=relevance&search=haswbstatement%3AP902%3D007084&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1&ns120=1
The item that you consider as correct match has a different value: 001303

Therefore, the reconciliation service first searches for items with this identifier, finds none, so it falls back on normal search without the identifier. The five first items have the exact required string as label or alias, hence they all get the maximum score.

It could indeed make sense to give them all less than the maximum score (because they all lack the P902=007804 claim).

At the moment I am reluctant to make changes to the scoring mechanism - not because I think it is flawless, just because I think there is value in stability of that mechanism. In the future, I would like that reconciliation clients can rely on more granular scores (identifier match, label match) rather than a single score whose computation is quite opaque:
OpenRefine/OpenRefine#3139

@thadguidry
Copy link
Contributor

thadguidry commented Sep 17, 2020

@wetneb If this helps, in Freebase, I remember that a single column added as a disambiguator would drop the score 50% if it didn't match. I don't remember the algorithm for additional columns added as a disambiguator, but the 1st column added would drop it from 100 to 50, for instance. Maybe Andy had 5 or 10% drops on additional columns, maybe videos around the internet of it being demonstrated might help clue us in to the algorithm approximately. Its possible Tom might remember about additional columns; I only recall about the 1st disambiguator column added and it's percentage drop on a no match.

Agree on giving power to the user to apply their own weighting through smarter clients including OpenRefine.

@tfmorris
Copy link

It could indeed make sense to give them all less than the maximum score (because they all lack the P902=007804 claim).

I think this absolutely makes sense and is what the users would expect, particularly if 100 is meant to convey "perfect match."

At the moment I am reluctant to make changes to the scoring mechanism - not because I think it is flawless, just because I think there is value in stability of that mechanism.

I would favor continuous improvement over stability. The requested behavior sounds like a clear improvement to me.

@wetneb
Copy link
Owner Author

wetneb commented Sep 17, 2020

The problem with scoring tweaks is that they generally sound very reasonable when looking at a particular use case, but it is hard to ensure they are not affecting other legitimate use cases in a detrimental way. So the risk by starting this is to be drawn into a series of follow-up fixes to cater for the needs of whoever is going to report a regression in their own workflow.

So, personally, if I had time to dedicate to this I would rather make progress on OpenRefine/OpenRefine#3139 rather than this issue, because I fear the downstream consequences of tweaking a scoring mechanism that has been stable for a long time, and I do not believe in a one-score-fits-all paradigm.

Once people can rely on individual scoring features in their reconciliation workflows, then it might become easier to tweak the global score (with the understanding that people should instead rely on the individual features if they care about reproducibility).

But that should not prevent people from running their own versions of the service with modified scoring mechanisms (locally or as a publicly hosted instance).

@tfmorris
Copy link

I guess we'll have to hope that Wikidata fixes it in their production reconciliation service then, but that could be a very long wait.

If someone does host a service that fixes this before then, I'd favor making that service the bundled OpenRefine service.

@hroest
Copy link

hroest commented Sep 18, 2020

Therefore, the reconciliation service first searches for items with this identifier, finds none, so it falls back on normal search without the identifier. The five first items have the exact required string as label or alias, hence they all get the maximum score.

Hmm, it seems like there is 2 types of scoring going on. One by identifier and one by name. It would be great if there was bigger transparency and the user could see which mechanism was used and could potentially sort by which column actually matched (identifier or name).

@wetneb
Copy link
Owner Author

wetneb commented Sep 19, 2020

That is exactly what OpenRefine/OpenRefine#3139 is about.

If you look at the raw response from the service, you will see two "features" in the candidate, each indicating whether the name or the identifier match:
https://wikidata.reconci.link/en/api?queries=%7B%22q0%22%3A%7B%22query%22%3A%22Teufen%22%2C%22properties%22%3A%5B%7B%22pid%22%3A%22P902%22%2C%22v%22%3A%22does_not_exist%22%7D%5D%7D%7D
(in the future I plan to add more of these features).

I would like to make these scores available in OpenRefine itself (for now it ignores these values returned by the service).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants