Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Jaccard UDF giving incorrect results in some cases #383

Open
j-svensmark opened this issue Sep 10, 2023 · 1 comment · May be fixed by #417
Open

Bug: Jaccard UDF giving incorrect results in some cases #383

j-svensmark opened this issue Sep 10, 2023 · 1 comment · May be fixed by #417
Assignees
Labels
bug Something isn't working

Comments

@j-svensmark
Copy link

I found a bug in the the Jaccard distance function https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/jaccard.sqlx (the function below is a copy of the Jaccard UDF, cast as a TEMP FUNCTION)

CREATE TEMP FUNCTION Jaccard(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS r"""
  let intersectSize = 0;

  if(a && b) {
    // Split word into characters, and make characters unique
    const sa = [...new Set(String(a).split(''))]
    const sb = [...new Set(String(b).split(''))]

    const la = (sa.length > sb.length) ? sa.length : sb.length
    const lb = (sa.length > sb.length) ? sb.length : sa.length

    for(let i = 0; i < la; i++) {
      for(let j = 0; j < lb; j++) {
        intersectSize += (sa[i] === sb[j]) ? 1 : 0
      }
    }

    // Compute Jaccard distance
    const union = (sa.length + sb.length) - intersectSize
    return parseFloat(((intersectSize / union) * 100 + Number.EPSILON) / 100).toFixed(2)
  } else {
    return 0.0
  }
""";

with datas as (
select 
'12' as a,
'132' as b
)

select
 Jaccard(a, b)
from datas

This should give 2/3 = 0.667 (since here there the intersection is size 2, and the union is size 3), but instead gives 0.25.

I believe the bug can be fixed by replacing

    const la = (sa.length > sb.length) ? sa.length : sb.length
    const lb = (sa.length > sb.length) ? sb.length : sa.length

with

    const la = sa.length
    const lb = sb.length

With the current version of these lengths, the array indices become out of bounds.

@j-svensmark
Copy link
Author

j-svensmark commented Sep 10, 2023

It might be possible to further optimize and generalize this function to arrays of any types, rather than just strings by casting it in SQL

create temp function `Jaccard`(a ANY TYPE, b ANY TYPE)
returns FLOAT64 as (
  (select count(distinct agrp) from unnest(a) as agrp inner join unnest(b) as bgrp on agrp = bgrp)/
  (select count(1) from (select * from unnest(a) union distinct select * from unnest(b)))
);

with datas as (
select 
['1', '2'] as a,
['1', '3', '2'] as b,
[1, 2] as a_int,
[1, 3, 2] as b_int
)

select
 Jaccard(a, b),
 Jaccard(a_int, b_int)
from datas

(See https://stackoverflow.com/a/36705483)

@afleisc afleisc self-assigned this May 21, 2024
@afleisc afleisc linked a pull request May 29, 2024 that will close this issue
@afleisc afleisc added the bug Something isn't working label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants