Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cosine and jaro on perfect matches #22

Open
jleblay opened this issue Dec 6, 2018 · 1 comment
Open

cosine and jaro on perfect matches #22

jleblay opened this issue Dec 6, 2018 · 1 comment

Comments

@jleblay
Copy link

jleblay commented Dec 6, 2018

Hi,
I am starting to use this library and came across those potential glitches.
I would expect cosine, jaro and jarowinkler to return exactly 1.0 on equal strings, but I get the following (jarowinkler omitted, but shows the same behaviour as jaro on this):

server_encoding                         | UTF8 
client_encoding                         | UTF8
pg_similarity.cosine_is_normalized      | on
pg_similarity.jaro_is_normalized        | on

cl_generated_1000=# select version();
                                                   version                                                    
--------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.6.9 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.0 (clang-700.1.76), 64-bit
(1 row)
# set pg_similarity.cosine_tokenizer = 'alnum';
SET
# set pg_similarity.jaro_tokenizer = 'alnum';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 t        | t        | t        | t
(1 row)

# set pg_similarity.cosine_tokenizer = 'word';
SET
# set pg_similarity.jaro_tokenizer = 'word';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 t        | f        | t        | t
(1 row)

# set pg_similarity.cosine_tokenizer = 'gram';
SET
# set pg_similarity.jaro_tokenizer = 'gram';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 f        | f        | t        | t
(1 row)

# set pg_similarity.cosine_tokenizer = 'camelcase';
SET
# set pg_similarity.jaro_tokenizer = 'camelcase';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 t        | f        | t        | t
(1 row)
@eulerto
Copy link
Owner

eulerto commented Feb 12, 2019

All pg_similarity functions return double precision. double precision is inexact [1]. Use a cast like:

select cast(cosine('Michael C', 'Michael C') as numeric(3,2)) < 1.0;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants