Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize and canonize names #2

Open
ppKrauss opened this issue Mar 23, 2017 · 0 comments
Open

normalize and canonize names #2

ppKrauss opened this issue Mar 23, 2017 · 0 comments

Comments

@ppKrauss
Copy link
Contributor

ppKrauss commented Mar 23, 2017

Rules to normalize:

  • same name same ASCII representation, by unaccent(name) function.
  • similar name same Metaphone-pt

Rules to canonize (choice of the official):

  1. WHEN same CPF and similar name (or same first names)
    1.1. check official name with a CPF-resolver
    1.2. (if not possible) use the most recent

  2. WHEN no CPF and same name (and same birthDate)
    2.1. use the most recent when more than 6 years diff
    2.2. (when else) use the "most accented" version or "most standard pt-BR" (eg. preffer use of i insted y)

... Use some log to notice mesages in conflict resolutions ...

Example:

               name               | birthdate  |           source         
----------------------------------+------------+---------------------------
 ANTONIO SETUBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2016
 ANTÔNIO SETUBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2008
 ANTONIO SETÚBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2004
 ANTÔNIO SETÚBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2012
 FABRICIO JOSE SATIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2010
 FABRICIO JOSÉ SATIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2004
 FABRÍCIO JOSÉ SATIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2000
 FABRICIO JOSÉ SÁTIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2008
 FABRÍCIO JOSÉ SÁTIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2012

Most accented "ANTÔNIO SETÚBAL SILVESTRE" of 2012, most recent "ANTONIO SETUBAL SILVESTRE" of 2016...

After canonized, delete records and register all variants in the info JSON

{
 "etc":"etc",
 "synonymous":[
     {"name":"ANTONIO SETUBAL SILVESTRE", "source":123},
     {"name":"ANTÔNIO SETUBAL SILVESTRE", "source":456}
  ]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant