difference between java implementation and python #47

idanmoradarthas · 2018-11-25T13:20:25Z

When I'm using the java code:

import com.tdunning.math.stats.TDigest;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVRecord;

import java.io.*;
import java.util.stream.StreamSupport;

public class TDigestTry {
    public static void main(String[] args) throws IOException {
        ClassLoader classLoader = TDigestTry.class.getClassLoader();
        File file = new File(classLoader.getResource("my_set.csv").getFile());
        Reader in = new FileReader(file);
        Iterable<CSVRecord> records = CSVFormat.EXCEL.withHeader().parse(in);
        TDigest digest = TDigest.createAvlTreeDigest(20);
        StreamSupport.stream(records.spliterator(), false)
                .map(record -> new Double(record.get("change rate")))
                .forEach(digest::add);
        System.out.println(digest.quantile(0.05));
        System.out.println(digest.quantile(0.95));
    }
}

I'm getting the results:

3.0
5.0

But when I'm this code:

from pathlib import Path

import pandas
from tdigest import TDigest

if __name__ == '__main__':
    frame = pandas.read_csv(Path(__file__).parents[0].joinpath("resources").joinpath("my_set.csv"))
    digest = TDigest()
    digest.batch_update(frame["change rate"].values)
    print(f"Quantile 0.05 = {digest.percentile(5)};\t\tQuantile 0.95 = {digest.percentile(95)}")

I'm getting the results:

Quantile 0.05 = 2.6495903059149586;		Quantile 0.95 = 3689686.790917569

How come there's a large difference between between the 0.95 quantiles?

P.S
same results when I use:

for value in frame["change rate"].values:
        digest.update(value)

The text was updated successfully, but these errors were encountered:

CamDavidsonPilon · 2018-11-25T14:27:37Z

Looks strange - are you able to share the data?

idanmoradarthas · 2018-11-25T14:30:31Z

It attached in the second line

or link:

https://github.com/CamDavidsonPilon/tdigest/files/2613099/my_set.zip

phucbui95 · 2019-01-04T04:25:30Z

Hi, i have a same issue here.

t.batch_update([1649.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0])
t.percentile(25)```
result was
`-523.5`

rlele5 · 2020-11-26T00:53:57Z

Hi, I noticed this too. Thanks for everything you've done! In the cases it does work, it seems to be doing a great job! However, with this simple example posted above I'm now a little worried. Do you know when this error case occurs? Is this something inherent to TDigests? When should we watch out for something like this happening?

rlele5 · 2020-11-27T19:59:54Z

@CamDavidsonPilon , @idanmoradarthas , @phucbui95 - I found where this regression was introduced:

commit 1f76113e3fce0d5b7189fd083085ba64d4129e05 (HEAD)
Author: Cameron Davidson-Pilon <[email protected]>
Date:   Tue Dec 26 23:04:33 2017 -0500

    bump to v0.5, drop python2 support

The commit before returns the correct results:

Author: Cameron Davidson-Pilon <[email protected]>
Date:   Mon Feb 20 11:22:54 2017 -0500

    test against 3.6

Fix: roll back the implementation of def percentile(self, p) back to the implementation in 9cd2536. @CamDavidsonPilon , are you able to explain this further? I still have to run your test cases to confirm, but this seems promising from some manual tests.

Examples:
Code:

from tdigest import TDigest
t = TDigest()
t.batch_update([1649.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0])

for percent in [0, 25, 50, 75, 100]:
    value = t.percentile(percent)
    print(f"p{percent} = {value}")

Output with 1f76113 (broken):

p0 = 69.0
p25 = -1906.0
p50 = -1116.0
p75 = -326.0
p100 = 464.0

Output with 9cd2536 (working):

p0 = 69.0
p25 = 69.0
p50 = 69.0
p75 = 69.0
p100 = 1649.0

rlele5 · 2020-11-27T20:36:28Z

Addendum - this may not be as simple as it seems, but may be a step in the right direction. I further compared a test case (bigJump) in the reference implementation (https://github.com/tdunning/t-digest/blob/ff3232bc25a69961fc7bf4911f8de0026bd28c44/core/src/test/java/com/tdunning/math/stats/TDigestTest.java)

Here is the java test - notice the assertions, then compare with the values with the working python implementation (9cd2536):

        TDigest digest = factory(100).create();
        for (int i = 1; i < 20; i++) {
            digest.add(i);
        }
        digest.add(1_000_000);

        assertEquals(18, digest.quantile(0.89999999), 0);
        assertEquals(19, digest.quantile(0.9), 0);
        assertEquals(19, digest.quantile(0.949999999), 0);
        assertEquals(1_000_000, digest.quantile(0.95), 0);

Working python implementation results (9cd2536):
Code:

from tdigest import TDigest

l = [i for i in range(20)]
l.append(1000000)
print(l)
t = TDigest()
t.batch_update(l)

for percent in [89.999999, 90, 94.9999999, 95,]:
    value = t.percentile(percent)
    print(f"p{percent} = {value}")

Output:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1000000]
p89.999999 = 18.39999979
p90 = 18.400000000000002
p94.9999999 = 225014.93950018956
p95 = 225014.94999999963

Notice that the p94.9999999 and p95 results are still quite different from the reference implementation. This will probably only happen in extreme scenarios, but I thought I'd point it out. Perhaps the reference implementation has some other conditions that handle these extremities. I don't think the python implementation is wrong, but this just may be one of the accuracy tradeoffs made with TDigest (just a guess).

CamDavidsonPilon · 2020-11-27T21:03:57Z

@rlele5 unfortunately I'm not involved much in this repo anymore - if you suggest a fix, with an appropriate test, I'd be happy to review and merge it.

rlele5 · 2020-11-27T21:38:00Z

@CamDavidsonPilon , sounds good. I'll have to go back and understand why the change was made in the first place and will see if there are any further changes needed.

rlele5 mentioned this issue Nov 27, 2020

Fix weights #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

difference between java implementation and python #47

difference between java implementation and python #47

idanmoradarthas commented Nov 25, 2018 •

edited

CamDavidsonPilon commented Nov 25, 2018

idanmoradarthas commented Nov 25, 2018

phucbui95 commented Jan 4, 2019 •

edited

rlele5 commented Nov 26, 2020

rlele5 commented Nov 27, 2020 •

edited

rlele5 commented Nov 27, 2020 •

edited

CamDavidsonPilon commented Nov 27, 2020

rlele5 commented Nov 27, 2020

difference between java implementation and python #47

difference between java implementation and python #47

Comments

idanmoradarthas commented Nov 25, 2018 • edited

CamDavidsonPilon commented Nov 25, 2018

idanmoradarthas commented Nov 25, 2018

phucbui95 commented Jan 4, 2019 • edited

rlele5 commented Nov 26, 2020

rlele5 commented Nov 27, 2020 • edited

rlele5 commented Nov 27, 2020 • edited

CamDavidsonPilon commented Nov 27, 2020

rlele5 commented Nov 27, 2020

idanmoradarthas commented Nov 25, 2018 •

edited

phucbui95 commented Jan 4, 2019 •

edited

rlele5 commented Nov 27, 2020 •

edited

rlele5 commented Nov 27, 2020 •

edited