Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difference between java implementation and python #47

Open
idanmoradarthas opened this issue Nov 25, 2018 · 8 comments
Open

difference between java implementation and python #47

idanmoradarthas opened this issue Nov 25, 2018 · 8 comments

Comments

@idanmoradarthas
Copy link

idanmoradarthas commented Nov 25, 2018

I have
my_set.zip.

When I'm using the java code:

import com.tdunning.math.stats.TDigest;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVRecord;

import java.io.*;
import java.util.stream.StreamSupport;

public class TDigestTry {
    public static void main(String[] args) throws IOException {
        ClassLoader classLoader = TDigestTry.class.getClassLoader();
        File file = new File(classLoader.getResource("my_set.csv").getFile());
        Reader in = new FileReader(file);
        Iterable<CSVRecord> records = CSVFormat.EXCEL.withHeader().parse(in);
        TDigest digest = TDigest.createAvlTreeDigest(20);
        StreamSupport.stream(records.spliterator(), false)
                .map(record -> new Double(record.get("change rate")))
                .forEach(digest::add);
        System.out.println(digest.quantile(0.05));
        System.out.println(digest.quantile(0.95));
    }
}

I'm getting the results:

3.0
5.0

But when I'm this code:

from pathlib import Path

import pandas
from tdigest import TDigest

if __name__ == '__main__':
    frame = pandas.read_csv(Path(__file__).parents[0].joinpath("resources").joinpath("my_set.csv"))
    digest = TDigest()
    digest.batch_update(frame["change rate"].values)
    print(f"Quantile 0.05 = {digest.percentile(5)};\t\tQuantile 0.95 = {digest.percentile(95)}")

I'm getting the results:

Quantile 0.05 = 2.6495903059149586;		Quantile 0.95 = 3689686.790917569

How come there's a large difference between between the 0.95 quantiles?

P.S
same results when I use:

for value in frame["change rate"].values:
        digest.update(value)
@CamDavidsonPilon
Copy link
Owner

Looks strange - are you able to share the data?

@idanmoradarthas
Copy link
Author

It attached in the second line

or link:

https://github.com/CamDavidsonPilon/tdigest/files/2613099/my_set.zip

@phucbui95
Copy link

phucbui95 commented Jan 4, 2019

Hi, i have a same issue here.

t.batch_update([1649.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0])
t.percentile(25)```
result was
`-523.5`

@rlele5
Copy link

rlele5 commented Nov 26, 2020

Hi, I noticed this too. Thanks for everything you've done! In the cases it does work, it seems to be doing a great job! However, with this simple example posted above I'm now a little worried. Do you know when this error case occurs? Is this something inherent to TDigests? When should we watch out for something like this happening?

@rlele5 rlele5 mentioned this issue Nov 27, 2020
@rlele5
Copy link

rlele5 commented Nov 27, 2020

@CamDavidsonPilon , @idanmoradarthas , @phucbui95 - I found where this regression was introduced:

commit 1f76113e3fce0d5b7189fd083085ba64d4129e05 (HEAD)
Author: Cameron Davidson-Pilon <[email protected]>
Date:   Tue Dec 26 23:04:33 2017 -0500

    bump to v0.5, drop python2 support

The commit before returns the correct results:

Author: Cameron Davidson-Pilon <[email protected]>
Date:   Mon Feb 20 11:22:54 2017 -0500

    test against 3.6

Fix: roll back the implementation of def percentile(self, p) back to the implementation in 9cd2536. @CamDavidsonPilon , are you able to explain this further? I still have to run your test cases to confirm, but this seems promising from some manual tests.


Examples:
Code:

from tdigest import TDigest
t = TDigest()
t.batch_update([1649.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0])

for percent in [0, 25, 50, 75, 100]:
    value = t.percentile(percent)
    print(f"p{percent} = {value}")

Output with 1f76113 (broken):

p0 = 69.0
p25 = -1906.0
p50 = -1116.0
p75 = -326.0
p100 = 464.0

Output with 9cd2536 (working):

p0 = 69.0
p25 = 69.0
p50 = 69.0
p75 = 69.0
p100 = 1649.0

@rlele5
Copy link

rlele5 commented Nov 27, 2020

Addendum - this may not be as simple as it seems, but may be a step in the right direction. I further compared a test case (bigJump) in the reference implementation (https://github.com/tdunning/t-digest/blob/ff3232bc25a69961fc7bf4911f8de0026bd28c44/core/src/test/java/com/tdunning/math/stats/TDigestTest.java)

Here is the java test - notice the assertions, then compare with the values with the working python implementation (9cd2536):

        TDigest digest = factory(100).create();
        for (int i = 1; i < 20; i++) {
            digest.add(i);
        }
        digest.add(1_000_000);

        assertEquals(18, digest.quantile(0.89999999), 0);
        assertEquals(19, digest.quantile(0.9), 0);
        assertEquals(19, digest.quantile(0.949999999), 0);
        assertEquals(1_000_000, digest.quantile(0.95), 0);

Working python implementation results (9cd2536):
Code:

from tdigest import TDigest

l = [i for i in range(20)]
l.append(1000000)
print(l)
t = TDigest()
t.batch_update(l)

for percent in [89.999999, 90, 94.9999999, 95,]:
    value = t.percentile(percent)
    print(f"p{percent} = {value}")

Output:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1000000]
p89.999999 = 18.39999979
p90 = 18.400000000000002
p94.9999999 = 225014.93950018956
p95 = 225014.94999999963

Notice that the p94.9999999 and p95 results are still quite different from the reference implementation. This will probably only happen in extreme scenarios, but I thought I'd point it out. Perhaps the reference implementation has some other conditions that handle these extremities. I don't think the python implementation is wrong, but this just may be one of the accuracy tradeoffs made with TDigest (just a guess).

@CamDavidsonPilon
Copy link
Owner

@rlele5 unfortunately I'm not involved much in this repo anymore - if you suggest a fix, with an appropriate test, I'd be happy to review and merge it.

@rlele5
Copy link

rlele5 commented Nov 27, 2020

@CamDavidsonPilon , sounds good. I'll have to go back and understand why the change was made in the first place and will see if there are any further changes needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants