Add Accurate Count-Distinct aggregator (RoaringBitmap) #493

vidma · 2015-10-04T07:56:29Z

Internally, this uses RoaringBitmap, a compressed alternative to BitSet (it's rather fast, and used commonly in projects as spark, druid etc).

This is very simple, but IMHO algebird is still missing this :)

I'll add tests if you'll be willing to merge it (our tests are currently Spark/DataFrame dependent)

@johnynek

ianoc · 2015-10-04T19:25:30Z

I think our main concern with this will be the addition of the dependency. Looks like its pretty pure of a dep, with all the packages it uses seem to be only used in tests. Small enough to be in core you think @johnynek or we should have it in an algebird-X ?

johnynek · 2015-10-04T20:21:36Z

Yes the dependency is the trick here. This would add a new dependency for everyone, even those that don't use this feature.

A second concern is that this would be the second bitset dependency (we took another one, which may have been a mistake):
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/BloomFilter.scala#L22

we definitely should not have two compressed bitset dependencies. They claim this performs better than Ewah. If that's true, we should consider just moving to this.

In the mean time, it might be better to make the aggregator without adding an additional dependency just by using the current dependency, and then we can evaluate if we should change to only use the RoaringBitmap implementation. That would be my opinon.

Also, note, there is a faster aggregation for RoaringBitmap according to the docs:
https://github.com/lemire/RoaringBitmap/blob/master/src/main/java/org/roaringbitmap/FastAggregation.java

which could be used as the implementation for sumOption on the semigroup, which should speed things up when this is used with spark or scalding.

One more comment: if we did merge this, the monoid should probably be in the mutable namespace since this datastructure is mutable (even though we are not mutating it here).

avibryant · 2015-10-05T03:51:25Z

algebird-core/src/main/scala/com/twitter/algebird/RoaringBitmapAggregator.scala

+ }
+}
+
+class RoaringBitampSemigroup extends Semigroup[RoaringBitmap] {


Bitamp should be Bitmap (also appears elsewhere)

avibryant · 2015-10-05T15:40:45Z

Agreed that we should pick a compressed bitset to use. It seems worth noting that the author of the EWAH implementation we use (Lemire) is also the author of the RoaringBitmap paper and implementation, which gives his claims about which one is faster a lot of weight.

vidma · 2015-10-05T18:57:57Z

good point on sumOption, finally Algebird's spark support might do some good for us (I hope).

Agreed that we should pick a compressed bitset to use.

Sounds good. then, I'll look into sumOption and add some tests.

vidma · 2015-10-05T21:00:37Z

btw, @johnynek , regarding sumOption ...

looking again at https://github.com/twitter/algebird/pull/397/files , I see your comment

I don't see a way to use sumOption in sumByKey or aggregateByKey with reimplementing or skipping map-side combining. I wish spark had something like scalding's sumByLocalKeys.

I wish too! any new ideas how to add this sumByLocalKeys to Spark?

johnynek · 2015-10-05T21:49:58Z

Guess I forgot about that. No new ideas on how to solve it.

On Monday, October 5, 2015, vidma [email protected] wrote:

btw, @johnynek https://github.com/johnynek , regarding sumOption ...

looking again at https://github.com/twitter/algebird/pull/397/files , I
see your comment

I don't see a way to use sumOption in sumByKey or aggregateByKey with
reimplementing or skipping map-side combining. I wish spark had something
like scalding's sumByLocalKeys.

I wish too! any new ideas how to add this sumByLocalKeys to Spark?

—
Reply to this email directly or view it on GitHub
#493 (comment).

Oscar Boykin :: @posco :: http://twitter.com/posco

vidma · 2015-10-05T22:57:14Z

continuing sumByLocalKeys offtopic, I'd say the only useful place is a combiner after a Shuffle, if using a sort-based shuffle (locality), but not so simple i guess...

P.S. as I understand even Scalding calls sumOption only after shuffle [SummingCache uses .plus].

johnynek · 2015-10-05T23:58:55Z

yes, for now, we only use sumOption after shuffle. We could do this on the
map-side as well, but the problem there is always about how many items to
cache in memory to possibly improve summing. A jit-like approach would be
great (do some measurements on a few items, then use those to do the rest.
That or store history and check a history service to see how many items
should be kept for each key before calling sumOption).

On Mon, Oct 5, 2015 at 12:57 PM, vidma [email protected] wrote:

continuing sumByLocalKeys offtopic, I'd say the only useful place is a
combiner after a Shuffle, if using a sort-based shuffle (locality), but not
so simple i guess...

P.S. as I understand even Scalding calls sumOption only after shuffle
[SummingCache uses .plus].

—
Reply to this email directly or view it on GitHub
#493 (comment).

Oscar Boykin :: @posco :: http://twitter.com/posco

ianoc · 2015-10-06T00:18:57Z

Scalding only uses it reduce side natively right now, though Summingbird actually will use sumOption mapside optionally.

I've used it map side with two general strategies before:

If we can quickly compute the topN (some code used in SB is in algebird-util here) then we can do something with those
If we know the overall space cardinality is small, or are happy to rely on lots of LRU evictions then putting lists in for the values works pretty well.

In scalding using sumByLocalKeys it looks something like: .map(specialType).sumByLocalKeys.map(_.present).group.forceToReducers.sum

CLAassistant · 2019-07-18T15:12:33Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

vidmantas zemleris seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Add exact count-distinct

e6b2096

avibryant reviewed Oct 5, 2015
View reviewed changes

vidma force-pushed the features/add-exact-count-distinct-monoid branch from 4527954 to f4e38ef Compare October 5, 2015 21:32

Use sumOption is RoaringBitmap

9e00b8b

vidma force-pushed the features/add-exact-count-distinct-monoid branch from f4e38ef to 9e00b8b Compare October 5, 2015 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Accurate Count-Distinct aggregator (RoaringBitmap) #493

Add Accurate Count-Distinct aggregator (RoaringBitmap) #493

vidma commented Oct 4, 2015

ianoc commented Oct 4, 2015

johnynek commented Oct 4, 2015

avibryant Oct 5, 2015

avibryant commented Oct 5, 2015

vidma commented Oct 5, 2015

vidma commented Oct 5, 2015

johnynek commented Oct 5, 2015

vidma commented Oct 5, 2015

johnynek commented Oct 5, 2015

ianoc commented Oct 6, 2015

CLAassistant commented Jul 18, 2019 •

edited

Add Accurate Count-Distinct aggregator (RoaringBitmap) #493

Are you sure you want to change the base?

Add Accurate Count-Distinct aggregator (RoaringBitmap) #493

Conversation

vidma commented Oct 4, 2015

ianoc commented Oct 4, 2015

johnynek commented Oct 4, 2015

avibryant Oct 5, 2015

Choose a reason for hiding this comment

avibryant commented Oct 5, 2015

vidma commented Oct 5, 2015

vidma commented Oct 5, 2015

johnynek commented Oct 5, 2015

vidma commented Oct 5, 2015

johnynek commented Oct 5, 2015

ianoc commented Oct 6, 2015

CLAassistant commented Jul 18, 2019 • edited

CLAassistant commented Jul 18, 2019 •

edited