Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filtering performance question #27

Open
cariaso opened this issue Dec 13, 2020 · 2 comments
Open

filtering performance question #27

cariaso opened this issue Dec 13, 2020 · 2 comments

Comments

@cariaso
Copy link

cariaso commented Dec 13, 2020

More of a performance question than a bug report, but I don't see a more suitable location.

I get that groq has capabilities that JMESPath does not, but as they do share some similarities I'm comparing them.

I need to filter some JSONs. For this test I'm working with 10k entries. Here is an example that shows 2 records. The real data has a few more columns, some with a bit of nested complexity, but my queries don't touch any of that.

[{
  "base__uid": 1664200,
  "base__chrom": "chr6",
  "base__pos": 1312763,
  "base__ref_base": "T"
},{
  "base__uid": 1669279,
  "base__chrom": "chr6",
  "base__pos": 4116028,
  "base__ref_base": "G"
}]

In JMESPath this query:

[?base__chrom=='chr6']|[?base__pos>=`1000000`]|[?base__pos<=`5000000`]

takes 7ms

and it seems to be equivalent to this groq query

*[base__chrom=='chr6' && base__pos>= 1000000 && base__pos <= 5000000]

which takes 5662ms.

So far my queries aren't particularly complex, but regardless of their complexity, runtime for groq seems linear and dominated by the number of records in my input. At 100k records, it always takes ~30s. 10k records = ~3s. etc.

I'm filtering with this code

       value = await evaluate(groqtree, {  dataset: allrec  });
       accepted = await value.get();

Do my performance numbers seem appropriate? Anything I should dig into in hopes of getting better performance from groq?

@judofyr
Copy link
Collaborator

judofyr commented Dec 16, 2020

So far my queries aren't particularly complex, but regardless of their complexity, runtime for groq seems linear and dominated by the number of records in my input.

This is expected. groq-js is a naive implementation and doesn't index the documents in any way to speed up query performance. Right now there's no performance gain of using groq-js vs. just calling .filter(…) in JavaScript.

which takes 5662ms.

As for the performance itself: There hasn't really been done any specific work in making it efficient. There might be a lot of low-hanging fruits which can speed up performance.

@scottrippey
Copy link

I'm running into similar performance issues. I'm attempting to make a reusable groq testing library, and for some queries across 1500 entries, it's taking 15+ seconds. These queries are perfectly fast in production.
I assume that the lack of indexing would essentially prevent this from running any faster ... do you think there's any possibility that groq-js could add support for indexing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants