Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fitting tools, combined fits, partial wave analysis, and machine learning #5

Closed
jpivarski opened this issue Jun 28, 2023 · 29 comments
Closed
Labels
2023 PyHEP.dev 2023 fitting see #5 ML Machine Learning topical-group Topic for discussion

Comments

@jpivarski
Copy link
Collaborator

Extracting statistical results from data with systematics, correlations, etc. at large scale.

@maxgalli
Copy link
Collaborator

maxgalli commented Jul 7, 2023

Interested in this topic! I'm mostly interested in investigating interoperability between Combine (tool that is still used in CMS to perform combinations and fits pretty much in every analysis) and modern packages (pyhf, cabinetry, zfit, etc.).

@rkansal47
Copy link
Collaborator

Also interested in discussing Python-based alternatives to Combine!

@lgray
Copy link
Collaborator

lgray commented Jul 7, 2023

jaxfit @nsmith-

@vgvassilev
Copy link
Collaborator

I am interested in adopting the work we did in RooFit and AD in Combine.

cc: @grimmmyshini, @guitargeek, @sudo-panda, @davidlange6

@pfackeldey
Copy link
Collaborator

pfackeldey commented Jul 8, 2023

I'd like to discuss with experts of the different fitting libraries the benefits and potential drawbacks of using JAX PyTrees. I made a small (proof-of-concept for now) package (dilax), which implements binned likelihood fits with pure JAX and PyTrees. This enables vectorisation on many (new?) levels, such as multiple simultaneous fits for a likelihood profile on a GPU etc. In addition, everything becomes differentiable by construction. After a discussion with @phinate, he started a GitHub-discussion in pyhf: scikit-hep/pyhf#2196, where all the concepts are written in greater detail.

+1

@ccochato
Copy link
Collaborator

ccochato commented Jul 9, 2023

+1

@redeboer
Copy link
Collaborator

I'm interested in this topic, particularly fitting tools and amplitude analysis / PWA!

Some thoughts:

  • Amplitude analysis is a bit of a niche, but if there are participants who are developing tools for it (or doing an analysis themselves), we should definitely set up a session for it! A few suggestions:
  • Formulating models symbolically with a CAS to use it as a template for numerical backends like JAX, TF, Numba, etc. I gave a talk on this at CHEP2023 and have been using SymPy+JAX for (amplitude) fitting for some years now (see general demo of fitting package here, as well as an example with zfit here). I'm not aware of other HEP projects doing this, so I'm interested to discuss whether it's worth pursuing for other packages and frameworks as well (not just Python).1
  • General question: what do we mean by ML under the heading of "fitting"? For sure, it's a form of fitting, but my impression is that once ML is used in HEP, it's usually for tracking, background reduction, etc., or tricks that benefit specific (amplitude) analysis. If so, it may be worth to put it under a different heading (perhaps moving to complete ML workflows in analysis: facilities capabilities and user interface #19 altogether?).

Footnotes

  1. Google seems to be trying something along those lines with sympy2jax. And (sorry to mention this at PyHEP), I have the impression that Julia could be used for such a workflow as well with Symbolics.jl and JuliaDiff

@redeboer
Copy link
Collaborator

Suggestion for a new topical issue
Another idea that is relevant to amplitude analysis, but that we may want to discuss more generally: last month, the PDG announced they now offer an API. Their Python API is still under development, so I feel we as PyHEP community should get involved in its development.
Perhaps @eduardo-rodrigues has thoughts on this? Is it worth creating a topical issue on this?

@alexander-held
Copy link
Collaborator

alexander-held commented Jul 10, 2023

I'm also interested in this. Another aspect to this topic is orchestration of model construction / metadata handling, which ties in with earlier steps in an analysis workflow (and #4). Regarding AD: also curious to learn more about how / how much of the functionality is exposed to users (i.e. can I easily take arbitrary derivatives myself, how limited are current implementations to just internally provide derivatives wrt. parameters to the minimizer).

@redeboer: probably best to open a new issue as this might get lost here and is not related to the thread. Presumably interesting e.g. for scikit-hep/particle.

@eduardo-rodrigues
Copy link
Member

Hi folks. Thanks for the ping. I'm aware of the new PDG API and in fact in touch with Juerg, the director :-). I do need to find time to have a proper look and comment ... But it is not forgotten and indeed a relevant thing how Particle sits/evolves vis-a-vis the new pdg package.

@redeboer
Copy link
Collaborator

redeboer commented Jul 10, 2023

@redeboer: probably best to open a new issue as this might get lost here and is not related to the thread. Presumably interesting e.g. for scikit-hep/particle.

✅ --> scikit-hep/particle#513

@alexander-held
Copy link
Collaborator

General question: what do we mean by ML under the heading of "fitting"?

One thing that fits into that box is simulation-based inference à la e.g. MadMiner or various anomaly detection methods.

@phinate
Copy link

phinate commented Jul 10, 2023

can't attend in-person sadly but would love to be involved in any discussions here if possible (timezones permitting)!

@mdsokoloff
Copy link
Collaborator

Hi All: I've been working on GooFit (https://github.com/GooFit/GooFit) for a decade now. One of its primary goals is doing time-dependent amplitude analyses with large data sets (think hundreds of thousands of events to millions). While all the underlying code is C++, the package has Python bindings for most methods. In addition, the (Python) DecayLanguage package that lives in SciKit (https://github.com/scikit-hep/decaylanguage) produces CUDA code for GooFit from AmpGen decay descriptor files (https://github.com/GooFit/AmpGen).

GooFit sounds like RooFit and its user interface mimics that of RooFit in many ways. It runs on nVidia GPUs, under OpenMP on x86 servers, and on single CPUs (the last is useful for debugging).

While GooFit has been used primarily for amplitude analyses, it can also be used effectively for coverage tests fitting simple one-dimensional functions, etc.

I am very interested in using AD within GooFit. From preliminary discussions with experts, GooFit's architecture should allow us to use/adapt Clad (https://compiler-research.org/clad/) in a fairly straight-forward way.

At the end of the day, we would like to make most of the functionality of GooFit available to users using Python interfaces that do not require developing new C++ code. It will be very interesting to see what a possible user community wants to do.

@nsmith-
Copy link
Collaborator

nsmith- commented Jul 10, 2023

I'm very interested in a jax-based statistical inference package, towards both binned and un-binned fits.

  • @redeboer I only have thought about sympy-assisted model building but glad to see you've made real progress there!
  • I'll +1 the HEP Statistics Serialization Standard HS3 as a topic worth discussing.
  • Perhaps it is time to revitalize Proposal: Outline of pydistcore API scikit-hep/pyhf#608 towards a common statistical interpretation API (evolution of RooStats, say), as this plus a serialization standard allows to switch out backends as necessary for performance.

In my experience in attempting a jax port of the CMS Higgs combination, I found the many un-vectorized parameters we have becomes a debilitating JIT compilation bottleneck in jax. But this situation may have changed since I checked back in 2021.

@phinate
Copy link

phinate commented Jul 10, 2023

I'm very interested in a jax-based statistical inference package, towards both binned and un-binned fits.

@nsmith- Is this better-scoped as a statistical modelling package, where one would find the appropriate abstraction that fits both binned/unbinned paradigms? Inference would just be extra layers on minimization, which I've already abstracted in relaxed for the most common cases encountered in pyhf (profile likelihood-based testing) -- the only important API requirement is the existence of the .logpdf method. (Upper limits are a small extension over that with a root-finder).

@nsmith-
Copy link
Collaborator

nsmith- commented Jul 10, 2023

@phinate yes! I guess your relaxed is an implementation of scikit-hep/pyhf#608 ?

@phinate
Copy link

phinate commented Jul 10, 2023

@phinate yes! I guess your relaxed is an implementation of scikit-hep/pyhf#608 ?

oh, I suppose so in a not-well-tested kind of way :) just asymptotic calcs though, and probably needs a quick going-through to truly be agnostic to the model representation but it is just a thin wrapper around jaxopt with HEP-like quantities/semantics!

would be happy to build this out more to support whatever model abstraction we can come up with!

@redeboer redeboer added fitting see #5 ML Machine Learning labels Jul 11, 2023
@JMolinaHN
Copy link
Collaborator

Hi everyone,
I wanted to bring up a key point concerning Amplitude Analysis: the integration of the Probability Density Function (PDF).
The speed of convergence hinges significantly on this aspect, and it's why parallel processing becomes crucial, particularly for processing large datasets with intricate integrals. Tools like GooFit have been invaluable in this regard, standing out as some of the best available solutions for this type of processing.

However, given the advancements in today's computational capabilities, I believe it might be beneficial to explore alternative approaches. For instance, we could consider precomputing the integrals and devising an efficient method for accessing these values as necessary. Another potential strategy could be experimenting with a Chi-squared (Chi2) fit with reduced granularity.
While this is typically quite fast, it does reintroduce the challenge of integration.

Beyond these technical aspects, there's another issue I've been considering: the generalization and user-level accessibility of fitting tools. It often feels like we lack a consistent standard across fitting tools. For instance, finding a tool that effectively handles both B and D decays can be challenging. Similarly, analyzing decays of more than three bodies can become complex, often requiring custom or adapted code that can be hard to decipher.

We need to address the readability of these codes and work towards creating user-level code that interfaces with the base code. Again, I bring up GooFit as an example - it does a great job of shielding the user from the intricacies of CUDA code to perform an analysis. Despite this, I find that there's room for improvement in the user experience, and I believe it would be fruitful for us to discuss these issues during the workshop.

@JMolinaHN JMolinaHN pinned this issue Jul 11, 2023
@redeboer
Copy link
Collaborator

We need to address the readability of these codes and work towards creating user-level code that interfaces with the base code. Again, I bring up GooFit as an example - it does a great job of shielding the user from the intricacies of CUDA code to perform an analysis. Despite this, I find that there's room for improvement in the user experience, and I believe it would be fruitful for us to discuss these issues during the workshop.

I fully agree!

Is it an idea to organise a dedicated session for amplitude analysis (UX and documentation specifically)? If so, who would be interested? @JMolinaHN @mdsokoloff @jonas-eschle?

@JMolinaHN
Copy link
Collaborator

@redeboer of course a discussion on amplitude analysis would be more than interesting! (in view of the latest results, I think we need it). From my point of view, I refuse to think that it can't be done a likelihood analysis in some decays like Dpipipi or Dkpipi. We all know those decays are challenging because of the pipi (in general, pp) but in some sense we should be adecuate (sensitive) to problems like that.

@mattbellis
Copy link
Collaborator

+1

1 similar comment
@ianna
Copy link
Collaborator

ianna commented Jul 19, 2023

+1

@nikoladze
Copy link
Collaborator

+1

@jonas-eschle
Copy link
Collaborator

I'm very interested in a jax-based statistical inference package, towards both binned and un-binned fits.

@nsmith- Is this better-scoped as a statistical modelling package, where one would find the appropriate abstraction that fits both binned/unbinned paradigms? Inference would just be extra layers on minimization, which I've already abstracted in relaxed for the most common cases encountered in pyhf (profile likelihood-based testing) -- the only important API requirement is the existence of the .logpdf method. (Upper limits are a small extension over that with a root-finder).

This is basically what zfit already solves, it combines binned and unbinned (and mixed) fits. I think it's crucially more than relaxed, which allows to use histogram templates as an unbinned PDF (afaiu), but there is more to that: analytic shapes, numerical integration & sampling methods, arbitrary correlations etc.

I also agree with the others, to point the three main topics that I see:

  • fitting tools landscape and interface (how to define an interface for a distribution, parameters etc). There are a lot around but all with their purpose and the main goal should be to bring them closer together
  • backends: Sympy and JAX are some popular ones, but they're not without drawbacks (i.e. TensorFlow is partially more powerful than JAX, there is aesara (and ping to @redeboer ;) ) that improves Sympy expressions). How can all of these work together best?
  • statistical language and serialization standard (HS3, decaylanguage, pyhf histfactory json) to have a common format of serializing (and publishing/exchanging!) likelihoods and models, also Amplitude fit models.

@nsmith-
Copy link
Collaborator

nsmith- commented Jul 21, 2023

zfit already solves, it combines binned and unbinned (and mixed) fits

In this regard, zfit and RooFit are alone at the moment. What I would like to understand is how their representations of mixed binned-unbinned data compare/contrast.

As an aside, Combine also can produce unbinned "pseudo-Asimov" datasets to take advantage of asymptotic methods. Is this something done elsewhere? (I am just ignorant here)

TensorFlow is partially more powerful than JAX

Curious about this!

@alexander-held
Copy link
Collaborator

As an aside, Combine also can produce unbinned "pseudo-Asimov" datasets to take advantage of asymptotic methods.

@nsmith- I'm curious to learn more about this. Is this in the docs?

@nsmith-
Copy link
Collaborator

nsmith- commented Jul 21, 2023

There is a brief discussion here http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#asimov-datasets

@matthewfeickert
Copy link
Collaborator

This topic seems perhaps too broad, and while I expect that during the week it will split out across different areas organically the areas that I think I'm most probable to spend time discussing are:

@riga riga unpinned this issue Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023 PyHEP.dev 2023 fitting see #5 ML Machine Learning topical-group Topic for discussion
Projects
None yet
Development

No branches or pull requests