Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease memory foot print of computing FMR metric #251

Open
AlekseySh opened this issue Dec 1, 2022 · 6 comments
Open

Decrease memory foot print of computing FMR metric #251

AlekseySh opened this issue Dec 1, 2022 · 6 comments
Labels

Comments

@AlekseySh
Copy link
Contributor

AlekseySh commented Dec 1, 2022

Initial discussion is in #250

@AlekseySh AlekseySh self-assigned this Dec 1, 2022
@AlekseySh AlekseySh removed their assignment Dec 1, 2022
@AlekseySh AlekseySh added this to To do in OML-planning via automation Dec 1, 2022
@AlekseySh AlekseySh moved this from To do to backlog in OML-planning Dec 1, 2022
@AlekseySh
Copy link
Contributor Author

AlekseySh commented Dec 1, 2022

from @dapladoc :

I thought about some algorithm to find quantiles online, in a batch mode. I found that there is P^2 algorithm (link) and here is a discussion about it (link).
Then I found more articles about quantile estimation: see references to this article (link). It looks like all these algorithms work only element-by-element. So, no way to parallelize it. I'm not a specialist in this, though.
Maybe it could be done effectively in cython. I'll try to implement the P^2 (or some other algorithm) in cython.

Another solution is to approximate the quantile (and the fnmr@fmr) by means of histogram, that could be computed incrementally. I can try to estimate accuracy of this approach on some syntetic data.

At the moment I don't have other ideas, but I will think on it, and maybe something new will come up.

@AlekseySh
Copy link
Contributor Author

from me:

@dapladoc thank you for your thoughts and links! please, don't start working on cython implementation (except if you really want this), let's discuss it for now. I am afraid that the possible profit is way less than the potential effort needed to implement this algorithm. Let's just discuss possibilities for now.

I like the idea of an approximation. I thought about an even simpler solution: we can consider only the part of pairwise distances if the matrix is too big. For example, randomly pick 10% or something like this.

It's still unclear what is the problem: the footprint of copying distances or memory overhead inside the quantile algorithm.

@dapladoc
Copy link
Collaborator

dapladoc commented Dec 2, 2022

I like the idea of an approximation. I thought about an even simpler solution: we can consider only the part of pairwise distances if the matrix is too big. For example, randomly pick 10% or something like this.

I don't like this idea. It is very uncontrollable approach. I think there could be very high variance for this approach.

Here is a small simulation. I drew pos_dist and neg_dist from the gamma distribution. The size of pos_dist is 1000, and the size of neg_dist is 1,000,000. Then in a loop I take randomly 10% of negative distances, evaluate fnmr@fmr and store it. Afterwards, I plot histograms of fnmr@fmr approximations and the true fnmr@fmr for different fmr_vals.

The simulation of fnmr@fmr approximation on random subsets.
pos_dist = torch.from_numpy(np.random.gamma(shape=1, scale=1, size=1000))
neg_dist = torch.from_numpy(np.random.gamma(shape=6, scale=1, size=1_000_000))
fmr_vals = (0.1, 0.5, 1, 5)
fnmr_true = calc_fnmr_at_fmr(pos_dist, neg_dist, fmr_vals).numpy()
fnmr_approx = []
for i in tqdm(range(10000)):
    _neg_dist = neg_dist[torch.rand(neg_dist.shape) > 0.9]
    fnmr_approx.append(calc_fnmr_at_fmr(pos_dist, _neg_dist, fmr_vals).tolist())
fnmr_approx = np.array(fnmr_approx)

fig, ax = plt.subplots(len(fnmr_true))
for i in range(len(fnmr_true)):
    ax[i].hist(fnmr_approx[:, i], bins=100)
    ax[i].axvline(fnmr_true[i], color='r')
    ax[i].set_title(f'fmr_val = {fmr_vals[i]}')
    print(f'Std(fnmr@fmr({fmr_vals[i]})) = {fnmr_approx[:, i].std():.4e}')
    print(f'Bias(fnmr@fmr({fmr_vals[i]})) = {np.mean(np.abs(fnmr_approx[:, i] - fnmr_true[i])):.4e}')
fig.set_size_inches(6.4, 4.8 * 3)

And here is its output

Std(fnmr@fmr(0.1)) = 5.5389e-01
Bias(fnmr@fmr(0.1)) = 4.0927e-01
Std(fnmr@fmr(0.5)) = 3.2004e-01
Bias(fnmr@fmr(0.5)) = 2.2448e-01
Std(fnmr@fmr(1)) = 1.5432e-01
Bias(fnmr@fmr(1)) = 6.6530e-02
Std(fnmr@fmr(5)) = 9.9451e-02
Bias(fnmr@fmr(5)) = 7.1290e-02

Here are histograms of fnmr@fmr approximations. The red lines are "ground truth".

fnmr_approximations_errors

Usually we are interested in fnmr@fmr for small values of fmr, and it will have the biggest bias.

@AlekseySh
Copy link
Contributor Author

The graphs above seem okay to me, it's I consider fnmr@fmr only as an additional (auxiliary) metric. So, it's not super important 30% or 32% of positive distances are smaller than 10% of the negative ones.
Of course, If I submit these results somewhere, it's not accurate enough.

@AlekseySh
Copy link
Contributor Author

If we want to work on this problem systematically, we have to measure the following first (for example, on SOP dataset):

  1. Memory is needed to compute 3 basic metrics (CMC, precision, ap)
  2. Memory is needed to do p.1 + copy the positive and negative distances
  3. Memory is needed to do p.2 + compute fnm@fmr metric

Then it would be clear what and how should be optimized

PS. My IMHO is that we can postpone the problem for now (#250 is enough) and focus on the tasks with higher priorities. In the end, it's probably not optimal to optimize fmr metric without (at least) having examples from the verification domain.

@dapladoc
Copy link
Collaborator

dapladoc commented Dec 2, 2022

The graphs above seem okay to me, it's I consider fnmr@fmr only as an additional (auxiliary) metric. So, it's not super important 30% or 32% of positive distances are smaller than 10% of the negative ones. Of course, If I submit these results somewhere, it's not accurate enough.

It really depends on the domain. For security related tasks (i.e. biometrics) 99.9% and 99.99% are noticebly different results.

PS. My IMHO is that we can postpone the problem for now (#250 is enough) and focus on the tasks with higher priorities. In the end, it's probably not optimal to optimize fmr metric without (at least) having examples from the verification domain.

I agree with you. There should be more thorough memory measurements done in order to find possible ways to solve this issue.

@AlekseySh AlekseySh moved this from backlog to To do in OML-planning Mar 12, 2023
@AlekseySh AlekseySh moved this from To do to backlog in OML-planning Mar 13, 2023
@AlekseySh AlekseySh linked a pull request Apr 1, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging a pull request may close this issue.

2 participants