Decrease memory foot print of computing FMR metric #251

AlekseySh · 2022-12-01T22:42:14Z

Initial discussion is in #250

AlekseySh · 2022-12-01T22:43:28Z

I thought about some algorithm to find quantiles online, in a batch mode. I found that there is P^2 algorithm (link) and here is a discussion about it (link).
Then I found more articles about quantile estimation: see references to this article (link). It looks like all these algorithms work only element-by-element. So, no way to parallelize it. I'm not a specialist in this, though.
Maybe it could be done effectively in cython. I'll try to implement the P^2 (or some other algorithm) in cython.

Another solution is to approximate the quantile (and the fnmr@fmr) by means of histogram, that could be computed incrementally. I can try to estimate accuracy of this approach on some syntetic data.

At the moment I don't have other ideas, but I will think on it, and maybe something new will come up.

AlekseySh · 2022-12-01T22:44:53Z

from me:

@dapladoc thank you for your thoughts and links! please, don't start working on cython implementation (except if you really want this), let's discuss it for now. I am afraid that the possible profit is way less than the potential effort needed to implement this algorithm. Let's just discuss possibilities for now.

I like the idea of an approximation. I thought about an even simpler solution: we can consider only the part of pairwise distances if the matrix is too big. For example, randomly pick 10% or something like this.

It's still unclear what is the problem: the footprint of copying distances or memory overhead inside the quantile algorithm.

dapladoc · 2022-12-02T13:57:46Z

I like the idea of an approximation. I thought about an even simpler solution: we can consider only the part of pairwise distances if the matrix is too big. For example, randomly pick 10% or something like this.

I don't like this idea. It is very uncontrollable approach. I think there could be very high variance for this approach.

Here is a small simulation. I drew pos_dist and neg_dist from the gamma distribution. The size of pos_dist is 1000, and the size of neg_dist is 1,000,000. Then in a loop I take randomly 10% of negative distances, evaluate fnmr@fmr and store it. Afterwards, I plot histograms of fnmr@fmr approximations and the true fnmr@fmr for different fmr_vals.

The simulation of fnmr@fmr approximation on random subsets.

pos_dist = torch.from_numpy(np.random.gamma(shape=1, scale=1, size=1000))
neg_dist = torch.from_numpy(np.random.gamma(shape=6, scale=1, size=1_000_000))
fmr_vals = (0.1, 0.5, 1, 5)
fnmr_true = calc_fnmr_at_fmr(pos_dist, neg_dist, fmr_vals).numpy()
fnmr_approx = []
for i in tqdm(range(10000)):
    _neg_dist = neg_dist[torch.rand(neg_dist.shape) > 0.9]
    fnmr_approx.append(calc_fnmr_at_fmr(pos_dist, _neg_dist, fmr_vals).tolist())
fnmr_approx = np.array(fnmr_approx)

fig, ax = plt.subplots(len(fnmr_true))
for i in range(len(fnmr_true)):
    ax[i].hist(fnmr_approx[:, i], bins=100)
    ax[i].axvline(fnmr_true[i], color='r')
    ax[i].set_title(f'fmr_val = {fmr_vals[i]}')
    print(f'Std(fnmr@fmr({fmr_vals[i]})) = {fnmr_approx[:, i].std():.4e}')
    print(f'Bias(fnmr@fmr({fmr_vals[i]})) = {np.mean(np.abs(fnmr_approx[:, i] - fnmr_true[i])):.4e}')
fig.set_size_inches(6.4, 4.8 * 3)

And here is its output

Std(fnmr@fmr(0.1)) = 5.5389e-01
Bias(fnmr@fmr(0.1)) = 4.0927e-01
Std(fnmr@fmr(0.5)) = 3.2004e-01
Bias(fnmr@fmr(0.5)) = 2.2448e-01
Std(fnmr@fmr(1)) = 1.5432e-01
Bias(fnmr@fmr(1)) = 6.6530e-02
Std(fnmr@fmr(5)) = 9.9451e-02
Bias(fnmr@fmr(5)) = 7.1290e-02

Here are histograms of fnmr@fmr approximations. The red lines are "ground truth".

Usually we are interested in fnmr@fmr for small values of fmr, and it will have the biggest bias.

AlekseySh · 2022-12-02T14:29:00Z

The graphs above seem okay to me, it's I consider fnmr@fmr only as an additional (auxiliary) metric. So, it's not super important 30% or 32% of positive distances are smaller than 10% of the negative ones.
Of course, If I submit these results somewhere, it's not accurate enough.

AlekseySh · 2022-12-02T14:44:59Z

If we want to work on this problem systematically, we have to measure the following first (for example, on SOP dataset):

Memory is needed to compute 3 basic metrics (CMC, precision, ap)
Memory is needed to do p.1 + copy the positive and negative distances
Memory is needed to do p.2 + compute fnm@fmr metric

Then it would be clear what and how should be optimized

PS. My IMHO is that we can postpone the problem for now (#250 is enough) and focus on the tasks with higher priorities. In the end, it's probably not optimal to optimize fmr metric without (at least) having examples from the verification domain.

dapladoc · 2022-12-02T16:01:45Z

The graphs above seem okay to me, it's I consider fnmr@fmr only as an additional (auxiliary) metric. So, it's not super important 30% or 32% of positive distances are smaller than 10% of the negative ones. Of course, If I submit these results somewhere, it's not accurate enough.

It really depends on the domain. For security related tasks (i.e. biometrics) 99.9% and 99.99% are noticebly different results.

PS. My IMHO is that we can postpone the problem for now (#250 is enough) and focus on the tasks with higher priorities. In the end, it's probably not optimal to optimize fmr metric without (at least) having examples from the verification domain.

I agree with you. There should be more thorough memory measurements done in order to find possible ways to solve this issue.

AlekseySh self-assigned this Dec 1, 2022

AlekseySh added the technical label Dec 1, 2022

AlekseySh removed their assignment Dec 1, 2022

AlekseySh added this to To do in OML-planning via automation Dec 1, 2022

AlekseySh moved this from To do to backlog in OML-planning Dec 1, 2022

AlekseySh mentioned this issue Dec 1, 2022

Removed FMR calculation to avoid OOM for SOP #250

Merged

AlekseySh added optimization good first issue Good for newcomers labels Dec 6, 2022

AlekseySh mentioned this issue Feb 1, 2023

Investigate & solve the problem with OOM on SOP dataset #283

Closed

AlekseySh moved this from backlog to To do in OML-planning Mar 12, 2023

AlekseySh moved this from To do to backlog in OML-planning Mar 13, 2023

AlekseySh removed the technical label Mar 24, 2024

AlekseySh linked a pull request Apr 1, 2024 that will close this issue

Memory optimization for validation stage #520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease memory foot print of computing FMR metric #251

Decrease memory foot print of computing FMR metric #251

AlekseySh commented Dec 1, 2022 •

edited

AlekseySh commented Dec 1, 2022 •

edited

AlekseySh commented Dec 1, 2022

dapladoc commented Dec 2, 2022

AlekseySh commented Dec 2, 2022

AlekseySh commented Dec 2, 2022

dapladoc commented Dec 2, 2022

Decrease memory foot print of computing FMR metric #251

Decrease memory foot print of computing FMR metric #251

Comments

AlekseySh commented Dec 1, 2022 • edited

AlekseySh commented Dec 1, 2022 • edited

AlekseySh commented Dec 1, 2022

dapladoc commented Dec 2, 2022

AlekseySh commented Dec 2, 2022

AlekseySh commented Dec 2, 2022

dapladoc commented Dec 2, 2022

AlekseySh commented Dec 1, 2022 •

edited

AlekseySh commented Dec 1, 2022 •

edited