Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding or removing one sample results in absolutely different embeddings #50

Open
cowjen01 opened this issue Feb 21, 2022 · 1 comment

Comments

@cowjen01
Copy link

My code is following:

pymde.seed(0)
mde = pymde.preserve_neighbors(
    matrix[: 1001], # matrix[: 1000]
    embedding_dim=2,
    init='random',
    device='cpu',
    constraint=pymde.Centered(),
    verbose=self.verbose
)
embeddings = mde.embed(verbose=self.verbose)
embeddings = embeddings.cpu().numpy()

When I use the first 1,000 samples from the input matrix I get a very different results then using one sample more (1,001).

Here is the log:

Feb 21 07:21:55 PM: Computing 5-nearest neighbors, with max_distance=None
Feb 21 07:21:55 PM: Exact nearest neighbors by brute force 
Feb 21 07:21:55 PM: Your dataset appears to contain duplicated items (rows); when embedding, you should typically have unique items.
Feb 21 07:21:55 PM: The following items have duplicates [261 262 264 385 394 490 521 542 547 592 715]
Feb 21 07:21:55 PM: Fitting a centered embedding into R^2, for a graph with 1001 items and 9562 edges.
Feb 21 07:21:55 PM: `embed` method parameters: eps=1.0e-05, max_iter=300, memory_size=10
Feb 21 07:21:55 PM: iteration 000 | distortion 0.773313 | residual norm 0.0166138 | step length 30.3 | percent change 1.09275
Feb 21 07:21:55 PM: iteration 030 | distortion 0.372009 | residual norm 0.00494183 | step length 1 | percent change 5.72445
Feb 21 07:21:55 PM: iteration 060 | distortion 0.305200 | residual norm 0.00271112 | step length 1 | percent change 3.55324
Feb 21 07:21:56 PM: iteration 090 | distortion 0.284056 | residual norm 0.00196794 | step length 1 | percent change 2.22588
Feb 21 07:21:56 PM: iteration 120 | distortion 0.277153 | residual norm 0.000870837 | step length 1 | percent change 0.436913
Feb 21 07:21:56 PM: iteration 150 | distortion 0.275639 | residual norm 0.00086974 | step length 1 | percent change 1.04672
Feb 21 07:21:56 PM: iteration 180 | distortion 0.272377 | residual norm 0.00140454 | step length 1 | percent change 1.2704
Feb 21 07:21:56 PM: iteration 210 | distortion 0.269552 | residual norm 0.000706442 | step length 1 | percent change 0.560233
Feb 21 07:21:56 PM: iteration 240 | distortion 0.267543 | residual norm 0.00103134 | step length 1 | percent change 0.558733
Feb 21 07:21:56 PM: iteration 270 | distortion 0.265752 | residual norm 0.000605354 | step length 1 | percent change 0.259163
Feb 21 07:21:56 PM: iteration 299 | distortion 0.265053 | residual norm 0.000348569 | step length 1 | percent change 0.0578442
Feb 21 07:21:56 PM: Finished fitting in 0.660 seconds and 300 iterations.
Feb 21 07:21:56 PM: average distortion 0.265 | residual norm 3.5e-04

And here the output embeddings:

Screenshot 2022-02-21 at 19 26 46

Screenshot 2022-02-21 at 19 26 56

Is this an expected behaviour? I thought adding one sample should not makes as much difference.

Thank you for helping me out!

@akshayka
Copy link
Member

It depends how close the new sample is on average to the first 1000 samples. If it's a nearest neighbor of many of the original samples, then the embedding may look a bit different.

A few things you can try:

  • Use the align function to align the first embedding to the first 1000 rows of the second embedding (https://pymde.org/api/index.html#pymde.align)
  • Supply a small value of eps to the embed method (mde.embed(verbose=True, eps=1e-6) for example)
  • Do an incremental embedding by using an Anchored constraint to pin the first 1000 samples to the original embedding.

If you give me access to the data I can play around with your example when I have some free time.

Additionally, I see that the log contains the following line:

Feb 21 07:21:55 PM: Your dataset appears to contain duplicated items (rows); when embedding, you should typically have unique items.

Having duplicated is typically ill-advised (and can sometimes lead to unexpected behavior), since it doesn't really make sense in the context of the embedding problem. You don't need two representations of the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants