Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDVP convergence issues in multi-process execution #1771

Open
andrea-lizzit opened this issue Apr 16, 2024 · 4 comments
Open

TDVP convergence issues in multi-process execution #1771

andrea-lizzit opened this issue Apr 16, 2024 · 4 comments

Comments

@andrea-lizzit
Copy link

Distributing a job on different processes with MPI results in much worse R_hat values for TDVP.

  • Expected result: changing the number of processes over which the job is distributed does not change the results.
  • Actual result: changing the number of processes over which the job is distributed affects the R_hat metric, often producing incorrect results

The plot below shows three simulation runs on the Ising model. The upper row shows an observable measured during time evolution, the bottom row shows the R_hat metric of "Generator". Running the script on one process with n_samples=1000 produces meaningful results (orange lines). When the same script is distributed between two processes, the markov chain does not converge and measurement of the observable produces different values (blue lines). Increasing the samples to n_samples=2000 on two processes restores the correct behaviour (green lines).

netket_multiprocess_test

Example script:

import netket as nk
import netket.experimental as nkx
from netket.operator.spin import sigmaz, sigmax
import numpy as np
from pathlib import Path
rng = np.random.default_rng()

n_samples = 1000
def main(L, W0, T, output_dir):
    g = nk.graph.Hypercube(length=L, n_dim=1, pbc=False)
    hi = nk.hilbert.Spin(s=1/2, N=L)
    H = nk.operator.Ising(hi, graph=g, h=W0)
    
    rbm = nk.models.RBM(alpha=2, param_dtype=complex)
    sampler = nk.sampler.MetropolisLocal(hi)
    vstate = nk.vqs.MCState(sampler, rbm, n_samples=n_samples)

    integrator = nkx.dynamics.Heun(dt=0.001)
    qgt = nk.optimizer.qgt.QGTJacobianDense(holomorphic=True)
    te = nkx.TDVP(H, vstate, integrator, qgt=qgt)
    obs = {"Corr": sigmaz(hi, 1) * sigmaz(hi, 4), "Z": sigmaz(hi, 4), "X": sigmax(hi, 4)}

    res = te.run(T=T, out=str(output_dir / "data"), obs=obs)
    return res

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description='LRTFIM')
    parser.add_argument("L", type=int, help="System size")
    parser.add_argument("W0", type=float, help="Field strength")
    parser.add_argument("-T", type=float, help="Simulation time", default=10.0)
    parser.add_argument("--output_dir", type=str, help="Output directory", default="output")
    args = parser.parse_args()

    L, W0 = args.L, args.W0
    output_dir = Path(args.output_dir) / ("test_netket_" + str(n_samples))
    res = main(L, W0, args.T, output_dir)

The issue seems to persist when running on multiple GPUs with native jax parallelism.

@PhilipVinc
Copy link
Member

I guess this is because by default netket uses 16 chains per device (which, by the way, is extraordinarily inefficient for GPUs) and n_discard=n_samples/10 (also very very inefficient).

So for 1000 sample and 1 device your chains have length 1000/16 with 100 steps of thermalisation after every change of parameters.

With 2000 samples and 2 devices you have chains of length 2000/32 with 200 steps of thermalisation after every change of parameters, so it should work even better than the case above.

With 1000 samples and 2 devices you have chains of length 1000/32 and 100 steps of thermalisation after every change of parameters.
So the chains are shorter, though with that much thermalisation I'm surprised it plays a role...

@PhilipVinc
Copy link
Member

@andrea-lizzit can you share the parameters for the simulations you plotted above? Th command line argument you used to launch them.

@andrea-lizzit
Copy link
Author

Here it is
python test_netket.py 20 1 -T 1 --output_dir test_netket
To produce the different plots I manually changed n_samples on line 8 to the values 1000 or 2000 and changed slurm environment variables to distribute the job over 1 or 2 MPI processes

@PhilipVinc
Copy link
Member

Thanks. I'll try to find the time to look into it. However a quick thing I noticed is that you're starting from a random state, which in turn changes every time. Already setting

    vstate = nk.vqs.MCState(sampler, rbm, n_samples=n_samples, seed=1234)

should ensure more reproducibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants