TDVP convergence issues in multi-process execution #1771

andrea-lizzit · 2024-04-16T15:06:42Z

Distributing a job on different processes with MPI results in much worse R_hat values for TDVP.

Expected result: changing the number of processes over which the job is distributed does not change the results.
Actual result: changing the number of processes over which the job is distributed affects the R_hat metric, often producing incorrect results

The plot below shows three simulation runs on the Ising model. The upper row shows an observable measured during time evolution, the bottom row shows the R_hat metric of "Generator". Running the script on one process with n_samples=1000 produces meaningful results (orange lines). When the same script is distributed between two processes, the markov chain does not converge and measurement of the observable produces different values (blue lines). Increasing the samples to n_samples=2000 on two processes restores the correct behaviour (green lines).

Example script:

import netket as nk
import netket.experimental as nkx
from netket.operator.spin import sigmaz, sigmax
import numpy as np
from pathlib import Path
rng = np.random.default_rng()

n_samples = 1000
def main(L, W0, T, output_dir):
    g = nk.graph.Hypercube(length=L, n_dim=1, pbc=False)
    hi = nk.hilbert.Spin(s=1/2, N=L)
    H = nk.operator.Ising(hi, graph=g, h=W0)
    
    rbm = nk.models.RBM(alpha=2, param_dtype=complex)
    sampler = nk.sampler.MetropolisLocal(hi)
    vstate = nk.vqs.MCState(sampler, rbm, n_samples=n_samples)

    integrator = nkx.dynamics.Heun(dt=0.001)
    qgt = nk.optimizer.qgt.QGTJacobianDense(holomorphic=True)
    te = nkx.TDVP(H, vstate, integrator, qgt=qgt)
    obs = {"Corr": sigmaz(hi, 1) * sigmaz(hi, 4), "Z": sigmaz(hi, 4), "X": sigmax(hi, 4)}

    res = te.run(T=T, out=str(output_dir / "data"), obs=obs)
    return res

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description='LRTFIM')
    parser.add_argument("L", type=int, help="System size")
    parser.add_argument("W0", type=float, help="Field strength")
    parser.add_argument("-T", type=float, help="Simulation time", default=10.0)
    parser.add_argument("--output_dir", type=str, help="Output directory", default="output")
    args = parser.parse_args()

    L, W0 = args.L, args.W0
    output_dir = Path(args.output_dir) / ("test_netket_" + str(n_samples))
    res = main(L, W0, args.T, output_dir)

The issue seems to persist when running on multiple GPUs with native jax parallelism.

The text was updated successfully, but these errors were encountered:

PhilipVinc · 2024-04-16T16:21:46Z

I guess this is because by default netket uses 16 chains per device (which, by the way, is extraordinarily inefficient for GPUs) and n_discard=n_samples/10 (also very very inefficient).

So for 1000 sample and 1 device your chains have length 1000/16 with 100 steps of thermalisation after every change of parameters.

With 2000 samples and 2 devices you have chains of length 2000/32 with 200 steps of thermalisation after every change of parameters, so it should work even better than the case above.

With 1000 samples and 2 devices you have chains of length 1000/32 and 100 steps of thermalisation after every change of parameters.
So the chains are shorter, though with that much thermalisation I'm surprised it plays a role...

PhilipVinc · 2024-04-19T08:16:58Z

@andrea-lizzit can you share the parameters for the simulations you plotted above? Th command line argument you used to launch them.

andrea-lizzit · 2024-04-21T15:50:29Z

Here it is
python test_netket.py 20 1 -T 1 --output_dir test_netket
To produce the different plots I manually changed n_samples on line 8 to the values 1000 or 2000 and changed slurm environment variables to distribute the job over 1 or 2 MPI processes

PhilipVinc · 2024-04-22T11:43:12Z

Thanks. I'll try to find the time to look into it. However a quick thing I noticed is that you're starting from a random state, which in turn changes every time. Already setting

    vstate = nk.vqs.MCState(sampler, rbm, n_samples=n_samples, seed=1234)

should ensure more reproducibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDVP convergence issues in multi-process execution #1771

TDVP convergence issues in multi-process execution #1771

andrea-lizzit commented Apr 16, 2024

PhilipVinc commented Apr 16, 2024

PhilipVinc commented Apr 19, 2024

andrea-lizzit commented Apr 21, 2024

PhilipVinc commented Apr 22, 2024

TDVP convergence issues in multi-process execution #1771

TDVP convergence issues in multi-process execution #1771

Comments

andrea-lizzit commented Apr 16, 2024

PhilipVinc commented Apr 16, 2024

PhilipVinc commented Apr 19, 2024

andrea-lizzit commented Apr 21, 2024

PhilipVinc commented Apr 22, 2024