Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBSControllerLauncher: Unable to connect_client_sync() #619

Open
lukas-koschmieder opened this issue Oct 28, 2021 · 16 comments
Open

PBSControllerLauncher: Unable to connect_client_sync() #619

lukas-koschmieder opened this issue Oct 28, 2021 · 16 comments

Comments

@lukas-koschmieder
Copy link

I have been using ipyparallel 6 for a while and would like to migrate to ipyparallel 7 mainly due to fact that the new Cluster API enables you to manage the entire process through a Jupyter Notebook. Unfortunatelly, I am having difficulties to connect a client to my cluster.

I have created a new IPython profile adding a custom ipcluster_config.py, which is a modified version of my existing/working config for ipp 6 (see below).

I can successfully start a cluster spawning two PBS jobs (controller and engine).

import ipyparallel as ipp

cluster=ipp.Cluster(
    n=128, 
    controller_ip='*',
    profile='pbs'
)
# Using existing profile dir: '/network/datamic/home/lukask/.aixvipmap/.ipython/profile_pbs'
await cluster.start_cluster()
# Job submitted with job id: '23777'
# Starting 4 engines with <class 'ipyparallel.cluster.launcher.PBSEngineSetLauncher'>
# Job submitted with job id: '23778'
# <Cluster(cluster_id='1635407068-xsc7', profile='pbs', controller=<running>, engine_sets=['1635407070'])>

But if I run the following line, the notebook will only show that the kernel is busy and it will never actually finish.

rc = cluster.connect_client_sync()

Am I using the API incorrectly? What might be the problem?

ipcluster_config.py

c.Cluster.engine_launcher_class = 'ipyparallel.cluster.launcher.PBSEngineSetLauncher'

c.Cluster.controller_launcher_class = 'ipyparallel.cluster.launcher.PBSControllerLauncher'

c.PBSControllerLauncher.batch_template = '''
#PBS -N ipcontroller
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=1

cd $PBS_O_WORKDIR

conda activate ipp7

ipcontroller --profile-dir={profile_dir}
'''

c.PBSEngineSetLauncher.batch_template = '''
#PBS -N ipengine
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes={n//4}:ppn=4

cd $PBS_O_WORKDIR

conda activate ipp7

module load intel
mpiexec -n {n} ipengine --profile-dir={profile_dir}
'''
@lukas-koschmieder
Copy link
Author

The program hangs in the while loop at

while not all(os.path.isfile(f) for f in paths):
because the config files created in PROFILE/security are named ipcontroller-client.json and ipcontroller-engine.json whereas the program expects the filenames to include cluster_id, e.g. ipcontroller-1635416202-ou8n-client.json, ipcontroller-1635416202-ou8n-engine.json. Is this also fixed in #606?

@minrk
Copy link
Member

minrk commented Oct 28, 2021

This is why I need to get CI tests for all the non-slurm batch launchers (#604)!

I do believe the issue is fixed in dev, but those custom templates will still reintroduce the problem. If you add --cluster-id={cluster_id} it will be fixed. The generic fix is to use {program_and_args} instead of ipengine --profile-dir={profile_dir}.

I believe these templates will work:

c.PBSControllerLauncher.batch_template = '''
#PBS -N ipcontroller
#PBS -V
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=1

cd $PBS_O_WORKDIR

conda activate ipp7

{program_and_args}
'''

c.PBSEngineSetLauncher.batch_template = '''
#PBS -N ipengine
#PBS -j oe
#PBS -V
#PBS -l walltime=01:00:00
#PBS -l nodes={n//4}:ppn=4

cd $PBS_O_WORKDIR

conda activate ipp7

module load intel
mpiexec -n {n} {program_and_args}
'''

The next release uses environment variables to pass things like the cluster id, which means you must add #PBS -V to your options.

@lukas-koschmieder
Copy link
Author

lukas-koschmieder commented Oct 28, 2021

Thank you for the quick reply! The general method works. 👍

Is it possible to instantiate a Cluster without an existing IPython profile and ipcluster_config.py by passing c.PBSControllerLauncher.batch_template and c.PBSEngineSetLauncher.batch_template somehow directly to the class constructor from a Jupyter Notebook?

Pseudocode:

controller_template='''
#PBS -N ipcontroller
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=1
##PBS -q {queue}

cd $PBS_O_WORKDIR

conda activate ipp7

{program_and_args}
'''

engine_template = '''
#PBS -N ipengine
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes={n//4}:ppn=4
##PBS -q {queue}

cd $PBS_O_WORKDIR

conda activate ipp7

module load intel
mpiexec -n {n} {program_and_args}
'''

cluster=ipp.Cluster(
    n=4, 
    controller_ip='*',
    profile='pbs-2021-10-28',
    extra_options={ 
        'c.PBSControllerLauncher.batch_template':controller_template,
        'c.PBSEngineSetLauncher.batch_template':engine_template
    })

@minrk
Copy link
Member

minrk commented Oct 28, 2021

Yes! You populate the cluster.config object, which is the same as c in your ipcluster_config.py:

cluster=ipp.Cluster(
    n=4, 
    controller_ip='*',
    profile='pbs-2021-10-28',
)
 # this is the same config object you would configure in ipcluster_config.py
# you don't have to call it `c`, but if you do, the rest will look familiar
c = cluster.config

c.PBSControllerLauncher.batch_template = controller_template
c.PBSEngineSetLauncher.batch_template = engine_template

await cluster.start_cluster()

@lukas-koschmieder
Copy link
Author

Fantastic! Thank you!

@minrk
Copy link
Member

minrk commented Oct 28, 2021

Adding lots of examples to my documentation todo list...

@minrk
Copy link
Member

minrk commented Oct 28, 2021

I'll make an 8.0 beta tomorrow. It would be great if you could test it out!

@lukas-koschmieder
Copy link
Author

Okay, great, I will test it.

@lukas-koschmieder
Copy link
Author

I've got another question and potential point for the documentation todo list: How do you configure the controller dynamically in Python / Jupyter Notebook (replacement for ipcontroller_config.py)? For instance, how would you set c.HubFactory.ip='*'?

@minrk
Copy link
Member

minrk commented Oct 29, 2021

That can be c.Cluster.controller_ip via config, or since it's on the Cluster object, it can be a constructor argument:

Cluster(controller_ip="*")

HubFactory is removed and replaced by IPController, so if you do still have an ipcontroller_config.pyit would bec.IPController.ip = '*'`.

The ambiguity is because there are really two things you are configuring:

  1. Cluster (and thereby Launchers) which start processes like ipcontroller, and
  2. ipcontroller, ipengine themselves

Some common options for configuring the controller itself can be done on the Cluster, but for the most part ipcontroller is configured directly through either ipcontroller_config.py or ControllerLauncher.controller_args. Cluster(controller_ip="*") is really a shortcut for c.ControllerLauncher.controler_args.append("--ip=*").

@minrk minrk changed the title Unable to connect_client_sync() PBSControllerLauncher: Unable to connect_client_sync() Oct 29, 2021
@minrk
Copy link
Member

minrk commented Oct 29, 2021

I just published 8.0.0b1 if you could give it a try

@lukas-koschmieder
Copy link
Author

lukas-koschmieder commented Oct 29, 2021

Okay, I'm currently in the middle of something but I will give it a try this afternoon/evening.

@minrk
Copy link
Member

minrk commented Oct 29, 2021

No rush! I've only got a few more minutes of work before the weekend. I'll probably aim to do a release around the end of next week.

@lukas-koschmieder
Copy link
Author

lukas-koschmieder commented Oct 31, 2021

I've installed the new beta version 8.0.0b1 and everything is looking fine - except for start_cluster_sync, which now appears to be significantly slower. The controller starts immediately but there is some noticeable delay before the engines come up. I haven't tested if this delay scales with the number of engines. I was using 4 engines in my test.

import time
start = time.time()
cluster.start_cluster_sync()
end = time.time()
print(end - start)

8.0.0b1 output

Job submitted with job id: '23909'
Starting 4 engines with <class 'ipyparallel.cluster.launcher.PBSEngineSetLauncher'>
Job submitted with job id: '23910'

30.14998745918274

7.1.0 (conda-forge) output

Job submitted with job id: '23911'
Starting 6 engines with <class 'ipyparallel.cluster.launcher.PBSEngineSetLauncher'>
Job submitted with job id: '23912'

1.15324068069458

Edit: If the release is next week, unfortunately I won't be able to participate in additional beta testing because I am on holiday until Nov 8th.

@minrk
Copy link
Member

minrk commented Nov 1, 2021

except for start_cluster_sync, which now appears to be significantly slower.

That makes sense. It's the new Cluster.send_engines_connection_env option, which means by default start_cluster waits for the controller to finish starting before starting the engines, because the connection info is passed via environment through the Launcher. To disable this and rely on the connection files on disk (pre-8.0 behavior):

cluster = Cluster(send_engines_connection_env=False, engines='pbs', controller='pbs', cnotroller_ip='*')

then the engine and controller jobs should both be submitted immediately.

@minrk
Copy link
Member

minrk commented Nov 11, 2021

@lukas-koschmieder I just published 8.0.0rc1. Can you test and then close here if you think everything is resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants