Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM submission limited to PBSpro installs #20

Open
lesleygray opened this issue Jun 13, 2022 · 9 comments
Open

SLURM submission limited to PBSpro installs #20

lesleygray opened this issue Jun 13, 2022 · 9 comments
Labels
bug Something isn't working

Comments

@lesleygray
Copy link

Hi Marek,

Thank you for sharing your wonderful pipeline! MINTIE is working very well in local mode submitted to our queue through an interactive job.

We have had problems with the cluster implementation as BPIPE requires the qstat -x flag included in PBSpro. Our qstat install is packaged with slurm-torque 18.08.4-1.el7.

Execution Command
nohup srun mintie -w -p params.txt cases/*.fastq.gz controls/*.fastq.gz &
Successfully submits 'fastq_dedup' to the queue as 1 job per sample.

Error
Pipeline hangs after successful completion of 'fastq_dedup'. SLURM exit status is COMPLETE and output fastq are generated.

Outputs in .bpipe/bpipe.log:

bpipe.Utils	[38]	INFO	|11:57:27 Executing command: qstat -x 5451428 
bpipe.executor.TorqueStatusMonitor	[38]	WARNING	|11:57:27 Error occurred in processing torque output: java.lang.Exception: Error parsing torque output: unexpected error: Unknown option: x 

Environment
The MINTIE installation is for version 0.3.9 installed via miniconda3/mamba. The package version are in the yaml here:
mintie.yml.txt

The BPIPE scheduling configuration is as follows:

executor="slurm"

//controls the total number of procs MINTIE can spawn
//if running locally, ensure that concurrency is not to
//set to more than the number of procs available. If
//running on a cluster, this can be increased
concurrency=10

//following commands are for running on a cluster
walltime="5-20:00:00"
queue="bigmem"
mem_param="mem-per-cpu"
memory="30"
proc_mode=1
usePollerFileWatcher=true
useLegacyTorqueJobPolling=true
procs=10
account="grayl"

//add server-specific module to load
modules="miniconda3"

commands {

Thank you in advance for taking a look at this.
Lesley

@ssadedin
Copy link

Hi @lesleygray - I see you've enabled the useLegacyTorqueJobPolling option which is indeed intended for this scenario. It seems it's not respecting that flag. To help debug it, I was wondering if you can check in the Bpipe logs - if it is recognising the flag then it should be printing into the log a message like:

Using legacy torque status polling

Are you seeing that? If you can let me know it'll help a lot to figure out why it's not obeying it in your case.

Thanks!

@ssadedin
Copy link

Oops, I just noticed you specified slurm as the executor, so I realised now this is definitely a bug as Bpipe should never be executing qstat when the executor is set to Slurm, rather it should be querying jobs with scontrol or sbatch etc.

Unfortunately I don't have a Slurm cluster to test with right now, but if you are willing to try it I can make a fix to this and give you an updated bpipe version to test it out. Let me know if you'd be up for that.

Thanks!

@lesleygray
Copy link
Author

Sorry for the delay Simon. No I cannot see that in the logs.

Here is a snippet:

...
bpipe.executor.CustomCommandExecutor    [40]    INFO    |11:49:38 Starting command: bash /data/Bioinfo/bioinfo-resources/apps/miniconda3/miniconda3-py39/envs/lesley_WGSEnv/envs/mintie/opt
/bpipe-0.9.11/bin/../bin/bpipe-slurm.sh start
bpipe.executor.SlurmCommandExecutor     [40]    INFO    |11:49:38 Started command with id 5451428
bpipe.executor.TorqueCommandExecutor    [40]    INFO    |11:49:38 Forwarding file .bpipe/commandtmp/1/1.out
bpipe.ForwardHost       [40]    INFO    |11:49:38 Forwarding file .bpipe/commandtmp/1/1.out using forwarder bpipe.Forwarder@7c31947e
bpipe.ForwardHost       [40]    INFO    |11:49:38 Forwarding file .bpipe/commandtmp/1/1.err using forwarder bpipe.Forwarder@58ec15f2
bpipe.PipelineContext   [40]    INFO    |11:49:38 Create storage layer bpipe.storage.LocalFileSystemStorageLayer for output SP-17-4474-1A/SP-17-4474-1A.1.fastq.gz
bpipe.PipelineContext   [40]    INFO    |11:49:38 Create storage layer bpipe.storage.LocalFileSystemStorageLayer for output SP-17-4474-1A/SP-17-4474-1A.2.fastq.gz
bpipe.executor.ThrottledDelegatingCommandExecutor       [40]    INFO    |11:49:38 Waiting for command to complete before releasing 2 resources
bpipe.executor.TorqueStatusMonitor      [40]    INFO    |11:49:38 Starting torque status monitor ...
bpipe.Utils     [38]    INFO    |11:49:39 Executing command: qstat -x 5451428
bpipe.executor.TorqueStatusMonitor      [38]    WARNING |11:49:39 Error occurred in processing torque output: java.lang.Exception: Error parsing torque output: unexpected error: Unknown o
ption: x
...

I am certainly happy to do some testing, task away!

@mcmero
Copy link
Collaborator

mcmero commented Jul 5, 2022

I've also tried running a conda-installed MINTIE on a slurm cluster and I'm getting this same issue. Also tested with all bpipe versions 0.9.9.9 to 0.9.11. A manual installation may fix the problem, however that is pretty fiddly.

I'm also happy to try testing with a patched version @ssadedin.

@ssadedin
Copy link

ssadedin commented Oct 4, 2022

Sorry all for taking a long while to followup.

In the end I realised this problem is probably addressed in a fix already in the codebase for quite a while, it's just the version of bpipe installed by default with MINTIE is a few years old.

@mcmero it would be great to validate if the latest codebase in master works with MINTIE and if so I will be releasing that officially as bpipe 0.9.12 shortly so then it would be great to include with MINTIE by default - what do you think?

Sorry again for taking ages to follow up!

@lesleygray
Copy link
Author

Thanks for your response Simon. Marek, I am still happy to run some testing if needed.

@mcmero
Copy link
Collaborator

mcmero commented Oct 10, 2022

Thanks @ssadedin. Any chance you could send me a binary of the master build?

@mcmero
Copy link
Collaborator

mcmero commented Nov 4, 2022

@ssadedin I've managed to compile the latest bpipe successfully, but it's still giving the same qstat error. Any ideas?

@lonsbio
Copy link

lonsbio commented Nov 4, 2022

Coincidently I've come across this issue on something else unrelated to MINTIE. I've been debugging with a Ubuntu VM with slurm installed, running as both head node and server, which seems to be enough to trigger an error when switching from local to slurm executor.

I've tried legacy polling too, and the only clue for me is that since Slurm extends Torque, is the useLegacyJobPolling fix overridding the config file? By my understanding of the logic, don't we want it to be true here and use the legacy, not the new pooled?

class SlurmCommandExecutor extends TorqueCommandExecutor implements CommandExecutor {

    public static final long serialVersionUID = 0L

    /**
     * Constructor
     */
    SlurmCommandExecutor() {
        super(new File(System.getProperty("bpipe.home") + "/bin/bpipe-slurm.sh"))
        
        // The pooled status polling only works for PBS Torque
        this.useLegacyJobPolling = false
    }

@mcmero mcmero added the bug Something isn't working label Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants