Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine: Fix bug introduced when refactoring upload_calculation #6348

Closed

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Apr 16, 2024

In 6898ff4 the implementation of the
processing of the local_copy_list in the upload_calculation method
was changed. Originally, the files specified by the local_copy_list
were first copied into the SandboxFolder before copying its contents
to the working directory using the transport. The commit allowed the
order in which the local and sandbox files were copied, so the local
files were now no longer copied through the sandbox. Rather, they were
copied to a temporary directory on disk, which was then copied over
using the transport. The problem is that if the latter would copy over a
directory that was already created by the copying of the sandbox, an
exception would be raised.

For example, if the sandbox contained the directory folder and the
local_copy_list contained the items (_, 'file', './folder/file')
this would work just fine in the original implementation as the file
would be written to the folder on the remote folder. The current
implementation, however, would write the file content to folder/file
in a local temporary directory, and then iterate over the directories
and copy them over. Essentially it would be doing:

Transport.put('/tmpdir/folder', 'folder')

but since folder would already exist on the remote working directory
the local folder would be nested and so the final path on the remote
would be /workingdir/folder/folder/file.

The correct approach is to copy each item of the local_copy_list from
the local temporary directory individually using the Transport.put
interface and not iterate over the top-level entries of the temporary
directory at the end.

@sphuber sphuber added the priority/critical-blocking must be resolved before next release label Apr 16, 2024
@sphuber sphuber force-pushed the fix/local-copy-list-subdirectory branch from b851b73 to 13d400b Compare April 16, 2024 18:07
@sphuber
Copy link
Contributor Author

sphuber commented Apr 17, 2024

Note that this is a critical bug that is currently on main as it breaks for example things as simple as running a PwCalculation from aiida-quantumespresso so this has to be fixed before release.

@sphuber sphuber force-pushed the fix/local-copy-list-subdirectory branch from 13d400b to 20fc810 Compare April 19, 2024 09:32
@mbercx
Copy link
Member

mbercx commented Apr 24, 2024

Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the pseudo folder would be copied inside an already existing one, causing QE to crash:

$ tree
.
├── CRASH
├── _aiidasubmit.sh
├── _scheduler-stderr.txt
├── _scheduler-stdout.txt
├── aiida.in
├── aiida.out
├── out
└── pseudo
    └── pseudo
        └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF

Can confirm that the changes in this PR fix that issue.

@sphuber
Copy link
Contributor Author

sphuber commented Apr 24, 2024

Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the pseudo folder would be copied inside an already existing one, causing QE to crash:

So he is running of the main branch, correct? Because that bug should not be present in an actual release. Can we merge this PR then? Or do you have doubts about the changes?

@mbercx
Copy link
Member

mbercx commented Apr 24, 2024

So he is running of the main branch, correct?

Yes, it seems so, and reverting to v2.5.1 fixed the issue.

Can we merge this PR then? Or do you have doubts about the changes?

I still want to have a proper look at the code, so I can also make sure I understand the changes in 6898ff4. Should have time for this tomorrow or Friday.

@mbercx
Copy link
Member

mbercx commented May 1, 2024

Looking into this some more, won't the following line cause similar woes?

https://github.com/sphuber/aiida-core/blob/20fc81065731a05eb5ee1da95976bda504d7f67d/src/aiida/engine/daemon/execmanager.py#L293

I was playing around with the following code:

import os
import pathlib
import shutil
from tempfile import TemporaryDirectory
from aiida import orm, load_profile

load_profile()

localhost = orm.load_computer('localhost')

remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'

folder_data = orm.FolderData(tree=pseudo_path)

shutil.rmtree(remote_workdir, ignore_errors=True)

def copy_local(transport, folder_data):
    with TemporaryDirectory() as tmpdir:

        dirpath = pathlib.Path(tmpdir)

        data_node = folder_data

        filepath_target = (dirpath / 'pseudo').resolve().absolute()
        filepath_target.parent.mkdir(parents=True, exist_ok=True)

        data_node.base.repository.copy_tree(filepath_target, 'pseudo')

        transport.put(f'{dirpath}/*', transport.getcwd())

with localhost.get_transport() as transport:

    transport.mkdir(remote_workdir)
    transport.chdir(remote_workdir)

    copy_local(transport, folder_data)
    transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo')

The code above will give the following directory tree:

.
├── data
│   └── pseudo
│       ├── Ba.upf
│       └── Si.upf
├── sandbox.ipynb
└── workdir
    └── pseudo
        ├── Ba.upf
        ├── Si.upf
        └── pseudo
            ├── Ba.upf
            └── Si.upf

But switching the order of the copy_local and transport.copy will not result in the nested psuedo folder.

@sphuber
Copy link
Contributor Author

sphuber commented May 1, 2024

But switching the order of the copy_local and transport.copy will not result in the nested psuedo folder.

Sure, but that is because you are calling the following

transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo')

And that is saying copy the contents of the source pseudo inside the pseudo target directory. It is not saying update the contents of the target pseudo dir with the contents of the source pseudo. What you are trying to do is:

transport.copy(os.path.join(pseudo_path, 'pseudo/*'), 'pseudo')

So you are globbing the contents of pseudo and copying that into the target pseudo. If you change this line, then your script works as expected. This is also exactly the change I added in this PR for the local copy.

So I don't think there is a regression in the behavior of transport.copy as it is used for the remote_copy_list is there?

@mbercx
Copy link
Member

mbercx commented May 1, 2024

So I don't think there is a regression in the behavior of transport.copy as it is used for the remote_copy_list is there?

I don't think there is a regression, I was just wondering if we should make a similar change for remote_copy_list, since now the copying behavior for the two is different when the folder is already present due to a previous copying operation.

I rewrote the example to rely on the functions in the execmanager to see if I can break things (as I like to do):

from logging import LoggerAdapter
import shutil

from aiida import orm, load_profile
from aiida.common import AIIDA_LOGGER
from aiida.engine.daemon.execmanager import _copy_remote_files, _copy_local_files

load_profile()

random_calc_job = orm.load_node(36)
logger = LoggerAdapter(logger=AIIDA_LOGGER.getChild('execmanager'))

localhost = orm.load_computer('localhost')

remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'

shutil.rmtree(remote_workdir, ignore_errors=True)

folder_data = orm.FolderData(tree=pseudo_path)
folder_data.store()

local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo')

with localhost.get_transport() as transport:

    transport.mkdir(remote_workdir)
    transport.chdir(remote_workdir)

    _copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item])
    _copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ())

Critical of course are the local_copy_list_item and remote_copy_list_item definitions:

local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo')

Here, all is well, since I use the glob * to copy the contents of /Users/mbercx/project/core/jupyter/data/pseudo, as you noted. Removing the glob results in the nested pseudo dirs, but perhaps this is expected.

If we remove the glob, and invert the _copy_local_files and _copy_remote_files operations (as we currently allow):

local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo', 'pseudo')

with localhost.get_transport() as transport:

    transport.mkdir(remote_workdir)
    transport.chdir(remote_workdir)

    _copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ())
    _copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item])

the behavior is different, i.e. there is no nested pseudo folder. I guess now my question is if this difference in behavior is problematic.

@unkcpz
Copy link
Member

unkcpz commented May 2, 2024

Since _copy_remote_files and _copy_local_files are not methods for public use, maybe it not a issue that the behaviors are not parallel with each other?

@sphuber
Copy link
Contributor Author

sphuber commented May 2, 2024

Since _copy_remote_files and _copy_local_files are not methods for public use, maybe it not a issue that the behaviors are not parallel with each other?

Sure, but they are actually used by the user, albeit it indirectly. The local_copy_list and remote_copy_list instructions are actually passed directly to these functions. If they now require different instructions to achieve the same behavior, I can see how that would be confusing and inconsistent. For example if in local_copy_list you can/have to use relative_source/* and for remote_copy_list you don't use the glob. So I agree with @mbercx that we should look at it being consistent.

I need to try and find some time to try the example scripts on the latest release, to see what the original behavior was. I think the current changes on main only change the behavior of local_copy_list which is why I had to change it. But remote_copy_list should not have changed, so I think the highlighted problem of nesting should be present in v2.5.1 as well.

@sphuber
Copy link
Contributor Author

sphuber commented May 5, 2024

TLDR: The solution of this PR is wrong. Not even so much for the discrepancy in behavior of local and remote copy lists, but really because the implementation of _copy_local_files does not directly respect the copying instructions.

This is a tricky one. The first question is what the behavior of Transport.put and Transport.copy (intuitively) should be. If you copy dir_a to dir_a, what should happen if a) dir_a does already exist in the target or b) dir_a does not exist. If we follow the behavior of cp, then there should be a difference in the two cases. If the target dir_a already exists, the dir_a source will be copied inside the existing directory, i.e., it will be nested. This is the current behavior of Transport.put and Transport.copy so it stands to be argued that that is the expected behavior.

The problem is really due to a detail of the original implementation of upload_calculation. There are three sources of input files that are provided by the calcjob plugin that are copied to the remote working directory:

  1. The sandbox folder
  2. The local_copy_list
  3. The remote_copy_list

The implementation did not literally copy the contents of each sequentially to the working directory. Rather, it would copy the instructions of the local_copy_list to the sandbox folder, and then it would copy the sandbox folder wholesale to the working dir. Critically, it did not use the Transport interface for the first operation, but simply used normal Python file operations since the sandbox sits on the local file system. The remote_copy_list was actually done through the transport interface directly to the working directory.

In the new implementation, this was changed, where the 3 copying steps are directly copied to the remote working dir, and the local_copy_list contents are not merged into the sandbox folder first. But this now leads to problem if the sandbox folder creates a folder on the remote, that also exists in the local_copy_list at which point the latter will now be nested. This happens for PwCalculation for example, because the plugin manually creates the pseudo folder in the sandbox, and then also copies the UpfData files to ./pseudo/filename through the local_copy_list.

In principle, getting rid of the "hack" of merging local_copy_list with the sandbox is a good thing in my opinion. However, the exact manner I did it in is causing the breaking of original functionality I believe. The _copy_local_files has to copy the files from the instructions to a temporary file on the local disk because the Transport interface does not allow file handles, and AiiDA's repository interface can only provide streaming handles. The problem is that I first copy all items of local_copy_list to a temp dir on disk, and then copy all the contents to the remote working dir, however, this is not identical to copying the local_copy_list instructions over one by one. So the solution, I think, is to copy each item of local_copy_list as soon as it is written to temporary file locally, instead of copying them all locally and then copying the temp dir in one go.

@sphuber sphuber force-pushed the fix/local-copy-list-subdirectory branch from 20fc810 to c2a5903 Compare May 5, 2024 17:10
@sphuber sphuber changed the title LocalTransport: Accept existing directories in puttree Engine: Fix bug introduced when refactoring upload_calculation May 5, 2024
@sphuber
Copy link
Contributor Author

sphuber commented May 5, 2024

Thanks for the critical review @mbercx . My original solution was horribly flawed and was trying to cover up a bad refactoring of the upload_calculation in an earlier commit. I am now pretty confident that I have found the correct solution which doesn't introduce differences between local and remote copy operations and does not have to touch the behavior of the transports whatsoever. I have tested the change with aiida-quantumespresso and it now works as expected.

Scratch that... it is still not quite that simple 😭

@sphuber sphuber force-pushed the fix/local-copy-list-subdirectory branch from c2a5903 to db3c9f1 Compare May 5, 2024 19:10
@sphuber sphuber requested a review from mbercx May 6, 2024 18:01
@sphuber
Copy link
Contributor Author

sphuber commented May 6, 2024

@mbercx could you give this another review please? The behavior of the remote_copy_list and local_copy_list in your example is now the same once again and I believe the original behavior of the latest release is also restored.

Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber! I think the critical question is indeed what the behavior of Transport.put and Transport.copy should be. I think it'd be hard to change their behavior from what you describe above, so I agree it makes sense to keep cp-like behavior.

I had a closer look at the code and did some more field testing on the behavior of the various FileCopyOperations. I left two comments so far, of which the double comment on line 364 re copying the contents of a FileType.DIRECTORY is the most critical.

# Now copy the contents of the temporary folder to the remote working directory using the transport
for filepath in dirpath.iterdir():
transport.put(str(filepath), filepath.name)
transport.makedirs(str(pathlib.Path(target).parent), ignore_existing=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that because of line 359 and this one, the copy behaviour of local_copy_list is not the same as cp, which would simply fail in case the parent folder that you are trying to copy into doesn't exist.

I wonder if this was also the previous behavior of local_copy_list. The QE plugin made the pseudo directory in the sandbox folder exactly because otherwise the copy command would fail, I assume.

Finally, remote_copy_list does fail when trying to copy files into a parent folder that doesn't exist.

@@ -360,15 +361,14 @@ def _copy_local_files(logger, node, transport, inputs, local_copy_list):
if data_node.base.repository.get_object(filename_source).file_type == FileType.DIRECTORY:
# If the source object is a directory, we copy its entire contents
data_node.base.repository.copy_tree(filepath_target, filename_source)
transport.put(f'{dirpath}/*', target or '.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment below, I was wondering if this means that the local_copy_list behavior is once again different from cp. Funny enough, it does seem similar when using -r and adding a forward slash after the source directory:

❯ rm -rf *; mkdir pseudo; cp -r ../test_qe/pseudo ./pseudo; tree
.
└── pseudo
    └── pseudo
        ├── Ba.upf
        └── Si.upf

3 directories, 2 files
❯ rm -rf *; mkdir pseudo; cp -r ../test_qe/pseudo/ ./pseudo; tree
.
└── pseudo
    ├── Ba.upf
    └── Si.upf

2 directories, 2 files

Kind of similar to rsync, I guess.

Copy link
Member

@mbercx mbercx May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually, after having a closer look, the behaviour seems different than I expected. With the following code:

import shutil

from logging import LoggerAdapter

from aiida import orm, load_profile
from aiida.common import AIIDA_LOGGER
from aiida.common.folders import SandboxFolder
from aiida.engine.daemon.execmanager import _copy_local_files, _copy_sandbox_files

load_profile()

random_calc_job = orm.load_node(36)
logger = LoggerAdapter(logger=AIIDA_LOGGER.getChild('execmanager'))

localhost = orm.load_computer('localhost')

remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
test_path = '/Users/mbercx/project/core/jupyter/test_qe'

shutil.rmtree(remote_workdir, ignore_errors=True)

folder_data = orm.FolderData(tree=test_path)
folder_data.store()

local_copy_list = [
    (folder_data.uuid, 'pseudo', 'pseudo'),
]

with SandboxFolder() as sandbox_folder:

    sandbox_folder.get_subfolder('pseudo', create=True)

    with localhost.get_transport() as transport:

        transport.mkdir(remote_workdir)
        transport.chdir(remote_workdir)

        _copy_sandbox_files(logger, random_calc_job, transport, sandbox_folder)
        _copy_local_files(logger, random_calc_job, transport, None, local_copy_list)

and the contents of test_qe:

test_qe
└── pseudo
    ├── Ba.upf
    └── Si.upf

I indeed get a nested folder:

workdir/
└── pseudo
    └── pseudo
        ├── Ba.upf
        └── Si.upf

Not creating the pseudo folder in the sandbox leads to the non-nested result. However, if I make the local_copy_list:

local_copy_list = [
    (folder_data.uuid, 'pseudo', '.'),
]

Then the workdir becomes:

workdir/
├── Ba.upf
└── Si.upf

Is that what we want? Now we are really doing the cp -r pseudo/ . (with forward slash) option, which means copy the contents of the directory to the target path.

@mbercx
Copy link
Member

mbercx commented May 8, 2024

@sphuber just a note: I'm leaving on holiday tomorrow until the 20th, so will most likely not have time to review until after that... I agree the release should come out soon though. Maybe @giovannipizzi (due to his experience) or @khsrali (due to the fact that he's working on transports) can get involved in the review?

I think the discrepancy between local_copy_list and remote_copy_list in allowing the absence of parent folders is fine, although we could also allow this for remote_copy_list as well by making the parent folders as we do for remote_copy_list.

The fact that local_copy_list copies the contents of a directory instead of the directory itself seems somewhat strange. I suppose this was necessary to maintain backwards compatibility with the previous approach of merging the sandbox folder and local_copy_list?

@sphuber
Copy link
Contributor Author

sphuber commented May 8, 2024

The fact that local_copy_list copies the contents of a directory instead of the directory itself seems somewhat strange

If and only if the target directory already exists right? Otherwise it just copies as is.

The problem here is indeed really the fact that the old implementation did not go directly through the Transport but had an intermediate merging step that used a different API and so we kind of now have to maintain that behavior in order to not break existing plugins, as demonstrated by PwCalculation breaking.

@mbercx
Copy link
Member

mbercx commented May 8, 2024

If and only if the target directory already exists right? Otherwise it just copies as is.

Not sure if that's true, see my example near the end of #6348 (comment)

@GeigerJ2
Copy link
Contributor

Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the pseudo folder would be copied inside an already existing one, causing QE to crash:

Just a note that I also stumbled on this behavior now, and installing from the PR here fixes it. So I'll try to dedicate some time to reviewing the code and reading the discussion, maybe that could be helpful for merging.

@rikigigi
Copy link
Member

I tested this PR for a while with aiida-quantumespresso, and so far I did not found any issues in the code behavior. I confirm that the main branch is completely broken. Can you add a test that covers this case before merging?
Thank you

@sphuber
Copy link
Contributor Author

sphuber commented May 22, 2024

Can you add a test that covers this case before merging?

This PR includes a regression test already Seems I have removed it in one of the iterations. I will see to adding it again.

In 6898ff4 the implementation of the
processing of the `local_copy_list` in the `upload_calculation` method
was changed. Originally, the files specified by the `local_copy_list`
were first copied into the `SandboxFolder` before copying its contents
to the working directory using the transport. The commit allowed the
order in which the local and sandbox files were copied, so the local
files were now no longer copied through the sandbox. Rather, they were
copied to a temporary directory on disk, which was then copied over
using the transport. The problem is that if the latter would copy over a
directory that was already created by the copying of the sandbox, an
exception would be raised.
In 6898ff4 the implementation of the
processing of the `local_copy_list` in the `upload_calculation` method
was changed. Originally, the files specified by the `local_copy_list`
were first copied into the `SandboxFolder` before copying its contents
to the working directory using the transport. The commit allowed the
order in which the local and sandbox files were copied, so the local
files were now no longer copied through the sandbox. Rather, they were
copied to a temporary directory on disk, which was then copied over
using the transport. The problem is that if the latter would copy over a
directory that was already created by the copying of the sandbox, an
exception would be raised.

For example, if the sandbox contained the directory `folder` and the
`local_copy_list` contained the items `(_, 'file', './folder/file')`
this would work just fine in the original implementation as the `file`
would be written to the `folder` on the remote folder. The current
implementation, however, would write the file content to `folder/file`
in a local temporary directory, and then iterate over the directories
and copy them over. Essentially it would be doing:

    Transport.put('/tmpdir/folder', 'folder')

but since `folder` would already exist on the remote working directory
the local folder would be _nested_ and so the final path on the remote
would be `/workingdir/folder/folder/file`.

The correct approach is to copy each item of the `local_copy_list` from
the local temporary directory _individually_ using the `Transport.put`
interface and not iterate over the top-level entries of the temporary
directory at the end.
@sphuber sphuber force-pushed the fix/local-copy-list-subdirectory branch from db3c9f1 to ad15c97 Compare May 22, 2024 07:47
@mbercx
Copy link
Member

mbercx commented May 29, 2024

@sphuber as discussed, I've written a bunch of tests to check the edge cases I've been studying, first based on the latest release tag (v2.5.1):

https://github.com/mbercx/aiida-core/blob/test/upload-calc/tests/engine/daemon/test_execmanager.py

These should all pass for the previous implementation, see

https://github.com/mbercx/aiida-core/actions/runs/9285274053/job/25549435202

I've then added the same tests as a commit on top of this PR:

https://github.com/mbercx/aiida-core/blob/test/upload-calc-order-fix/tests/engine/daemon/test_execmanager.py

Here, 3 tests fail:

https://github.com/mbercx/aiida-core/actions/runs/9285404390/job/25549825351

  1. The sandbox folder is created with the pseudo folder and the local_copy_list tries to copy the contents of a SingleFileData to the target folder, see here. This would fail with a IsADirectoryError in v2.5.1, now it creates an unexpected file hierarchy:

    >               assert serialize_file_hierarchy(filepath_workdir) == expected_hierarchy 
    E               AssertionError: assert {'pseudo': {'... 'Ba pseudo'}} == {'pseudo': {'... 'Ba pseudo'}}
    E                 Differing items:
    E                 {'pseudo': {'pseudo': 'Ba pseudo'}} != {'pseudo': {'Ba.upf': 'Ba pseudo'}}
    E                 Use -v to get more diff
    

    I.e. a nested pseudo folder.

  2. The sandbox folder is created with the pseudo folder and the local_copy_list tries to copy the contents of the pseudo folder in a FolderData to the target pseudo folder, see here. Here also a nested hierarchy ({'pseudo': {'pseudo': {'Ba.upf': 'Ba pseudo'}}}) is created instead of the previous one ({'pseudo': {'Ba.upf': 'Ba pseudo'}}).

  3. Same as (2), but this time with a nested folder.

So the main point now is that the current copying behaviour of the PR does not preserve that of v2.5.1. I'll see if I can fix that first.

@mbercx
Copy link
Member

mbercx commented Jun 3, 2024

Ok, I've extended the number of tests, added tests for the SSH transport, and managed to get it down to one failure (for SSH):

https://github.com/mbercx/aiida-core/actions/runs/9345893736/job/25719640192

See all changes here: mbercx@fa0e9c5

It seems the issue is that the local and SSH transport return different errors when trying to remotely copy a file to a directory that doesn't yet exist. The local transport returns a FileNotFoundError raised by the shutil.copy command:

shutil.copy(the_source, the_destination)

The SSH transport returns an OSError instead:

raise OSError(
'Error while executing cp. Exit code: {}, ' "stdout: '{}', stderr: '{}', " "command: '{}'".format(
retval, stdout, stderr, command
)
)

For some reason, the engine deals with these two error differently:

try:
transport.copy(remote_abs_path, dest_rel_path)
except FileNotFoundError:
logger.warning(
f'[submission of calculation {node.pk}] Unable to copy remote '
f'resource from {remote_abs_path} to {dest_rel_path}! NOT Stopping but just ignoring!.'
)
except OSError:
logger.warning(
f'[submission of calculation {node.pk}] Unable to copy remote '
f'resource from {remote_abs_path} to {dest_rel_path}! Stopping.'
)
raise

I'm now trying to figure out why these two are treated differently.

EDIT: the reason is explained here: 101a8d6

So similar to the source file not existing, if the target path is a file in a directory that doesn't exist, the failure will not be transient and hence we should have the SSH transport return a FileNotFoundError to be consistent with the local transport.

@sphuber
Copy link
Contributor Author

sphuber commented Jun 3, 2024

I'm now trying to figure out why these two are treated differently.

The OSError is a broad catch-all exception. The FileNoteFoundError is a subclass of OSerror and more specific. When copying remote files from remote_copy_list we don't want to raise when the source does not exist, but just warns. If the copy fails due to another reason, the problem may be more severe, and so it makes sense to let the calculation error out.

What we should do is check for the SshTransport what kind of OSError it is throwing. You can figure this out with through the errno attribute of the exception. You can use the constants in the errno module to figure out the meaning. For example:

try:
    os.remove(filename)
except OSError as e:
    if e.errno == errno.ENOENT:
        print('File does not exist')

See this PEP as well for details.

If the reason for the SshTransport is raising is the file is missing, we should have transport also raise the more specific FileNotFoundError. And then the test will pass

@mbercx
Copy link
Member

mbercx commented Jun 3, 2024

Thanks @sphuber! Unfortunately, the .errno attribute of the exception that is raised seems to be None...

Instead I've added a check to see if the parent destination directory exists in the SshTransport.copy() method, see:

mbercx@011f5f7#diff-1284fda2c2014c8c418475c99a82308d82ab96cb33e6c49d978c0031f3b1ef3d

It's more patchy for sure, but at least my tests now all pass locally. 😅 I'm praying they also pass on GitHub:

https://github.com/mbercx/aiida-core/actions/runs/9347879530/job/25725810490

@mbercx
Copy link
Member

mbercx commented Jun 3, 2024

Also double-checked my branch adding these tests on v2.5.1, and can confirm they all pass, save for the same error discussed above.

@mbercx
Copy link
Member

mbercx commented Jun 3, 2024

My suggestion for proceeding is to split up the PR in 3 commits:

  1. SshTransport: Return FileNotFoundError if destination parent does not exist
  2. Transports: overwrite existing folders if overwrite == True
  3. Engine: Recover behavior for upload_calculation after refactor

@sphuber
Copy link
Contributor Author

sphuber commented Jun 4, 2024

Superseded by #6447

@sphuber sphuber closed this Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-blocking must be resolved before next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants