Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull: breaks if imported dir ends with "/" #10426

Open
afaul opened this issue May 13, 2024 · 1 comment · May be fixed by #10446
Open

pull: breaks if imported dir ends with "/" #10426

afaul opened this issue May 13, 2024 · 1 comment · May be fixed by #10446
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p2-medium Medium priority, should be done, but less important

Comments

@afaul
Copy link

afaul commented May 13, 2024

Bug Report

Description

When a directory is imported with dvc import and the directory ends with / dvc pull is unable to get the imported files in a clean clone of the repository.

Reproduce

Run this bash-script to reproduce the bug.

rm -rf dvc-test
mkdir dvc-test
cd dvc-test

mkdir repoA
cd repoA
python3 -m venv env
source env/bin/activate
pip install -q dvc
pip install -q dvc-s3
git init
dvc init
mkdir data
dvc import https://github.com/iterative/dataset-registry.git tutorials/nlp/  -o data/   ## broken
# dvc import https://github.com/iterative/dataset-registry.git tutorials/nlp  -o data/   ## working
git add data/nlp.dvc data/.gitignore
git commit -m "commit"
deactivate

cd ..
git clone repoA repoB

cd repoB
python3 -m venv env
source env/bin/activate
pip install -q dvc
pip install -q dvc-s3
dvc pull
deactivate

cd ..

ls -l repoA/data
ls -l repoB/data

Expected

dvc pull should be able to get the data like dvc import

Environment information

Output of dvc doctor:

DVC version: 3.50.1 (pip)
-------------------------
Platform: Python 3.12.3 on Linux-6.8.9-arch1-2-x86_64-with-glibc2.39
Subprojects:
	dvc_data = 3.15.1
	dvc_objects = 5.1.0
	dvc_render = 1.0.2
	dvc_task = 0.4.0
	scmrepo = 3.3.3
Supports:
	http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2024.3.1, boto3 = 1.34.69)
Config:
	Global: /home/afaul/.config/dvc
	System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vg_ssd-lv_home
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/vg_ssd-lv_home
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/69d484a87eba683f3683324f5c8f57f4

Additional Information (if any):

% dvc pull --verbose
2024-05-13 19:55:34,755 DEBUG: v3.50.1 (pip), CPython 3.12.3 on Linux-6.8.9-arch1-2-x86_64-with-glibc2.39
2024-05-13 19:55:34,755 DEBUG: command: /home/afaul/Downloads/dvc-test/repoB/env/bin/dvc pull --verbose
2024-05-13 19:55:35,660 DEBUG: Creating external repo https://github.com/iterative/dataset-registry.git@f59388cd04276e75d70b2136597aaa27e7937cc3
2024-05-13 19:55:35,660 DEBUG: erepo: git clone 'https://github.com/iterative/dataset-registry.git' to a temporary dir              
Collecting                                                                                                |4.00 [00:01, 3.50entry/s]
Fetching                                                                                                                            
Building workspace index                                                                                  |1.00 [00:00,  379entry/s]
Comparing indexes                                                                                        |7.00 [00:00, 1.25kentry/s]
2024-05-13 19:55:36,999 WARNING: No file hash info found for '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./.gitignore'. It won't be created.
2024-05-13 19:55:36,999 DEBUG: failed to create '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./.gitignore' from 'None'            
Traceback (most recent call last):
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/checkout.py", line 94, in _create_files
    src_fs, src_path = storage_obj.get(entry)
                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/index.py", line 198, in get
    raise ValueError
ValueError

2024-05-13 19:55:37,002 WARNING: No file hash info found for '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./Posts.xml.zip'. It won't be created.
2024-05-13 19:55:37,002 DEBUG: failed to create '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./Posts.xml.zip' from 'None'         
Traceback (most recent call last):
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/checkout.py", line 94, in _create_files
    src_fs, src_path = storage_obj.get(entry)
                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/index.py", line 198, in get
    raise ValueError
ValueError

2024-05-13 19:55:37,003 WARNING: No file hash info found for '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./pipeline.zip'. It won't be created.
2024-05-13 19:55:37,003 DEBUG: failed to create '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./pipeline.zip' from 'None'          
Traceback (most recent call last):
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/checkout.py", line 94, in _create_files
    src_fs, src_path = storage_obj.get(entry)
                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/index.py", line 198, in get
    raise ValueError
ValueError

Applying changes                                                                                          |0.00 [00:00,     ?file/s]
2024-05-13 19:55:37,004 DEBUG: Removing '/home/afaul/Downloads/dvc-test/repoB/data/nlp'
No remote provided and no default remote set.
Everything is up to date.
2024-05-13 19:55:37,005 ERROR: failed to pull data from the cloud - Checkout failed for following targets:
data/nlp
Is your cache up to date?
<https://error.dvc.org/missing-files>
Traceback (most recent call last):
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc/commands/data_sync.py", line 35, in run
    stats = self.repo.pull(
            ^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc/repo/pull.py", line 42, in pull
    stats = self.checkout(
            ^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc/repo/checkout.py", line 184, in checkout
    raise CheckoutError([relpath(out_path) for out_path in failed], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
data/nlp
Is your cache up to date?
<https://error.dvc.org/missing-files>

2024-05-13 19:55:37,011 DEBUG: Analytics is enabled.
2024-05-13 19:55:37,073 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpmpwaa9dc', '-v']
2024-05-13 19:55:37,083 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpmpwaa9dc', '-v'] with pid 3404
2024-05-13 19:55:37,085 DEBUG: Removing '/tmp/tmpr14fojvgdvc-clone'
2024-05-13 19:55:37,089 DEBUG: Removing '/tmp/tmpt_4flv1tdvc-cache'
@shcheklein shcheklein added the triage Needs to be triaged label May 19, 2024
@dberenbaum dberenbaum added bug Did we break something? p2-medium Medium priority, should be done, but less important A: data-sync Related to dvc get/fetch/import/pull/push and removed triage Needs to be triaged labels May 20, 2024
@dberenbaum
Copy link
Contributor

The difference is that the dependency path is saved as tutorials/nlp/ instead of tutorials/nlp. We should either be stripping the final / there or treating these as equivalent in the dvc-data index and everywhere else.

georgeyk added a commit to georgeyk/dvc that referenced this issue May 31, 2024
@georgeyk georgeyk linked a pull request May 31, 2024 that will close this issue
2 tasks
georgeyk added a commit to georgeyk/dvc that referenced this issue May 31, 2024
georgeyk added a commit to georgeyk/dvc that referenced this issue May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants