Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc repro --pull: invalidates cache #10412

Closed
legendof-selda opened this issue May 2, 2024 · 8 comments · Fixed by #10435
Closed

dvc repro --pull: invalidates cache #10412

legendof-selda opened this issue May 2, 2024 · 8 comments · Fixed by #10435
Labels
A: pipelines Related to the pipelines feature A: run-cache Related to the run-cache feature awaiting response we are waiting for your reply, please respond! :) p2-medium Medium priority, should be done, but less important

Comments

@legendof-selda
Copy link

Bug Report

repro: dvc repro --pull invalidates cache. doing dvc pull and then dvc repro works.

Description

When I use dvc repro --pull on my dvc pipelines, it invalidates cache for certain stages and runs them again.
However, if we do dvc pull and then do dvc repro it respects the cache and skips the stages. This is the behaviour I expect for dvc repro --pull as well.

Also for some reason the hash key of the stage outputs keeps changing for every run as well.

Reproduce

  1. Delete all dvc files in directory.
  2. dvc repro --pull

Expected

dvc repro --pull must behave the same way as if we do dvc pull and then dvc repro.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 3.49.0 (pip)
-------------------------
Platform: Python 3.11.3 on Linux-6.5.0-1017-azure-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.1
Supports:
        azure (adlfs = 2024.4.0, knack = 0.11.0, azure-identity = 1.12.0),
        http (aiohttp = 3.9.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.4, aiohttp-retry = 2.8.3)
Config:
        Global: /home/user/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/data_vg-lv_data
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/mapper/data_vg-lv_data
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/cc226011e8332423634ca4bb8822ef4b

Additional Information (if any):

dvc repro --pull data/dvc.yaml -v: repro_pull_verbose.log

@dberenbaum
Copy link
Contributor

Are you able to provide a reproducible example? I'm not able to reproduce with https://github.com/iterative/example-get-started:

$ dvc repro --pull
WARNING: Failed to pull run cache: run-cache is not supported for http filesystem: https://remote.dvc.org/get-started
Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data/data.xml'

Stage 'prepare' is cached - skipping run, checking out outputs

Stage 'featurize' is cached - skipping run, checking out outputs

Stage 'train' is cached - skipping run, checking out outputs

Stage 'evaluate' is cached - skipping run, checking out outputs
Use `dvc push` to send your updates to remote storage.

@dberenbaum dberenbaum added the awaiting response we are waiting for your reply, please respond! :) label May 3, 2024
@legendof-selda
Copy link
Author

legendof-selda commented May 6, 2024

I forgot to upload the dvc.yaml file used

vars:
  - BLOB_URL: "<URL>"
  - data-version.toml
stages:
  download_full_parquet:
    desc: Download the exported parquet files
    params:
      - data/data-version.toml:
          - DATA_DIR
    cmd: >2
      <script to download from $BLOB_URL>
    wdir: ..
    outs:
      - data/full_output:
          persist: true
          cache: true
  download_dbg_parquet:
    desc: Download the exported dbg parquet files
    params:
      - data/data-version.toml:
          - DEBUG_DATA_DIR
    cmd: >2
      <script to download from $BLOB_URL>
    wdir: ..
    outs:
      - data/dbg_output:
          persist: true
          cache: true
  generate_full_json:
    desc: Generate the json intermediaries for the full dataset.
    cmd: python <script to generate json data> data/full_output data/full_json
    wdir: ../
    deps:
      - data/full_output
      - data/cached
      - .hashes/code.hash
      - poetry.lock
    outs:
      - data/full_json:
          cache: true
          persist: false
  generate_dbg_json:
    desc: Generate the json intermediaries for the debug dataset.
    cmd: python <script to generate json data> data/dbg_output data/dbg_json
    wdir: ../
    deps:
      - data/dbg_output
      - data/cached
      - .hashes/code.hash
      - poetry.lock
    outs:
      - data/dbg_json:
          cache: true
          persist: false
  generate_full_hdf5:
    desc: Transform the raw json files into hdf5 files the model can understand.
    cmd: python <script to transform json to hdf5> data/full_json data/full_wholesalers
    wdir: ../
    deps:
      - poetry.lock
      - .hashes/code.hash
      - data/full_json
      - data/fred/state_economics.json
      - data/fred/fred_signals_included.json
      - data/fred/national_economics.json
      - data/gas_price/output/state_gas_price.json
      - data/industry_trend/processed_trend.csv
      - data/state_region_mapping/state_region_mapping.json
    outs:
      - data/full_wholesalers:
          cache: true
          persist: true
  generate_dbg_hdf5:
    desc: Transform the raw debug json files into hdf5 files the model can understand.
    cmd: python <script to transform json to hdf5> data/dbg_json data/dbg_wholesalers
    wdir: ../
    deps:
      - poetry.lock
      - .hashes/code.hash
      - data/dbg_json
      - data/fred/state_economics.json
      - data/fred/fred_signals_included.json
      - data/fred/national_economics.json
      - data/gas_price/output/state_gas_price.json
      - data/industry_trend/processed_trend.csv
      - data/state_region_mapping/state_region_mapping.json
    outs:
      - data/dbg_wholesalers:
          cache: true
          persist: true

@legendof-selda
Copy link
Author

legendof-selda commented May 6, 2024

I think the issue is perhaps due to persist=true?

I assume on pull, persisted output stages are downloaded. I have seen that stages which are set to persist=true requires that we pull the data first. if the output path content has changed and persist was set to true, dvc doesn't skip the stage. like in the case when I forget to checkout dvc while switching branches.

With --pull does it pull the stages first especially for persisted stages?

@dberenbaum
Copy link
Contributor

Ah, you figured it out. persist=True is the issue. When persist=True, it creates a circular dependency, where the output depends not only on the listed dependencies, but also on the existing output. DVC has no way of tracking the contents of the previous iteration of that output, so it does not save that stage to the run-cache. If anything, I think the bug may be that DVC ever skips a stage with persist=True.

@legendof-selda
Copy link
Author

so is this by design? I assume that for dvc repro --pull we need to pull the stages that the workflow depends on? if persist is True, it should checkout the cache that is currently in dvc.lock. This issue doesn't pop up if I did a normal dvc pull and then dvc repro.

@dberenbaum dberenbaum added bug Did we break something? p2-medium Medium priority, should be done, but less important A: run-cache Related to the run-cache feature A: pipelines Related to the pipelines feature and removed awaiting response we are waiting for your reply, please respond! :) labels May 9, 2024
@dberenbaum
Copy link
Contributor

so is this by design?

I think so, although I agree it's confusing. There are two ways that dvc can skip a stage:

  1. If the workspace hasn't changed from what's in dvc.lock.
  2. If the stage can be restored from the run-cache (it has been run before and the data is in the cache).

In dvc repro --pull, 1 fails for all your stages because there is nothing in your workspace. This is the difference from doing dvc pull first. At the time dvc checks whether the stage has changed, there is nothing in the workspace yet, so it looks like the data has been deleted.

However, 2 succeeds for all your stages except for the persist one since that one was never saved to the run-cache for the reasons mentioned above.

I think what you want is dvc repro --pull --allow-missing. That will make 1 succeed when the data is missing. Maybe it makes sense for --pull to imply --allow-missing.

@dberenbaum dberenbaum added awaiting response we are waiting for your reply, please respond! :) and removed bug Did we break something? labels May 10, 2024
@legendof-selda
Copy link
Author

isn't --allow-missing for ignoring data doesn't exist in remote cache error?

When it comes to the 2nd case, if the workspace is empty then pull the data for the given hash key. If the workspace has different data in the workspace, either rerun the stage or give an error saying the data is different and we could provide a cmd flag or prompt input to force pull the persisted stages. Similar to dvc pull when we have data which isn't tracked by dvc (like the times when we kill the process in the middle you have dvc changes that aren't tracked).

@dberenbaum
Copy link
Contributor

isn't --allow-missing for ignoring data doesn't exist in remote cache error?

Unfortunately, it is used to mean different things under different commands today. This is not the meaning for repro.

if the workspace is empty then pull the data for the given hash key

This is what --allow-missing does except that it skips the stage entirely and doesn't waste time pulling the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature A: run-cache Related to the run-cache feature awaiting response we are waiting for your reply, please respond! :) p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants