Skip to content

Fix: DVC Not Working — Remote Push Errors, Pipeline DAG Issues, and Git Integration

FixDevs · (Updated: )

Part of:  Python Errors

Quick Answer

How to fix DVC errors — dvc push authentication failed, dvc pull file missing, pipeline stage not reproducing, cache out of disk space, dvc add vs dvc stage, conflict with git LFS, and S3/GCS remote setup.

The Error

You set up DVC and try to push to a remote — authentication fails:

$ dvc push
ERROR: failed to push data to the cloud - Authentication failed
unable to access S3

Or you clone a repo with DVC files and the data is missing:

$ dvc pull
WARNING: No remote provided
ERROR: failed to pull data from the cloud

Or a pipeline stage refuses to reproduce despite changed code:

$ dvc repro
Stage 'train' didn't change, skipping
# But you just edited train.py!

Or the cache fills up the disk:

$ dvc status
ERROR: unable to write file to cache: [Errno 28] No space left on device

Or dvc add and git add interact badly:

$ dvc add data/large_file.csv
$ git add data/large_file.csv   # Mistake — adds the large file to git
$ git push
remote: error: File large_file.csv exceeds GitHub's file size limit

DVC (Data Version Control) brings Git-like workflows to large data files and ML pipelines. Files live in a cache and remote storage; Git tracks small .dvc metadata files instead of the data itself. The model is powerful for ML reproducibility but creates specific failure modes around remote configuration, pipeline DAGs, and the dual Git/DVC interaction. This guide covers each.

Why This Happens

DVC stores actual file contents in a content-addressed cache (.dvc/cache/). Git tracks tiny .dvc pointer files (a few hundred bytes each) that contain hashes. The cache pushes to remote storage (S3, GCS, Azure, SSH). When you dvc pull, DVC reads the hashes from .dvc files and downloads matching content from the remote.

The pipeline DAG is defined in dvc.yaml. DVC tracks each stage’s dependencies (code files, data inputs, parameters) and caches outputs. A stage is “fresh” if all dependencies are unchanged — DVC computes hashes of every dep, not just timestamps. Confusion happens when a dep change isn’t picked up because it’s not declared in dvc.yaml.

Fix 1: Adding Data Files

# Add a single file
dvc add data/raw/dataset.csv

# DVC creates:
# - data/raw/dataset.csv.dvc  (small metadata file — commit this to git)
# - .gitignore entry for data/raw/dataset.csv  (auto-added)
# - File moved to cache, symlinked/copied back

git add data/raw/dataset.csv.dvc data/raw/.gitignore
git commit -m "Track dataset with DVC"

Add an entire directory:

dvc add data/raw/
# Creates data/raw.dvc instead of one file per element

Common Mistake: Running git add data/raw/dataset.csv after dvc add. DVC adds the file to .gitignore automatically, but if you’ve previously committed the file or use git add -f, you bypass the ignore and push the actual data to git. Always check git status after dvc add — only the .dvc file and .gitignore change should appear staged.

Verify what’s tracked:

dvc status         # Show changes since last commit
dvc list .         # List all DVC-tracked files
git status         # Show only .dvc metadata files (no large files)

Updating tracked data:

# Edit the file
echo "new row" >> data/raw/dataset.csv

# Re-add — updates the .dvc file with new hash
dvc add data/raw/dataset.csv

# Commit the updated .dvc file
git add data/raw/dataset.csv.dvc
git commit -m "Update dataset"

Fix 2: Remote Storage Setup

# Configure S3 remote
dvc remote add -d myremote s3://my-bucket/dvc-storage

# Configure GCS remote
dvc remote add -d myremote gs://my-bucket/dvc-storage

# Configure Azure
dvc remote add -d myremote azure://my-container/dvc-storage

# SSH remote (your own server)
dvc remote add -d myremote ssh://[email protected]/path/to/storage

# Local network share
dvc remote add -d myremote /mnt/shared/dvc-storage

# Google Drive (free, slow)
dvc remote add -d myremote gdrive://YOUR_FOLDER_ID

-d makes it the default remote.

Authentication for cloud remotes:

# AWS S3 — uses standard AWS credentials
# (~/.aws/credentials, AWS_ACCESS_KEY_ID env vars, IAM instance profile)
aws configure   # If not done

# Or set explicitly per remote
dvc remote modify myremote access_key_id YOUR_KEY
dvc remote modify myremote secret_access_key YOUR_SECRET --local
# --local stores in .dvc/config.local (gitignored)

GCS — uses Google Cloud SDK auth:

gcloud auth application-default login

# Or service account
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json

Test connectivity:

dvc remote list
dvc push --dry-run   # Show what would be pushed without actually transferring

Pro Tip: Use --local flag for credentials that shouldn’t be committed:

dvc remote modify myremote access_key_id YOUR_KEY --local

This writes to .dvc/config.local instead of .dvc/config. The .local file is gitignored by default — credentials never reach git.

Fix 3: Push, Pull, and Fetch

# Push tracked files to remote (after dvc add)
dvc push

# Pull tracked files from remote (after git clone)
dvc pull

# Fetch (download but don't checkout)
dvc fetch

# Check status
dvc status -c   # Compare local cache to remote
dvc status      # Compare workspace to local cache

Common errors:

ERROR: failed to push data to the cloud - 403 Forbidden

The credentials work for reading but not writing. Check IAM permissions on the bucket.

ERROR: Unable to find DVC file with output 'data/raw/dataset.csv'

Run dvc add first to create the .dvc file.

WARNING: No data found in cache for path 'data/raw/dataset.csv'

The cache was cleared (or you’re on a fresh clone). Run dvc pull to download from remote.

Push specific files:

dvc push data/raw/dataset.csv
dvc push -r myremote data/processed/

Pull specific files:

dvc pull data/raw/dataset.csv
dvc pull --include-only data/   # Only files under data/

Fix 4: Pipeline Stages with dvc.yaml

DVC pipelines define reproducible workflows. Each stage has commands, dependencies, and outputs:

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py data/raw/dataset.csv data/prepared/
    deps:
      - src/prepare.py
      - data/raw/dataset.csv
    outs:
      - data/prepared/

  train:
    cmd: python src/train.py data/prepared/ models/model.pkl
    deps:
      - src/train.py
      - data/prepared/
    params:
      - train.learning_rate
      - train.epochs
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py models/model.pkl data/test/ evaluation.json
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/test/
    metrics:
      - evaluation.json:
          cache: false

Run the pipeline:

dvc repro         # Run only stages whose deps have changed
dvc repro -f      # Force re-run all stages
dvc repro train   # Run a specific stage and downstream stages

Stage doesn’t re-run when expected:

$ dvc repro
Stage 'train' didn't change, skipping

This means DVC computed the same hash for all dependencies. Common causes:

  1. Dependency missing from dvc.yaml — you changed a file DVC isn’t tracking as a dep
  2. cmd change isn’t tracked — DVC v3+ tracks cmd automatically, but ensure you didn’t edit something else
  3. Cached output is reused — even if you delete the output, DVC restores from cache

Force re-run with -f:

dvc repro -f train   # Re-run regardless of cache state

Common Mistake: Forgetting to add a code file as a dependency. If prepare.py imports utils.py and you change utils.py, DVC won’t notice — utils.py isn’t in deps. Add it explicitly:

prepare:
  cmd: python src/prepare.py data/raw/ data/prepared/
  deps:
    - src/prepare.py
    - src/utils.py           # Add this
    - data/raw/
  outs:
    - data/prepared/

Or use deps: - src/ to track the whole directory (broader but safer).

Fix 5: Parameters and Experiments

# params.yaml
train:
  learning_rate: 0.001
  epochs: 50
  batch_size: 32

prepare:
  test_split: 0.2
  random_seed: 42

Reference in dvc.yaml:

stages:
  train:
    cmd: python src/train.py
    deps: [src/train.py, data/prepared/]
    params: [train.learning_rate, train.epochs, train.batch_size]
    outs: [models/model.pkl]

Reading params in Python:

# src/train.py
import yaml

with open("params.yaml") as f:
    params = yaml.safe_load(f)["train"]

lr = params["learning_rate"]
epochs = params["epochs"]

Run experiments with different params:

# Override a param for one experiment
dvc exp run -S train.learning_rate=0.01 -S train.epochs=100

# Run a sweep
dvc exp run -S 'train.learning_rate=range(1e-5, 1e-1, 1e-5)'

# List experiments
dvc exp show

# Promote the best one to a git branch
dvc exp branch exp-abc123 best-lr

dvc exp show displays a table comparing experiments by metric — your model selection dashboard.

Fix 6: Avoiding Cache Bloat

$ dvc status
ERROR: unable to write file to cache: No space left on device

DVC’s cache grows with every version of every file. Large datasets multiply quickly.

Clean unused cache entries:

dvc gc -w        # Remove cache for files not referenced in workspace
dvc gc -a -w     # Across all branches/tags
dvc gc -a -w -r myremote   # Also remove from remote

-w = current workspace; -a = all branches; -T = all tags.

Configure cache location if disk is full:

# Move cache to a larger disk
dvc cache dir /mnt/large-disk/dvc-cache

Shared cache for teams on the same machine:

dvc cache dir --global /shared/dvc-cache
dvc config cache.shared group --global

Sets the cache to a shared directory with group-writeable permissions.

Configure symlinks instead of copies (saves disk for large files):

dvc config cache.type symlink,hardlink,copy

DVC tries symlink first, then hardlink, then copy. Symlinks save the most space but can be fragile if cache is on a different filesystem.

Fix 7: Git Integration and .dvc Files

A .dvc file looks like this:

# data/raw/dataset.csv.dvc
outs:
- md5: a1b2c3d4e5f6789abc...
  size: 1234567
  path: dataset.csv

Small enough to commit to Git. The hash tells DVC which content to fetch from the cache or remote.

.gitignore rules DVC creates:

# .gitignore (created automatically by dvc add)
/data/raw/dataset.csv

Common Mistake: Editing the .dvc file by hand. The hash must match the file content exactly. If you edit .dvc to point at a different hash, DVC can’t find the content and dvc checkout fails. Always re-run dvc add to update the hash.

Restoring an old version:

# Check out an old git commit
git checkout abc123

# DVC files now point at old data — sync workspace
dvc checkout
# Or pull from remote if not in local cache
dvc pull

Common workflow for data updates:

# 1. Update the data
python prepare_new_data.py > data/raw/dataset.csv

# 2. Re-add to DVC (updates the .dvc hash)
dvc add data/raw/dataset.csv

# 3. Commit the .dvc file
git add data/raw/dataset.csv.dvc
git commit -m "Update dataset to v2"

# 4. Push to DVC remote
dvc push

# 5. Push to Git
git push

Fix 8: CI/CD with DVC

# .github/workflows/train.yml
name: Train Model

on:
  push:
    branches: [main]
  workflow_dispatch:

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - run: pip install dvc[s3]
      - run: pip install -r requirements.txt

      - name: Configure DVC remote
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull

      - name: Run pipeline
        run: dvc repro

      - name: Push artifacts
        if: success()
        run: dvc push

      - name: Show metrics
        run: dvc metrics show

dvc[s3] — install the S3 storage backend. Use dvc[gs] for GCS, dvc[azure] for Azure, dvc[ssh] for SSH.

Cache the DVC cache between runs:

- uses: actions/cache@v4
  with:
    path: .dvc/cache
    key: dvc-${{ hashFiles('**/*.dvc', 'dvc.lock') }}

Saves time on repeated runs that share data.

Platform Differences and Backend Choices

DVC’s storage layer is pluggable: the same dvc add and dvc push commands work against half a dozen backends, but each has its own auth model, latency profile, and failure mode. The orchestration layer also overlaps with MLflow and Metaflow — knowing where DVC fits prevents tool sprawl.

Local backend. A directory on disk (/mnt/shared/dvc-storage) or an NFS mount. Fastest for a single workstation; useful for teams sharing an NFS server. Atomicity is the catch — DVC writes content-addressed files, but if two processes push the same hash simultaneously over NFS, you can hit truncated reads on lockless mounts. Use a real cloud backend the moment you have more than one machine.

Amazon S3. The most common production choice. Auth uses standard AWS credentials (instance profile on EC2, OIDC in GitHub Actions, ~/.aws/credentials locally). Watch the egress cost — dvc pull of a 10GB dataset across regions adds up. Use S3 in the same region as your compute. For private buckets, dvc remote modify myremote endpointurl https://s3.us-east-1.amazonaws.com avoids DNS resolution to the wrong endpoint.

Google Cloud Storage. Auth via gcloud auth application-default login or GOOGLE_APPLICATION_CREDENTIALS. GCS multi-region buckets are eventually consistent for list operations — dvc fetch may briefly see fewer objects than dvc push just wrote. Pin compute and bucket to the same region.

Azure Blob Storage. Auth via AZURE_STORAGE_CONNECTION_STRING, SAS tokens, or Azure CLI login. Configure: dvc remote add -d myremote azure://container/prefix. Azure’s throughput per blob is lower than S3’s; for very large files, set dvc config remote.myremote.threads 16 to parallelize chunks.

SSH backend. Push to your own server over SSH (ssh://user@host/path). Useful for on-prem or air-gapped environments. Requires key-based auth — DVC will not prompt for passwords interactively in CI. Slow compared to cloud backends; fine for a small team.

Git LFS comparison. Git LFS stores files alongside your Git provider (GitHub LFS, GitLab LFS) with hard quotas and bandwidth limits. DVC stores files in your own cloud bucket with no quota beyond what you pay your cloud provider for. For ML projects with datasets larger than a few GB, DVC wins on cost. Git LFS is simpler if your data fits in the free tier and you don’t need pipelines.

DVC vs MLflow vs Metaflow — different layers. DVC is data + pipelines + experiments-as-Git-branches. MLflow is experiment tracking + model registry + serving. Metaflow is workflow orchestration with built-in step execution (often on AWS Batch). They overlap on “experiment tracking” but the philosophies differ: DVC ties experiments to Git commits, MLflow stores them in a separate database, Metaflow records them in S3 as flow runs. Common stacks: DVC for data and pipelines, MLflow for the registry, Metaflow only if you need cloud-native step execution.

GitHub Actions integration. Runners are ephemeral, so DVC’s cache is empty on every run unless you restore it. Cache .dvc/cache keyed on dvc.lock hash, then dvc pull only fetches what’s missing. For private remotes, store cloud credentials in repository secrets and pass them as env vars to the step — never commit them to .dvc/config. Use dvc[s3] (or your backend variant) in the install step, not bare dvc.

Monorepo with multiple dvc.yaml files. DVC supports subdirectories via --cwd or by running commands from inside the subdirectory. Each subdirectory gets its own dvc.yaml and dvc.lock. The cache is shared at the repo root unless you split via dvc cache dir. For a monorepo with three ML projects, run pipelines as cd projects/recsys && dvc repro — keeps the lock files local to each project and avoids merge conflicts on a single root dvc.lock.

Windows-specific gotchas. Symlinks require either developer mode or admin shell. Without them, DVC falls back to copying, which doubles disk usage. Enable developer mode in Settings → For developers, then dvc config cache.type symlink,hardlink,copy. Path separators on Windows can confuse stage definitions — keep dvc.yaml paths forward-slashed, and DVC normalizes them per OS.

Still Not Working?

DVC vs Git LFS

  • DVC — Designed for ML workflows: pipelines, experiments, metrics, parameters. Works with any storage (S3, GCS, SSH).
  • Git LFS — Generic large file storage in Git. Simpler model, fewer features. Tied to your Git hosting provider.

For ML projects, DVC’s pipeline/experiment tracking is the key differentiator. For binary asset versioning in a non-ML project, Git LFS is simpler.

dvc.lock File

DVC v3+ uses dvc.lock (similar to package-lock.json in npm) to record exact dependency hashes. Commit this file:

git add dvc.lock dvc.yaml
git commit -m "Update pipeline"

Without dvc.lock, collaborators may get different pipeline results from the same code. Always commit it.

Integration with MLflow / Weights & Biases

DVC handles data and pipelines; MLflow/W&B handle experiment metadata. They complement:

  • Use DVC for data versioning and reproducible pipelines
  • Use MLflow for model registry and experiment metric tracking
  • Use both — log to MLflow inside DVC pipeline stages

For MLflow-specific patterns, see MLflow not working. For W&B integration, see Weights & Biases not working.

Studio and CML for Collaboration

DVC Studio (web UI) and CML (continuous machine learning for CI/CD comments) are separate paid/free products from Iterative. Useful for team workflows but not required for basic DVC use.

Working with Pandas / Polars

DVC stores file contents, but reading them depends on your code. For pandas DataFrame operations on DVC-tracked files, see pandas SettingWithCopyWarning.

Git Performance with Many .dvc Files

A repo with thousands of .dvc files has the same git performance characteristics as a repo with thousands of text files. Use dvc add on directories instead of individual files when possible to reduce .dvc file count.

For git-specific issues that affect DVC’s workflow, see git fatal not a git repository.

Reproducing Across Different Hardware

dvc repro recomputes hashes deterministically from file content, so the inputs are reproducible. But the outputs of a training stage often differ across machines: a GPU run produces slightly different model weights than a CPU run, mixed-precision training varies by hardware, and even the same GPU model with a different driver can shift floating-point output. Don’t expect bit-identical model files from dvc repro on a different machine. Pin random seeds in your training code, log key metrics with dvc metrics, and treat hash mismatches on training outputs as expected — review the metric diff, not the binary diff.

Pipeline Stage Skipped When You Expected a Re-Run

Three things make DVC skip a stage: (1) all deps hash the same as the lock file, (2) all params resolve to the same values, (3) the cmd string is identical. If a stage refuses to re-run after you “obviously” changed something, the change is probably outside the declared deps. Run dvc status first — it shows exactly which dep changed. If dvc status is clean, your edit didn’t touch a tracked file. Either add the file to deps, or use dvc repro -f stage_name to force.

Sharing Experiments Without Pushing Branches

dvc exp saves experiments as detached references, not full branches. To share an experiment without polluting git branch -a, push it via dvc exp push origin <exp-name>. Teammates dvc exp pull origin <exp-name> to retrieve it. The experiment metadata stays in .dvc/tmp/exps; only promoted experiments (dvc exp branch) become real branches.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles