Fix: MLflow Not Working — Tracking URI, Artifact Store, and Model Registry Errors
Quick Answer
How to fix MLflow errors — no tracking server, artifact path not accessible, model version not found, experiment not found, MLFLOW_TRACKING_URI not set, autolog not recording metrics, and MLflow UI showing no runs.
The Error
You log metrics but the MLflow UI is empty — no runs, no experiments:
mlflow.log_metric("accuracy", 0.94)
# No error, but nothing appears in the UIOr you try to load a registered model and get a version error:
MlflowException: Registered Model with name='my_model' not found.Or artifact logging fails with a path error:
MlflowException: API request to http://localhost:5000/api/2.0/mlflow/runs/log-artifact failed with exception HTTPConnectionPool: Max retries exceededOr you run mlflow ui and it starts, but all your runs from the training script are missing.
MLflow separates tracking (metrics, parameters, tags), artifacts (files, models), and the model registry into three layers — each with its own storage backend. Misconfiguring any one of them produces silent failures or confusing errors. This guide covers all three.
Why This Happens
MLflow defaults to storing everything in a local ./mlruns directory relative to where you run your Python script. The MLflow UI, when launched with mlflow ui, looks in ./mlruns relative to where that command runs. If your training script and mlflow ui run from different directories, they use different mlruns folders and never see each other’s data.
The fix for any environment beyond a single local machine is to set MLFLOW_TRACKING_URI explicitly and point everything — training scripts, the UI, and any code that loads models — at the same backend.
Fix 1: Runs Not Appearing in the UI
The most common MLflow issue: training runs successfully but the UI shows “No experiments found.”
Root cause: tracking URI mismatch. Check where your script is logging vs. where the UI is reading:
import mlflow
# Where is your script logging?
print(mlflow.get_tracking_uri())
# Default: file:///path/to/your/script/directory/mlruns# Where is the UI reading from?
mlflow ui
# Reads from ./mlruns in the CURRENT DIRECTORYIf you ran python train.py from /home/user/project/ and mlflow ui from /home/user/, they use completely different mlruns directories.
Fix — set MLFLOW_TRACKING_URI consistently:
# Set once in your shell — all MLflow calls in this session use this path
export MLFLOW_TRACKING_URI=file:///home/user/project/mlruns
# Now run training and UI from anywhere — they all point to the same store
python train.py
mlflow ui # http://localhost:5000Or set it in your training script:
import mlflow
# Set before any logging — absolute path avoids directory confusion
mlflow.set_tracking_uri("file:///home/user/project/mlruns")
with mlflow.start_run():
mlflow.log_metric("accuracy", 0.94)Or use a tracking server (the right approach for teams):
# Start the tracking server — stores data in ./mlruns, serves UI on port 5000
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./mlartifacts \
--host 0.0.0.0 \
--port 5000import mlflow
# Point all scripts at the server
mlflow.set_tracking_uri("http://localhost:5000")
with mlflow.start_run():
mlflow.log_metric("loss", 0.12)Fix 2: Experiment Not Found or Runs Go to Wrong Experiment
mlflow.exceptions.MlflowException: Experiment '0' does not exist.
mlflow.exceptions.MlflowException: Could not find experiment with name 'my_experiment'By default, MLflow logs to an experiment called “Default” (ID 0). If you reference an experiment that doesn’t exist, the call fails. Runs from different scripts mixed in “Default” also make the UI hard to read.
Create and set an experiment before logging:
import mlflow
# Set by name — creates it if it doesn't exist (MLflow 1.x behavior)
mlflow.set_experiment("fraud_detection_v2")
# Explicit create-or-get pattern (more robust)
experiment_name = "fraud_detection_v2"
experiment = mlflow.get_experiment_by_name(experiment_name)
if experiment is None:
mlflow.create_experiment(
name=experiment_name,
artifact_location="s3://my-bucket/mlflow/fraud_detection_v2", # Optional
tags={"team": "data-science", "project": "fraud"},
)
mlflow.set_experiment(experiment_name)
# Now all runs go to this experiment
with mlflow.start_run(run_name="xgboost_baseline"):
mlflow.log_param("n_estimators", 100)
mlflow.log_metric("auc", 0.87)Organize runs with tags for filtering in the UI:
with mlflow.start_run(run_name="experiment_001") as run:
mlflow.set_tags({
"model_type": "gradient_boosting",
"dataset_version": "v3",
"environment": "dev",
})
mlflow.log_params({"n_estimators": 100, "max_depth": 5, "learning_rate": 0.1})
mlflow.log_metric("accuracy", 0.91)
mlflow.log_metric("f1_score", 0.88)
print(f"Run ID: {run.info.run_id}")Log metrics over time (training curves):
with mlflow.start_run():
for epoch in range(100):
train_loss = train_one_epoch()
val_loss = validate()
# step parameter creates a time-series in the UI
mlflow.log_metric("train_loss", train_loss, step=epoch)
mlflow.log_metric("val_loss", val_loss, step=epoch)Fix 3: Artifact Logging Failures
MlflowException: API request to .../log-artifact failed
MlflowException: No such file or directory: '/tmp/mlflow-artifacts/...'
OSError: [Errno 2] No such file or directoryMLflow stores artifacts (model files, plots, feature importance charts) in an artifact store that’s separate from the metrics database. Mismatches between where the server expects artifacts and where the client tries to write them cause these errors.
Log files as artifacts:
import mlflow
import matplotlib.pyplot as plt
import os
with mlflow.start_run():
# Log a single file
mlflow.log_artifact("feature_importance.csv")
# Log a file to a subdirectory in the artifact store
mlflow.log_artifact("confusion_matrix.png", artifact_path="plots")
# Log an entire directory
mlflow.log_artifacts("./output/", artifact_path="model_outputs")
# Log in-memory content without writing to disk first
import json
config = {"model": "xgboost", "version": "1.0"}
with open("/tmp/config.json", "w") as f:
json.dump(config, f)
mlflow.log_artifact("/tmp/config.json")
# Log a matplotlib figure directly
fig, ax = plt.subplots()
ax.plot([1, 2, 3], [0.8, 0.85, 0.9], label="accuracy")
ax.legend()
mlflow.log_figure(fig, "training_curve.png") # MLflow 1.13+
plt.close(fig)Artifact store configuration for remote backends:
# S3 backend
mlflow server \
--backend-store-uri postgresql://user:pass@host/mlflow \
--default-artifact-root s3://my-mlflow-bucket/artifacts \
--host 0.0.0.0
# Google Cloud Storage
mlflow server \
--default-artifact-root gs://my-mlflow-bucket/artifacts
# Azure Blob Storage
mlflow server \
--default-artifact-root wasbs://[email protected]/artifactsWhen using S3, the training machine must have AWS credentials with write access to the artifact bucket — not just the MLflow server:
# Training script runs on a different machine than MLflow server
# That machine needs S3 write access
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
# Or use IAM instance profile if on EC2Pro Tip: Use mlflow.log_artifact for final artifacts (trained model, evaluation report). Use mlflow.log_metric at every epoch for training curves. Save large intermediate files (preprocessed datasets, checkpoints) directly to S3/GCS — only log the final model through MLflow. The artifact store isn’t a general-purpose file system.
Fix 4: Model Logging and Loading Errors
MlflowException: Run 'abc123' not found.
MlflowException: Model flavor 'sklearn' is not supported.
mlflow.exceptions.MlflowException: Registered Model with name='classifier' not found.MLflow provides flavors for common frameworks. Each flavor knows how to save and load models in a way that preserves the prediction interface.
Log models using the correct flavor:
import mlflow
import mlflow.sklearn
import mlflow.pytorch
import mlflow.tensorflow
import mlflow.xgboost
# scikit-learn
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
with mlflow.start_run() as run:
mlflow.sklearn.log_model(
sk_model=model,
artifact_path="model",
registered_model_name="fraud_classifier", # Registers in Model Registry
input_example=X_train[:5], # Captures input schema
)
# PyTorch
with mlflow.start_run():
mlflow.pytorch.log_model(
pytorch_model=net,
artifact_path="model",
registered_model_name="image_classifier",
)
# XGBoost
with mlflow.start_run():
mlflow.xgboost.log_model(
xgb_model=booster,
artifact_path="model",
)Load a model by run ID:
import mlflow
run_id = "abc123def456"
# Load by run ID and artifact path
model = mlflow.sklearn.load_model(f"runs:/{run_id}/model")
# Load as generic Python function (works for any flavor)
pyfunc_model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
predictions = pyfunc_model.predict(X_test)Load a model from the Model Registry:
import mlflow.pyfunc
# Load a specific version
model = mlflow.pyfunc.load_model("models:/fraud_classifier/3")
# Load by stage (Production, Staging, Archived)
model = mlflow.pyfunc.load_model("models:/fraud_classifier/Production")
# MLflow 2.x: use aliases instead of deprecated stages
model = mlflow.pyfunc.load_model("models:/fraud_classifier@champion")MLflow 2.x deprecated model stages (Production, Staging) in favor of aliases. If you’re on MLflow 2.x and getting deprecation warnings:
from mlflow import MlflowClient
client = MlflowClient()
# Old pattern — deprecated in 2.x
client.transition_model_version_stage(
name="fraud_classifier",
version="3",
stage="Production", # DeprecationWarning
)
# New pattern — set an alias
client.set_registered_model_alias(
name="fraud_classifier",
alias="champion",
version="3",
)
# Load by alias
model = mlflow.pyfunc.load_model("models:/fraud_classifier@champion")Fix 5: autolog Not Recording Metrics
mlflow.sklearn.autolog()
# Model trains, run appears in UI, but params/metrics are emptyautolog hooks into framework training methods to automatically capture parameters, metrics, and models. If nothing is recorded, the autolog call happened after the framework was already imported in a problematic state, or the training function isn’t one that autolog hooks into.
Call autolog before any model initialization:
import mlflow
# WRONG — autolog after model creation may miss constructor params
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100)
mlflow.sklearn.autolog() # Too late — already initialized
# CORRECT — autolog before everything
mlflow.sklearn.autolog() # Hook in first
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
model = GradientBoostingClassifier(n_estimators=100, max_depth=3)
# fit() triggers the autolog hooks
model.fit(X_train, y_train)Enable autolog for all supported frameworks at once:
import mlflow
mlflow.autolog(
log_input_examples=True, # Log sample inputs
log_model_signatures=True, # Log input/output schema
log_models=True, # Log the model artifact
disable=False,
exclusive=False,
disable_for_unsupported_versions=False,
silent=False,
)Disable autolog for specific frameworks if it conflicts with your manual logging:
mlflow.sklearn.autolog(disable=True)PyTorch Lightning autolog — must be set before the Trainer:
import mlflow
import pytorch_lightning as pl
mlflow.pytorch.autolog() # Before Trainer instantiation
trainer = pl.Trainer(max_epochs=10)
trainer.fit(model, train_loader, val_loader)
# MLflow captures train_loss, val_loss per epoch automaticallyCommon Mistake: Using mlflow.autolog() inside a with mlflow.start_run(): block and expecting it to capture the outer run. autolog creates its own nested run for framework calls. For clean logging, set autolog before start_run, or use manual logging inside the with block and skip autolog.
Fix 6: MLflow Server Setup for Teams
Running mlflow ui locally only works for solo development. For teams, you need a tracking server with a real database backend.
Minimal production setup with PostgreSQL + S3:
# Install MLflow with extras
pip install mlflow[extras] psycopg2-binary boto3
# Start the server
mlflow server \
--backend-store-uri "postgresql://mlflow_user:password@db-host:5432/mlflow" \
--default-artifact-root "s3://company-mlflow/artifacts" \
--host 0.0.0.0 \
--port 5000 \
--workers 4Docker Compose setup for local team development:
# docker-compose.yml
version: "3.8"
services:
mlflow-db:
image: postgres:15
environment:
POSTGRES_USER: mlflow
POSTGRES_PASSWORD: mlflow
POSTGRES_DB: mlflow
volumes:
- mlflow-db-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "mlflow"]
interval: 5s
retries: 5
mlflow-server:
image: python:3.12-slim
depends_on:
mlflow-db:
condition: service_healthy
command: >
bash -c "pip install mlflow psycopg2-binary &&
mlflow server
--backend-store-uri postgresql://mlflow:mlflow@mlflow-db:5432/mlflow
--default-artifact-root /mlartifacts
--host 0.0.0.0
--port 5000"
ports:
- "5000:5000"
volumes:
- mlflow-artifacts:/mlartifacts
training:
build: .
environment:
MLFLOW_TRACKING_URI: http://mlflow-server:5000
depends_on:
- mlflow-server
volumes:
mlflow-db-data:
mlflow-artifacts:For service dependency and health check patterns in this Docker Compose setup, see docker-compose depends_on not working.
Initialize the database schema on first run:
mlflow db upgrade postgresql://mlflow_user:password@db-host:5432/mlflowWithout running db upgrade, the server fails to start with a schema error.
Fix 7: Version Compatibility Issues
MLflow has fast release cycles. Mixing MLflow versions between the training environment and the serving/loading environment causes compatibility errors:
MlflowException: The model's mlflow version (2.8.0) is incompatible with the current version (2.12.0).
mlflow.exceptions.MlflowException: Unsupported format for model artifacts.Check versions across your environments:
# Training environment
python -c "import mlflow; print(mlflow.__version__)"
# Serving environment
mlflow --versionPin MLflow in requirements.txt to avoid drift:
mlflow==2.14.0
mlflow-skinny==2.14.0 # Lighter install without server dependenciesmlflow-skinny is the client-only package — it doesn’t install the server, UI, or heavyweight dependencies. Use it in training environments and containers where you only need to log, not serve:
# Training container — only needs to log metrics
pip install mlflow-skinny boto3
# Serving container — needs full MLflow
pip install mlflowMLflow model format evolution — models logged with older MLflow versions can usually be loaded by newer versions but not vice versa. If you must load an old model with new MLflow, check the MLmodel file in the artifact:
# Inspect the model metadata
cat mlartifacts/0/run_id/artifacts/model/MLmodel
# Shows: mlflow_version: 2.8.0, flavors, signature, etc.Fix 8: Querying Runs Programmatically
The MLflow UI is useful for exploration, but production workflows need to query run history programmatically.
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient(tracking_uri="http://localhost:5000")
# Search runs in an experiment
experiment = client.get_experiment_by_name("fraud_detection_v2")
runs = client.search_runs(
experiment_ids=[experiment.experiment_id],
filter_string="metrics.auc > 0.85 AND params.model_type = 'xgboost'",
order_by=["metrics.auc DESC"],
max_results=10,
)
for run in runs:
print(f"Run {run.info.run_id}: AUC={run.data.metrics['auc']:.4f}")
# Get the best run
best_run = runs[0]
print(f"Best run ID: {best_run.info.run_id}")
print(f"Best AUC: {best_run.data.metrics['auc']}")
print(f"Params: {best_run.data.params}")
# Load the best model
best_model = mlflow.sklearn.load_model(f"runs:/{best_run.info.run_id}/model")Download artifacts from a run:
client = MlflowClient()
# List artifacts in a run
artifacts = client.list_artifacts(run_id, path="model")
for artifact in artifacts:
print(artifact.path, artifact.is_dir, artifact.file_size)
# Download to local path
local_path = client.download_artifacts(
run_id=run_id,
path="model",
dst_path="/tmp/downloaded_model"
)Compare runs with pandas:
import mlflow
import pandas as pd
# Export all runs to a DataFrame for analysis
df = mlflow.search_runs(
experiment_names=["fraud_detection_v2"],
filter_string="status = 'FINISHED'",
)
# Column names: params.*, metrics.*, tags.*, start_time, run_id, etc.
print(df[["params.n_estimators", "metrics.auc", "metrics.f1_score"]].sort_values("metrics.auc", ascending=False))Still Not Working?
MLflow with Remote Tracking and Local Artifacts
If your training runs in a cloud VM but you want artifacts stored locally for debugging, override the artifact location per run:
with mlflow.start_run() as run:
mlflow.log_artifact("model.pkl", artifact_path="model")
# Artifact is stored at the server's artifact root by default
# Download artifact after the run
client = MlflowClient()
client.download_artifacts(run.info.run_id, "model", "/local/path")Integration with Training Frameworks
For PyTorch training loop patterns that work well with MLflow logging (especially mixed precision and gradient accumulation), see PyTorch not working. For scikit-learn pipelines logged as MLflow models, see scikit-learn not working. For experiment notebooks where you prototype before moving to tracked training runs, see Jupyter not working.
MLflow on Kubernetes
The mlflow server command runs a single-process Flask app — not suitable for high-concurrency production workloads. For Kubernetes, use the official Helm chart or a managed service (Databricks Managed MLflow, AWS SageMaker MLflow Tracking). The server should run behind a load balancer with multiple --workers (Gunicorn workers) for parallel request handling.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Gradio Not Working — Share Link, Queue Timeout, and Component Errors
How to fix Gradio errors — share link not working, queue timeout, component not updating, Blocks layout mistakes, flagging permission denied, file upload size limit, and HuggingFace Spaces deployment failures.
Fix: Jupyter Notebook Not Working — Kernel Dead, Module Not Found, and Widget Errors
How to fix Jupyter errors — kernel fails to start or dies, ModuleNotFoundError despite pip install, matplotlib plots not showing, ipywidgets not rendering in JupyterLab, port already in use, and jupyter command not found.
Fix: LightGBM Not Working — Installation Errors, Categorical Features, and Training Issues
How to fix LightGBM errors — ImportError libomp libgomp not found, do not support special JSON characters in feature name, categorical feature index out of range, num_leaves vs max_depth overfitting, early stopping callback changes, and GPU build errors.
Fix: NumPy Not Working — Broadcasting Error, dtype Mismatch, and Array Shape Problems
How to fix NumPy errors — ValueError operands could not be broadcast together, setting an array element with a sequence, integer overflow, axis confusion, view vs copy bugs, NaN handling, and NumPy 1.24+ removed type aliases.