Fix: scalene Not Working — Web UI, GPU Profiling, and AI Suggestion Errors
Part of: Python Errors
Quick Answer
How to fix scalene errors — scalene command not found, web UI port conflict, no GPU detected, profile.json empty, AI optimize requires OpenAI key, native code not attributed, and Jupyter integration.
The Error
You install scalene and the command isn’t found:
$ scalene my_script.py
bash: scalene: command not foundOr the web UI fails to open:
$ scalene my_script.py
# Profile completes but web UI doesn't open in browserOr GPU profiling shows nothing:
$ scalene --gpu my_script.py
# Profile shows CPU and memory but GPU column is all zerosOr AI optimization suggestions don’t work:
$ scalene --cli my_script.py
# Output mentions "ask AI" but no actual suggestions appearOr native code (NumPy, PyTorch) shows as opaque blocks:
# my_script.py
import numpy as np
arr = np.dot(big_matrix, big_matrix.T)
# Scalene flags numpy.dot as slow but doesn't show WHYscalene is the modern combined profiler — measures CPU (Python and native separately), memory (with leak detection), GPU usage, and energy consumption in a single run. The killer feature is the AI-assisted optimization mode: GPT-4 reads the profile and suggests faster code. It’s heavier than py-spy but gives more dimensions of insight. This guide covers the common setup issues.
Why This Happens
scalene installs via pip but writes its scripts to the user-script directory, which may not be on PATH. The web UI opens automatically when scalene finishes — works locally but fails in headless environments (CI, remote SSH, Docker). GPU profiling requires NVIDIA libraries (pynvml) which aren’t installed by default and require a CUDA-enabled GPU. The AI optimization feature integrates with OpenAI’s API and needs an API key in env vars.
Fix 1: Installation and PATH
pip install scalene
# Or for the latest with GPU support
pip install "scalene[gpu]"Verify the command is on PATH:
which scalene
# /usr/local/bin/scalene or ~/.local/bin/scalene or similarIf “command not found”:
# Find where scalene was installed
python -m pip show scalene | grep Location
# Add the scripts dir to PATH
export PATH="$HOME/.local/bin:$PATH"
# Or run via Python module
python -m scalene my_script.pyPython module form works regardless of PATH:
python -m scalene my_script.py
python -m scalene --cli my_script.py
python -m scalene --html --outfile profile.html my_script.pyCommon Mistake: Installing scalene globally via pip but running scripts inside a virtualenv. The venv may not have scalene installed, so scalene resolves to the global binary that profiles the global Python — not your venv’s Python. Always pip install scalene inside the venv you’re profiling.
Fix 2: Output Modes
scalene has multiple output modes for different contexts:
# Default: opens web UI in browser
scalene my_script.py
# CLI mode (terminal output, no browser)
scalene --cli my_script.py
# HTML file for sharing or CI artifacts
scalene --html --outfile profile.html my_script.py
# JSON for tooling integration
scalene --json --outfile profile.json my_script.py
# All three at once
scalene --cli --html --json --outfile profile my_script.pyCLI mode output:
Memory usage: ...
│ Time %│ Memory %│
│ │ 1 │ import numpy as np │
│ │ 2 │ │
│ N │ 3 │ data = np.random.rand(1_000_000, 100) │ 2% ▆ │ 120 MB ▆ │
│ N │ 4 │ result = data @ data.T │ 87% █ │ 80 MB ▆ │Column meanings:
Time %— % of total CPU timeMemory %— % of total memory allocatedNindicator — line uses native (C) codePindicator — line uses Python- Black square — pure CPU; orange — GPU; etc.
Pro Tip: For CI integration, use --cli and pipe to a file:
scalene --cli my_script.py > profile.txt 2>&1Then upload profile.txt as a CI artifact. Web UI is great for local exploration but useless in headless CI environments.
Fix 3: GPU Profiling
scalene --gpu my_script.pyRequirements:
- NVIDIA GPU (no AMD/Intel support yet)
pynvmlinstalled:pip install nvidia-ml-pyorpip install "scalene[gpu]"- NVIDIA drivers loaded
Verify GPU access:
nvidia-smi # Should list your GPUs
python -c "import pynvml; pynvml.nvmlInit(); print(pynvml.nvmlDeviceGetCount())"Common Mistake: Running scalene --gpu on a machine without NVIDIA hardware. Scalene runs but reports 0% GPU usage everywhere — no error, just useless output. Verify GPU access first; only use --gpu flag when you actually have one.
For multi-GPU systems:
CUDA_VISIBLE_DEVICES=0 scalene --gpu my_script.py
# Profile only GPU 0For PyTorch GPU profiling that goes deeper than scalene’s high-level stats, see PyTorch not working.
Fix 4: AI-Assisted Optimization
Scalene’s AI feature reads the profile and suggests optimized code:
# Set API key
export OPENAI_API_KEY=sk-...
# Run with AI suggestions enabled
scalene --cli my_script.py
# Look for "Optimize" links in the outputIn the web UI, click ”🧠 Optimize” next to a hot line — Scalene sends that code (with context) to OpenAI and shows the suggested optimization.
Local LLM alternative:
# Set Azure OpenAI endpoint instead
export OPENAI_API_BASE=https://your-azure.openai.azure.com/
export OPENAI_API_KEY=your-azure-keyOr use --ai-provider:
scalene --ai-provider azure --cli my_script.pyCommon Mistake: Expecting AI suggestions to work without an OpenAI key. The feature is opt-in — you must set OPENAI_API_KEY for it to function. Without the key, the “Optimize” button does nothing.
Treat AI suggestions as a starting point, not an answer. The model sees the line in isolation; it does not know your call site, your types, or your hot loop. Read the suggestion, then verify with a benchmark.
Fix 5: Profiling Specific Code (Decorators)
By default, scalene profiles everything. To focus on specific functions:
from scalene import scalene_profiler
@scalene_profiler.profile
def slow_function():
data = [i ** 2 for i in range(1_000_000)]
return sum(data)
slow_function()Then run normally:
scalene --profile-only "@scalene_profiler.profile" my_script.pyOr profile only specific modules:
scalene --profile-only "my_module" my_script.py
# Only lines in my_module.py appear in the profileExclude noise (third-party libs that dominate the profile):
scalene --profile-exclude "site-packages" my_script.py
# Excludes all third-party codePro Tip: For large applications, always set --profile-only to your project’s package. Without it, the profile is dominated by framework internals (FastAPI, SQLAlchemy, asyncio) and your actual hot code is buried. Restricting the profile makes the signal much clearer.
Fix 6: Memory Profiling
scalene’s memory profiling tracks both Python objects and native allocations:
scalene --cli my_script.py
# Memory column shows allocations per lineMemory leak detection:
scalene --memory-leak-detector my_script.py
# Highlights allocations that grow without being freedSampling vs full tracking:
# Sampling (default) — low overhead, may miss small allocations
scalene my_script.py
# Reduce sampling overhead even more
scalene --reduced-profile my_script.pyCommon Mistake: Comparing scalene’s memory numbers to top or htop. The numbers measure different things — scalene tracks allocations attributable to specific Python lines; top shows total RSS (resident set size) including caches and shared libs. Expect scalene’s numbers to be smaller — they’re allocations, not total memory in use.
For pure memory questions, memray gives more detail than scalene. scalene’s strength is combining memory with CPU — useful when “is this slow or is this allocating too much?” is the actual question.
Fix 7: Jupyter Integration
%load_ext scalene%scalene
def slow_thing():
data = [i ** 2 for i in range(1_000_000)]
return sum(data)
slow_thing()The %scalene magic profiles the cell. Output is displayed inline.
For longer cells:
%%scalene
import numpy as np
arr = np.random.rand(10_000_000)
result = arr.sum()
print(result)Cell-level vs line-level — %scalene profiles the whole cell; %%scalene is the same but cell magic syntax. Both show line-by-line breakdowns.
The magic only works inside the notebook process — it cannot reach into a remote kernel running in a different container. If your notebook talks to a remote IPython kernel, run scalene on the remote host instead and copy the HTML output back.
Fix 8: Reducing Profiling Overhead
# Default: ~5-15% overhead
scalene my_script.py
# Reduced: ~2-5% overhead, fewer samples
scalene --reduced-profile my_script.py
# Custom CPU sampling rate
scalene --cpu-sampling-rate 0.05 my_script.py
# Higher = more samples, more overhead
# Sample memory less often
scalene --allocation-sampling-window 1024 my_script.py
# Sample 1 in 1024 allocationsProfile vs measure — for measuring production performance, lower the sampling rate. For deep debugging, raise it.
Production Incident Lens — Overhead Is a First-Class Concern
Reach for scalene when the question is “is this code slow because of CPU, memory, or both?” Do not reach for it as the first profiler on a hot production worker. Scalene’s overhead is real (5-15% default) and its instrumentation touches the allocator — heavier than py-spy’s pure sampling. The right venue for scalene is staging, a benchmark, or a soak test, not the customer-facing fleet.
When scalene is the right tool:
- A nightly benchmark regressed and you need to know whether the new code added CPU work or allocations.
- A load test in staging produces a memory growth curve and you need per-line attribution.
- A worker in production is healthy on CPU but OOM-kills weekly — you reproduce in staging, then scalene.
When py-spy is the right tool instead:
- A live worker is spiking right now and you need a snapshot without restarting.
- You only care about CPU; memory is irrelevant.
- The host has strict ptrace capability rules and you cannot install new packages.
Real incident pattern — the silent memory creep. Metrics show RSS growing 50MB per day on a long-running worker. Restarting fixes it temporarily. The team blames “a leak” but tracemalloc shows nothing obvious. scalene with --memory-leak-detector is the right next move, but only in a reproduction environment — running scalene on the production worker would double the leak rate during the investigation and likely trigger the OOM kill you are trying to avoid. Reproduce the workload in staging, attach scalene there, identify the allocator, ship the fix, then verify in production with metrics alone.
Common Mistake: Running scalene --memory-leak-detector on a worker that already has 4GB RSS and only 8GB cgroup limit. The detector itself allocates to track allocations — you can push the worker over the limit and crash the very process you were investigating. Stage the investigation, then port the fix.
For deeper memory-only investigations, scalene’s leak detector is a starting point — switch to memray when you need allocation timelines and call-graph attribution. For CPU-only “where is the time going right now” questions on a live host, switch to py-spy.
Still Not Working?
scalene vs py-spy vs memray
- scalene — All-in-one: CPU + memory + GPU + energy + AI suggestions. Best for “what’s slow AND why?”
- py-spy — CPU only, sampling, attach to running processes. Best for production diagnostics. See py-spy not working.
- memray — Memory only, allocation tracking. Best for finding leaks. See memray not working.
scalene is the heaviest of the three but provides the most context. Use it for “this is slow, let me understand why”; use py-spy for “this is hung, what’s blocking?”; use memray for “this is leaking, where?”
Distinguishing Python vs Native Time
scalene’s most unique feature is the ”% time in C” column — shows how much time was in native code (NumPy, PyTorch internals) vs Python:
Time % Python │ Time % C │ Line
5% │ 85% │ result = np.dot(a, b)
85% │ 5% │ result = [x * y for x in a for y in b]The numpy line spends most time in C (good — moving to numpy was the right optimization). The pure-Python list comprehension spends most time in Python (where you’d see your code on the call stack, but the work is interpreter-bound).
Pro Tip: When optimizing, look at the Python % column. If it’s high, you have room to optimize by moving to NumPy/Cython/Rust. If C % is high, your code is already calling fast native code — further speedup needs algorithmic improvements, not language tricks.
CI Integration
# .github/workflows/profile.yml
- name: Profile benchmark
run: |
pip install scalene
scalene --cli --outfile profile.txt benchmarks/main.py
- uses: actions/upload-artifact@v4
with:
name: profile
path: profile.txtCompare profile.txt across commits to spot performance regressions.
Working with Tests
scalene --cli --outfile profile.txt -- pytest tests/slow_test.py-- separates scalene flags from the command to run. Use this whenever the profiled command has its own flags.
Profile only the slow test, not the whole suite — scalene’s overhead can quadruple a 60-second suite into a 4-minute one if every fixture runs under it. Mark the suspect test, run scalene against just that test, then turn the flag off.
When scalene Doesn’t Help
If scalene shows everything is fast but your app is slow:
- External I/O — DB queries, HTTP calls. Use APM (Datadog, Honeycomb, Sentry) or
strace/tcpdump. - Lock contention — GIL, asyncio event loop saturation. Use py-spy’s
dumpmode. - GPU memory bandwidth — actual compute is fast but waiting on memory transfers. Use NVIDIA Nsight.
A profile that shows nothing hot is itself useful information — it tells you the bottleneck is outside the Python process, which redirects your investigation. Don’t keep re-running scalene with different flags expecting different results.
Energy Consumption Profiling
scalene can estimate energy use per code path (RAPL on supported Intel/AMD CPUs):
scalene --cli my_script.py
# Look for Energy column when availableThe energy column shows joules consumed per line — useful for sustainability-conscious teams optimizing for cost or carbon. Requires Linux + supported hardware (Intel Sandy Bridge or newer, AMD Zen). On macOS and Windows, the column is omitted.
Multi-Process Profiling
scalene doesn’t have a direct equivalent to --subprocesses from py-spy, but you can profile parent and children separately:
# Parent
scalene --cli --outfile parent.txt my_script.py
# For workers, instrument them to write per-process profiles
# Use os.getpid() to differentiate output filesFor multiprocessing patterns that affect profiling strategy, see Python multiprocessing not working.
Programmatic Use
from scalene import scalene_profiler
scalene_profiler.start()
# Code to profile
do_heavy_work()
scalene_profiler.stop()Useful when you want to profile only a portion of a long-running app — start/stop around the specific section.
Comparing Profiles Across Runs
scalene doesn’t have built-in diff support, but you can compare JSON outputs:
scalene --json --outfile before.json my_script.py
# Make changes
scalene --json --outfile after.json my_script.py
# Manual diff or use a custom script to compare line-by-line metrics
python compare_profiles.py before.json after.jsonRegression detection in CI follows this pattern — store a baseline JSON, run on every PR, alert if any line’s CPU% or memory regresses by more than a threshold.
Visualization Modes
In the web UI, click column headers to sort by:
- CPU % (Python or native)
- Memory (current or peak)
- GPU %
- Line execution count
Different sorts reveal different bottlenecks — sorting by memory finds allocators, by CPU finds compute hotspots, by execution count finds tight loops.
scalene Output Is Empty or All Zeros
A run completes but every column reads 0%. Common causes:
- The script finished too fast for any sample to land. Scalene samples on a timer; a script that runs in 50ms gets zero samples. Wrap the workload in a loop that runs for at least several seconds.
- The script crashed before profiling started. Look at stderr — scalene wraps the target, so its own startup output can hide the target’s traceback. Run the target without scalene first to confirm it works.
--profile-onlywas set to a path that matched nothing. Drop the flag, confirm output appears, then re-add it with the correct prefix.
Web UI Will Not Open in WSL or Remote SSH
Scalene tries to launch a browser when the run finishes. In WSL2, headless containers, or over SSH, the launch silently fails. Use --html --outfile profile.html and open the file manually, or use --cli to skip the browser entirely. For SSH, ssh -L 8088:localhost:8088 plus scalene --web can forward the live UI, but the static HTML file is usually the simpler path.
Cannot Distinguish “Slow” from “Allocating” in the Report
scalene shows CPU time and memory as separate columns, but they sit next to each other in the line view and a hot line often shows both — high CPU because allocation is itself work. The trick: look at the Memory column independently of CPU. If the line allocates 200MB but uses 2% CPU, it is GC pressure dressed up as a hot loop. The fix is usually a generator instead of a list, or a __slots__ declaration, not a faster algorithm.
For the same line showing low memory but high CPU, the bottleneck is real compute. Native-percent column tells you whether the work is in Python (interpreter-bound, optimization possible) or in C (already calling fast code — algorithmic change required).
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: py-spy Not Working — Attach Permission, Empty Output, and Native Frame Errors
How to fix py-spy errors — Operation not permitted ptrace, flamegraph blank, missing native code frames, top mode shows no Python frames, dump command empty, and subprocess inheritance.
Fix: memray Not Working — Tracking Errors, Flamegraph Empty, and Native Allocations
How to fix memray errors — memray run command not found, flamegraph shows no data, native allocations not tracked, live mode TUI broken, attach to running process fails, and pytest integration.
Fix: Locust Not Working — User Class Errors, Distributed Mode, and Throughput Issues
How to fix Locust errors — no locustfile found, User class not detected, worker connection refused, distributed mode throughput lower than single-node, StopUser exception, FastHttpUser vs HttpUser, and headless CSV reports.
Fix: TensorFlow Not Working — OOM, Shape Mismatch, GPU Not Found, and Keras Errors
How to fix TensorFlow errors — GPU not detected CUDA library missing, ResourceExhaustedError OOM, InvalidArgumentError shape mismatch, NaN loss, @tf.function AutoGraph failures, and Keras 3 breaking changes in TF 2.16+.