Skip to content

Fix: py-spy Not Working — Attach Permission, Empty Output, and Native Frame Errors

FixDevs · (Updated: )

Part of:  Python Errors

Quick Answer

How to fix py-spy errors — Operation not permitted ptrace, flamegraph blank, missing native code frames, top mode shows no Python frames, dump command empty, and subprocess inheritance.

The Error

You try to attach py-spy to a running process and get permission denied:

$ py-spy dump --pid 12345
Error: Operation not permitted: ptrace failed

Or the flamegraph generated is empty or just shows <idle>:

$ py-spy record -o profile.svg --pid 12345 --duration 30
# After 30 seconds, profile.svg shows almost nothing

Or you see “thread state was not Python” errors:

Error: Unable to get python interpreter state, can't profile

Or native (C extension) frames don’t appear:

$ py-spy record -o profile.svg python my_script.py
# Flamegraph shows numpy.zeros calls but no breakdown of what numpy does internally

Or subprocess profiling misses child processes:

$ py-spy record -o profile.svg python my_script.py
# Script spawns worker processes; py-spy only profiles the parent

py-spy is the gold-standard Python CPU profiler — Rust-based, sample-based (low overhead), can attach to already-running processes without restart. It’s the right tool for “production is slow, why?” debugging — but ptrace permission issues, sampling-vs-deterministic confusion, and native frame handling produce specific failures. This guide covers each.

Why This Happens

py-spy works by attaching to a running Python process via ptrace (Linux/macOS) or similar system calls. It samples the interpreter’s call stack at a fixed rate (default 100 Hz). No instrumentation, no code changes — just observation. Because it’s sample-based, very fast functions may not appear in profiles; only frames captured at sample points show up.

ptrace is a privileged system call on Linux. Default kernel hardening (Yama LSM, ptrace_scope=1) restricts ptrace to parent processes only — attaching to arbitrary processes requires either CAP_SYS_PTRACE capability, root, or relaxing the kernel setting.

Fix 1: Installation and Basic Use

pip install py-spy
# Or via cargo for the latest
cargo install py-spy

Three main commands:

# Profile a new process — outputs flamegraph SVG
py-spy record -o profile.svg python my_script.py

# Live top view (like Linux top, but for Python)
py-spy top --pid 12345

# One-shot stack dump for all threads
py-spy dump --pid 12345

Common Mistake: Running py-spy record against a one-shot script and expecting the flamegraph to show the script’s logic. py-spy samples — a script that runs in 100ms only gets a handful of samples and produces a useless flamegraph. Always profile workloads that run for at least several seconds; ideally minutes.

Profile until script exits:

py-spy record -o profile.svg -- python my_script.py
# Note the `--` before python

Profile for fixed duration:

py-spy record -o profile.svg --duration 60 --pid 12345
# Records 60 seconds of samples from PID 12345

Fix 2: Permission Denied on Attach

$ py-spy dump --pid 12345
Error: Permission denied: cannot read process memory

Or:

ptrace: Operation not permitted (os error 1)

Linux blocks ptrace to non-child processes by default for security.

Quick fix — sudo:

sudo py-spy dump --pid 12345
sudo py-spy record -o profile.svg --pid 12345 --duration 30

Permanent fix — relax ptrace scope:

# Check current setting
cat /proc/sys/kernel/yama/ptrace_scope
# 0 = ptrace any process
# 1 = ptrace only descendants (default on Ubuntu/Debian)
# 2 = require CAP_SYS_PTRACE
# 3 = ptrace disabled

# Temporarily allow ptrace any process
sudo sysctl kernel.yama.ptrace_scope=0

# Permanently
echo "kernel.yama.ptrace_scope = 0" | sudo tee /etc/sysctl.d/10-ptrace.conf
sudo sysctl -p /etc/sysctl.d/10-ptrace.conf

Reduce security risk — grant just py-spy the capability:

sudo setcap cap_sys_ptrace=eip $(which py-spy)
# Now py-spy can attach to any process without sudo

This is safer than ptrace_scope=0 — only py-spy gets the elevated permission, not every process you run.

Docker containers — add the capability:

docker run --cap-add SYS_PTRACE ...

Or for security policies, use a sidecar container:

services:
  app:
    image: myapp
  py-spy:
    image: python:3.12
    pid: "service:app"   # Share PID namespace
    cap_add: [SYS_PTRACE]
    command: py-spy record -o /output/profile.svg --pid 1 --duration 60

Pro Tip: Set cap_sys_ptrace on the py-spy binary once via setcap. After that, profile any process without sudo or kernel config changes. The security risk is bounded to py-spy specifically, not your shell or all binaries.

Fix 3: Sampling Rate

py-spy record -o profile.svg --rate 250 --pid 12345
# Sample at 250 Hz instead of default 100

Higher rates = more detail but more overhead:

RateOverheadUse case
1 Hz0.01%Very long observations (hours), low overhead
100 Hz (default)~1%General profiling
500 Hz~5%Short bursts, fine-grained
1000 Hz~10%Very fast functions

For production profiling where overhead matters, stick to 100 Hz or less. For debugging brief operations, 500-1000 Hz catches more.

Sampling vs deterministic profilers:

  • Sampling (py-spy, py-spy) — periodic snapshots, low overhead, may miss fast events
  • Deterministic (cProfile) — records every call, high overhead, captures everything

py-spy’s sampling means very brief functions called rarely won’t show up. For “what’s slow in this hot loop?”, py-spy is perfect; for “did this function get called at all?”, cProfile is better.

Fix 4: Native Frames (C Extensions)

py-spy record -o profile.svg --native --pid 12345

--native resolves C extension frames (NumPy, PyTorch, lxml, etc.) instead of showing them as opaque blocks.

Without --native:

my_func
  numpy.dot
  <native code>   ← Just a black box

With --native:

my_func
  numpy.dot
  __dgemm_kernel_avx2   ← Actual BLAS function
  matmul_internal

Common Mistake: Profiling NumPy/PyTorch code without --native and concluding the bottleneck is “matrix multiplication” — but the real question is “which BLAS kernel” or “is GEMM blocking on memory bandwidth.” Native frames reveal the actual hot spots.

--native overhead — adds 10-50% to sampling overhead because it resolves symbols at each sample. For long-running production profiles, leave native off; for targeted hot-path investigations, turn it on.

For PyTorch profiling that benefits from native frames, see PyTorch not working.

Fix 5: Subprocess and Multi-Process Profiling

py-spy record -o profile.svg --subprocesses python my_script.py

--subprocesses follows child processes spawned via os.fork, subprocess.Popen, multiprocessing. Each subprocess gets its own sample stream in the same flamegraph.

Without --subprocesses:

# my_script.py
from multiprocessing import Pool

with Pool(4) as p:
    p.map(slow_function, range(1000))
py-spy record -o profile.svg python my_script.py
# Profile shows only the parent process — workers invisible

With --subprocesses:

py-spy record -o profile.svg --subprocesses python my_script.py
# Profile includes all worker processes

For containerized workloads:

py-spy record -o profile.svg --pid 12345 --subprocesses --duration 60

For multiprocessing patterns that affect profiling, see Python multiprocessing not working.

Fix 6: Top Mode (Live)

py-spy top --pid 12345

Live TUI like Linux top but for Python — shows which functions are using CPU right now:

%Own   %Total   Function (filename)
80.0%  80.0%   slow_function (my_script.py)
15.0%  95.0%   process_data (my_script.py)
 5.0%  5.0%   <built-in method>

Sort modes (press the key while top is running):

KeySort by
1%Own
2%Total
3Function name
4Time spent
qQuit

top is great for “is this still happening?” queries. Attach to a running process for a few seconds, see what’s hot, detach.

Pro Tip: Use py-spy top as a quick diagnostic before deeper analysis. If top shows your hot function consistently, you have a CPU problem worth profiling further. If top shows everything is idle but the process is slow, your bottleneck is I/O — switch to other tools (strace for syscalls, iotop for disk).

Fix 7: Dump for Stuck/Hung Processes

py-spy dump --pid 12345

Prints the current Python stack of every thread — instant snapshot, no sampling. Perfect for hung processes:

Thread 0x7f8b2c19c700 (active+gil): "MainThread"
    fetch_data (my_module.py:42)
    main (my_module.py:78)
    <module> (my_script.py:5)

Thread 0x7f8b1a3fd700 (idle): "Thread-1"
    wait (threading.py:312)
    join (threading.py:355)
    main (my_module.py:80)

(active+gil) = currently executing Python; (idle) = waiting (I/O, lock, sleep).

Common Mistake: Using top on a hung process. py-spy top shows CPU usage — a deadlocked process shows 0% CPU and you learn nothing. Use dump for hangs; it shows where every thread is parked.

Profile a stuck process from a remote machine:

ssh prod-server "sudo py-spy dump --pid 12345"

Output is plain text — easy to grep and share with teammates.

Fix 8: Reading Flamegraphs

py-spy generates SVG flamegraphs — interactive in any browser.

py-spy record -o profile.svg --pid 12345 --duration 60
open profile.svg   # macOS — opens in default browser

Flamegraph anatomy:

  • X-axis (width) = total sample time at that function
  • Y-axis (depth) = call stack — deeper = nested calls
  • Color = arbitrary, helps distinguish adjacent frames
  • Click a frame = zoom in to that subtree
  • Search box = highlight matching frames

To find bottlenecks:

  1. Look for wide blocks at the top — these are the leaves of the stack, the actual work being done
  2. Look for stacks that recur — same function called from many paths is a candidate for caching
  3. Ignore tall narrow towers — they’re deep but not consuming much time

Reverse flamegraph (icicle graph) — root at top, leaves at bottom:

py-spy record -o profile.svg --flamegraph-direction down --pid 12345

Useful when you want to start from “what’s calling X?” rather than “where does X live?”

Speedscope format for richer analysis:

py-spy record -o profile.json --format speedscope --pid 12345

Then open in speedscope.app — interactive timeline, multiple visualization modes, filtering.

Production Incident Lens — Attaching Without Adding Risk

The reason teams reach for py-spy mid-incident is exactly the reason they hesitate: the box is on fire and you need to look inside without making it worse. Two failure modes show up repeatedly during real incidents.

Failure mode 1 — py-spy itself spikes CPU on attach. Attaching to a process with thousands of threads, deep stacks, or heavy native code can momentarily push CPU higher while symbols resolve. On a worker already running at 85% CPU, that initial burst can tip latency over the SLO. Mitigation: start with --rate 50 --duration 30 on a single worker behind a healthy load balancer, never the whole fleet, and never the leader of a quorum.

# Lowest-risk first attach
py-spy record -o /tmp/profile.svg --pid <one-worker-pid> --rate 50 --duration 30

Failure mode 2 — CAP_SYS_PTRACE leaks across the blast radius. The convenient sudo sysctl kernel.yama.ptrace_scope=0 you set during the incident often gets forgotten. It stays in /etc/sysctl.d/ until someone audits it months later. Every process on the box can now ptrace every other process — a real privilege-escalation accelerant if the host is later compromised. The correct pattern is setcap cap_sys_ptrace=eip on the py-spy binary itself, scoped to that binary, scoped to the operator.

Blast radius checklist before you attach to production:

  1. One worker, not the fleet — pull it out of rotation if possible.
  2. --rate 50, --duration 30 for the first sample. Increase only if results are too sparse.
  3. Run from a sidecar container with shared PID namespace, not inside the app container.
  4. Capture the SVG to a writable volume you can grab even if the worker is later replaced.
  5. Reset any kernel toggles you flipped — ptrace_scope should go back to 1 once the investigation closes.

Real incident pattern — the “p99 climbed at 03:00 and we don’t know why” page. Metrics show CPU is up, logs show nothing unusual, and the obvious suspects (deploys, traffic spike) are clean. The right move is py-spy dump --pid <pid> first — instant snapshot, zero sampling overhead, tells you what every thread is doing right now. If most threads are in socket.recv or psycopg2, the answer is downstream and profiling further won’t help. If most threads are in CPU-bound Python, that’s when you switch to record for a flamegraph.

Pro Tip: Treat py-spy access like a sharp tool. Document who has CAP_SYS_PTRACE, in which environments, and audit it quarterly. The first time you reach for py-spy at 3am is not the time to also be debugging permissions.

Still Not Working?

py-spy vs scalene vs cProfile

  • py-spy — Sampling, low overhead, attach to running processes. Best for production debugging.
  • scalene — Sampling + CPU + memory + GPU. Slightly more overhead than py-spy. Best when you want everything in one tool.
  • cProfile — Deterministic stdlib profiler, captures every call. Best for unit-test-style profiling of specific code.
  • memray — Memory profiling specifically. See memray not working.

For “production is slow,” start with py-spy. For “this test is slow,” use cProfile or pytest-benchmark.

Profiling pytest Tests

py-spy record -o test-profile.svg -- pytest tests/slow_test.py

Or use pytest-py-spy:

pip install pytest-py-spy
pytest --py-spy tests/

This is the cleanest way to investigate a slow CI suite — record once, share the SVG in the PR comment, let reviewers see exactly which test eats the budget.

Profiling Async Code

Async code is tricky to profile — many tasks share the event loop. py-spy handles asyncio reasonably:

py-spy record --pid 12345 -o asyncio-profile.svg

The flamegraph shows asyncio.events.run_forever at the top, with coroutines underneath. Look for _run_once and the coroutines it calls. If _run_once itself is wide, the loop is saturated — you need more workers or to move blocking work off the loop. If _run_once is narrow but a single coroutine dominates, that coroutine is the culprit.

Profiling Production with Minimal Risk

# 30 seconds of sampling at 50 Hz on a single worker
py-spy record -o profile.svg --pid <worker-pid> --duration 30 --rate 50

50 Hz with 30 second duration: ~0.5% CPU overhead, low risk of affecting production traffic. For longer observations, lower the rate further.

Pro tip for Docker: Don’t run py-spy inside the same container as the app for production profiling. Use a sidecar container that shares the PID namespace — keeps py-spy off the production hot path.

Combining with logging / metrics

Production profiling complements logs and metrics — not replaces them. Use py-spy when:

  • Metrics show CPU is high
  • Logs don’t reveal the cause
  • You need to see what code path is responsible

Structured logs answer “what happened”; metrics answer “how much”; py-spy answers “where in the code did it happen.” Reach for it once the first two say something is wrong but neither says where.

Output Formats

py-spy record -o profile.svg                 # Flamegraph SVG
py-spy record -o profile.json --format speedscope   # Speedscope JSON
py-spy record -o profile.raw --format raw          # Raw sample data

Raw format is useful for custom analysis — feed into your own scripts to compute custom metrics.

Live Profiling for FastAPI / Uvicorn

# Find the worker PID
ps aux | grep uvicorn

# Profile
py-spy top --pid <worker-pid>

Watch the workers handle requests in real time. For Uvicorn worker configuration, see Uvicorn not working.

py-spy Hangs the Target Process on Attach

A rare but real failure: py-spy attaches, the target process freezes for several seconds, then resumes. The cause is usually that py-spy is walking a very large heap or many threads while symbol resolution blocks on disk I/O (cold pyc cache, network-mounted site-packages). Mitigations:

  • Pre-warm the worker before profiling — issue a few requests so files are page-cached.
  • Skip --native for the first sample. It is the most expensive flag.
  • Avoid network-mounted Python installations in production. Local disk only.

If the hang persists, switch to dump (snapshot only, no sampling loop) and skip record for that host.

Kernel and Container Edge Cases

gVisor and some other sandboxed runtimes silently disable ptrace. py-spy attaches with no error but returns empty stacks. Verify the runtime supports ptrace before debugging “py-spy broken.” Check /proc/<pid>/status for TracerPid — if you can’t see the field, the kernel is hiding it from you and py-spy will fail.

For Kubernetes, securityContext.capabilities.add: [SYS_PTRACE] is required on the pod that runs py-spy. PSPs or Pod Security Standards at restricted level will block it. You need an exception namespace or a privileged debug pod.

Symbols Look Like Hex Addresses

Stack frames showing as 0x7f8b2c19c700 instead of function names mean symbol resolution failed. Common causes:

  • Stripped Python binary (Alpine images often ship without debug symbols). Use python:3.12-slim or python:3.12 instead of python:3.12-alpine when you expect to profile.
  • Stale debug info path. py-spy uses the same paths as gdb; if your debug info is in a separate /usr/lib/debug tree, mount it into the profiling container.
  • The target process is a frozen binary (PyInstaller, Nuitka). py-spy may not resolve frames inside the bundled interpreter. Profile the development build instead.
F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles