Fix: memray Not Working — Tracking Errors, Flamegraph Empty, and Native Allocations
Part of: Python Errors
Quick Answer
How to fix memray errors — memray run command not found, flamegraph shows no data, native allocations not tracked, live mode TUI broken, attach to running process fails, and pytest integration.
The Error
You install memray and run a script — get a binary file but can’t read it:
$ pip install memray
$ memray run my_script.py
# Output: memray-my_script.py.12345.bin
$ cat memray-my_script.py.12345.bin
# Binary garbage — how do I read this?Or the generated flamegraph is empty:
$ memray flamegraph memray-my_script.py.12345.bin
# Opens HTML — but the flamegraph is just one tiny block at the topOr native allocations from C extensions aren’t tracked:
import numpy as np
arr = np.zeros(1_000_000_000) # Allocates ~8 GB
# memray report shows ~0 MB used — native alloc not trackedOr live mode TUI doesn’t work in your terminal:
$ memray run --live my_script.py
# Terminal goes blank, no display, or weird charactersOr attaching to a running process fails:
$ memray attach 12345
# Error: cannot attach — ptrace permissionsmemray is the heavyweight Python memory profiler — written by Bloomberg, tracks every allocation (Python and native), supports live monitoring of running processes, and generates flamegraphs. The Bloomberg engineering quality shows: the tooling is excellent. But the default workflow has a “track first, view later” pattern that confuses developers used to live profilers, and native allocation tracking requires explicit opt-in. This guide covers each.
Why This Happens
memray records allocations to a binary file during the program run. The file contains the call stacks and sizes for every alloc — converting it to a human-readable view (flamegraph, summary, tree) happens as a separate memray <command> step. New users expect a “run and see results” workflow like py-spy; memray’s “run, then analyze” model takes adjustment.
Native allocations (C/Rust extensions like NumPy, PyTorch) bypass Python’s tracemalloc and aren’t tracked by default. memray can trace them via libc hooks, but you must enable --native explicitly.
Fix 1: Basic Recording and Viewing
# Record allocations
memray run my_script.py
# Generates: memray-my_script.py.<pid>.bin
# Quick summary
memray summary memray-my_script.py.12345.bin
# Flamegraph (HTML)
memray flamegraph memray-my_script.py.12345.bin
# Opens memray-flamegraph-my_script.py.12345.html
# Allocation tree
memray tree memray-my_script.py.12345.bin
# Stats
memray stats memray-my_script.py.12345.binThe 3-step workflow:
- Run with
memray run - Open the .bin file with a viewer command
- Browse the report
Specify output file:
memray run -o my_profile.bin my_script.py
memray flamegraph my_profile.binCommon Mistake: Looking for live output during memray run. The recording mode runs silently — no progress bar, no in-terminal stats, just generates the binary file. For live monitoring, use --live (covered below).
Profile a module/script with args:
memray run -m my_package.main --arg1 value1
memray run my_script.py arg1 arg2Fix 2: Native Allocations
import numpy as np
import torch
arr = np.zeros(100_000_000) # 800 MB native alloc
tensor = torch.zeros(50_000_000) # 200 MB native allocWithout --native, memray only sees Python’s allocator — these large native allocs are invisible.
Enable native tracking:
memray run --native my_script.py
memray flamegraph --native memray-my_script.py.12345.binNative tracking intercepts malloc/free via libc hooks. This catches:
- NumPy / SciPy array allocations
- PyTorch tensor allocations
- pandas DataFrame internal buffers
- Anything any C/Rust extension allocates via standard libc
Overhead — native tracking adds 2-5x slowdown vs Python-only profiling. Worth it when debugging C-extension memory; skip it for pure-Python profiling.
Pro Tip: For ML / data science workloads (PyTorch, TensorFlow, pandas, NumPy), always use --native. Without it, you’d see your Python code allocating dicts and lists but completely miss the multi-GB tensor allocations dominating actual memory use. The slowdown is acceptable for debugging sessions.
Fix 3: Empty Flamegraph
If memray flamegraph shows just one tiny block, your script either ran too briefly or didn’t allocate significantly.
Force longer profiling:
# my_script.py
def actually_do_work():
data = [i ** 2 for i in range(10_000_000)]
return sum(data)
actually_do_work()
# Add more work if needed — a microsecond-long script has no allocations to trackUse --leaks mode to focus on leaked allocations:
memray flamegraph --leaks memray-script.binThis shows only allocations that weren’t freed by program end — focuses the flamegraph on actual leaks.
Use --temporary-allocation-threshold for short-lived allocs:
memray run --trace-python-allocators --temporary-allocation-threshold 1024 my_script.pyThis separately tracks allocations that are quickly freed — useful for finding code paths that thrash the allocator.
Common Mistake: Profiling a short script (< 100ms) and concluding memray is broken. memray’s overhead per allocation is meaningful — very brief scripts may have so few allocations they don’t make for a meaningful flamegraph. Add more work, or profile a longer test/workload.
Fix 4: Live Mode TUI
memray run --live my_script.pyLive mode opens a terminal UI showing allocations in real time as the script runs. Useful for long-running scripts.
Live mode controls:
| Key | Action |
|---|---|
t | Switch between Total/Own memory views |
← → | Navigate sort columns |
s | Toggle ordering |
q | Quit |
TUI doesn’t render properly — usually a terminal compatibility issue:
# Try different terminal types
TERM=xterm-256color memray run --live my_script.py
TERM=screen memray run --live my_script.pyOr run live mode in a separate process:
# Terminal 1
memray run --live-remote -p 9000 my_script.py
# Terminal 2
memray live 9000--live-remote opens a socket on the specified port; memray live connects from anywhere (including over SSH).
Fix 5: Attach to Running Process
# Find PID
ps aux | grep python
# Attach
memray attach 12345Required permissions:
On Linux, attaching needs ptrace permission:
# Either run as root
sudo memray attach 12345
# Or enable ptrace for unprivileged processes
sudo sysctl kernel.yama.ptrace_scope=0
# Or per-process: launch with PR_SET_DUMPABLEDetach with:
memray attach --stop 12345
# Or send SIGUSR1 to the process
kill -USR1 12345Common Mistake: Attaching to a process and getting “ptrace permission denied” without realizing it’s a kernel security setting. The kernel.yama.ptrace_scope default of 1 only allows ptrace for parent processes (and children). For arbitrary processes, set it to 0 (less secure) or use sudo.
Attach + live mode:
memray attach --live 12345Combines attach with the live TUI — peek into a running production-ish process’s memory pattern.
Fix 6: pytest Integration
pip install pytest-memray# test_my_code.py
import pytest
@pytest.mark.limit_memory("100 MB")
def test_memory_use():
# Test fails if it allocates > 100 MB
data = [i ** 2 for i in range(1_000_000)]
assert sum(data) > 0
@pytest.mark.limit_leaks("1 KB")
def test_no_leaks():
# Test fails if any allocation isn't freed
result = compute_something()
assert resultpytest --memray # Enable memray for all tests, prints summary
pytest --memray test_my_code.py::test_memory_use # Profile one testCommon Mistake: Setting overly tight limits like limit_memory("10 MB") on tests that legitimately need more. The test fails not because of a bug but because the limit was unrealistic. Profile the test first to know its actual memory baseline, then add 50% headroom for the limit.
For pytest fixture patterns that work with memray, see pytest fixture not found.
Fix 7: CI Integration and Regression Detection
# .github/workflows/memory.yml
name: Memory Regression Check
on: [push, pull_request]
jobs:
memory:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install memray pytest pytest-memray
- run: pytest --memray --memray-bin-path=memray-reports
- uses: actions/upload-artifact@v4
with:
name: memray-reports
path: memray-reports/Compare two profiles to detect regressions:
memray compare baseline.bin new.bin
# Shows allocations that increased between runsFor continuous monitoring in production-like environments, periodic profiling jobs catch slow memory creep before it hits prod.
Fix 8: Reading the Flamegraph
The flamegraph’s columns/rows mean different things than you might think:
- X-axis (width) = total allocated memory at that call site
- Y-axis (depth) = call stack — deeper = more nested
- Color = arbitrary, distinguishes adjacent frames
To find leaks:
- Generate
--leaksflamegraph - Look for wide blocks deep in the stack
- Wide block = lots of memory allocated, never freed
To find hot allocators:
memray summary memray-script.binShows top N allocators by total bytes. Often surprising — Pydantic validation, JSON serialization, and pandas DataFrame construction commonly dominate.
Pro Tip: memray’s tree mode is often more useful than the flamegraph for digging into allocations:
memray tree memray-script.binIt’s a navigable tree where you can drill down into call paths. Click a function to see what it allocated. For tracking down a specific leak source, tree is faster than scanning a flamegraph.
Still Not Working?
memray vs py-spy vs cProfile
- memray — Memory profiling. Best for finding leaks and high allocators.
- py-spy — CPU profiling. Sample-based, low overhead, attach without restart. Best for understanding where time goes.
- cProfile — Stdlib CPU profiler. Higher overhead but built-in.
- scalene — Memory + CPU + GPU. Newer, full-featured.
For memory-specific debugging, memray wins. For CPU + memory combined, scalene is worth a look.
Tracking Python Allocators Only
memray run --trace-python-allocators my_script.pyTracks each call to Python’s memory allocator (pymalloc) separately. Useful for understanding small-object churn that doesn’t show up in regular allocation tracking.
Profiling Multi-Process Applications
memray run --follow-fork my_script.py
# Tracks child processes spawned via fork()Each child gets its own .bin file. Multiprocessing apps (Celery, Gunicorn workers) need this flag to see their workers’ allocations.
For multiprocessing patterns that interact with memory profiling, see Python multiprocessing not working.
Large .bin Files
For long-running profiles, the .bin can be gigabytes:
memray run --aggregate my_script.py--aggregate records aggregated stats instead of every allocation — much smaller file, less detail.
Profiling Tests / FastAPI / Django
memray run --aggregate -o profile.bin -- pytest tests/
memray flamegraph profile.bin
# FastAPI request handler
memray run --aggregate -o profile.bin -- uvicorn app:app
# Then send requests; press Ctrl+C; analyzeCombining with Structured Logging
For long-running services, periodic memory snapshots via logging:
import resource
def log_memory():
usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# On Linux, ru_maxrss is in KB; on macOS, in bytes
print(f"Peak memory: {usage / 1024:.0f} MB")Inline Python Use (Context Manager)
For profiling a specific block of code without external CLI:
import memray
with memray.Tracker("output.bin"):
# Code in here is profiled
data = [i ** 2 for i in range(1_000_000)]
process(data)
# Tracking stops at context exit
# Then analyze:
# memray flamegraph output.binThis pattern is useful for profiling specific functions inside a larger app without restarting under memray run.
Tracking Stack Depth
Default stack depth is 50 frames — may not show enough for deeply nested code:
memray run --max-stack-depth 100 my_script.pyHigher depth gives more context but produces larger .bin files. 100 is enough for most apps; raise to 200+ for deep recursion or complex frameworks.
Custom Memory Allocators (PyTorch, JAX)
PyTorch’s CUDA allocator and JAX’s allocators are outside libc’s malloc — memray’s --native doesn’t catch them. For GPU memory:
# PyTorch
import torch
print(torch.cuda.memory_summary()) # PyTorch's own memory report
# JAX
import jax
print(jax.devices()[0].memory_stats())For PyTorch GPU memory issues, the built-in torch.cuda.memory_summary() is the right tool — memray only sees CPU memory.
When to Reach for memray vs Alternatives
- Memory leak in long-running service — memray with
--leaksmode - High memory at peak — memray full profile, look for largest allocators
- OOM kill in CI — memray with
--aggregateto keep file size small - Native extension suspected — memray with
--native - General “is my code slow” question — py-spy first; memory profiling is secondary
For PyTorch-specific memory debugging, see PyTorch not working. For NumPy/Pandas patterns that often dominate memory, see NumPy not working.
Production Incident: Profiling Overhead vs Observability Cost
memray is the right answer for diagnosing a leak, and the wrong answer for “always-on” production observability. Recording every allocation on a busy service can multiply CPU time 3-10x and inflate latency past your SLO. The incident pattern is predictable: a leak in prod gets blamed on the profiler the moment it is enabled, because the overhead is doing exactly what the docs warned about.
Decide before you attach:
memray runwith default tracking on a worker handling > 1k req/s will breach p99 budgets within seconds--aggregatecuts the .bin size and most of the overhead but loses per-allocation context — good for capacity baselining, weak for leak hunting--nativedoubles the overhead again; only enable when you are sure the leak is in a C extension
Safer rollout pattern on Kubernetes:
- Take one replica out of the load balancer (
kubectl label pod ... role=debug) - Attach memray to that pod only, with
--aggregate - Send a controlled slice of traffic via a shadow router or a probe job
- Detach after 5-10 minutes; analyze the .bin offline
Symptoms when the overhead is the real incident:
- Latency p99 jumps the second profiling starts and recovers when it stops
- The OOM you were investigating disappears under profiling because the allocator path changes
- Other replicas, untouched, continue serving fine — confirmation that the profiler, not the bug, caused the user-facing pain
Cost dimension: the .bin files for --aggregate runs of a busy service still reach 100s of MB per minute. Plan storage: dump to a sidecar volume or a separate object store, never to the workload’s primary disk where it can race for inodes with the application.
Pro Tip: for permanent leak hunting in production, prefer continuous low-overhead tools (prometheus_client exposing process_resident_memory_bytes, jemalloc stats, RSS deltas via Kubernetes metrics) and reach for memray only when a specific replica’s RSS climbs past a threshold. Treat memray as a debugger, not a monitor.
memray-reporter Disk Pressure in CI
Long pytest runs with --memray generate one .bin per test. Hundreds of tests means GBs of artifacts uploaded per run. Cap with --memray-bin-prefix to scope the prefix and --memray-prune-zero to drop tests that allocated nothing, or only enable --memray for the leak-suspect suite.
--follow-fork Drops Child Output
Forking workers (Celery, multiprocessing) sometimes produce empty child .bin files when the child exits before flushing. Catch this by sending SIGTERM to the parent and giving the child a grace period via --child-output-prefix plus an explicit Pool.close() / Pool.join() in your code. Without the join, the parent exits, memray closes the parent’s bin, and the child writes to a file that nothing reads.
Flamegraph Missing Frames Under PyO3 / Cython
Rust extensions built with PyO3 and Cython modules built without debug symbols show up as <unknown> frames in the flamegraph. Rebuild the extension with debug symbols (RUSTFLAGS="-C debuginfo=2" for PyO3, --define-macro CYTHON_TRACE=1 for Cython) and rerun the profile — the previously opaque blocks resolve to actual function names.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: scalene Not Working — Web UI, GPU Profiling, and AI Suggestion Errors
How to fix scalene errors — scalene command not found, web UI port conflict, no GPU detected, profile.json empty, AI optimize requires OpenAI key, native code not attributed, and Jupyter integration.
Fix: py-spy Not Working — Attach Permission, Empty Output, and Native Frame Errors
How to fix py-spy errors — Operation not permitted ptrace, flamegraph blank, missing native code frames, top mode shows no Python frames, dump command empty, and subprocess inheritance.
Fix: Locust Not Working — User Class Errors, Distributed Mode, and Throughput Issues
How to fix Locust errors — no locustfile found, User class not detected, worker connection refused, distributed mode throughput lower than single-node, StopUser exception, FastHttpUser vs HttpUser, and headless CSV reports.
Fix: Python asyncio Blocking the Event Loop — Mixing Sync and Async Code
How to fix Python asyncio event loop blocking — using run_in_executor for sync calls, asyncio.to_thread, avoiding blocking I/O in coroutines, and detecting event loop stalls.