Fix: memray Not Working — Tracking Errors, Flamegraph Empty, and Native Allocations

Q: How do I fix "memray Not Working — Tracking Errors, Flamegraph Empty, and Native Allocations"?

How to fix memray errors — memray run command not found, flamegraph shows no data, native allocations not tracked, live mode TUI broken, attach to running process fails, and pytest integration.

The Error

You install memray and run a script — get a binary file but can’t read it:

$ pip install memray
$ memray run my_script.py
# Output: memray-my_script.py.12345.bin
$ cat memray-my_script.py.12345.bin
# Binary garbage — how do I read this?

Or the generated flamegraph is empty:

$ memray flamegraph memray-my_script.py.12345.bin
# Opens HTML — but the flamegraph is just one tiny block at the top

Or native allocations from C extensions aren’t tracked:

import numpy as np
arr = np.zeros(1_000_000_000)   # Allocates ~8 GB
# memray report shows ~0 MB used — native alloc not tracked

Or live mode TUI doesn’t work in your terminal:

$ memray run --live my_script.py
# Terminal goes blank, no display, or weird characters

Or attaching to a running process fails:

$ memray attach 12345
# Error: cannot attach — ptrace permissions

memray is the heavyweight Python memory profiler — written by Bloomberg, tracks every allocation (Python and native), supports live monitoring of running processes, and generates flamegraphs. The Bloomberg engineering quality shows: the tooling is excellent. But the default workflow has a “track first, view later” pattern that confuses developers used to live profilers, and native allocation tracking requires explicit opt-in. This guide covers each.

Why This Happens

memray records allocations to a binary file during the program run. The file contains the call stacks and sizes for every alloc — converting it to a human-readable view (flamegraph, summary, tree) happens as a separate memray <command> step. New users expect a “run and see results” workflow like py-spy; memray’s “run, then analyze” model takes adjustment.

Native allocations (C/Rust extensions like NumPy, PyTorch) bypass Python’s tracemalloc and aren’t tracked by default. memray can trace them via libc hooks, but you must enable --native explicitly.

Fix 1: Basic Recording and Viewing

# Record allocations
memray run my_script.py
# Generates: memray-my_script.py.<pid>.bin

# Quick summary
memray summary memray-my_script.py.12345.bin

# Flamegraph (HTML)
memray flamegraph memray-my_script.py.12345.bin
# Opens memray-flamegraph-my_script.py.12345.html

# Allocation tree
memray tree memray-my_script.py.12345.bin

# Stats
memray stats memray-my_script.py.12345.bin

The 3-step workflow:

Run with memray run
Open the .bin file with a viewer command
Browse the report

Specify output file:

memray run -o my_profile.bin my_script.py
memray flamegraph my_profile.bin

Common Mistake: Looking for live output during memray run. The recording mode runs silently — no progress bar, no in-terminal stats, just generates the binary file. For live monitoring, use --live (covered below).

Profile a module/script with args:

memray run -m my_package.main --arg1 value1
memray run my_script.py arg1 arg2

Fix 2: Native Allocations

import numpy as np
import torch

arr = np.zeros(100_000_000)   # 800 MB native alloc
tensor = torch.zeros(50_000_000)   # 200 MB native alloc

Without --native, memray only sees Python’s allocator — these large native allocs are invisible.

Enable native tracking:

memray run --native my_script.py
memray flamegraph --native memray-my_script.py.12345.bin

Native tracking intercepts malloc/free via libc hooks. This catches:

NumPy / SciPy array allocations
PyTorch tensor allocations
pandas DataFrame internal buffers
Anything any C/Rust extension allocates via standard libc

Overhead — native tracking adds 2-5x slowdown vs Python-only profiling. Worth it when debugging C-extension memory; skip it for pure-Python profiling.

Pro Tip: For ML / data science workloads (PyTorch, TensorFlow, pandas, NumPy), always use --native. Without it, you’d see your Python code allocating dicts and lists but completely miss the multi-GB tensor allocations dominating actual memory use. The slowdown is acceptable for debugging sessions.

Fix 3: Empty Flamegraph

If memray flamegraph shows just one tiny block, your script either ran too briefly or didn’t allocate significantly.

Force longer profiling:

# my_script.py
def actually_do_work():
    data = [i ** 2 for i in range(10_000_000)]
    return sum(data)

actually_do_work()
# Add more work if needed — a microsecond-long script has no allocations to track

Use --leaks mode to focus on leaked allocations:

memray flamegraph --leaks memray-script.bin

This shows only allocations that weren’t freed by program end — focuses the flamegraph on actual leaks.

Use --temporary-allocation-threshold for short-lived allocs:

memray run --trace-python-allocators --temporary-allocation-threshold 1024 my_script.py

This separately tracks allocations that are quickly freed — useful for finding code paths that thrash the allocator.

Common Mistake: Profiling a short script (< 100ms) and concluding memray is broken. memray’s overhead per allocation is meaningful — very brief scripts may have so few allocations they don’t make for a meaningful flamegraph. Add more work, or profile a longer test/workload.

Fix 4: Live Mode TUI

memray run --live my_script.py

Live mode opens a terminal UI showing allocations in real time as the script runs. Useful for long-running scripts.

Live mode controls:

Key	Action
`t`	Switch between Total/Own memory views
`←` `→`	Navigate sort columns
`s`	Toggle ordering
`q`	Quit

TUI doesn’t render properly — usually a terminal compatibility issue:

# Try different terminal types
TERM=xterm-256color memray run --live my_script.py
TERM=screen memray run --live my_script.py

Or run live mode in a separate process:

# Terminal 1
memray run --live-remote -p 9000 my_script.py

# Terminal 2
memray live 9000

--live-remote opens a socket on the specified port; memray live connects from anywhere (including over SSH).

Fix 5: Attach to Running Process

# Find PID
ps aux | grep python

# Attach
memray attach 12345

Required permissions:

On Linux, attaching needs ptrace permission:

# Either run as root
sudo memray attach 12345

# Or enable ptrace for unprivileged processes
sudo sysctl kernel.yama.ptrace_scope=0

# Or per-process: launch with PR_SET_DUMPABLE

Detach with:

memray attach --stop 12345
# Or send SIGUSR1 to the process
kill -USR1 12345

Common Mistake: Attaching to a process and getting “ptrace permission denied” without realizing it’s a kernel security setting. The kernel.yama.ptrace_scope default of 1 only allows ptrace for parent processes (and children). For arbitrary processes, set it to 0 (less secure) or use sudo.

Attach + live mode:

memray attach --live 12345

Combines attach with the live TUI — peek into a running production-ish process’s memory pattern.

Fix 6: pytest Integration

pip install pytest-memray

# test_my_code.py
import pytest

@pytest.mark.limit_memory("100 MB")
def test_memory_use():
    # Test fails if it allocates > 100 MB
    data = [i ** 2 for i in range(1_000_000)]
    assert sum(data) > 0

@pytest.mark.limit_leaks("1 KB")
def test_no_leaks():
    # Test fails if any allocation isn't freed
    result = compute_something()
    assert result

pytest --memray   # Enable memray for all tests, prints summary
pytest --memray test_my_code.py::test_memory_use   # Profile one test

Common Mistake: Setting overly tight limits like limit_memory("10 MB") on tests that legitimately need more. The test fails not because of a bug but because the limit was unrealistic. Profile the test first to know its actual memory baseline, then add 50% headroom for the limit.

For pytest fixture patterns that work with memray, see pytest fixture not found.

Fix 7: CI Integration and Regression Detection

# .github/workflows/memory.yml
name: Memory Regression Check

on: [push, pull_request]

jobs:
  memory:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install memray pytest pytest-memray
      - run: pytest --memray --memray-bin-path=memray-reports
      - uses: actions/upload-artifact@v4
        with:
          name: memray-reports
          path: memray-reports/

Compare two profiles to detect regressions:

memray compare baseline.bin new.bin
# Shows allocations that increased between runs

For continuous monitoring in production-like environments, periodic profiling jobs catch slow memory creep before it hits prod.

Fix 8: Reading the Flamegraph

The flamegraph’s columns/rows mean different things than you might think:

X-axis (width) = total allocated memory at that call site
Y-axis (depth) = call stack — deeper = more nested
Color = arbitrary, distinguishes adjacent frames

To find leaks:

Generate --leaks flamegraph
Look for wide blocks deep in the stack
Wide block = lots of memory allocated, never freed

To find hot allocators:

memray summary memray-script.bin

Shows top N allocators by total bytes. Often surprising — Pydantic validation, JSON serialization, and pandas DataFrame construction commonly dominate.

Pro Tip: memray’s tree mode is often more useful than the flamegraph for digging into allocations:

memray tree memray-script.bin

It’s a navigable tree where you can drill down into call paths. Click a function to see what it allocated. For tracking down a specific leak source, tree is faster than scanning a flamegraph.

Still Not Working?

memray vs py-spy vs cProfile

memray — Memory profiling. Best for finding leaks and high allocators.
py-spy — CPU profiling. Sample-based, low overhead, attach without restart. Best for understanding where time goes.
cProfile — Stdlib CPU profiler. Higher overhead but built-in.
scalene — Memory + CPU + GPU. Newer, full-featured.

For memory-specific debugging, memray wins. For CPU + memory combined, scalene is worth a look.

Tracking Python Allocators Only

memray run --trace-python-allocators my_script.py

Tracks each call to Python’s memory allocator (pymalloc) separately. Useful for understanding small-object churn that doesn’t show up in regular allocation tracking.

Profiling Multi-Process Applications

memray run --follow-fork my_script.py
# Tracks child processes spawned via fork()

Each child gets its own .bin file. Multiprocessing apps (Celery, Gunicorn workers) need this flag to see their workers’ allocations.

For multiprocessing patterns that interact with memory profiling, see Python multiprocessing not working.

Large .bin Files

For long-running profiles, the .bin can be gigabytes:

memray run --aggregate my_script.py

--aggregate records aggregated stats instead of every allocation — much smaller file, less detail.

Profiling Tests / FastAPI / Django

memray run --aggregate -o profile.bin -- pytest tests/
memray flamegraph profile.bin

# FastAPI request handler
memray run --aggregate -o profile.bin -- uvicorn app:app
# Then send requests; press Ctrl+C; analyze

Combining with Structured Logging

For long-running services, periodic memory snapshots via logging:

import resource

def log_memory():
    usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    # On Linux, ru_maxrss is in KB; on macOS, in bytes
    print(f"Peak memory: {usage / 1024:.0f} MB")

Inline Python Use (Context Manager)

For profiling a specific block of code without external CLI:

import memray

with memray.Tracker("output.bin"):
    # Code in here is profiled
    data = [i ** 2 for i in range(1_000_000)]
    process(data)

# Tracking stops at context exit
# Then analyze:
#   memray flamegraph output.bin

This pattern is useful for profiling specific functions inside a larger app without restarting under memray run.

Tracking Stack Depth

Default stack depth is 50 frames — may not show enough for deeply nested code:

memray run --max-stack-depth 100 my_script.py

Higher depth gives more context but produces larger .bin files. 100 is enough for most apps; raise to 200+ for deep recursion or complex frameworks.

Custom Memory Allocators (PyTorch, JAX)

PyTorch’s CUDA allocator and JAX’s allocators are outside libc’s malloc — memray’s --native doesn’t catch them. For GPU memory:

# PyTorch
import torch
print(torch.cuda.memory_summary())   # PyTorch's own memory report

# JAX
import jax
print(jax.devices()[0].memory_stats())

For PyTorch GPU memory issues, the built-in torch.cuda.memory_summary() is the right tool — memray only sees CPU memory.

When to Reach for memray vs Alternatives

Memory leak in long-running service — memray with --leaks mode
High memory at peak — memray full profile, look for largest allocators
OOM kill in CI — memray with --aggregate to keep file size small
Native extension suspected — memray with --native
General “is my code slow” question — py-spy first; memory profiling is secondary

For PyTorch-specific memory debugging, see PyTorch not working. For NumPy/Pandas patterns that often dominate memory, see NumPy not working.

Production Incident: Profiling Overhead vs Observability Cost

memray is the right answer for diagnosing a leak, and the wrong answer for “always-on” production observability. Recording every allocation on a busy service can multiply CPU time 3-10x and inflate latency past your SLO. The incident pattern is predictable: a leak in prod gets blamed on the profiler the moment it is enabled, because the overhead is doing exactly what the docs warned about.

Decide before you attach:

memray run with default tracking on a worker handling > 1k req/s will breach p99 budgets within seconds
--aggregate cuts the .bin size and most of the overhead but loses per-allocation context — good for capacity baselining, weak for leak hunting
--native doubles the overhead again; only enable when you are sure the leak is in a C extension

Safer rollout pattern on Kubernetes:

Take one replica out of the load balancer (kubectl label pod ... role=debug)
Attach memray to that pod only, with --aggregate
Send a controlled slice of traffic via a shadow router or a probe job
Detach after 5-10 minutes; analyze the .bin offline

Symptoms when the overhead is the real incident:

Latency p99 jumps the second profiling starts and recovers when it stops
The OOM you were investigating disappears under profiling because the allocator path changes
Other replicas, untouched, continue serving fine — confirmation that the profiler, not the bug, caused the user-facing pain

Cost dimension: the .bin files for --aggregate runs of a busy service still reach 100s of MB per minute. Plan storage: dump to a sidecar volume or a separate object store, never to the workload’s primary disk where it can race for inodes with the application.

Pro Tip: for permanent leak hunting in production, prefer continuous low-overhead tools (prometheus_client exposing process_resident_memory_bytes, jemalloc stats, RSS deltas via Kubernetes metrics) and reach for memray only when a specific replica’s RSS climbs past a threshold. Treat memray as a debugger, not a monitor.

memray-reporter Disk Pressure in CI

Long pytest runs with --memray generate one .bin per test. Hundreds of tests means GBs of artifacts uploaded per run. Cap with --memray-bin-prefix to scope the prefix and --memray-prune-zero to drop tests that allocated nothing, or only enable --memray for the leak-suspect suite.

`--follow-fork` Drops Child Output

Forking workers (Celery, multiprocessing) sometimes produce empty child .bin files when the child exits before flushing. Catch this by sending SIGTERM to the parent and giving the child a grace period via --child-output-prefix plus an explicit Pool.close() / Pool.join() in your code. Without the join, the parent exits, memray closes the parent’s bin, and the child writes to a file that nothing reads.

Flamegraph Missing Frames Under PyO3 / Cython

Rust extensions built with PyO3 and Cython modules built without debug symbols show up as <unknown> frames in the flamegraph. Rebuild the extension with debug symbols (RUSTFLAGS="-C debuginfo=2" for PyO3, --define-macro CYTHON_TRACE=1 for Cython) and rerun the profile — the previously opaque blocks resolve to actual function names.

Fix: memray Not Working — Tracking Errors, Flamegraph Empty, and Native Allocations

The Error

Why This Happens

Fix 1: Basic Recording and Viewing

Fix 2: Native Allocations

Fix 3: Empty Flamegraph

Fix 4: Live Mode TUI

Fix 5: Attach to Running Process

Fix 6: pytest Integration

Fix 7: CI Integration and Regression Detection

Fix 8: Reading the Flamegraph

Still Not Working?

memray vs py-spy vs cProfile

Tracking Python Allocators Only

Profiling Multi-Process Applications

Large .bin Files

Profiling Tests / FastAPI / Django

Combining with Structured Logging

Inline Python Use (Context Manager)

Tracking Stack Depth

Custom Memory Allocators (PyTorch, JAX)

When to Reach for memray vs Alternatives

Production Incident: Profiling Overhead vs Observability Cost

memray-reporter Disk Pressure in CI

`--follow-fork` Drops Child Output

Flamegraph Missing Frames Under PyO3 / Cython

Related Articles

Fix: scalene Not Working — Web UI, GPU Profiling, and AI Suggestion Errors

Fix: py-spy Not Working — Attach Permission, Empty Output, and Native Frame Errors

Fix: Locust Not Working — User Class Errors, Distributed Mode, and Throughput Issues

Fix: Python asyncio Blocking the Event Loop — Mixing Sync and Async Code

The Error

Why This Happens

Fix 1: Basic Recording and Viewing

Fix 2: Native Allocations

Fix 3: Empty Flamegraph

Fix 4: Live Mode TUI

Fix 5: Attach to Running Process

Fix 6: pytest Integration

Fix 7: CI Integration and Regression Detection

Fix 8: Reading the Flamegraph

Still Not Working?

memray vs py-spy vs cProfile

Tracking Python Allocators Only

Profiling Multi-Process Applications

Large .bin Files

Profiling Tests / FastAPI / Django

Combining with Structured Logging

Inline Python Use (Context Manager)

Tracking Stack Depth

Custom Memory Allocators (PyTorch, JAX)

When to Reach for memray vs Alternatives

Production Incident: Profiling Overhead vs Observability Cost

memray-reporter Disk Pressure in CI

--follow-fork Drops Child Output

Flamegraph Missing Frames Under PyO3 / Cython

Related Articles

Fix: scalene Not Working — Web UI, GPU Profiling, and AI Suggestion Errors

Fix: py-spy Not Working — Attach Permission, Empty Output, and Native Frame Errors

Fix: Locust Not Working — User Class Errors, Distributed Mode, and Throughput Issues

Fix: Python asyncio Blocking the Event Loop — Mixing Sync and Async Code

`--follow-fork` Drops Child Output