Fix: Python threading Not Running in Parallel (GIL Limitations)
Part of: Python Errors
Quick Answer
How to fix Python threading not achieving parallelism due to the GIL — when to use multiprocessing, concurrent.futures, or asyncio instead, and what the GIL actually blocks.
The Error
You add threads to speed up your Python code but see no performance improvement — or worse, it runs slower:
import threading
import time
def cpu_task(n):
# Heavy computation
total = sum(i * i for i in range(n))
return total
start = time.time()
threads = [threading.Thread(target=cpu_task, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
print(f"Threaded: {time.time() - start:.2f}s") # ~8s — SLOWER than sequential!
start = time.time()
for _ in range(4): cpu_task(10_000_000)
print(f"Sequential: {time.time() - start:.2f}s") # ~2s — fasterThreads run but provide no speedup for CPU-bound work.
Why This Happens
CPython (the standard Python interpreter) has the Global Interpreter Lock (GIL) — a mutex that allows only one thread to execute Python bytecode at a time. Even on a multi-core CPU, Python threads cannot run Python code simultaneously. The threads exist as real OS threads, they are scheduled by the kernel, and they take turns acquiring the GIL — but only one is ever holding it.
The GIL exists because CPython’s memory management (reference counting for garbage collection) is not thread-safe. Every Py_INCREF and Py_DECREF would otherwise need an atomic operation, which is expensive on most architectures. The GIL gives single-threaded code a free pass on synchronization at the cost of multi-threaded scaling. Removing it cleanly requires fundamental changes to the interpreter — which is exactly what PEP 703 and the Python 3.13 free-threaded build attempt.
Adding threads to CPU-bound code can actually make it slower than sequential code because the threads still serialize on the GIL, but you pay extra cost in context switches, GIL handoff, and cache invalidation between cores. This is the classic trap: “I added threads and my code got slower.” That is not a bug — that is the GIL.
What the GIL blocks:
- Parallel execution of Python bytecode across CPU cores.
- CPU-bound tasks (computation, data processing, string manipulation) gain nothing from threads.
What the GIL does NOT block:
- I/O operations — when a thread waits for network, disk, or sleep, it releases the GIL, allowing other threads to run.
- C extensions that release the GIL (NumPy, OpenSSL, database drivers using libpq or libmysqlclient).
- Multiple processes — each process has its own GIL.
So threads are useful for I/O-bound work but useless (and harmful) for CPU-bound work.
Platform and Environment Differences
GIL behavior is not uniform across Python interpreters, versions, or builds. The decision tree for “what should I actually use” depends heavily on what you are running.
CPython 3.0 - 3.12 (standard interpreter). Full GIL. Threads share one core for Python code. This is what 99% of production deployments run.
CPython 3.13+ free-threaded build (PEP 703). Python 3.13 (October 2024) ships an opt-in build that disables the GIL: python3.13t. You install it separately and run with PYTHON_GIL=0 set, or check at runtime with sys._is_gil_enabled(). The free-threaded build is officially experimental — many C extensions are not yet compatible, and single-threaded performance is roughly 10-15% slower than the GIL build due to switching reference counts to biased reference counting. Python 3.14 (October 2025) keeps it experimental but stabilizes the ABI. Production use of free-threaded Python is not yet recommended for most workloads.
CPython 3.14 (October 2025). Default start method for multiprocessing changed from fork to spawn on Linux to match macOS/Windows behavior. This is relevant when comparing threads vs processes — see Python multiprocessing not working for the implications.
PyPy. PyPy has a GIL, similar to CPython, but its JIT can release the GIL for longer stretches in optimized paths. PyPy briefly experimented with Software Transactional Memory (PyPy-STM) as a GIL replacement around 2014-2016, but the project was paused and never reached production. Use PyPy when you need single-threaded speed, not for parallel threads.
Jython. Runs on the JVM and has no GIL. Java threads run Python code in true parallel. The downside: Jython is stuck at Python 2.7 syntax for the official release, with Python 3 support still incomplete. Most modern libraries do not work.
IronPython. Runs on .NET and also has no GIL. IronPython 3 supports Python 3.4 syntax (with ongoing work toward newer versions). Like Jython, it lacks support for most C-extension-heavy scientific libraries (NumPy, pandas) because those require CPython’s C API.
MicroPython, CircuitPython, GraalPy. Each has its own threading model. MicroPython is single-threaded by default. GraalPy (GraalVM Python) supports true parallel threads but is still maturing for full library compatibility.
ARM vs x86 thread scheduling. On Apple Silicon (M1/M2/M3/M4) and AWS Graviton, the GIL handoff penalty is slightly lower because of the unified-cache architecture and faster atomic operations. CPU-bound threaded Python is still serialized, but the overhead of switching between threads is smaller. This rarely changes the right architecture choice, but it does mean benchmarks vary across platforms.
Choosing between threading, multiprocessing, and asyncio.
threading: I/O-bound work, when you need shared in-process state and a small number of concurrent tasks (10-100s).multiprocessing: CPU-bound work, when you need to use multiple cores. Pickling overhead matters for large data.asyncio: I/O-bound work with high concurrency (1000s of connections) and code you can write inasync defstyle. Lower memory footprint than threads.concurrent.futures: high-level wrapper around the first two; pickThreadPoolExecutororProcessPoolExecutorto switch.
GPU and CUDA workloads. PyTorch, TensorFlow, and JAX release the GIL during their compiled kernels — threads can issue GPU work in parallel even on CPython. But fork()-based multiprocessing after CUDA initialization causes deadlocks; always use spawn or forkserver start methods for ML training pipelines.
Fix 1: Use multiprocessing for CPU-Bound Work
Replace threading with multiprocessing for CPU-bound parallelism. Each process has its own Python interpreter and GIL:
Broken — threading for CPU work:
import threading
results = []
lock = threading.Lock()
def cpu_task(n):
total = sum(i * i for i in range(n))
with lock:
results.append(total)
threads = [threading.Thread(target=cpu_task, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# No speedup — GIL prevents parallel executionFixed — multiprocessing:
from multiprocessing import Pool
def cpu_task(n):
return sum(i * i for i in range(n))
if __name__ == "__main__":
with Pool(4) as pool: # 4 worker processes
results = pool.map(cpu_task, [10_000_000] * 4)
print(results)
# ~4x faster on a 4-core machinemultiprocessing.Pool runs each task in a separate process with its own GIL. True parallel execution on multiple CPU cores.
When to use which:
Workload Use CPU-bound (computation, data processing) multiprocessingI/O-bound (network, disk, database) threadingorasyncioMixed (CPU + I/O) concurrent.futureswith appropriate executorMany lightweight concurrent tasks asyncio
Fix 2: Use concurrent.futures for a Unified API
concurrent.futures provides a consistent interface that works with both threads and processes:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import time
def io_task(url):
"""I/O-bound — use ThreadPoolExecutor"""
import urllib.request
with urllib.request.urlopen(url) as response:
return len(response.read())
def cpu_task(n):
"""CPU-bound — use ProcessPoolExecutor"""
return sum(i * i for i in range(n))
urls = ["https://example.com"] * 10
numbers = [10_000_000] * 4
# I/O-bound: threads work well
with ThreadPoolExecutor(max_workers=10) as executor:
io_results = list(executor.map(io_task, urls))
# CPU-bound: use processes
with ProcessPoolExecutor(max_workers=4) as executor:
cpu_results = list(executor.map(cpu_task, numbers))concurrent.futures is higher-level than threading or multiprocessing directly. Use as_completed() for handling results as they finish:
from concurrent.futures import ProcessPoolExecutor, as_completed
def process_chunk(chunk):
return sum(x * x for x in chunk)
data_chunks = [range(i, i + 1_000_000) for i in range(0, 10_000_000, 1_000_000)]
with ProcessPoolExecutor() as executor:
futures = {executor.submit(process_chunk, chunk): i for i, chunk in enumerate(data_chunks)}
for future in as_completed(futures):
chunk_index = futures[future]
result = future.result()
print(f"Chunk {chunk_index} done: {result}")Fix 3: Use asyncio for I/O-Bound Concurrency
For I/O-bound tasks (HTTP requests, database queries, file operations), asyncio is more efficient than threads — one thread handles thousands of concurrent operations:
Threading for I/O (works, but has overhead):
import threading
import urllib.request
def fetch(url):
with urllib.request.urlopen(url) as r:
return r.read()
threads = [threading.Thread(target=fetch, args=("https://example.com",)) for _ in range(100)]
for t in threads: t.start()
for t in threads: t.join()
# 100 threads — high memory usage, OS context switching overheadasyncio for I/O (more efficient):
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as response:
return await response.read()
async def main():
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, "https://example.com") for _ in range(100)]
results = await asyncio.gather(*tasks)
return results
asyncio.run(main())
# 100 concurrent requests, 1 thread — much lower overheadasyncio uses cooperative multitasking — one OS thread handles all I/O concurrency by switching between tasks when they wait for I/O. No GIL contention because only one task runs at a time, but I/O waits are overlapped.
Fix 4: Use NumPy / C Extensions That Release the GIL
Many scientific computing libraries (NumPy, SciPy, pandas) release the GIL during heavy C-level operations. Threading works for these:
import threading
import numpy as np
def matrix_multiply(size):
a = np.random.rand(size, size)
b = np.random.rand(size, size)
return np.dot(a, b) # NumPy releases GIL during computation
# This actually runs in parallel because NumPy releases the GIL
threads = [threading.Thread(target=matrix_multiply, args=(1000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# Faster than sequential — NumPy C code runs in parallelCheck if a library releases the GIL by profiling with multiple threads. If adding threads speeds up the work, the library releases the GIL during the heavy operation.
Fix 5: Profile to Confirm the Bottleneck
Before switching from threading to multiprocessing, confirm the bottleneck is CPU-bound (GIL-limited) vs I/O-bound:
import cProfile
import pstats
def my_task():
# Your code here
result = sum(i * i for i in range(10_000_000))
return result
with cProfile.Profile() as pr:
my_task()
stats = pstats.Stats(pr)
stats.sort_stats("cumulative")
stats.print_stats(10) # Top 10 slowest functionsCheck CPU usage during threading:
# Run your threaded program, then in another terminal:
top -p $(pgrep -f "python your_script.py")
# If CPU usage is ~100% (one core), it's GIL-limited
# If CPU usage is ~400% (four cores), threads ARE running in parallel (I/O or C extension work)Use py-spy for low-overhead profiling:
pip install py-spy
py-spy top --pid $(pgrep -f python)Fix 6: Python 3.13+ Free-Threaded Mode (GIL Disabled)
Python 3.13 introduced an experimental build option to disable the GIL (--disable-gil). This is available as a separate build of CPython and is not the default:
# Install free-threaded Python (Python 3.13+)
# On Ubuntu via pyenv:
PYTHON_CONFIGURE_OPTS="--disable-gil" pyenv install 3.13.0
# Verify GIL is disabled
python -c "import sys; print(sys._is_gil_enabled())"
# False — GIL is disabledimport threading
def cpu_task():
return sum(i * i for i in range(10_000_000))
# With GIL disabled, threads run in true parallel
threads = [threading.Thread(target=cpu_task) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# ~4x faster on 4 coresWarning: Free-threaded Python 3.13 is experimental. Many third-party libraries (especially C extensions) are not yet compatible. Use it for testing and exploration, not production workloads.
Fix 7: Practical Patterns for Real-World Use
Web scraping — use threads (I/O-bound):
from concurrent.futures import ThreadPoolExecutor
import requests
def scrape(url):
return requests.get(url).text
urls = [f"https://example.com/page/{i}" for i in range(100)]
with ThreadPoolExecutor(max_workers=20) as executor:
pages = list(executor.map(scrape, urls))Data processing pipeline — use processes (CPU-bound):
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
def process_chunk(filepath):
df = pd.read_csv(filepath)
# Heavy transformation
return df.groupby("category").sum()
files = [f"data_{i}.csv" for i in range(20)]
with ProcessPoolExecutor() as executor:
results = list(executor.map(process_chunk, files))
combined = pd.concat(results)Mixed I/O and CPU — chain executors:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import requests
def download(url):
return requests.get(url).content # I/O-bound
def process(data):
return len(data) * 2 # CPU-bound (simplified)
urls = [f"https://example.com/file/{i}" for i in range(10)]
# Step 1: download in threads (I/O-bound)
with ThreadPoolExecutor(max_workers=10) as executor:
raw_data = list(executor.map(download, urls))
# Step 2: process in processes (CPU-bound)
with ProcessPoolExecutor() as executor:
results = list(executor.map(process, raw_data))Still Not Working?
Benchmark before optimizing. Use time.perf_counter() to measure actual execution time with and without parallelism. If the task is fast enough that overhead dominates, parallelism makes it slower.
Check pickling overhead for multiprocessing. Data passed between processes must be pickled (serialized). For large datasets, pickling time can exceed the parallelism benefit. Pass file paths or database queries instead of raw data when possible.
Consider joblib for scientific computing. joblib provides a high-level parallel computing interface commonly used with scikit-learn:
from joblib import Parallel, delayed
results = Parallel(n_jobs=4)(
delayed(cpu_task)(n) for n in [10_000_000] * 4
)joblib also supports memory mapping for large NumPy arrays (mmap_mode="r"), avoiding the pickle overhead entirely by sharing the array through the filesystem.
Check OMP_NUM_THREADS and MKL_NUM_THREADS. NumPy, SciPy, and scikit-learn link against OpenMP/MKL/BLAS, which spawn their own thread pools outside the GIL. If you launch 4 processes and each runs NumPy with 8 OpenMP threads, you have 32 threads contending for the same cores. Set OMP_NUM_THREADS=1 before forking workers to avoid oversubscription:
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python my_script.pyCheck that you are not running on a single-vCPU container. AWS Lambda, Cloud Run, and small Fargate tasks may have only 1 vCPU regardless of how many cores the host has. Multiprocessing on a 1-vCPU container gives you zero speedup — verify with os.cpu_count() and len(os.sched_getaffinity(0)) (the second number reflects cgroup/affinity limits).
Check for the asyncio + threading mix-up. Calling synchronous blocking I/O from inside an async def function blocks the entire event loop. If you must call a blocking library, wrap it with asyncio.to_thread() or loop.run_in_executor(). See Python async/sync mix for the patterns.
Check for Python interpreter pinning to one core. Some container runtimes (especially older Kubernetes with strict CPU limits and CFS quotas) effectively serialize Python’s worker processes onto one CPU. Profile with perf stat -e cs to see context-switch counts.
Check for asyncio.gather swallowing exceptions. If you mix threading with asyncio (asyncio.to_thread) and a worker raises, gather defaults to return_exceptions=False and cancels siblings — which can look like “threading not making progress.” See Python asyncio.gather error for the pattern.
For multiprocessing-specific errors (freeze_support, pickle errors), see Fix: Python multiprocessing not working. For asyncio runtime errors, see Fix: Python asyncio not running.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Python multiprocessing Not Working (freeze_support, Pickle Errors, Zombie Processes)
How to fix Python multiprocessing not working — freeze_support error on Windows, pickle errors with lambdas, zombie processes, and Pool hanging indefinitely.
Fix: scalene Not Working — Web UI, GPU Profiling, and AI Suggestion Errors
How to fix scalene errors — scalene command not found, web UI port conflict, no GPU detected, profile.json empty, AI optimize requires OpenAI key, native code not attributed, and Jupyter integration.
Fix: py-spy Not Working — Attach Permission, Empty Output, and Native Frame Errors
How to fix py-spy errors — Operation not permitted ptrace, flamegraph blank, missing native code frames, top mode shows no Python frames, dump command empty, and subprocess inheritance.
Fix: memray Not Working — Tracking Errors, Flamegraph Empty, and Native Allocations
How to fix memray errors — memray run command not found, flamegraph shows no data, native allocations not tracked, live mode TUI broken, attach to running process fails, and pytest integration.