Fix: Python threading Not Running in Parallel (GIL Limitations)

Q: How do I fix "Python threading Not Running in Parallel (GIL Limitations)"?

How to fix Python threading not achieving parallelism due to the GIL — when to use multiprocessing, concurrent.futures, or asyncio instead, and what the GIL actually blocks.

The Error

You add threads to speed up your Python code but see no performance improvement — or worse, it runs slower:

import threading
import time

def cpu_task(n):
    # Heavy computation
    total = sum(i * i for i in range(n))
    return total

start = time.time()
threads = [threading.Thread(target=cpu_task, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
print(f"Threaded: {time.time() - start:.2f}s")  # ~8s — SLOWER than sequential!

start = time.time()
for _ in range(4): cpu_task(10_000_000)
print(f"Sequential: {time.time() - start:.2f}s")  # ~2s — faster

Threads run but provide no speedup for CPU-bound work.

Why This Happens

CPython (the standard Python interpreter) has the Global Interpreter Lock (GIL) — a mutex that allows only one thread to execute Python bytecode at a time. Even on a multi-core CPU, Python threads cannot run Python code simultaneously. The threads exist as real OS threads, they are scheduled by the kernel, and they take turns acquiring the GIL — but only one is ever holding it.

The GIL exists because CPython’s memory management (reference counting for garbage collection) is not thread-safe. Every Py_INCREF and Py_DECREF would otherwise need an atomic operation, which is expensive on most architectures. The GIL gives single-threaded code a free pass on synchronization at the cost of multi-threaded scaling. Removing it cleanly requires fundamental changes to the interpreter — which is exactly what PEP 703 and the Python 3.13 free-threaded build attempt.

Adding threads to CPU-bound code can actually make it slower than sequential code because the threads still serialize on the GIL, but you pay extra cost in context switches, GIL handoff, and cache invalidation between cores. This is the classic trap: “I added threads and my code got slower.” That is not a bug — that is the GIL.

What the GIL blocks:

Parallel execution of Python bytecode across CPU cores.
CPU-bound tasks (computation, data processing, string manipulation) gain nothing from threads.

What the GIL does NOT block:

I/O operations — when a thread waits for network, disk, or sleep, it releases the GIL, allowing other threads to run.
C extensions that release the GIL (NumPy, OpenSSL, database drivers using libpq or libmysqlclient).
Multiple processes — each process has its own GIL.

So threads are useful for I/O-bound work but useless (and harmful) for CPU-bound work.

Platform and Environment Differences

GIL behavior is not uniform across Python interpreters, versions, or builds. The decision tree for “what should I actually use” depends heavily on what you are running.

CPython 3.0 - 3.12 (standard interpreter). Full GIL. Threads share one core for Python code. This is what 99% of production deployments run.

CPython 3.13+ free-threaded build (PEP 703). Python 3.13 (October 2024) ships an opt-in build that disables the GIL: python3.13t. You install it separately and run with PYTHON_GIL=0 set, or check at runtime with sys._is_gil_enabled(). The free-threaded build is officially experimental — many C extensions are not yet compatible, and single-threaded performance is roughly 10-15% slower than the GIL build due to switching reference counts to biased reference counting. Python 3.14 (October 2025) keeps it experimental but stabilizes the ABI. Production use of free-threaded Python is not yet recommended for most workloads.

CPython 3.14 (October 2025). Default start method for multiprocessing changed from fork to spawn on Linux to match macOS/Windows behavior. This is relevant when comparing threads vs processes — see Python multiprocessing not working for the implications.

PyPy. PyPy has a GIL, similar to CPython, but its JIT can release the GIL for longer stretches in optimized paths. PyPy briefly experimented with Software Transactional Memory (PyPy-STM) as a GIL replacement around 2014-2016, but the project was paused and never reached production. Use PyPy when you need single-threaded speed, not for parallel threads.

Jython. Runs on the JVM and has no GIL. Java threads run Python code in true parallel. The downside: Jython is stuck at Python 2.7 syntax for the official release, with Python 3 support still incomplete. Most modern libraries do not work.

IronPython. Runs on .NET and also has no GIL. IronPython 3 supports Python 3.4 syntax (with ongoing work toward newer versions). Like Jython, it lacks support for most C-extension-heavy scientific libraries (NumPy, pandas) because those require CPython’s C API.

MicroPython, CircuitPython, GraalPy. Each has its own threading model. MicroPython is single-threaded by default. GraalPy (GraalVM Python) supports true parallel threads but is still maturing for full library compatibility.

ARM vs x86 thread scheduling. On Apple Silicon (M1/M2/M3/M4) and AWS Graviton, the GIL handoff penalty is slightly lower because of the unified-cache architecture and faster atomic operations. CPU-bound threaded Python is still serialized, but the overhead of switching between threads is smaller. This rarely changes the right architecture choice, but it does mean benchmarks vary across platforms.

Choosing between threading, multiprocessing, and asyncio.

threading: I/O-bound work, when you need shared in-process state and a small number of concurrent tasks (10-100s).
multiprocessing: CPU-bound work, when you need to use multiple cores. Pickling overhead matters for large data.
asyncio: I/O-bound work with high concurrency (1000s of connections) and code you can write in async def style. Lower memory footprint than threads.
concurrent.futures: high-level wrapper around the first two; pick ThreadPoolExecutor or ProcessPoolExecutor to switch.

GPU and CUDA workloads. PyTorch, TensorFlow, and JAX release the GIL during their compiled kernels — threads can issue GPU work in parallel even on CPython. But fork()-based multiprocessing after CUDA initialization causes deadlocks; always use spawn or forkserver start methods for ML training pipelines.

Fix 1: Use multiprocessing for CPU-Bound Work

Replace threading with multiprocessing for CPU-bound parallelism. Each process has its own Python interpreter and GIL:

Broken — threading for CPU work:

import threading

results = []
lock = threading.Lock()

def cpu_task(n):
    total = sum(i * i for i in range(n))
    with lock:
        results.append(total)

threads = [threading.Thread(target=cpu_task, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# No speedup — GIL prevents parallel execution

Fixed — multiprocessing:

from multiprocessing import Pool

def cpu_task(n):
    return sum(i * i for i in range(n))

if __name__ == "__main__":
    with Pool(4) as pool:  # 4 worker processes
        results = pool.map(cpu_task, [10_000_000] * 4)
    print(results)
    # ~4x faster on a 4-core machine

multiprocessing.Pool runs each task in a separate process with its own GIL. True parallel execution on multiple CPU cores.

When to use which:
Workload Use
CPU-bound (computation, data processing) multiprocessing
I/O-bound (network, disk, database) threading or asyncio
Mixed (CPU + I/O) concurrent.futures with appropriate executor
Many lightweight concurrent tasks asyncio

Workload	Use
CPU-bound (computation, data processing)	`multiprocessing`
I/O-bound (network, disk, database)	`threading` or `asyncio`
Mixed (CPU + I/O)	`concurrent.futures` with appropriate executor
Many lightweight concurrent tasks	`asyncio`

Fix 2: Use concurrent.futures for a Unified API

concurrent.futures provides a consistent interface that works with both threads and processes:

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import time

def io_task(url):
    """I/O-bound — use ThreadPoolExecutor"""
    import urllib.request
    with urllib.request.urlopen(url) as response:
        return len(response.read())

def cpu_task(n):
    """CPU-bound — use ProcessPoolExecutor"""
    return sum(i * i for i in range(n))

urls = ["https://example.com"] * 10
numbers = [10_000_000] * 4

# I/O-bound: threads work well
with ThreadPoolExecutor(max_workers=10) as executor:
    io_results = list(executor.map(io_task, urls))

# CPU-bound: use processes
with ProcessPoolExecutor(max_workers=4) as executor:
    cpu_results = list(executor.map(cpu_task, numbers))

concurrent.futures is higher-level than threading or multiprocessing directly. Use as_completed() for handling results as they finish:

from concurrent.futures import ProcessPoolExecutor, as_completed

def process_chunk(chunk):
    return sum(x * x for x in chunk)

data_chunks = [range(i, i + 1_000_000) for i in range(0, 10_000_000, 1_000_000)]

with ProcessPoolExecutor() as executor:
    futures = {executor.submit(process_chunk, chunk): i for i, chunk in enumerate(data_chunks)}
    for future in as_completed(futures):
        chunk_index = futures[future]
        result = future.result()
        print(f"Chunk {chunk_index} done: {result}")

Fix 3: Use asyncio for I/O-Bound Concurrency

For I/O-bound tasks (HTTP requests, database queries, file operations), asyncio is more efficient than threads — one thread handles thousands of concurrent operations:

Threading for I/O (works, but has overhead):

import threading
import urllib.request

def fetch(url):
    with urllib.request.urlopen(url) as r:
        return r.read()

threads = [threading.Thread(target=fetch, args=("https://example.com",)) for _ in range(100)]
for t in threads: t.start()
for t in threads: t.join()
# 100 threads — high memory usage, OS context switching overhead

asyncio for I/O (more efficient):

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.read()

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, "https://example.com") for _ in range(100)]
        results = await asyncio.gather(*tasks)
    return results

asyncio.run(main())
# 100 concurrent requests, 1 thread — much lower overhead

asyncio uses cooperative multitasking — one OS thread handles all I/O concurrency by switching between tasks when they wait for I/O. No GIL contention because only one task runs at a time, but I/O waits are overlapped.

Fix 4: Use NumPy / C Extensions That Release the GIL

Many scientific computing libraries (NumPy, SciPy, pandas) release the GIL during heavy C-level operations. Threading works for these:

import threading
import numpy as np

def matrix_multiply(size):
    a = np.random.rand(size, size)
    b = np.random.rand(size, size)
    return np.dot(a, b)  # NumPy releases GIL during computation

# This actually runs in parallel because NumPy releases the GIL
threads = [threading.Thread(target=matrix_multiply, args=(1000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# Faster than sequential — NumPy C code runs in parallel

Check if a library releases the GIL by profiling with multiple threads. If adding threads speeds up the work, the library releases the GIL during the heavy operation.

Fix 5: Profile to Confirm the Bottleneck

Before switching from threading to multiprocessing, confirm the bottleneck is CPU-bound (GIL-limited) vs I/O-bound:

import cProfile
import pstats

def my_task():
    # Your code here
    result = sum(i * i for i in range(10_000_000))
    return result

with cProfile.Profile() as pr:
    my_task()

stats = pstats.Stats(pr)
stats.sort_stats("cumulative")
stats.print_stats(10)  # Top 10 slowest functions

Check CPU usage during threading:

# Run your threaded program, then in another terminal:
top -p $(pgrep -f "python your_script.py")
# If CPU usage is ~100% (one core), it's GIL-limited
# If CPU usage is ~400% (four cores), threads ARE running in parallel (I/O or C extension work)

Use py-spy for low-overhead profiling:

pip install py-spy
py-spy top --pid $(pgrep -f python)

Fix 6: Python 3.13+ Free-Threaded Mode (GIL Disabled)

Python 3.13 introduced an experimental build option to disable the GIL (--disable-gil). This is available as a separate build of CPython and is not the default:

# Install free-threaded Python (Python 3.13+)
# On Ubuntu via pyenv:
PYTHON_CONFIGURE_OPTS="--disable-gil" pyenv install 3.13.0

# Verify GIL is disabled
python -c "import sys; print(sys._is_gil_enabled())"
# False — GIL is disabled

import threading

def cpu_task():
    return sum(i * i for i in range(10_000_000))

# With GIL disabled, threads run in true parallel
threads = [threading.Thread(target=cpu_task) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# ~4x faster on 4 cores

Warning: Free-threaded Python 3.13 is experimental. Many third-party libraries (especially C extensions) are not yet compatible. Use it for testing and exploration, not production workloads.

Fix 7: Practical Patterns for Real-World Use

Web scraping — use threads (I/O-bound):

from concurrent.futures import ThreadPoolExecutor
import requests

def scrape(url):
    return requests.get(url).text

urls = [f"https://example.com/page/{i}" for i in range(100)]

with ThreadPoolExecutor(max_workers=20) as executor:
    pages = list(executor.map(scrape, urls))

Data processing pipeline — use processes (CPU-bound):

from concurrent.futures import ProcessPoolExecutor
import pandas as pd

def process_chunk(filepath):
    df = pd.read_csv(filepath)
    # Heavy transformation
    return df.groupby("category").sum()

files = [f"data_{i}.csv" for i in range(20)]

with ProcessPoolExecutor() as executor:
    results = list(executor.map(process_chunk, files))

combined = pd.concat(results)

Mixed I/O and CPU — chain executors:

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import requests

def download(url):
    return requests.get(url).content  # I/O-bound

def process(data):
    return len(data) * 2  # CPU-bound (simplified)

urls = [f"https://example.com/file/{i}" for i in range(10)]

# Step 1: download in threads (I/O-bound)
with ThreadPoolExecutor(max_workers=10) as executor:
    raw_data = list(executor.map(download, urls))

# Step 2: process in processes (CPU-bound)
with ProcessPoolExecutor() as executor:
    results = list(executor.map(process, raw_data))

Still Not Working?

Benchmark before optimizing. Use time.perf_counter() to measure actual execution time with and without parallelism. If the task is fast enough that overhead dominates, parallelism makes it slower.

Check pickling overhead for multiprocessing. Data passed between processes must be pickled (serialized). For large datasets, pickling time can exceed the parallelism benefit. Pass file paths or database queries instead of raw data when possible.

Consider joblib for scientific computing. joblib provides a high-level parallel computing interface commonly used with scikit-learn:

from joblib import Parallel, delayed

results = Parallel(n_jobs=4)(
    delayed(cpu_task)(n) for n in [10_000_000] * 4
)

joblib also supports memory mapping for large NumPy arrays (mmap_mode="r"), avoiding the pickle overhead entirely by sharing the array through the filesystem.

Check OMP_NUM_THREADS and MKL_NUM_THREADS. NumPy, SciPy, and scikit-learn link against OpenMP/MKL/BLAS, which spawn their own thread pools outside the GIL. If you launch 4 processes and each runs NumPy with 8 OpenMP threads, you have 32 threads contending for the same cores. Set OMP_NUM_THREADS=1 before forking workers to avoid oversubscription:

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python my_script.py

Check that you are not running on a single-vCPU container. AWS Lambda, Cloud Run, and small Fargate tasks may have only 1 vCPU regardless of how many cores the host has. Multiprocessing on a 1-vCPU container gives you zero speedup — verify with os.cpu_count() and len(os.sched_getaffinity(0)) (the second number reflects cgroup/affinity limits).

Check for the asyncio + threading mix-up. Calling synchronous blocking I/O from inside an async def function blocks the entire event loop. If you must call a blocking library, wrap it with asyncio.to_thread() or loop.run_in_executor(). See Python async/sync mix for the patterns.

Check for Python interpreter pinning to one core. Some container runtimes (especially older Kubernetes with strict CPU limits and CFS quotas) effectively serialize Python’s worker processes onto one CPU. Profile with perf stat -e cs to see context-switch counts.

Check for asyncio.gather swallowing exceptions. If you mix threading with asyncio (asyncio.to_thread) and a worker raises, gather defaults to return_exceptions=False and cancels siblings — which can look like “threading not making progress.” See Python asyncio.gather error for the pattern.

For multiprocessing-specific errors (freeze_support, pickle errors), see Fix: Python multiprocessing not working. For asyncio runtime errors, see Fix: Python asyncio not running.