Skip to content

Fix: Polars Not Working — AttributeError, InvalidOperationError, and ShapeError

FixDevs ·

Quick Answer

How to fix Polars errors — AttributeError groupby not found, InvalidOperationError from Python lambdas, ShapeError broadcasting mismatch, lazy vs eager collect confusion, type casting failures, and ColumnNotFoundError in with_columns.

The Error

You switch from Pandas to Polars and the familiar API doesn’t exist:

AttributeError: 'DataFrame' object has no attribute 'groupby'

Or you try to filter with a lambda and get a cryptic failure:

InvalidOperationError: expression not allowed in this context

Or a column operation crashes with a shape mismatch:

ShapeError: unable to add a column of length 3 to a DataFrame of height 5

Or you run a scan_csv pipeline and nothing happens — no error, no data, just an empty LazyFrame printed to the console.

Polars is not a drop-in Pandas replacement. It has a different execution model, stricter type system, and an expression-based API that requires rethinking how you write transformations. These errors are all fixable once you understand the patterns.

Why This Happens

Polars separates eager and lazy execution explicitly. Operations on DataFrame run immediately; operations on LazyFrame build a query plan that only executes on .collect(). The expression system (pl.col("x") > 5) compiles to optimized Rust code — Python lambdas bypass this and are only allowed in specific slower-path methods.

The Pandas migration friction comes from subtle renames (.groupby().group_by()), removed conveniences (no .loc, .iloc, .values), and a stricter type system where nulls and NaN are distinct and shapes must always match.

Fix 1: Pandas API Errors — Method Names Changed

Polars deliberately renamed or removed several Pandas methods. These all surface as AttributeError.

groupbygroup_by (with underscore):

import polars as pl

df = pl.DataFrame({"category": ["A", "A", "B"], "value": [10, 20, 15]})

# WRONG
result = df.groupby("category").agg(...)  # AttributeError

# CORRECT
result = df.group_by("category").agg(pl.col("value").sum())

No .loc or .iloc — use expressions instead:

# WRONG — Polars has no index-based selection
df.iloc[0:5]         # AttributeError
df.loc["label"]      # AttributeError

# CORRECT — slice by position
first_five = df.slice(0, 5)         # First 5 rows
first_five = df.head(5)             # Equivalent

# Filter by condition (replaces .loc[mask])
filtered = df.filter(pl.col("value") > 10)

# Select rows by index (integer position)
row = df[2]       # Single row as DataFrame
rows = df[1:4]    # Slice

.values.to_numpy():

# WRONG
arr = df["value"].values   # AttributeError

# CORRECT
arr = df["value"].to_numpy()

# Or convert whole DataFrame
arr = df.to_numpy()

.iterrows().iter_rows(named=True):

# WRONG
for idx, row in df.iterrows():   # AttributeError
    print(row["value"])

# CORRECT
for row in df.iter_rows(named=True):
    print(row["value"])           # row is a dict

# Or iterate as tuples (faster)
for row in df.iter_rows():
    print(row[1])                 # Tuple access by position

.apply().map_elements() (renamed in Polars 0.19, removed in 1.0):

# WRONG (Polars 1.0+)
df.with_columns(pl.col("value").apply(lambda x: x * 2))   # AttributeError

# CORRECT
df.with_columns(
    doubled=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Int64)
)

Pro Tip: Before spending time on a workaround, check if Polars has a native expression for what you’re doing. df.apply(func) for squaring values becomes pl.col("x") ** 2 — zero Python overhead, and much faster.

Fix 2: Lazy vs Eager — Don’t Forget .collect()

pl.scan_csv(), pl.scan_parquet(), and other scan_* functions return a LazyFrame — a query plan, not data. Nothing executes until you call .collect().

import polars as pl

# scan_csv returns a LazyFrame — no data loaded yet
lf = pl.scan_csv("large_file.csv")
print(type(lf))  # <class 'polars.LazyFrame'>

# Filters and selections added to the query plan — still no execution
lf = lf.filter(pl.col("country") == "US").select(["name", "country", "revenue"])

# STILL nothing executed — lf just prints the query plan
print(lf)   # Prints "PLAN" not data

# Execute the plan — this is when disk I/O and filtering actually happen
df = lf.collect()
print(type(df))  # <class 'polars.DataFrame'>
print(df.shape)  # (n_rows, 3)

Use lazy evaluation by default for files. Polars optimizes the query plan before executing — it pushes filters down to the file reader (reading only matching rows) and projects only the columns you need:

# Reads the ENTIRE CSV then filters — inefficient
df = pl.read_csv("100gb_file.csv").filter(pl.col("year") == 2025)

# Pushes the filter to disk read — only scans matching rows
df = pl.scan_csv("100gb_file.csv").filter(pl.col("year") == 2025).collect()

Inspect the query plan before collecting to understand what Polars will do:

lf = pl.scan_csv("data.csv").filter(pl.col("x") > 5).select(["x", "y"])
print(lf.explain())            # Unoptimized plan
print(lf.explain(optimized=True))  # After predicate/projection pushdown

For very large files that don’t fit in memory, streaming processes the data in chunks:

df = (
    pl.scan_csv("huge_file.csv")
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(pl.col("revenue").sum())
    .collect(streaming=True)   # Processes in batches, bounded memory
)

LazyFrame.schema was removed in Polars 1.0. To inspect columns and types without collecting:

# OLD (0.x, broken in 1.0)
schema = lf.schema   # AttributeError in 1.0

# CORRECT (1.0+)
schema = lf.collect_schema()
print(schema)   # Schema({'col1': Int64, 'col2': Utf8, ...})

Fix 3: InvalidOperationError — Use Polars Expressions, Not Python Lambdas

InvalidOperationError: expression not allowed in this context

Polars expressions (pl.col("x") > 5, pl.col("name").str.starts_with("A")) compile to optimized Rust. Python lambdas in .filter() or similar contexts break the expression system entirely.

import polars as pl

df = pl.DataFrame({"x": [1, 5, 10, 3, 8], "name": ["alice", "bob", "carol", "dave", "eve"]})

# WRONG — lambdas not allowed in filter
df.filter(lambda row: row["x"] > 5)    # InvalidOperationError

# CORRECT — use Polars expressions
df.filter(pl.col("x") > 5)

# String operations use the .str namespace
df.filter(pl.col("name").str.starts_with("a"))

# Combine conditions with & (and) and | (or)
df.filter((pl.col("x") > 3) & (pl.col("name").str.len_chars() > 3))

When you genuinely need a Python function, use map_elements() — but understand the performance cost:

import polars as pl

df = pl.DataFrame({"text": ["hello world", "foo bar", "baz"]})

# map_elements: Python called once per element (slow for large datasets)
df.with_columns(
    word_count=pl.col("text").map_elements(
        lambda s: len(s.split()),
        return_dtype=pl.Int32,
    )
)

# Always specify return_dtype — without it, Polars infers from the first element,
# which can produce unexpected types on later rows

map_batches() is faster — it passes an entire Series to your function at once rather than element by element. Use it when your function can operate on a whole Series:

import polars as pl

df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})

# map_batches: Python called once with the full Series
df.with_columns(
    normalized=pl.col("x").map_batches(
        lambda s: (s - s.mean()) / s.std(),
        return_dtype=pl.Float64,
    )
)

Performance hierarchy (fastest to slowest):

  1. Native Polars expressions — pl.col("x") * 2, pl.col("x").log()
  2. map_batches() — Python called once per Series
  3. map_elements() — Python called once per element

Before reaching for map_elements, check the Polars expressions API — string methods, date operations, list operations, and statistics are all built in.

Fix 4: Type Casting Errors — Strict vs Lenient

InvalidOperationError: cannot cast Utf8 to Int64 in strict mode

Polars defaults to strict=True in .cast() — if any value can’t be converted, the entire operation fails. This is the right behavior for clean data but breaks on real-world data with missing markers.

import polars as pl

df = pl.DataFrame({"amount": ["100", "250", "N/A", "400", "null"]})

# WRONG — fails because "N/A" and "null" can't become Int64
df.with_columns(pl.col("amount").cast(pl.Int64))   # InvalidOperationError

# CORRECT — non-convertible values become null
df.with_columns(pl.col("amount").cast(pl.Int64, strict=False))
# [100, 250, null, 400, null]

# Fill nulls after casting
df.with_columns(
    pl.col("amount").cast(pl.Int64, strict=False).fill_null(0)
)
# [100, 250, 0, 400, 0]

Specify types at read time — more efficient than reading and casting:

df = pl.read_csv(
    "transactions.csv",
    schema_overrides={
        "amount": pl.Float64,
        "quantity": pl.Int32,
        "user_id": pl.Utf8,   # Keep as string even if it looks numeric
    },
    null_values=["N/A", "null", "", "NA"],
)

Polars separates null and NaN — two distinct concepts that Pandas conflates. null is a missing value (all types). NaN is a floating-point representation of “not a number” (only Float32/Float64). They need different handling:

import polars as pl
import math

df = pl.DataFrame({"x": [1.0, float("nan"), None, 4.0]})

print(df.select(pl.col("x").is_null()))   # [false, false, true, false]
print(df.select(pl.col("x").is_nan()))    # [false, true, false, false]

# fill_null handles missing values (None/null)
# fill_nan handles NaN (floating point only)
df.with_columns(pl.col("x").fill_nan(0.0).fill_null(0.0))
# [1.0, 0.0, 0.0, 4.0]

If you’re reading data that has CSV "NaN" strings, map them to Polars nulls at read time:

df = pl.read_csv("data.csv", null_values=["NaN", "nan", "N/A", ""])

Fix 5: ColumnNotFoundError and with_columns Chaining

ColumnNotFoundError: column 'total' not found

The most common cause: you create a column in one with_columns() call and try to reference it in the same call. New columns aren’t visible within the same with_columns() invocation.

import polars as pl

df = pl.DataFrame({"price": [10.0, 20.0, 30.0], "qty": [2, 3, 1]})

# WRONG — 'total' doesn't exist yet when 'discount' is computed
df.with_columns(
    total=pl.col("price") * pl.col("qty"),
    discount=pl.col("total") * 0.1,   # ColumnNotFoundError
)

# CORRECT — chain two with_columns calls
df.with_columns(
    total=pl.col("price") * pl.col("qty"),
).with_columns(
    discount=pl.col("total") * 0.1,
)

select() vs with_columns() — these are different operations that Pandas users often confuse:

# select() — returns only the listed columns (like SQL SELECT)
df.select("price", "qty")            # DataFrame with 2 columns
df.select(pl.col("price") * 1.1)     # Computed column, original dropped

# with_columns() — keeps all original columns and adds/replaces
df.with_columns(adjusted_price=pl.col("price") * 1.1)  # 3 columns: price, qty, adjusted_price

Rename columns to fix mismatches between datasets:

df.rename({"old_name": "new_name", "another_old": "another_new"})

Check column names before referencing them:

print(df.columns)   # List of column names
print(df.schema)    # Dict of {name: dtype}

Fix 6: ShapeError — Broadcasting Rules

ShapeError: unable to add a column of length 3 to a DataFrame of height 5

Polars is strict about shapes. A Series added to a DataFrame must either match the DataFrame’s height exactly or have length 1 (which broadcasts). Unlike NumPy or Pandas, there is no silent truncation or repetition.

import polars as pl

df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})   # height = 5

# WRONG — Series has wrong length
s = pl.Series([10, 20, 30])   # length 3
df.with_columns(y=s)          # ShapeError

# CORRECT — Series matches height
s = pl.Series([10, 20, 30, 40, 50])
df.with_columns(y=s)          # Works

# CORRECT — Scalar broadcasts to all rows
df.with_columns(constant=pl.lit(42))   # 42 in every row

# CORRECT — Expressions operate row-by-row (automatic length match)
df.with_columns(doubled=pl.col("x") * 2)

For group-level aggregations that you want to join back to the original DataFrame, use .over() (window function) instead of .group_by().agg():

df = pl.DataFrame({
    "category": ["A", "A", "B", "B", "B"],
    "value": [10, 20, 15, 25, 5],
})

# group_by returns one row per group (height changes)
totals = df.group_by("category").agg(total=pl.col("value").sum())
# totals has 2 rows, df has 5 — can't add this back with with_columns

# CORRECT — over() keeps original height, broadcasts group result
df.with_columns(
    group_total=pl.col("value").sum().over("category")
)
# Every row gets the sum for its category group

Fix 7: group_by and Aggregation Syntax

Polars aggregation is explicit — you must list every column you want in the output. There is no as_index=False or automatic column retention.

import polars as pl

df = pl.DataFrame({
    "category": ["A", "A", "B", "B", "B"],
    "sub": ["x", "y", "x", "y", "x"],
    "value": [10, 20, 15, 25, 5],
})

# Basic aggregation
result = df.group_by("category").agg(
    pl.col("value").sum(),
    pl.col("value").mean().alias("avg_value"),
    pl.col("value").count().alias("n"),
)

# Multiple grouping columns
result = df.group_by("category", "sub").agg(
    total=pl.col("value").sum(),
)

# Group order is non-deterministic by default — use maintain_order for consistent output
result = df.group_by("category", maintain_order=True).agg(
    pl.col("value").sum()
)

# Multiple aggregations on the same column
result = df.group_by("category").agg([
    pl.col("value").sum().alias("total"),
    pl.col("value").mean().alias("avg"),
    pl.col("value").min().alias("min"),
    pl.col("value").max().alias("max"),
    pl.col("value").std().alias("std_dev"),
])

Window functions with .over() — like SQL’s PARTITION BY, they compute an aggregate per group but keep all original rows:

df.with_columns(
    # Sum per category, broadcast back to each row
    category_total=pl.col("value").sum().over("category"),
    # Rank within category
    category_rank=pl.col("value").rank(descending=True).over("category"),
    # Cumulative sum within category (in original row order)
    cumsum=pl.col("value").cum_sum().over("category"),
)

Fix 8: Reading Files and Schema Inference Problems

Polars infers column types by reading the first 1,024 rows. If your data has type-breaking values in row 1,025, the CSV read fails mid-stream.

Increase inference scan depth or disable it entirely:

import polars as pl

# Scan more rows before inferring (slower but safer)
df = pl.read_csv("data.csv", infer_schema_length=10_000)

# Read everything as strings, then cast manually (safest)
df = pl.read_csv("data.csv", infer_schema_length=0)
# df.dtypes are all Utf8; cast what you need
df = df.with_columns(
    amount=pl.col("amount").cast(pl.Float64, strict=False),
    count=pl.col("count").cast(pl.Int32, strict=False),
)

Map missing value strings to null at read time:

df = pl.read_csv(
    "data.csv",
    null_values=["N/A", "NA", "null", "NULL", "-", ""],
)

You can also specify per-column null values as a dict when different columns use different conventions.

Use Parquet for large production pipelines — it stores schema alongside data and reads dramatically faster than CSV:

# Write once
df.write_parquet("data.parquet")

# Read (schema always correct, no inference needed)
df = pl.read_parquet("data.parquet")

# Lazy scan with predicate pushdown (reads only matching rows from disk)
df = (
    pl.scan_parquet("large_data.parquet")
    .filter(pl.col("year") == 2025)
    .select(["date", "revenue", "region"])
    .collect()
)

Common Mistake: Using read_csv for files that are gigabytes in size. Use scan_csv(...).collect() instead so Polars can optimize the read with projection and predicate pushdown. The difference can be 10x in both time and peak memory.

Still Not Working?

Polars 0.x Code Breaks on 1.0

The most disruptive 0.x → 1.0 changes:

Old (0.x)New (1.0)Notes
.apply().map_elements()With return_dtype arg
.groupby().group_by()Underscore added
lf.schemalf.collect_schema()LazyFrame only
.replace().replace() + .replace_strict()Behavior split
pl.map()pl.map_elements()Global function

Run python -c "import polars; print(polars.__version__)" to confirm which version you’re on, then check the official upgrade guide.

Performance: Polars Is Slower Than Expected

If Polars feels slower than Pandas on small DataFrames — it often is. Polars’ Rust execution engine has startup overhead that only pays off on larger datasets (typically 100k+ rows). For small local tables, this is normal. The gains become significant at millions of rows.

If large operations are slow, check whether you’re accidentally using eager evaluation when lazy would benefit from predicate pushdown. And always profile before using map_elements() — it surrenders Polars’ performance advantage.

Using Polars with PyTorch or NumPy

Polars integrates cleanly with NumPy and PyTorch through Arrow zero-copy:

import polars as pl
import numpy as np
import torch

df = pl.DataFrame({"x": [1.0, 2.0, 3.0], "y": [4.0, 5.0, 6.0]})

# To NumPy (zero-copy if no nulls)
arr = df.to_numpy()

# To PyTorch tensor
tensor = torch.from_numpy(df.to_numpy())

For training loops and model pipelines that consume Polars DataFrames, see PyTorch not working for tensor device and dtype patterns.

Migrating Large Pandas Codebases

The official Polars migration guide maps common Pandas patterns to Polars equivalents. For Pandas-specific errors you encounter while migrating, see pandas SettingWithCopyWarning and pandas merge key error.

Installing Optional Extras

Some Polars features require additional dependencies:

# Excel support (read_excel / write_excel)
pip install "polars[fastexcel]"

# Cloud storage (S3, GCS, Azure Blob)
pip install "polars[cloud]"

# All extras
pip install "polars[all]"

For installation failures — particularly when building from source on unusual platforms — see Python packaging not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles