Skip to content

Fix: ClickHouse Not Working — MergeTree ORDER BY, INSERT Batching, LowCardinality, and Replication

FixDevs · (Updated: )

Part of:  Database Errors

Quick Answer

How to fix ClickHouse errors — table engine choice, ORDER BY for primary key, INSERT too many small parts, LowCardinality types, Nullable performance, ZooKeeper for Replicated tables, and HTTP vs Native client.

The Error

You create a table without ORDER BY and the CREATE fails:

DB::Exception: Storage MergeTree requires either ORDER BY or PRIMARY KEY.

Or INSERT performance is terrible — 100 rows/second instead of millions:

# Inserting one row at a time:
for row in rows:
    client.execute("INSERT INTO events VALUES", [row])
# Each call creates a small part. ClickHouse merges them in background but slows down.

Or SELECT count(*) over a column is slow despite a billion-row table:

SELECT count() FROM events WHERE user_country = 'JP';
-- Scans 100 GB instead of using the index.

Or queries fail with replication errors:

DB::Exception: ZooKeeper session has expired. Please try reconnecting.

Why This Happens

ClickHouse is a columnar OLAP database optimized for analytics. Schema design choices that look stylistic in OLTP databases — column types, NULL handling, primary keys — directly determine whether queries scan kilobytes or terabytes. ClickHouse trades flexibility (no point updates, no enforced unique constraints, no transactions across multiple tables) for raw analytics throughput, and most “ClickHouse is slow” reports trace back to a table created with row-store assumptions.

The mental model shift: every column is stored separately, compressed independently, and read independently. A SELECT col_a FROM t WHERE col_b = x only touches col_a and col_b files; the other 50 columns of your table are irrelevant cost-wise. ORDER BY is not a uniqueness constraint — it’s a physical sort order plus sparse index. MergeTree doesn’t enforce keys; it just merges parts in the background. Once you internalize that, the rest of the quirks (LowCardinality, async_insert, ReplacingMergeTree FINAL) follow naturally.

  • MergeTree engine variants. MergeTree, ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, ReplicatedMergeTree. Each has different semantics and tradeoffs. Picking wrong gives correct-looking output but wrong behavior at scale.
  • INSERT model. ClickHouse expects batched inserts. Row-by-row inserts create one “part” per call, and parts merge in background. Too many parts → slow merges, errors, OOM.
  • Primary key isn’t enforced. ORDER BY is the “primary key” (used for sparse index), but doesn’t dedupe — MergeTree allows duplicates. Use ReplacingMergeTree for dedup-on-merge.
  • LowCardinality dictionary encoding. Columns with few distinct values benefit from LowCardinality(String) — 10-100x storage reduction and faster filters. Without it, every row stores the full string.

Fix 1: Always Specify ORDER BY

CREATE TABLE events (
    event_time DateTime,
    user_id UInt64,
    event_type LowCardinality(String),
    country LowCardinality(String),
    revenue Float64
) ENGINE = MergeTree()
ORDER BY (event_time, user_id);

ORDER BY does two things:

  1. Physical sort. Data is stored in this order on disk.
  2. Sparse index. ClickHouse skips parts that don’t match the ORDER BY prefix.

For queries like WHERE event_time > now() - INTERVAL 1 HOUR AND user_id = 42, both predicates use the ORDER BY index — fast.

Choose ORDER BY by:

  • Most-selective filter first. If event_time is always in your WHERE, put it first.
  • Bounded cardinality. Avoid high-cardinality columns first (e.g. UUIDs) — they don’t compress or skip well.
  • 3-5 columns max. Too many keys = wide index, slower writes.

For wider table tuning:

CREATE TABLE events (...) ENGINE = MergeTree()
ORDER BY (event_time, user_id)
PARTITION BY toYYYYMM(event_time)  -- Optional: separate parts per month
TTL event_time + INTERVAL 90 DAY DELETE;  -- Auto-delete old rows

PARTITION BY lets you drop entire partitions in O(1) (ALTER TABLE events DROP PARTITION 202401) — much faster than DELETE.

Pro Tip: Don’t over-partition. Each partition is a directory; thousands of small partitions hurts performance. Aim for partitions of 1-100M rows.

Fix 2: Batch INSERTs

ClickHouse expects batches. The recommended pattern:

# Bad — one INSERT per row:
for event in events:
    client.execute("INSERT INTO events VALUES", [event])

# Good — one INSERT per batch:
client.execute("INSERT INTO events VALUES", events)  # All at once

# Better for very large batches — async insert:
client.execute("INSERT INTO events SETTINGS async_insert=1, wait_for_async_insert=1 VALUES", events)

For streaming inserts where you can’t accumulate large batches client-side, use async_insert:

INSERT INTO events SETTINGS async_insert = 1, wait_for_async_insert = 0
VALUES (now(), 1, 'click', 'JP', 0.0);

ClickHouse buffers async inserts server-side and flushes them in batches (default 200ms or 10 MB). Per-row cost drops dramatically.

For very high write throughput, use Buffer engine or write to a Kafka table function — ClickHouse pulls from Kafka in batches automatically.

Common Mistake: Setting async_insert = 1 without wait_for_async_insert. Inserts return immediately, before they’re flushed. If the server crashes between buffer and flush, data is lost. Use wait_for_async_insert = 1 for durability.

Fix 3: Use LowCardinality for String Enums

-- BAD: every row stores the full string
CREATE TABLE events (event_type String, country String, ...);

-- GOOD: dictionary-encoded
CREATE TABLE events (event_type LowCardinality(String), country LowCardinality(String), ...);

LowCardinality(String) stores strings as int dictionary keys. For fields with ≤ ~10K distinct values, this is 5-100x more compact and faster to filter.

Rule of thumb: use LowCardinality when you have:

  • Country codes
  • Event type enums
  • HTTP status codes
  • Service names
  • Region/zone identifiers
  • Any “category” string

Don’t use for:

  • User IDs (high cardinality)
  • URLs (often unique)
  • Free-form text

Pro Tip: You can convert later: ALTER TABLE events MODIFY COLUMN event_type LowCardinality(String). But ClickHouse rewrites the column — expensive on big tables. Set it correctly at table create.

Fix 4: Avoid Nullable Unless You Need It

-- Slow: Nullable adds overhead (separate bitmap)
CREATE TABLE events (revenue Nullable(Float64), ...);

-- Fast: use a sentinel value if NULL semantics aren't critical
CREATE TABLE events (revenue Float64 DEFAULT 0, ...);

Nullable(T) columns have ~2x storage overhead and slower filters. For numeric columns where 0 or -1 can serve as “no value,” skip Nullable.

For strings, use empty string instead of NULL:

CREATE TABLE events (referer String DEFAULT '', ...);

If you genuinely need NULL semantics (distinct from “empty”), use Nullable but be deliberate.

Fix 5: Match Engine to Use Case

EngineUse case
MergeTreeGeneral-purpose. Allows duplicates.
ReplacingMergeTree(version_col)Deduplicate on background merge.
SummingMergeTree(sum_col)Sum-aggregate on background merge.
AggregatingMergeTreeCustom aggregates via aggregate functions.
CollapsingMergeTree(sign)Mark rows as deleted (-1 sign).
VersionedCollapsingMergeTreeVersioned collapsing.

For event sourcing where you want the “latest version of each entity”:

CREATE TABLE users (
    user_id UInt64,
    name String,
    updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;

ClickHouse keeps the row with the highest updated_at for each user_id after background merges. Note: merges are async — you may see duplicates briefly. Use FINAL modifier (SELECT * FROM users FINAL) to force dedup at query time (slower).

For pre-aggregated rollups:

CREATE MATERIALIZED VIEW events_daily
ENGINE = SummingMergeTree()
ORDER BY (date, country, event_type)
AS SELECT
    toDate(event_time) AS date,
    country,
    event_type,
    count() AS event_count,
    sum(revenue) AS total_revenue
FROM events
GROUP BY date, country, event_type;

Each new INSERT to events triggers a small INSERT into events_daily. Background merges sum the matching rows. Queries against events_daily are massively faster than aggregating raw events.

Common Mistake: Querying a ReplacingMergeTree table without FINAL and expecting dedup. Without FINAL (or aggregate argMax), you see all versions.

Fix 6: Replication via ZooKeeper / Keeper

For HA, use ReplicatedMergeTree:

CREATE TABLE events ON CLUSTER my_cluster (...) ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/events',  -- ZK path
    '{replica}'                            -- Replica name
)
ORDER BY (event_time, user_id);

{shard} and {replica} are macros defined per-server in config.

Each shard has multiple replicas. INSERTs to one replica propagate to others. SELECTs can hit any replica.

For multi-shard tables (sharded by hash), wrap in Distributed:

CREATE TABLE events_dist ON CLUSTER my_cluster AS events
ENGINE = Distributed(my_cluster, default, events, rand());

Inserts to events_dist route to shards by rand() (or your hash). Queries fan out across all shards.

Note: ClickHouse Keeper is a built-in ZooKeeper replacement. Newer deployments should use Keeper — fewer dependencies, better performance, ClickHouse-native config.

For monitoring replication lag:

SELECT * FROM system.replication_queue WHERE table = 'events';

A growing queue means a replica is falling behind — investigate before it becomes critical.

Fix 7: HTTP vs Native Protocol

ClickHouse exposes two protocols:

  • Native (port 9000): Binary, used by official clickhouse-client and most language libraries. Fastest.
  • HTTP (port 8123): Plain HTTP with JSON or TSV responses. Easier to debug; works from any HTTP client.

For Python with the native driver:

from clickhouse_driver import Client

client = Client(host="localhost", port=9000, user="default", password="")
result = client.execute("SELECT count() FROM events")

For HTTP-based:

import requests
r = requests.post(
    "http://localhost:8123",
    params={"query": "SELECT count() FROM events FORMAT JSON"},
)
print(r.json())

For Node:

import { createClient } from "@clickhouse/client";

const client = createClient({ url: "http://localhost:8123" });
const result = await client.query({
  query: "SELECT count() FROM events",
  format: "JSON",
});
console.log(await result.json());

The official JS client uses HTTP under the hood. There’s no widely-used native Node driver.

Pro Tip: For high-throughput INSERT, prefer native (Python). For occasional queries from a web app, HTTP is fine.

Fix 8: Common Query Patterns

Time-bucketed counts:

SELECT
    toStartOfHour(event_time) AS hour,
    count() AS events
FROM events
WHERE event_time >= now() - INTERVAL 1 DAY
GROUP BY hour
ORDER BY hour;

Top-N per group:

SELECT country, event_type, count() AS c
FROM events
WHERE event_time >= today()
GROUP BY country, event_type
ORDER BY c DESC
LIMIT 10 BY country;

LIMIT N BY column is a ClickHouse extension — returns top N per group, more efficient than window functions.

Percentile:

SELECT
    quantile(0.99)(latency_ms) AS p99,
    quantile(0.95)(latency_ms) AS p95,
    quantile(0.50)(latency_ms) AS p50
FROM requests
WHERE event_time >= now() - INTERVAL 1 HOUR;

ClickHouse’s quantile is approximate by default — fast on huge tables. For exact, use quantileExact.

JOIN large tables:

-- Use the smaller table on the right:
SELECT users.name, events.event_type
FROM events
INNER JOIN users ON users.id = events.user_id
WHERE events.event_time >= today();

ClickHouse builds a hash table from the right side. Put the smaller table there.

For very large joins where neither side fits in memory:

SET join_algorithm = 'partial_merge';  -- Disk-based merge join

ClickHouse vs Other Columnar OLAP Engines

ClickHouse is one of several columnar OLAP options. Each handles the same fundamental problem (analytics over billions of rows) with different tradeoffs around deployment model, query patterns, and concurrency.

ClickHouse. Self-hosted or ClickHouse Cloud. Stupid-fast on filtering + aggregation, single-table dominance, ingestion measured in millions of rows per second per node. Joins are the weak spot — possible but require careful tuning. Sweet spot: real-time analytics on event streams, observability backends, ad-tech rollups.

DuckDB. Embedded analytics engine — runs in-process, no server. Reads Parquet, CSV, JSON, and Arrow files directly. Best for single-node interactive analytics on datasets up to ~1 TB. Excellent for notebook workflows. Sweet spot: data science exploration, ETL pipelines, lakehouse query layer.

BigQuery. Fully managed, serverless, scales to petabytes transparently. Pay per byte scanned (which can surprise). SQL is standards-compliant; integration with the GCP ecosystem is excellent. Sweet spot: GCP-hosted data warehouses, ad-hoc analytics where ops simplicity matters more than per-query cost.

Snowflake. Multi-cloud managed warehouse with separated storage and compute. Warehouses can be paused, resumed, and resized independently. Strong SQL semantics, robust governance. Most expensive option but ops burden is minimal. Sweet spot: enterprise BI, multi-tenant analytics SaaS.

Trino (PrestoSQL). Federated query engine — runs over S3, Hive, JDBC, Kafka, anything. Doesn’t store data itself. Pair with Iceberg or Hudi for a modern lakehouse. Sweet spot: query layer over an existing data lake, multi-source joins.

Apache Druid. Real-time + historical OLAP optimized for sub-second queries with high concurrency. More operational complexity than ClickHouse, but better at “thousands of concurrent dashboard users.” Sweet spot: customer-facing analytics, high-concurrency dashboards.

The rough rule: self-hosted real-time event analytics → ClickHouse. Lakehouse query engine → Trino. Single-node Parquet crunching → DuckDB. Managed cloud warehouse → BigQuery or Snowflake. High-concurrency customer-facing dashboards → Druid. ClickHouse fills the sweet spot of “as fast as Druid, simpler than Druid, far cheaper than Snowflake/BigQuery, self-hosted.”

Still Not Working?

A few less-obvious failures:

  • Too many parts error. Background merges can’t keep up. Throttle inserts, increase batch size, or check disk I/O. system.parts shows part counts per table.
  • OOM during a large query. Set max_memory_usage per query. For aggregations, enable max_bytes_before_external_group_by to spill to disk.
  • DB::Exception: Code: 159. Timeout exceeded. Increase max_execution_time or break the query into smaller chunks.
  • Replication queue grows unbounded. Network split or one replica is down. Restart the lagging replica or remove it temporarily.
  • Code: 252. DB::Exception: Too many partitions. Each partition is a directory. Don’t partition by hour if you have years of data — partition by month or week instead.
  • SELECT returns duplicate rows after INSERT. ReplacingMergeTree hasn’t merged yet. Use FINAL or argMax if you need dedupe immediately.
  • Cannot bind to address. Port conflict. ClickHouse defaults to 8123 (HTTP) and 9000 (Native). Other tools (Portainer, MinIO) use 9000 too.
  • High cardinality LowCardinality field warning. If a column has >10K distinct values, LowCardinality adds overhead. ClickHouse can warn or error depending on settings — check low_cardinality_max_dictionary_size.
  • Materialized view stops updating. MVs are triggers, not maintained projections — they run on each INSERT to the source. If the source is Distributed, MV doesn’t fire on remote inserts; attach MVs to the underlying local tables on each shard instead.
  • Dictionary lookups slow after restart. Dictionaries lazy-load by default. Force eager load with dictGet warm-up at startup or SYSTEM RELOAD DICTIONARIES.
  • TTL DELETE not removing rows. TTL fires on merges, not on a schedule. Force with OPTIMIZE TABLE events FINAL or wait for the next background merge.

For related analytics and OLAP database issues, see DuckDB not working, BigQuery not working, Snowflake not working, and Postgres slow query.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles