Skip to content

Fix: MySQL Replication Not Working — Replica Lag, Stopped Replication, or GTID Errors

FixDevs ·

Quick Answer

How to fix MySQL replication issues — SHOW REPLICA STATUS errors, relay log corruption, GTID configuration, replication lag, skipping errors, and replica promotion.

The Problem

MySQL replication has stopped with an error:

SHOW REPLICA STATUS\G
-- Replica_IO_Running: Yes
-- Replica_SQL_Running: No
-- Last_SQL_Error: Error 'Duplicate entry '42' for key 'PRIMARY'' on query.
--   Default database: 'myapp'. Query: 'INSERT INTO users ...'

Or the replica is lagging behind the primary:

SHOW REPLICA STATUS\G
-- Seconds_Behind_Source: 3842  -- Over an hour behind

Or GTID replication fails after restoring a backup:

Got fatal error 1236 from source when reading data from binary log:
'Cannot replicate because the source purged required binary logs.
Replicate the missing transactions from elsewhere, or provision a new replica
from backup. Consider increasing the source's binary log expiration period.'

Why This Happens

MySQL replication works by replaying binary log events from the primary on the replica. It breaks when:

  • Replication error on the replica — the most common cause: a statement that succeeded on the primary fails on the replica (duplicate key, row not found, constraint violation). The SQL thread stops until the error is resolved.
  • Replica lag — the replica can’t apply binary log events as fast as the primary generates them. Typically caused by slow queries, resource constraints, or a single-threaded replica applying multi-threaded primary writes.
  • GTID gaps — when using GTID replication, the replica tracks which transactions it has applied. If GTIDs are missing (from a restore that didn’t preserve GTID state), replication can’t figure out where to resume.
  • Binary log expired — if the primary purges binary logs before the replica reads them (due to replica downtime or binlog_expire_logs_seconds too low), the replica loses its position.

Fix 1: Diagnose and Resolve SQL Thread Errors

Start with SHOW REPLICA STATUS\G to identify the error:

-- Run on the replica
SHOW REPLICA STATUS\G

-- Key fields to check:
-- Replica_IO_Running: Yes/No  — is the IO thread receiving binlogs?
-- Replica_SQL_Running: Yes/No — is the SQL thread applying them?
-- Last_IO_Error: error message from IO thread
-- Last_SQL_Error: error message from SQL thread
-- Seconds_Behind_Source: replication lag in seconds
-- Exec_Source_Log_File / Exec_Source_Log_Pos: current position being applied

Fix duplicate key errors (most common):

-- Option 1: Skip the offending transaction (use with caution)
-- For GTID replication:
SET GTID_NEXT = 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:N';
BEGIN; COMMIT;  -- Empty transaction to "skip" the GTID
SET GTID_NEXT = 'AUTOMATIC';
START REPLICA;

-- For non-GTID replication (legacy):
SET GLOBAL SQL_REPLICA_SKIP_COUNTER = 1;  -- Skip 1 event
START REPLICA;

-- Option 2: Fix the data on the replica, then restart
-- If the row is missing on replica but exists on primary:
INSERT INTO users (id, email) VALUES (42, '[email protected]');
START REPLICA;

-- Option 3: Use pt-slave-repair (Percona Toolkit) for data discrepancies
pt-table-sync --replicate myapp.checksums h=replica,u=root,p=pass \
  --execute --print

Warning: Skipping transactions can cause data drift between primary and replica. After fixing, run pt-table-checksum to verify data consistency.

Fix 2: Reduce Replication Lag

Enable parallel replication (MySQL 5.7+):

-- On the replica — enable multi-threaded replication
-- Check current setting:
SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';

-- Enable parallel workers (use 4-8, matching primary CPU cores)
SET GLOBAL replica_parallel_workers = 8;
SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
-- LOGICAL_CLOCK: applies transactions that overlapped on primary in parallel
-- DATABASE: applies transactions for different databases in parallel (simpler, less effective)
# my.cnf — persist the settings
[mysqld]
replica_parallel_workers = 8
replica_parallel_type = LOGICAL_CLOCK
replica_preserve_commit_order = ON  # Ensures replica commits in same order as primary

Enable binary log group commit on the primary (reduces sync overhead):

# my.cnf on primary
[mysqld]
binlog_group_commit_sync_delay = 1000    # Wait 1ms to group more transactions
binlog_group_commit_sync_no_delay_count = 100  # Or until 100 transactions ready

Check what’s causing lag:

-- See what the SQL thread is applying
SHOW PROCESSLIST;
-- Look for "System lock" or slow queries in the replica SQL thread

-- Check replica SQL thread wait states
SELECT * FROM performance_schema.events_waits_current
WHERE THREAD_ID = (
  SELECT THREAD_ID FROM performance_schema.threads
  WHERE NAME = 'thread/sql/replica_sql'
);

Fix 3: Fix GTID Replication Issues

Check GTID state:

-- On primary
SHOW GLOBAL VARIABLES LIKE 'gtid_mode';
SHOW MASTER STATUS;
-- Shows Executed_Gtid_Set: source UUID:1-12345

-- On replica
SHOW REPLICA STATUS\G
-- Executed_Gtid_Set: what replica has applied
-- Retrieved_Gtid_Set: what replica has received but not yet applied
-- Auto_Position: 1 (using GTID auto-positioning)

Fix GTID gaps after a restore:

-- If replica is missing transactions that primary has already purged:

-- Option 1: Provision a new replica from a fresh backup
-- This is the safest approach

-- Option 2: Inject empty transactions to fill the gaps
-- Find the missing GTIDs:
-- Primary has: source_uuid:1-1000
-- Replica has: source_uuid:1-800, source_uuid:850-1000
-- Missing: source_uuid:801-849

-- Inject empty transactions on the replica:
STOP REPLICA;
SET GTID_NEXT = 'source_uuid:801';
BEGIN; COMMIT;
SET GTID_NEXT = 'source_uuid:802';
BEGIN; COMMIT;
-- ... repeat for each missing GTID
-- Or use a loop in a shell script for large ranges:
-- for i in $(seq 801 849); do
--   mysql -e "SET GTID_NEXT='source_uuid:$i'; BEGIN; COMMIT;"
-- done
SET GTID_NEXT = 'AUTOMATIC';
START REPLICA;

Set up GTID replication correctly:

# my.cnf — both primary and replica
[mysqld]
gtid_mode = ON
enforce_gtid_consistency = ON
log_bin = mysql-bin
log_replica_updates = ON  # Required: replica also writes to its own binlog
-- On the replica — connect using GTID auto-positioning
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = 'primary-host',
  SOURCE_USER = 'replication_user',
  SOURCE_PASSWORD = 'replication_password',
  SOURCE_AUTO_POSITION = 1;  -- GTID auto-positioning
START REPLICA;

Fix 4: Rebuild a Replica from Scratch

When replication is too far out of sync, rebuilding is faster than catching up:

# Method 1: Using mysqldump with GTID state
# On primary:
mysqldump \
  --single-transaction \
  --master-data=2 \
  --set-gtid-purged=ON \
  --all-databases \
  -u root -p > full_backup.sql

# Transfer to replica and restore:
mysql -u root -p < full_backup.sql

# On replica — check GTID_PURGED was set:
mysql -e "SHOW GLOBAL VARIABLES LIKE 'gtid_purged';"

# Connect to primary and start replication:
mysql -e "
  CHANGE REPLICATION SOURCE TO
    SOURCE_HOST='primary-host',
    SOURCE_USER='repl',
    SOURCE_PASSWORD='password',
    SOURCE_AUTO_POSITION=1;
  START REPLICA;
"
# Method 2: Using Percona XtraBackup (faster for large databases)
# On primary:
xtrabackup --backup --target-dir=/backup/full \
  --user=root --password=pass

xtrabackup --prepare --target-dir=/backup/full

# Transfer to replica:
rsync -avz /backup/full/ replica:/var/lib/mysql/

# On replica:
xtrabackup --move-back --target-dir=/backup/full
chown -R mysql:mysql /var/lib/mysql
systemctl start mysql

# Get the GTID position from backup:
cat /backup/full/xtrabackup_binlog_info
# mysql-bin.000042  1234567  source_uuid:1-54321

mysql -e "
  RESET REPLICA ALL;
  SET GLOBAL gtid_purged='source_uuid:1-54321';
  CHANGE REPLICATION SOURCE TO SOURCE_HOST='primary', SOURCE_AUTO_POSITION=1;
  START REPLICA;
"

Fix 5: Monitor Replication Health

Set up proactive monitoring to catch issues before they become critical:

-- Check replication health (run regularly)
SELECT
  SERVICE_STATE AS io_state,
  LAST_ERROR_NUMBER,
  LAST_ERROR_MESSAGE,
  LAST_ERROR_TIMESTAMP
FROM performance_schema.replication_connection_status;

SELECT
  SERVICE_STATE AS sql_state,
  LAST_ERROR_NUMBER,
  LAST_ERROR_MESSAGE,
  LAST_ERROR_TIMESTAMP,
  LAST_APPLIED_TRANSACTION
FROM performance_schema.replication_applier_status_by_worker;

-- Lag monitoring
SELECT
  CHANNEL_NAME,
  SERVICE_STATE,
  LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
  LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
  TIMESTAMPDIFF(
    SECOND,
    LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
    NOW()
  ) AS lag_seconds
FROM performance_schema.replication_applier_status_by_worker;

Shell script for monitoring:

#!/bin/bash
# replication_check.sh — alert if replication is broken or lagging

MYSQL="mysql -u monitor -ppassword -h replica-host"
MAX_LAG=30  # Alert if lag > 30 seconds

STATUS=$($MYSQL -e "SHOW REPLICA STATUS\G" 2>/dev/null)

IO_RUNNING=$(echo "$STATUS" | grep "Replica_IO_Running:" | awk '{print $2}')
SQL_RUNNING=$(echo "$STATUS" | grep "Replica_SQL_Running:" | awk '{print $2}')
LAG=$(echo "$STATUS" | grep "Seconds_Behind_Source:" | awk '{print $2}')
ERROR=$(echo "$STATUS" | grep "Last_SQL_Error:" | cut -d: -f2-)

if [ "$IO_RUNNING" != "Yes" ] || [ "$SQL_RUNNING" != "Yes" ]; then
  echo "CRITICAL: Replication stopped! Error:$ERROR"
  # Send alert here
elif [ "$LAG" != "0" ] && [ "$LAG" -gt "$MAX_LAG" ]; then
  echo "WARNING: Replication lag ${LAG}s exceeds threshold of ${MAX_LAG}s"
fi

Fix 6: Replication User and Permissions

A misconfigured replication user is a common setup mistake:

-- On primary — create replication user
CREATE USER 'repl'@'replica-ip' IDENTIFIED WITH mysql_native_password BY 'strong-password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'replica-ip';
FLUSH PRIVILEGES;

-- Verify the grant
SHOW GRANTS FOR 'repl'@'replica-ip';

-- On replica — connect
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = '10.0.0.1',
  SOURCE_PORT = 3306,
  SOURCE_USER = 'repl',
  SOURCE_PASSWORD = 'strong-password',
  SOURCE_AUTO_POSITION = 1;

-- For SSL/TLS replication (recommended for production)
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = 'primary-host',
  SOURCE_USER = 'repl',
  SOURCE_PASSWORD = 'password',
  SOURCE_AUTO_POSITION = 1,
  SOURCE_SSL = 1,
  SOURCE_SSL_CA = '/etc/mysql/ssl/ca.pem',
  SOURCE_SSL_CERT = '/etc/mysql/ssl/client-cert.pem',
  SOURCE_SSL_KEY = '/etc/mysql/ssl/client-key.pem';

Still Not Working?

Binary log format mismatch — if the primary uses binlog_format=STATEMENT and the replica has issues executing non-deterministic queries (NOW(), RAND()), switch the primary to ROW format. ROW format replicates the actual row changes, not the SQL statements, and is more reliable:

-- Check current format
SHOW GLOBAL VARIABLES LIKE 'binlog_format';

-- Change to ROW (requires replica restart to take effect)
SET GLOBAL binlog_format = 'ROW';

read_only not set on replica — if the replica doesn’t have read_only = ON, application writes directly to the replica cause data drift and replication conflicts. Always set read_only = ON and super_read_only = ON on replicas:

[mysqld]
read_only = ON
super_read_only = ON  # Prevents even SUPER privilege users from writing

Semi-synchronous replication timeouts — if rpl_semi_sync_source_timeout is reached and the primary falls back to asynchronous replication, the replica may have missed committed transactions. Check Rpl_semi_sync_source_no_tx counter.

For related MySQL issues, see Fix: MySQL Deadlock Detected and Fix: MySQL Lock Wait Timeout Exceeded.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles