Tuning a Ceph cluster for high-throughput workloads

A default Ceph install gets you a working cluster, not a fast one. This is the full tuning playbook I use in production: kernel network parameters, TCP stack hardening, CPU governor and idle state management, and route initialization. The kind of changes that make a measurable difference under real workloads.

Why defaults fall short

Out of the box, Linux network buffers are sized conservatively. For a Ceph cluster juggling replication traffic, scrubbing, recovery, and client I/O simultaneously, that creates unnecessary backpressure. The goal is to let the network breathe and to keep the CPUs ready to handle it.

Network buffer tuning

The highest-impact change is expanding socket buffer sizes. Ceph’s messenger layer is latency-sensitive; larger buffers let the kernel batch more data before applying backpressure, which matters most during recovery and rebalancing when connection counts spike.

# /etc/sysctl.d/90-ceph-network.conf

# Socket receive/send buffers
net.core.rmem_default = 212992
net.core.wmem_default = 212992
net.core.rmem_max     = 134217728
net.core.wmem_max     = 134217728

# TCP auto-tuning: min / default / max
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728

# Let the kernel tune buffer sizes dynamically
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_no_metrics_save = 1
net.core.optmem_max          = 268435456
net.ipv4.tcp_adv_win_scale   = 2

By default, the kernel caches TCP metrics per destination and reuses them on reconnect, including a previously-bad congestion window. Setting tcp_no_metrics_save = 1 forces fresh negotiation every connection, which matters when OSDs restart or fail over.

Connection handling and port exhaustion

In a large cluster the number of simultaneous TCP connections grows fast. OSDs, monitors, managers, and clients all connecting to each other. A wide ephemeral port range and deep connection queues prevent silent drops under load.

# Port range and backlog
net.ipv4.ip_local_port_range  = 1024 65535
net.core.somaxconn            = 65535
net.ipv4.tcp_max_syn_backlog  = 8192
net.core.netdev_max_backlog   = 166660
net.core.netdev_budget        = 20000
net.core.netdev_budget_usecs  = 2000

# TIME_WAIT socket recycling
net.ipv4.tcp_max_tw_buckets   = 1440000
net.ipv4.tcp_tw_reuse         = 1

# Keepalive: detect dead peers faster
net.ipv4.tcp_keepalive_time   = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl  = 15
net.ipv4.tcp_fin_timeout      = 15

# UDP buffers (used by some Ceph messenger paths)
net.ipv4.udp_rmem_min = 16384
net.ipv4.udp_wmem_min = 16384

TCP performance features

A few TCP knobs worth being explicit about. SACK prevents retransmitting the entire window when a single segment is lost. Fast Open eliminates a round-trip for repeat connections. MTU probing avoids fragmentation across mixed-MTU networks, common in storage clusters where jumbo frames are enabled on some paths.

net.ipv4.tcp_sack         = 1   # Selective acknowledgements
net.ipv4.tcp_timestamps   = 1   # Required for SACK + PAWS
net.ipv4.tcp_fastopen     = 3   # Enable for both client and server
net.ipv4.tcp_mtu_probing  = 1   # Handle mixed-MTU paths
net.ipv4.tcp_ecn          = 2   # ECN — only useful if switches support it
net.ipv4.ip_no_pmtu_disc  = 0   # Keep path MTU discovery enabled

# Critical for bursty Ceph I/O patterns
net.ipv4.tcp_slow_start_after_idle = 0

# SYN hardening
net.ipv4.tcp_syncookies     = 1
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_rfc1337        = 1

Ceph I/O is inherently bursty. When connections go quiet and then resume, the kernel normally resets the congestion window to its initial value and ramps back up. Disabling tcp_slow_start_after_idle keeps the window at its negotiated size, avoiding a slow ramp-up exactly when you need full throughput most.

A note on congestion control: BBR is worth considering for general workloads, but it doesn’t play well with Ceph. It tends to overestimate available bandwidth on storage networks and can cause OSD timeout instability under sustained load. Stick with CUBIC. Verify your active algorithm with sysctl net.ipv4.tcp_congestion_control.

Neighbour table and conntrack

On larger clusters the ARP neighbour table fills up silently, dropping packets and causing spiky latency that’s hard to diagnose. Connection tracking also needs headroom for the volume of short-lived OSD connections during recovery events.

# ARP table — size up for large cluster node counts
net.ipv4.neigh.default.gc_thresh1 = 2048
net.ipv4.neigh.default.gc_thresh2 = 4096
net.ipv4.neigh.default.gc_thresh3 = 8192

# Netfilter connection tracking
net.netfilter.nf_conntrack_max                    = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 600
net.netfilter.nf_conntrack_generic_timeout         = 30

Route initialization

Setting initrwnd on the default route controls the initial receive window advertised on new connections. A higher value reduces the number of round-trips needed to reach full throughput on connections with high bandwidth-delay products, common between storage nodes on the same switch but different racks.

# Apply to the default route — substitute your actual gateway
ip route change default via <gateway> dev <interface> proto static onlink initrwnd 100

# Persist this via /etc/networkd-dispatcher/routable.d/ or an
# ExecStartPost in your network unit — it does not survive a reboot
# unless explicitly re-applied.

CPU governor and idle state management

Network tuning alone won’t eliminate latency spikes if CPUs are sleeping between requests. Storage nodes serving low-latency reads pay a real penalty waking from deeper C-states. Two settings work together here: pinning the governor to performance mode, and disabling deep idle states entirely.

# Install cpupower
apt-get install linux-tools-common

# Pin governor to performance on all cores
cpupower frequency-set -g performance

# Disable C-states deeper than C1
# -D 1 = disable every state with exit latency above C1
cpupower idle-set -D 1

# Verify
cpupower frequency-info
cpupower idle-info

Pinning to performance and disabling deep C-states increases idle power draw, sometimes significantly. On dedicated storage nodes this is typically acceptable. NVMe-backed OSDs see the largest latency improvement; HDD-backed clusters get a more modest gain and may not justify the power cost.

The cpupower settings don’t survive a reboot on their own. Drop the two commands into a systemd oneshot service or an ExecStartPost on your network target to keep them applied across restarts.

Applying everything

# Write sysctl config
vim /etc/sysctl.d/90-ceph-network.conf

# Apply immediately without rebooting
sysctl --system

# Confirm a key value
sysctl net.ipv4.tcp_slow_start_after_idle
# → net.ipv4.tcp_slow_start_after_idle = 0

# Apply CPU settings (re-run after each reboot, or wire into a systemd unit)
cpupower frequency-set -g performance
cpupower idle-set -D 1

These settings are battle-tested on production Ceph Reef and Quincy clusters. Your mileage will vary based on hardware, workload mix, and topology. Always benchmark before and after with fio or ceph tell osd.* bench to confirm you’re actually gaining.

March 22, 2026

Brett Petch

Uncategorized