Network Performance Tuning: TCP, MTU, and Latency Optimisation

You’ve debugged the bug and the network “works” — but it’s slow. Tuning network performance is mostly about understanding three things: how TCP fills a pipe, where buffering hurts you, and where round-trips are the bottleneck.

The bandwidth-delay product

The fundamental equation:

Max throughput = TCP window size / Round-trip time

If you have a 100ms RTT link and a default 64 KB window, you cannot get more than ~5 Mbps no matter how fat the underlying pipe is. Modern Linux negotiates much larger windows automatically (window scaling), but on long-fat-network paths this is still the most common single cause of disappointing throughput.

TCP congestion control

Linux defaults to CUBIC, which works well in stable data centers but underperforms on lossy or bufferbloated paths. BBR, developed at Google, models the link’s actual bandwidth and RTT instead of treating loss as the only signal — it’s often dramatically better for long-haul transfers and video streaming.

# Check current
sysctl net.ipv4.tcp_congestion_control

# Switch to BBR (kernel 4.9+)
echo "net.core.default_qdisc=fq" | sudo tee -a /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control=bbr" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

MTU and fragmentation

The default MTU is 1500 bytes. If anything in the path has a smaller MTU and packets aren’t fragmented, you get the dreaded “works for small requests, hangs for large ones” symptom. To find the real path MTU:

ping -M do -s 1472 target            # 1472 + 28 IP/ICMP = 1500
# Increase/decrease the size until you find the largest that doesn't fragment

Inside data centers and cloud VPCs, jumbo frames (MTU 9000) can dramatically reduce CPU per byte for large transfers — but every device on the path must agree.

Latency: the round-trip you can’t escape

Bandwidth doubles every couple of years. Latency hasn’t improved meaningfully in decades because it’s bounded by the speed of light. Tactics that actually help:

  • Move closer. A CDN PoP at the edge beats any clever protocol.
  • Reuse connections. HTTP keep-alive, HTTP/2 multiplexing, connection pools — every new TCP+TLS handshake costs ~3 RTTs.
  • Batch requests. One large request with 10 things beats 10 small requests, even at the same byte count.
  • QUIC and HTTP/3. 0-RTT resumption and no head-of-line blocking, especially helpful on lossy mobile networks.
  • Pre-connect and prefetch. Open the TCP+TLS connection while the user is still reading, so the request is instant when they click.

Buffer bloat

Excessive buffering anywhere in the path turns a 30 ms link into a 500 ms one under load. Modern qdiscs (fq_codel, cake) actively manage buffers to keep latency low even when fully utilised. Most modern Linux distros use fq_codel by default — check with tc qdisc show.

Measure, don’t guess

Always benchmark before and after a tuning change. iperf3 for raw throughput, tcpdump + tshark for round-trip and retransmit analysis, and end-to-end timings (real curl requests, real browser RUM data) for what actually matters to users.

What to learn next

Performance work depends on understanding the protocols themselves: revisit TCP basics, HTTP/HTTPS, and CDN basics. For diagnosis, see debugging by OSI layer.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *