Low-Latency C++

This is where C++ separates from every other language. In quantitative finance, high-frequency trading, and real-time systems, microseconds matter. This section covers the techniques used to write the fastest possible C++ code.

1 — Why Low-Latency Matters

1.1 HFT Context

2 — CPU Cache Hierarchy

2.1 Cache Lines & Data-Oriented Design

3 — Memory Allocation

3.1 Arena Allocator
3.2 Object Pool
3.3 Stack Allocation

4 — Branch Prediction

4.1 Branchless Code & Hints

5 — Lock-Free Data Structures

5.1 SPSC Queue
5.2 Lock-Free Stack (CAS)

6 — Avoiding System Calls

6.1 Batch I/O & Busy-Waiting

7 — CPU Pinning & NUMA

7.1 Affinity & Core Isolation

8 — Compiler Optimizations

8.1 Flags & Attributes

9 — SIMD

9.1 Single Instruction Multiple Data

10 — Profiling

10.1 Tools & Measurement

11 — Low-Latency Checklist

11.1 The 11 Rules

Practice Questions

Glossary — Key Terms at a Glance

Term	Meaning
Cache line	64-byte block — the unit of data transfer between CPU and memory
L1/L2/L3	CPU cache levels — L1 fastest (~1ns), L3 shared (~10ns)
SoA	Structure of Arrays — cache-friendly data layout
Arena allocator	Pre-allocate large block, hand out chunks — O(1) alloc
Branch misprediction	CPU guessed wrong branch direction — ~15-20 cycle penalty
Lock-free	Concurrent access without mutexes — uses atomics/CAS
SPSC queue	Single Producer Single Consumer — fastest IPC primitive
CAS	Compare-And-Swap — atomic read-modify-write operation
NUMA	Non-Uniform Memory Access — memory latency depends on socket
SIMD	Single Instruction Multiple Data — process 4-8 values per cycle
`rdtsc`	Read Time Stamp Counter — cycle-accurate benchmarking

1 — Why Low-Latency Matters

1.1 HFT Context

In HFT (High-Frequency Trading):

1 microsecond advantage = millions in annual profit
Market data arrives every ~10–100μs
Order-to-fill must be <10μs at top firms
Every allocation, branch, and cache miss counts

Network tick → Parse → Strategy → Risk Check → Order → Network
        ↑                                              ↑
    ~5μs                                           ~5μs
        └── Your code must execute in ~1-5μs ──────┘

2 — CPU Cache Hierarchy

2.1 Cache Lines & Data-Oriented Design

The single most important concept for performance:

┌──────────┐  ~1ns    32-64 KB
│  L1 Cache │  ←── per core
├──────────┤  ~4ns    256 KB - 1MB
│  L2 Cache │  ←── per core
├──────────┤  ~10ns   8-64 MB
│  L3 Cache │  ←── shared across cores
├──────────┤  ~100ns
│  Main RAM │  ←── 16-128 GB
├──────────┤  ~10ms
│  Disk/SSD │
└──────────┘

L1 cache hit is 100x faster than main memory!

Cache Line Facts:

CPU reads memory in cache lines (typically 64 bytes)
Accessing one byte loads the entire 64-byte line
Accessing adjacent memory = cache hits = fast
Random access = cache misses = slow

Data-Oriented Design — SoA vs AoS:

// BAD: Array of Structures (AoS) — poor cache locality
struct Order {
    double price;       // 8 bytes
    int quantity;        // 4 bytes
    char side;           // 1 byte
    // 3 bytes padding
    std::string symbol;  // 32 bytes — heap allocated!
};
std::vector<Order> orders;  // each order spread across memory

// GOOD: Structure of Arrays (SoA) — cache-friendly
struct Orders {
    std::vector<double> prices;
    std::vector<int> quantities;
    std::vector<char> sides;
    std::vector<uint32_t> symbolIds;  // use IDs instead of strings
};
// When iterating prices, they're contiguous in memory → cache hits

🔍 Why SoA? When you iterate over all prices, the entire prices array sits in contiguous memory. Each cache line fetch brings 8 doubles = 8 useful values. With AoS, each fetch brings 1 price mixed with quantity/side/symbol — wasting 80%+ of cache line bandwidth.

Measuring Cache Performance:

# Linux perf
perf stat -e cache-references,cache-misses,L1-dcache-load-misses ./program

# Valgrind cachegrind
valgrind --tool=cachegrind ./program

3 — Memory Allocation

⚠️ new and malloc are too slow for the hot path. They involve system calls, locking, and fragmentation.

3.1 Arena Allocator

Pre-allocate a large block, hand out chunks — O(1) allocation:

class Arena {
    char* memory;
    size_t offset;
    size_t capacity;
    
public:
    Arena(size_t size) : memory(new char[size]), offset(0), capacity(size) {}
    ~Arena() { delete[] memory; }
    
    void* allocate(size_t bytes, size_t alignment = 8) {
        size_t aligned = (offset + alignment - 1) & ~(alignment - 1);
        if (aligned + bytes > capacity) return nullptr;
        void* ptr = memory + aligned;
        offset = aligned + bytes;
        return ptr;
    }
    
    void reset() { offset = 0; }  // O(1) "free all"
    
    Arena(const Arena&) = delete;
    Arena& operator=(const Arena&) = delete;
};

// Usage:
Arena arena(1024 * 1024);  // 1MB pool
auto* order = new(arena.allocate(sizeof(Order))) Order{};
// No syscall, no locking, O(1) allocation

🔍 Why arenas? A single new call takes ~100ns (system call + malloc internals). Arena allocation takes ~2ns (pointer bump). On a hot path executing in 1-5μs, every allocation counts.

3.2 Object Pool

For fixed-size objects — free-list based O(1) alloc/dealloc:

template<typename T, size_t N>
class ObjectPool {
    union Slot {
        T object;
        Slot* next;
        Slot() {}
        ~Slot() {}
    };
    
    Slot slots[N];
    Slot* free_list;
    
public:
    ObjectPool() {
        free_list = &slots[0];
        for (size_t i = 0; i < N - 1; ++i)
            slots[i].next = &slots[i + 1];
        slots[N - 1].next = nullptr;
    }
    
    template<typename... Args>
    T* allocate(Args&&... args) {
        if (!free_list) return nullptr;
        Slot* slot = free_list;
        free_list = slot->next;
        return new(&slot->object) T(std::forward<Args>(args)...);
    }
    
    void deallocate(T* obj) {
        obj->~T();
        auto* slot = reinterpret_cast<Slot*>(obj);
        slot->next = free_list;
        free_list = slot;
    }
};

3.3 Stack Allocation

// SLOW: heap allocation
auto* arr = new int[100];
delete[] arr;

// FAST: stack allocation
int arr[100];

// FAST: std::array (stack)
std::array<int, 100> arr;

// For dynamic sizes, reserve upfront
std::vector<int> v;
v.reserve(1000);  // single allocation, no reallocations

Rule: If the size is known at compile time, use stack. If dynamic, reserve capacity upfront to avoid reallocations.

4 — Branch Prediction

4.1 Branchless Code & Hints

Modern CPUs predict which way a branch will go. A misprediction costs ~15-20 cycles (~5-7ns at 3GHz).

Branch-Free Code:

// SLOW: branch-heavy
int abs_val(int x) {
    if (x < 0) return -x;
    return x;
}

// FAST: branchless
int abs_val(int x) {
    int mask = x >> 31;           // 0 if positive, -1 if negative
    return (x + mask) ^ mask;     // branchless absolute value
}

// SLOW: conditional increment
if (condition) count++;

// FAST: branchless
count += (condition);  // condition is 0 or 1

Likely/Unlikely Hints:

// GCC/Clang
if (__builtin_expect(error_occurred, 0)) {  // unlikely
    handle_error();
}

// C++20:
if (value > threshold) [[likely]] {
    fast_path();
} else [[unlikely]] {
    slow_path();
}

Sort Before Branching: If data is sorted, the branch predictor can learn the pattern:

std::sort(data.begin(), data.end());
for (int x : data) {
    if (x > threshold) {
        sum += x;  // branch becomes predictable after the boundary
    }
}

5 — Lock-Free Data Structures

⚠️ Mutexes are slow on the hot path (context switch ~1μs). For inter-thread communication in latency-critical code, use lock-free structures.

5.1 SPSC Queue

Lock-Free Queue (Single Producer Single Consumer):

template<typename T, size_t Capacity>
class SPSCQueue {
    static_assert((Capacity & (Capacity - 1)) == 0, "Capacity must be power of 2");
    
    std::array<T, Capacity> buffer;
    alignas(64) std::atomic<size_t> head{0};  // written by consumer
    alignas(64) std::atomic<size_t> tail{0};  // written by producer
    // alignas(64) prevents false sharing
    
public:
    bool push(const T& item) {
        size_t t = tail.load(std::memory_order_relaxed);
        size_t next = (t + 1) & (Capacity - 1);
        
        if (next == head.load(std::memory_order_acquire))
            return false;  // full
        
        buffer[t] = item;
        tail.store(next, std::memory_order_release);
        return true;
    }
    
    bool pop(T& item) {
        size_t h = head.load(std::memory_order_relaxed);
        
        if (h == tail.load(std::memory_order_acquire))
            return false;  // empty
        
        item = buffer[h];
        head.store((h + 1) & (Capacity - 1), std::memory_order_release);
        return true;
    }
};

🔍 Why alignas(64)? head and tail are written by different threads. Without alignment, they’d share a cache line (false sharing), causing constant cache invalidation — 10-100x slower.

5.2 Lock-Free Stack (CAS)

template<typename T>
class LockFreeStack {
    struct Node {
        T data;
        Node* next;
    };
    
    std::atomic<Node*> head{nullptr};
    
public:
    void push(T value) {
        Node* new_node = new Node{std::move(value), nullptr};
        new_node->next = head.load(std::memory_order_relaxed);
        
        while (!head.compare_exchange_weak(
            new_node->next, new_node,
            std::memory_order_release,
            std::memory_order_relaxed)) {
            // CAS failed — another thread modified head
            // new_node->next is automatically updated to current head
            // retry
        }
    }
};

🔍 Why compare_exchange_weak? It can spuriously fail (for performance reasons on some architectures), but in a loop that’s fine. compare_exchange_strong guarantees no spurious failures but may be slower.

6 — Avoiding System Calls

6.1 Batch I/O & Busy-Waiting

⚠️ System calls involve kernel transition: ~1-10μs each. On a hot path, that’s unacceptable.

Batch I/O:

// SLOW: frequent I/O — one syscall per message
for (auto& msg : messages) {
    write(fd, msg.data(), msg.size());  // syscall each iteration
}

// FAST: batch I/O — single syscall
std::string batch;
batch.reserve(total_size);
for (auto& msg : messages) {
    batch += msg;
}
write(fd, batch.data(), batch.size());  // single syscall

// FAST: writev for scatter-gather I/O
struct iovec iov[N];
// ... fill iov with pointers and lengths ...
writev(fd, iov, N);  // single syscall, multiple buffers

Busy-Waiting vs Sleeping:

// HOT PATH: busy-wait (spins, uses CPU, but zero latency)
while (!data_ready.load(std::memory_order_acquire)) {
    // spin — CPU core dedicated to this
}

// COLD PATH: sleep (yields CPU, higher latency)
while (!data_ready.load(std::memory_order_acquire)) {
    std::this_thread::yield();
}

🔍 In HFT, dedicated cores busy-wait on the hot path. The CPU cost is worth the nanoseconds saved.

7 — CPU Pinning & NUMA

7.1 Affinity & Core Isolation

CPU Affinity — pin threads to specific cores:

#include <pthread.h>

void pin_thread_to_core(int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
}

pin_thread_to_core(2);  // pin trading thread to core 2

NUMA Awareness — allocate on local memory node:

#include <numa.h>
void* ptr = numa_alloc_onnode(size, numa_node_of_cpu(sched_getcpu()));

Core Isolation (Linux kernel):

# In /etc/default/grub:
GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3"
# Cores 2 and 3 are dedicated to your application — no kernel tasks

Why isolate cores? Without isolation, the Linux scheduler can preempt your trading thread for kernel housekeeping tasks, adding microseconds of jitter. Isolated cores run only your code.

8 — Compiler Optimizations

8.1 Flags & Attributes

Optimization Flags:

g++ -O3 -march=native -flto -DNDEBUG program.cpp
# -O3: aggressive optimization
# -march=native: use all CPU features (AVX, SSE, etc.)
# -flto: link-time optimization (cross-file inlining)
# -DNDEBUG: disable assert()

Hot/Cold Attributes:

[[gnu::hot]] void process_order(Order& order) {
    // Compiler optimizes this more aggressively
}

[[gnu::cold]] void handle_error(int code) {
    // Compiler moves this to cold section (less cache pressure on hot code)
}

Prevent Compiler from Optimizing Away (for benchmarks):

void do_not_optimize(auto& value) {
    asm volatile("" : : "r,m"(value) : "memory");
}

9 — SIMD

9.1 Single Instruction Multiple Data

Definition: SIMD processes multiple data points in one CPU instruction — typically 4-8x throughput for arithmetic.

#include <immintrin.h>

// Add 8 floats at once (AVX)
void add_arrays(float* a, float* b, float* result, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);
        __m256 vb = _mm256_loadu_ps(b + i);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(result + i, vc);
    }
}
// 8x throughput for floating-point operations!

🔍 Modern compilers auto-vectorize with -O3 -march=native, but manual SIMD gives more control for critical inner loops.

10 — Profiling

10.1 Tools & Measurement

⚠️ Never optimize without profiling first. The bottleneck is rarely where you think it is.

Tool	What it Measures	Platform
perf	CPU cycles, cache misses, branch misses	Linux
VTune	Deep CPU analysis, threading	Cross-platform
gprof	Function-level profiling	Linux
Valgrind/Cachegrind	Cache simulation	Linux
Google Benchmark	Microbenchmarks	Cross-platform
Tracy	Real-time frame profiler	Cross-platform

Quick Benchmarking:

#include <chrono>

auto start = std::chrono::high_resolution_clock::now();
// ... code to benchmark ...
auto end = std::chrono::high_resolution_clock::now();

auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
std::cout << duration.count() << " ns\n";

rdtsc — CPU Cycle Counter (most precise):

inline uint64_t rdtsc() {
    unsigned int lo, hi;
    __asm__ volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

uint64_t start = rdtsc();
// ... hot path code ...
uint64_t cycles = rdtsc() - start;
// At 3GHz: 1 cycle ≈ 0.33ns

11 — Low-Latency Checklist

11.1 The 11 Rules

#	Rule	Why
1	No heap allocation on hot path	`malloc`/`new` = ~100ns syscall
2	No locks on hot path	Mutex = ~1μs context switch
3	No virtual functions on hot path	vtable = cache miss + indirection
4	No exceptions on hot path	Exception = expensive stack unwinding
5	No I/O on hot path	I/O = syscall + blocking
6	Use contiguous memory (arrays, not linked lists)	Cache locality
7	Pad atomics to cache line (64B)	Prevent false sharing
8	Pin threads to cores	Prevent context switches
9	Use lock-free queues (SPSC/MPSC)	Zero-overhead communication
10	Pre-compute everything possible	Move work off the hot path
11	Profile, don’t guess	Bottleneck is rarely where you think

Practice Questions

Q1. Explain why L1 cache is ~100x faster than main memory. How does this affect data structure design?

Q2. Convert an AoS (Array of Structures) layout to SoA (Structure of Arrays). Measure the difference with a loop iterating over one field.

Q3. Implement an arena allocator. Compare its allocation time against new using rdtsc.

Q4. Write a branchless min(a, b) function. Explain why it’s faster than if-else on unpredictable data.

Q5. Implement an SPSC (Single Producer Single Consumer) lock-free queue. Explain each memory ordering choice.

Q6. What is false sharing? Demonstrate it with two threads writing to adjacent atomic<int> variables, then fix it with alignas(64).

Q7. Explain the difference between busy-waiting and sleeping. When is each appropriate in a trading system?

Q8. What does -O3 -march=native -flto do? Why should you always compile with these for production?

Q9. Write AVX code to compute the dot product of two float arrays. Compare throughput against scalar code.

Q10. Given a hot path that takes 8μs, you need to reduce it to 3μs. Describe your profiling and optimization strategy step by step.

Key Takeaways

Cache is king — design data structures for contiguous access
Allocate memory upfront — pools, arenas, pre-reserved vectors
Profile before optimizing — measure, don’t guess
Lock-free for inter-thread communication on the hot path
Pin threads to cores — eliminate context switches
Branchless code where branch predictor can’t help
Compile with -O3 -march=native -flto for production builds
Low-latency is a mindset — every line on the hot path must justify its existence

← Back to C++ Notes

🔒 Private Site