πŸ”’ Private Site

This site is password-protected.

Low-Latency C++

This is where C++ separates from every other language. In quantitative finance, high-frequency trading, and real-time systems, microseconds matter. This section covers the techniques used to write the fastest possible C++ code.


Table of Contents

1 β€” Why Low-Latency Matters

2 β€” CPU Cache Hierarchy

3 β€” Memory Allocation

4 β€” Branch Prediction

5 β€” Lock-Free Data Structures

6 β€” Avoiding System Calls

7 β€” CPU Pinning & NUMA

8 β€” Compiler Optimizations

9 β€” SIMD

10 β€” Profiling

11 β€” Low-Latency Checklist

Practice Questions


Glossary β€” Key Terms at a Glance

Term Meaning
Cache line 64-byte block β€” the unit of data transfer between CPU and memory
L1/L2/L3 CPU cache levels β€” L1 fastest (~1ns), L3 shared (~10ns)
SoA Structure of Arrays β€” cache-friendly data layout
Arena allocator Pre-allocate large block, hand out chunks β€” O(1) alloc
Branch misprediction CPU guessed wrong branch direction β€” ~15-20 cycle penalty
Lock-free Concurrent access without mutexes β€” uses atomics/CAS
SPSC queue Single Producer Single Consumer β€” fastest IPC primitive
CAS Compare-And-Swap β€” atomic read-modify-write operation
NUMA Non-Uniform Memory Access β€” memory latency depends on socket
SIMD Single Instruction Multiple Data β€” process 4-8 values per cycle
rdtsc Read Time Stamp Counter β€” cycle-accurate benchmarking

1 β€” Why Low-Latency Matters

1.1 HFT Context

In HFT (High-Frequency Trading):

  • 1 microsecond advantage = millions in annual profit
  • Market data arrives every ~10–100ΞΌs
  • Order-to-fill must be <10ΞΌs at top firms
  • Every allocation, branch, and cache miss counts
Network tick β†’ Parse β†’ Strategy β†’ Risk Check β†’ Order β†’ Network
        ↑                                              ↑
    ~5ΞΌs                                           ~5ΞΌs
        └── Your code must execute in ~1-5ΞΌs β”€β”€β”€β”€β”€β”€β”˜

2 β€” CPU Cache Hierarchy

2.1 Cache Lines & Data-Oriented Design

The single most important concept for performance:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  ~1ns    32-64 KB
β”‚  L1 Cache β”‚  ←── per core
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  ~4ns    256 KB - 1MB
β”‚  L2 Cache β”‚  ←── per core
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  ~10ns   8-64 MB
β”‚  L3 Cache β”‚  ←── shared across cores
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  ~100ns
β”‚  Main RAM β”‚  ←── 16-128 GB
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  ~10ms
β”‚  Disk/SSD β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

L1 cache hit is 100x faster than main memory!

Cache Line Facts:

  • CPU reads memory in cache lines (typically 64 bytes)
  • Accessing one byte loads the entire 64-byte line
  • Accessing adjacent memory = cache hits = fast
  • Random access = cache misses = slow

Data-Oriented Design β€” SoA vs AoS:

// BAD: Array of Structures (AoS) β€” poor cache locality
struct Order {
    double price;       // 8 bytes
    int quantity;        // 4 bytes
    char side;           // 1 byte
    // 3 bytes padding
    std::string symbol;  // 32 bytes β€” heap allocated!
};
std::vector<Order> orders;  // each order spread across memory

// GOOD: Structure of Arrays (SoA) β€” cache-friendly
struct Orders {
    std::vector<double> prices;
    std::vector<int> quantities;
    std::vector<char> sides;
    std::vector<uint32_t> symbolIds;  // use IDs instead of strings
};
// When iterating prices, they're contiguous in memory β†’ cache hits

πŸ” Why SoA? When you iterate over all prices, the entire prices array sits in contiguous memory. Each cache line fetch brings 8 doubles = 8 useful values. With AoS, each fetch brings 1 price mixed with quantity/side/symbol β€” wasting 80%+ of cache line bandwidth.

Measuring Cache Performance:

# Linux perf
perf stat -e cache-references,cache-misses,L1-dcache-load-misses ./program

# Valgrind cachegrind
valgrind --tool=cachegrind ./program

3 β€” Memory Allocation

⚠️ new and malloc are too slow for the hot path. They involve system calls, locking, and fragmentation.

3.1 Arena Allocator

Pre-allocate a large block, hand out chunks β€” O(1) allocation:

class Arena {
    char* memory;
    size_t offset;
    size_t capacity;
    
public:
    Arena(size_t size) : memory(new char[size]), offset(0), capacity(size) {}
    ~Arena() { delete[] memory; }
    
    void* allocate(size_t bytes, size_t alignment = 8) {
        size_t aligned = (offset + alignment - 1) & ~(alignment - 1);
        if (aligned + bytes > capacity) return nullptr;
        void* ptr = memory + aligned;
        offset = aligned + bytes;
        return ptr;
    }
    
    void reset() { offset = 0; }  // O(1) "free all"
    
    Arena(const Arena&) = delete;
    Arena& operator=(const Arena&) = delete;
};

// Usage:
Arena arena(1024 * 1024);  // 1MB pool
auto* order = new(arena.allocate(sizeof(Order))) Order{};
// No syscall, no locking, O(1) allocation

πŸ” Why arenas? A single new call takes ~100ns (system call + malloc internals). Arena allocation takes ~2ns (pointer bump). On a hot path executing in 1-5ΞΌs, every allocation counts.


3.2 Object Pool

For fixed-size objects β€” free-list based O(1) alloc/dealloc:

template<typename T, size_t N>
class ObjectPool {
    union Slot {
        T object;
        Slot* next;
        Slot() {}
        ~Slot() {}
    };
    
    Slot slots[N];
    Slot* free_list;
    
public:
    ObjectPool() {
        free_list = &slots[0];
        for (size_t i = 0; i < N - 1; ++i)
            slots[i].next = &slots[i + 1];
        slots[N - 1].next = nullptr;
    }
    
    template<typename... Args>
    T* allocate(Args&&... args) {
        if (!free_list) return nullptr;
        Slot* slot = free_list;
        free_list = slot->next;
        return new(&slot->object) T(std::forward<Args>(args)...);
    }
    
    void deallocate(T* obj) {
        obj->~T();
        auto* slot = reinterpret_cast<Slot*>(obj);
        slot->next = free_list;
        free_list = slot;
    }
};

3.3 Stack Allocation

// SLOW: heap allocation
auto* arr = new int[100];
delete[] arr;

// FAST: stack allocation
int arr[100];

// FAST: std::array (stack)
std::array<int, 100> arr;

// For dynamic sizes, reserve upfront
std::vector<int> v;
v.reserve(1000);  // single allocation, no reallocations

Rule: If the size is known at compile time, use stack. If dynamic, reserve capacity upfront to avoid reallocations.


4 β€” Branch Prediction

4.1 Branchless Code & Hints

Modern CPUs predict which way a branch will go. A misprediction costs ~15-20 cycles (~5-7ns at 3GHz).

Branch-Free Code:

// SLOW: branch-heavy
int abs_val(int x) {
    if (x < 0) return -x;
    return x;
}

// FAST: branchless
int abs_val(int x) {
    int mask = x >> 31;           // 0 if positive, -1 if negative
    return (x + mask) ^ mask;     // branchless absolute value
}

// SLOW: conditional increment
if (condition) count++;

// FAST: branchless
count += (condition);  // condition is 0 or 1

Likely/Unlikely Hints:

// GCC/Clang
if (__builtin_expect(error_occurred, 0)) {  // unlikely
    handle_error();
}

// C++20:
if (value > threshold) [[likely]] {
    fast_path();
} else [[unlikely]] {
    slow_path();
}

Sort Before Branching: If data is sorted, the branch predictor can learn the pattern:

std::sort(data.begin(), data.end());
for (int x : data) {
    if (x > threshold) {
        sum += x;  // branch becomes predictable after the boundary
    }
}

5 β€” Lock-Free Data Structures

⚠️ Mutexes are slow on the hot path (context switch ~1μs). For inter-thread communication in latency-critical code, use lock-free structures.

5.1 SPSC Queue

Lock-Free Queue (Single Producer Single Consumer):

template<typename T, size_t Capacity>
class SPSCQueue {
    static_assert((Capacity & (Capacity - 1)) == 0, "Capacity must be power of 2");
    
    std::array<T, Capacity> buffer;
    alignas(64) std::atomic<size_t> head{0};  // written by consumer
    alignas(64) std::atomic<size_t> tail{0};  // written by producer
    // alignas(64) prevents false sharing
    
public:
    bool push(const T& item) {
        size_t t = tail.load(std::memory_order_relaxed);
        size_t next = (t + 1) & (Capacity - 1);
        
        if (next == head.load(std::memory_order_acquire))
            return false;  // full
        
        buffer[t] = item;
        tail.store(next, std::memory_order_release);
        return true;
    }
    
    bool pop(T& item) {
        size_t h = head.load(std::memory_order_relaxed);
        
        if (h == tail.load(std::memory_order_acquire))
            return false;  // empty
        
        item = buffer[h];
        head.store((h + 1) & (Capacity - 1), std::memory_order_release);
        return true;
    }
};

πŸ” Why alignas(64)? head and tail are written by different threads. Without alignment, they’d share a cache line (false sharing), causing constant cache invalidation β€” 10-100x slower.


5.2 Lock-Free Stack (CAS)

template<typename T>
class LockFreeStack {
    struct Node {
        T data;
        Node* next;
    };
    
    std::atomic<Node*> head{nullptr};
    
public:
    void push(T value) {
        Node* new_node = new Node{std::move(value), nullptr};
        new_node->next = head.load(std::memory_order_relaxed);
        
        while (!head.compare_exchange_weak(
            new_node->next, new_node,
            std::memory_order_release,
            std::memory_order_relaxed)) {
            // CAS failed β€” another thread modified head
            // new_node->next is automatically updated to current head
            // retry
        }
    }
};

πŸ” Why compare_exchange_weak? It can spuriously fail (for performance reasons on some architectures), but in a loop that’s fine. compare_exchange_strong guarantees no spurious failures but may be slower.


6 β€” Avoiding System Calls

6.1 Batch I/O & Busy-Waiting

⚠️ System calls involve kernel transition: ~1-10ΞΌs each. On a hot path, that’s unacceptable.

Batch I/O:

// SLOW: frequent I/O β€” one syscall per message
for (auto& msg : messages) {
    write(fd, msg.data(), msg.size());  // syscall each iteration
}

// FAST: batch I/O β€” single syscall
std::string batch;
batch.reserve(total_size);
for (auto& msg : messages) {
    batch += msg;
}
write(fd, batch.data(), batch.size());  // single syscall

// FAST: writev for scatter-gather I/O
struct iovec iov[N];
// ... fill iov with pointers and lengths ...
writev(fd, iov, N);  // single syscall, multiple buffers

Busy-Waiting vs Sleeping:

// HOT PATH: busy-wait (spins, uses CPU, but zero latency)
while (!data_ready.load(std::memory_order_acquire)) {
    // spin β€” CPU core dedicated to this
}

// COLD PATH: sleep (yields CPU, higher latency)
while (!data_ready.load(std::memory_order_acquire)) {
    std::this_thread::yield();
}

πŸ” In HFT, dedicated cores busy-wait on the hot path. The CPU cost is worth the nanoseconds saved.


7 β€” CPU Pinning & NUMA

7.1 Affinity & Core Isolation

CPU Affinity β€” pin threads to specific cores:

#include <pthread.h>

void pin_thread_to_core(int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
}

pin_thread_to_core(2);  // pin trading thread to core 2

NUMA Awareness β€” allocate on local memory node:

#include <numa.h>
void* ptr = numa_alloc_onnode(size, numa_node_of_cpu(sched_getcpu()));

Core Isolation (Linux kernel):

# In /etc/default/grub:
GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3"
# Cores 2 and 3 are dedicated to your application β€” no kernel tasks

Why isolate cores? Without isolation, the Linux scheduler can preempt your trading thread for kernel housekeeping tasks, adding microseconds of jitter. Isolated cores run only your code.


8 β€” Compiler Optimizations

8.1 Flags & Attributes

Optimization Flags:

g++ -O3 -march=native -flto -DNDEBUG program.cpp
# -O3: aggressive optimization
# -march=native: use all CPU features (AVX, SSE, etc.)
# -flto: link-time optimization (cross-file inlining)
# -DNDEBUG: disable assert()

Hot/Cold Attributes:

[[gnu::hot]] void process_order(Order& order) {
    // Compiler optimizes this more aggressively
}

[[gnu::cold]] void handle_error(int code) {
    // Compiler moves this to cold section (less cache pressure on hot code)
}

Prevent Compiler from Optimizing Away (for benchmarks):

void do_not_optimize(auto& value) {
    asm volatile("" : : "r,m"(value) : "memory");
}

9 β€” SIMD

9.1 Single Instruction Multiple Data

Definition: SIMD processes multiple data points in one CPU instruction β€” typically 4-8x throughput for arithmetic.

#include <immintrin.h>

// Add 8 floats at once (AVX)
void add_arrays(float* a, float* b, float* result, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);
        __m256 vb = _mm256_loadu_ps(b + i);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(result + i, vc);
    }
}
// 8x throughput for floating-point operations!

πŸ” Modern compilers auto-vectorize with -O3 -march=native, but manual SIMD gives more control for critical inner loops.


10 β€” Profiling

10.1 Tools & Measurement

⚠️ Never optimize without profiling first. The bottleneck is rarely where you think it is.

Tool What it Measures Platform
perf CPU cycles, cache misses, branch misses Linux
VTune Deep CPU analysis, threading Cross-platform
gprof Function-level profiling Linux
Valgrind/Cachegrind Cache simulation Linux
Google Benchmark Microbenchmarks Cross-platform
Tracy Real-time frame profiler Cross-platform

Quick Benchmarking:

#include <chrono>

auto start = std::chrono::high_resolution_clock::now();
// ... code to benchmark ...
auto end = std::chrono::high_resolution_clock::now();

auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
std::cout << duration.count() << " ns\n";

rdtsc β€” CPU Cycle Counter (most precise):

inline uint64_t rdtsc() {
    unsigned int lo, hi;
    __asm__ volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

uint64_t start = rdtsc();
// ... hot path code ...
uint64_t cycles = rdtsc() - start;
// At 3GHz: 1 cycle β‰ˆ 0.33ns

11 β€” Low-Latency Checklist

11.1 The 11 Rules

# Rule Why
1 No heap allocation on hot path malloc/new = ~100ns syscall
2 No locks on hot path Mutex = ~1ΞΌs context switch
3 No virtual functions on hot path vtable = cache miss + indirection
4 No exceptions on hot path Exception = expensive stack unwinding
5 No I/O on hot path I/O = syscall + blocking
6 Use contiguous memory (arrays, not linked lists) Cache locality
7 Pad atomics to cache line (64B) Prevent false sharing
8 Pin threads to cores Prevent context switches
9 Use lock-free queues (SPSC/MPSC) Zero-overhead communication
10 Pre-compute everything possible Move work off the hot path
11 Profile, don’t guess Bottleneck is rarely where you think

Practice Questions

Q1. Explain why L1 cache is ~100x faster than main memory. How does this affect data structure design?

Q2. Convert an AoS (Array of Structures) layout to SoA (Structure of Arrays). Measure the difference with a loop iterating over one field.

Q3. Implement an arena allocator. Compare its allocation time against new using rdtsc.

Q4. Write a branchless min(a, b) function. Explain why it’s faster than if-else on unpredictable data.

Q5. Implement an SPSC (Single Producer Single Consumer) lock-free queue. Explain each memory ordering choice.

Q6. What is false sharing? Demonstrate it with two threads writing to adjacent atomic<int> variables, then fix it with alignas(64).

Q7. Explain the difference between busy-waiting and sleeping. When is each appropriate in a trading system?

Q8. What does -O3 -march=native -flto do? Why should you always compile with these for production?

Q9. Write AVX code to compute the dot product of two float arrays. Compare throughput against scalar code.

Q10. Given a hot path that takes 8ΞΌs, you need to reduce it to 3ΞΌs. Describe your profiling and optimization strategy step by step.


Key Takeaways

  1. Cache is king β€” design data structures for contiguous access
  2. Allocate memory upfront β€” pools, arenas, pre-reserved vectors
  3. Profile before optimizing β€” measure, don’t guess
  4. Lock-free for inter-thread communication on the hot path
  5. Pin threads to cores β€” eliminate context switches
  6. Branchless code where branch predictor can’t help
  7. Compile with -O3 -march=native -flto for production builds
  8. Low-latency is a mindset β€” every line on the hot path must justify its existence

← Back to C++ Notes