Systems Programming13 min read·

Hardware Acceleration for Quantitative Finance

JIT compilation, SIMD instructions, GPU computing with CUDA, and FPGAs — the hardware acceleration techniques used in high-performance financial systems.

When Software Optimisation Is Not Enough

You have written efficient algorithms, chosen the right data structures, profiled your code, and eliminated bottlenecks. But your Monte Carlo simulation still takes too long, your real-time risk engine cannot keep up with market data, or your backtesting framework needs hours to test a strategy over a decade of tick data.

This is where hardware acceleration comes in — techniques that exploit specific hardware capabilities to achieve performance that pure algorithmic optimisation cannot reach.


JIT Compilation: Numba

Just-In-Time compilation takes Python code and compiles it to optimised machine code at runtime. Numba is the most popular JIT compiler for numerical Python, and it can deliver C-like performance with minimal code changes.

import numpy as np from numba import njit @njit def calculate_returns(prices): n = len(prices) returns = np.empty(n - 1) for i in range(n - 1): returns[i] = (prices[i + 1] - prices[i]) / prices[i] return returns @njit def monte_carlo_option_price(S0, K, r, sigma, T, n_sims, n_steps): dt = T / n_steps payoff_sum = 0.0 for sim in range(n_sims): S = S0 for step in range(n_steps): z = np.random.standard_normal() S = S * np.exp((r - 0.5 * sigma**2) * dt + sigma * np.sqrt(dt) * z) payoff = max(S - K, 0.0) payoff_sum += payoff return np.exp(-r * T) * (payoff_sum / n_sims) # First call compiles the function (~1 second) price = monte_carlo_option_price(100, 100, 0.05, 0.2, 1.0, 1_000_000, 252) # Subsequent calls run at compiled speed (~100x faster than pure Python)

The @njit decorator tells Numba to compile the function to machine code. The key constraint: Numba works best with numerical code — loops over arrays, mathematical operations. It does not support arbitrary Python objects or string manipulation.

When to Use Numba

  • Numerical loops that NumPy cannot vectorise easily
  • Monte Carlo simulations with path-dependent logic
  • Custom rolling window calculations
  • Any CPU-bound numerical code where you want C speed without writing C

SIMD: Single Instruction, Multiple Data

Modern CPUs can process multiple data values simultaneously using SIMD instructions. Instead of adding two numbers, a SIMD instruction adds 4, 8, or 16 numbers in a single operation.

NumPy already uses SIMD internally for many operations. But you can exploit it more directly:

# NumPy already uses SIMD under the hood import numpy as np prices = np.random.uniform(100, 200, 1_000_000) volumes = np.random.uniform(1000, 100_000, 1_000_000) # This uses SIMD internally — processes multiple elements per instruction notionals = prices * volumes # Vector multiply, not a loop # For custom operations, Numba can generate SIMD code from numba import njit, prange @njit(parallel=True) def weighted_average_parallel(values, weights): n = len(values) total_weight = 0.0 weighted_sum = 0.0 for i in prange(n): # prange enables SIMD and multi-threading weighted_sum += values[i] * weights[i] total_weight += weights[i] return weighted_sum / total_weight

In C++, you can use SIMD intrinsics directly for maximum control:

#include <immintrin.h> // AVX2 intrinsics // Add 4 doubles simultaneously using AVX2 void add_vectors(const double* a, const double* b, double* result, size_t n) { size_t i = 0; for (; i + 4 <= n; i += 4) { __m256d va = _mm256_load_pd(&a[i]); __m256d vb = _mm256_load_pd(&b[i]); __m256d vr = _mm256_add_pd(va, vb); _mm256_store_pd(&result[i], vr); } // Handle remaining elements for (; i < n; i++) { result[i] = a[i] + b[i]; } }

GPU Computing with CUDA

GPUs have thousands of cores designed for parallel computation. While each core is simpler than a CPU core, the sheer parallelism makes GPUs dramatically faster for suitable workloads.

CuPy: NumPy on GPUs

The easiest way to use GPU acceleration in Python is CuPy — a drop-in replacement for NumPy that runs on NVIDIA GPUs:

import cupy as cp # Move data to GPU prices_gpu = cp.array(prices) volumes_gpu = cp.array(volumes) # Same API as NumPy, but runs on GPU notionals_gpu = prices_gpu * volumes_gpu mean_notional = cp.mean(notionals_gpu) # Move result back to CPU result = float(mean_notional)

For large arrays (millions of elements), CuPy can be 10-100x faster than NumPy. The overhead is in transferring data between CPU and GPU memory, so it works best when you can keep data on the GPU for multiple operations.

CUDA Kernels for Custom Logic

For maximum flexibility, you can write custom CUDA kernels:

from numba import cuda import numpy as np @cuda.jit def monte_carlo_kernel(results, S0, K, r, sigma, T, n_steps, rng_states): idx = cuda.grid(1) if idx < results.shape[0]: dt = T / n_steps S = S0 for step in range(n_steps): z = cuda.random.xoroshiro128p_normal_float64(rng_states, idx) S = S * np.exp((r - 0.5 * sigma**2) * dt + sigma * np.sqrt(dt) * z) results[idx] = max(S - K, 0.0) # Launch 1 million simulations across GPU threads n_sims = 1_000_000 results = cuda.device_array(n_sims) threads_per_block = 256 blocks = (n_sims + threads_per_block - 1) // threads_per_block monte_carlo_kernel[blocks, threads_per_block]( results, 100.0, 100.0, 0.05, 0.2, 1.0, 252, rng_states ) option_price = np.exp(-0.05) * results.copy_to_host().mean()

When GPUs Make Sense

  • Monte Carlo simulations (embarrassingly parallel)
  • Matrix operations for portfolio optimisation
  • Machine learning model training and inference
  • Real-time risk computation across thousands of positions

GPUs do not help for sequential, branchy logic — the overhead of data transfer and kernel launch outweighs any benefit.


FPGAs: The Ultimate in Low Latency

Field-Programmable Gate Arrays are hardware chips that can be configured for specific tasks. Unlike CPUs and GPUs that execute instructions sequentially or in waves, FPGAs implement logic directly in hardware — processing data in nanoseconds.

In finance, FPGAs are used for:

  • Market data parsing — decode exchange feed messages in hardware, before the data even reaches the CPU
  • Order routing — make routing decisions at wire speed
  • Risk checks — pre-trade risk validation in nanoseconds

FPGAs require specialised skills (hardware description languages like Verilog or VHDL) and are typically only used by high-frequency trading firms where the latency advantage justifies the development cost.


Choosing the Right Acceleration

TechniqueSpeedupEffortBest For
Numba JIT10-100xLowNumerical Python loops
SIMD2-8xMediumBatch processing
GPU (CuPy)10-100xLowLarge array operations
GPU (CUDA)50-1000xHighCustom parallel algorithms
FPGAHardware speedVery highUltra-low-latency, specific tasks

Start with the simplest option that meets your needs. Numba JIT is often sufficient — slap @njit on a bottleneck function and see 100x improvement. Only move to more complex solutions when you have measured and confirmed that simpler approaches are insufficient.

For the network layer considerations that often determine whether hardware acceleration is worth the investment, see our latency guide. And if you are choosing between Rust and C++ for your performance-critical code, understanding hardware acceleration helps inform that decision.

Want to go deeper on Hardware Acceleration for Quantitative Finance?

This article covers the essentials, but there's a lot more to learn. Inside Quantt, you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.

Free to get started · No credit card required