When Software Optimisation Is Not Enough
You have written efficient algorithms, chosen the right data structures, profiled your code, and eliminated bottlenecks. But your Monte Carlo simulation still takes too long, your real-time risk engine cannot keep up with market data, or your backtesting framework needs hours to test a strategy over a decade of tick data.
This is where hardware acceleration comes in — techniques that exploit specific hardware capabilities to achieve performance that pure algorithmic optimisation cannot reach.
JIT Compilation: Numba
Just-In-Time compilation takes Python code and compiles it to optimised machine code at runtime. Numba is the most popular JIT compiler for numerical Python, and it can deliver C-like performance with minimal code changes.
import numpy as np from numba import njit @njit def calculate_returns(prices): n = len(prices) returns = np.empty(n - 1) for i in range(n - 1): returns[i] = (prices[i + 1] - prices[i]) / prices[i] return returns @njit def monte_carlo_option_price(S0, K, r, sigma, T, n_sims, n_steps): dt = T / n_steps payoff_sum = 0.0 for sim in range(n_sims): S = S0 for step in range(n_steps): z = np.random.standard_normal() S = S * np.exp((r - 0.5 * sigma**2) * dt + sigma * np.sqrt(dt) * z) payoff = max(S - K, 0.0) payoff_sum += payoff return np.exp(-r * T) * (payoff_sum / n_sims) # First call compiles the function (~1 second) price = monte_carlo_option_price(100, 100, 0.05, 0.2, 1.0, 1_000_000, 252) # Subsequent calls run at compiled speed (~100x faster than pure Python)
The @njit decorator tells Numba to compile the function to machine code. The key constraint: Numba works best with numerical code — loops over arrays, mathematical operations. It does not support arbitrary Python objects or string manipulation.
When to Use Numba
- Numerical loops that NumPy cannot vectorise easily
- Monte Carlo simulations with path-dependent logic
- Custom rolling window calculations
- Any CPU-bound numerical code where you want C speed without writing C
SIMD: Single Instruction, Multiple Data
Modern CPUs can process multiple data values simultaneously using SIMD instructions. Instead of adding two numbers, a SIMD instruction adds 4, 8, or 16 numbers in a single operation.
NumPy already uses SIMD internally for many operations. But you can exploit it more directly:
# NumPy already uses SIMD under the hood import numpy as np prices = np.random.uniform(100, 200, 1_000_000) volumes = np.random.uniform(1000, 100_000, 1_000_000) # This uses SIMD internally — processes multiple elements per instruction notionals = prices * volumes # Vector multiply, not a loop # For custom operations, Numba can generate SIMD code from numba import njit, prange @njit(parallel=True) def weighted_average_parallel(values, weights): n = len(values) total_weight = 0.0 weighted_sum = 0.0 for i in prange(n): # prange enables SIMD and multi-threading weighted_sum += values[i] * weights[i] total_weight += weights[i] return weighted_sum / total_weight
In C++, you can use SIMD intrinsics directly for maximum control:
#include <immintrin.h> // AVX2 intrinsics // Add 4 doubles simultaneously using AVX2 void add_vectors(const double* a, const double* b, double* result, size_t n) { size_t i = 0; for (; i + 4 <= n; i += 4) { __m256d va = _mm256_load_pd(&a[i]); __m256d vb = _mm256_load_pd(&b[i]); __m256d vr = _mm256_add_pd(va, vb); _mm256_store_pd(&result[i], vr); } // Handle remaining elements for (; i < n; i++) { result[i] = a[i] + b[i]; } }
GPU Computing with CUDA
GPUs have thousands of cores designed for parallel computation. While each core is simpler than a CPU core, the sheer parallelism makes GPUs dramatically faster for suitable workloads.
CuPy: NumPy on GPUs
The easiest way to use GPU acceleration in Python is CuPy — a drop-in replacement for NumPy that runs on NVIDIA GPUs:
import cupy as cp # Move data to GPU prices_gpu = cp.array(prices) volumes_gpu = cp.array(volumes) # Same API as NumPy, but runs on GPU notionals_gpu = prices_gpu * volumes_gpu mean_notional = cp.mean(notionals_gpu) # Move result back to CPU result = float(mean_notional)
For large arrays (millions of elements), CuPy can be 10-100x faster than NumPy. The overhead is in transferring data between CPU and GPU memory, so it works best when you can keep data on the GPU for multiple operations.
CUDA Kernels for Custom Logic
For maximum flexibility, you can write custom CUDA kernels:
from numba import cuda import numpy as np @cuda.jit def monte_carlo_kernel(results, S0, K, r, sigma, T, n_steps, rng_states): idx = cuda.grid(1) if idx < results.shape[0]: dt = T / n_steps S = S0 for step in range(n_steps): z = cuda.random.xoroshiro128p_normal_float64(rng_states, idx) S = S * np.exp((r - 0.5 * sigma**2) * dt + sigma * np.sqrt(dt) * z) results[idx] = max(S - K, 0.0) # Launch 1 million simulations across GPU threads n_sims = 1_000_000 results = cuda.device_array(n_sims) threads_per_block = 256 blocks = (n_sims + threads_per_block - 1) // threads_per_block monte_carlo_kernel[blocks, threads_per_block]( results, 100.0, 100.0, 0.05, 0.2, 1.0, 252, rng_states ) option_price = np.exp(-0.05) * results.copy_to_host().mean()
When GPUs Make Sense
- Monte Carlo simulations (embarrassingly parallel)
- Matrix operations for portfolio optimisation
- Machine learning model training and inference
- Real-time risk computation across thousands of positions
GPUs do not help for sequential, branchy logic — the overhead of data transfer and kernel launch outweighs any benefit.
FPGAs: The Ultimate in Low Latency
Field-Programmable Gate Arrays are hardware chips that can be configured for specific tasks. Unlike CPUs and GPUs that execute instructions sequentially or in waves, FPGAs implement logic directly in hardware — processing data in nanoseconds.
In finance, FPGAs are used for:
- Market data parsing — decode exchange feed messages in hardware, before the data even reaches the CPU
- Order routing — make routing decisions at wire speed
- Risk checks — pre-trade risk validation in nanoseconds
FPGAs require specialised skills (hardware description languages like Verilog or VHDL) and are typically only used by high-frequency trading firms where the latency advantage justifies the development cost.
Choosing the Right Acceleration
| Technique | Speedup | Effort | Best For |
|---|---|---|---|
| Numba JIT | 10-100x | Low | Numerical Python loops |
| SIMD | 2-8x | Medium | Batch processing |
| GPU (CuPy) | 10-100x | Low | Large array operations |
| GPU (CUDA) | 50-1000x | High | Custom parallel algorithms |
| FPGA | Hardware speed | Very high | Ultra-low-latency, specific tasks |
Start with the simplest option that meets your needs. Numba JIT is often sufficient — slap @njit on a bottleneck function and see 100x improvement. Only move to more complex solutions when you have measured and confirmed that simpler approaches are insufficient.
For the network layer considerations that often determine whether hardware acceleration is worth the investment, see our latency guide. And if you are choosing between Rust and C++ for your performance-critical code, understanding hardware acceleration helps inform that decision.
Want to go deeper on Hardware Acceleration for Quantitative Finance?
This article covers the essentials, but there's a lot more to learn. Inside Quantt, you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.
Free to get started · No credit card required