From Laptop to a Million Scenarios
You've built a financial model in Python. It runs beautifully on your laptop — perhaps a portfolio stress-testing model that grinds through a few hundred economic scenarios. You've shown it to the right people, they're suitably impressed, and now someone has uttered the dreaded words: "Can we scale this up?"
Suddenly, 150 scenarios isn't enough. They want a million. And they want it running in the cloud. Serverless, naturally, because nobody wants the overhead of managing dedicated infrastructure.
This article walks through how to architect a large-scale serverless batch processing pipeline on AWS — one that takes a Python financial model and runs it at serious scale. Having taken this exact journey, from a humble Lambda function processing 150 scenarios to over a million using the setup described below, I can confirm: it works, and it's rather satisfying when it does.
The Problem (and Why Lambda Won't Cut It)
Let's set the scene. You have a portfolio model. It takes a set of financial scenarios — interest rate paths, equity shocks, credit spread movements, and so on — and runs your portfolio through each one. At small scale, AWS Lambda is perfectly adequate. Cheap, simple, no infrastructure to worry about.
But Lambda has limits. A 15-minute execution timeout, 10GB of memory, and no sensible way to coordinate thousands of parallel invocations without building your own orchestration layer. Once you need to process hundreds of thousands (or millions) of scenarios, you need something more industrial.
The answer is a serverless batch processing architecture using AWS Step Functions, AWS Batch, and AWS Fargate. Think of it as the next step up from Lambda — purpose-built for exactly this kind of workload.
The Architecture: A Bird's Eye View
Here's how the pieces fit together, end to end.
API Gateway and Authentication
Everything starts with an API call. AWS API Gateway provides the HTTPS endpoint, and you bolt on authentication — whether that's Cognito, API keys, or a custom authoriser. This is the entry point to the pipeline, and authentication ensures only authorised clients can trigger a run.
Step Functions: The Orchestrator
AWS Step Functions is the conductor of this entire orchestra. It manages the workflow as a state machine: what runs first, what runs in parallel, and what happens when something fails — which, in distributed computing, you should always plan for.
The workflow runs in three stages:
Stage 1 — Shard the data. A Fargate task splits your scenario dataset into manageable chunks, which we'll call "shards." If you have a million scenarios and want each processing task to handle 500, that's 2,000 shards. This task writes the shard definitions (essentially pointers to subsets of your input data) and registers them for processing.
Stage 2 — Process in parallel. Step Functions triggers AWS Batch, which orchestrates the spinning up of thousands of concurrent Fargate tasks. Each task picks up a shard of scenarios, runs your Python model against them, and writes the results. AWS Batch handles the scheduling, queuing, and retry logic. You just tell it how many jobs you need and let it get on with things.
Stage 3 — Collate the results. Once every shard is processed, Step Functions calls a final Fargate task that stitches the individual outputs back together into one coherent dataset. This is where your aggregated risk figures, portfolio P&L distributions, or whatever your model produces come together in a single, consumable output.
It's a classic scatter-gather pattern, and Step Functions handles the coordination beautifully — including retries, error handling, and giving you visibility into exactly which step you're on at any given moment.
AWS Batch and Fargate: The Heavy Lifting
AWS Batch is the job scheduler. You tell it "I need 2,000 containers to run this Docker image with these parameters," and it works out how to make that happen. Fargate provides the compute — serverless containers that spin up on demand without you provisioning a single server.
Each Fargate task runs your Python model against its allocated shard of scenarios. When it's done, it writes the results and reports back. No servers to patch, no clusters to manage, and no on-call overhead for the underlying infrastructure.
Tracking Progress with DynamoDB
When you've got thousands of tasks running concurrently, you need to know what's going on. A DynamoDB table serves as the tracking layer — each shard gets a row recording its status (pending, running, complete, failed), the task ID processing it, and the output location.
DynamoDB is ideal here because it handles thousands of concurrent writes with ease, scales automatically, and you pay per request. It gives you a real-time view of progress: a quick query tells you that 1,847 of 2,000 shards are complete, 150 are still running, and 3 have failed and need retrying. It also stores the output object ID for each shard, linking the tracking data to the actual results.
Data Storage: S3 and DynamoDB Working Together
For storing actual results data — scenario outputs, intermediate calculations, final aggregated datasets — the pattern that works well is a combination of S3 and DynamoDB:
- S3 stores the blob objects — the actual data files (Parquet, CSV, JSON, or whatever format your model outputs).
- DynamoDB stores the metadata — what each object is, when it was created, which run it belongs to, where it lives in S3, and any summary statistics.
This separation keeps things clean. S3 is phenomenally cheap for storing large objects and handles virtually unlimited data. DynamoDB gives you fast, indexed lookups on the metadata without having to list S3 buckets — which, at scale, is both slow and surprisingly expensive.
What About SQL?
You could absolutely use a relational database — RDS PostgreSQL or Aurora — instead of DynamoDB for the metadata layer. SQL gives you richer querying, joins, and a more familiar interface for most developers.
The trade-off is that RDS requires provisioned capacity (or Aurora Serverless, which has its own quirks around cold starts), and it won't handle thousands of concurrent writes as gracefully as DynamoDB without careful connection pooling. For pure metadata tracking in a high-concurrency batch pipeline, DynamoDB tends to be the better fit. For downstream analytics and reporting on the results, SQL is often more practical.
Many setups use both — DynamoDB during the run, then ETL the results into a SQL database for analysis afterwards. Belt and braces, as they say.
Parallelism Within Each Fargate Task
Your Fargate tasks don't have to be single-threaded. If you allocate a task with 4 vCPUs, you can use Python's multiprocessing module (or concurrent.futures.ProcessPoolExecutor) to process multiple scenarios simultaneously within each container.
This gives you two levels of parallelism: AWS Batch handles the horizontal scaling across thousands of containers, and multiprocessing handles the vertical scaling within each container. The sweet spot depends on your model — CPU-bound financial models benefit enormously from multiprocessing, while I/O-bound workloads might get more from threading or async approaches.
One word of caution: Python's Global Interpreter Lock means that threading won't give you genuine parallelism for CPU-bound work. Use multiprocessing instead. This is one of those Python quirks that catches people out at least once, so worth being aware of from the start.
Fargate Spot vs On-Demand: The Cost Question
AWS Fargate Spot instances run on spare capacity and cost up to 70% less than on-demand pricing. The catch? AWS can reclaim them with two minutes' notice. For batch processing, Spot is often an excellent choice. Your tasks are short-lived, idempotent (they can be safely retried if interrupted), and you've got DynamoDB tracking which shards are complete. If a Spot instance gets reclaimed mid-shard, the shard simply gets reprocessed. You might lose a few minutes of compute, but you save a small fortune over the course of a large run.
The pragmatic approach: run most of your tasks on Spot and keep a smaller on-demand allocation as a fallback for the stragglers. AWS Batch supports mixed compute environments, so you can configure this without too much fuss.
Networking: The Bit Everyone Forgets
To run Fargate tasks, you need a VPC with subnets. This sounds straightforward enough, but there's a detail that trips up nearly everyone on their first large-scale run: each Fargate task consumes one IP address in the subnet.
If you're spinning up 2,000 concurrent tasks, you need at least 2,000 available IP addresses across your subnets. A /24 CIDR block gives you 251 usable addresses — nowhere near enough. You'll want a /20 or larger (4,091 usable addresses) to give yourself headroom. Running out of IP addresses mid-run is not a situation you want to find yourself in.
Plan your VPC CIDR ranges and subnet sizing carefully from the start. Expanding them later is possible but tedious, and it's much easier to get right the first time.
Spreading Across Availability Zones and Regions
To access enough Fargate Spot capacity for large runs, you'll want subnets spread across multiple Availability Zones within a region. AWS Batch can distribute tasks across AZs automatically, which both improves Spot availability and gives you resilience if one AZ has issues.
For truly massive scale, you can even look at running across multiple AWS regions. This adds meaningful complexity — you'll need to replicate your container images to ECR in each region, coordinate job submissions across regions, and aggregate results back to a central location — but it gives you access to a substantially larger pool of Spot capacity. Most setups won't need this, but it's worth knowing the option exists when scenario counts grow beyond what a single region can comfortably support.
The Results: From 150 to Over a Million
This architecture has been used to scale a financial model from running roughly 150 scenarios inside a single Lambda function to processing over a million scenarios in a single batch run. The model code itself barely changed — the same Python, packaged into a Docker container, running in parallel across thousands of Fargate tasks.
The scaling is largely linear: double the scenarios, double the tasks, roughly double the cost, same wall-clock time. That's the beauty of embarrassingly parallel workloads — they scale predictably and efficiently.
Wrapping Up
If you've got a Python financial model that's outgrown Lambda and you need to process scenarios at serious scale, the combination of Step Functions, AWS Batch, and Fargate gives you a fully serverless architecture that can handle millions of scenarios without you managing a single server.
The key ingredients: API Gateway as the entry point, Step Functions for orchestration, AWS Batch and Fargate for elastic compute, DynamoDB for progress tracking and metadata, S3 for data storage, and careful VPC planning so you don't run out of IP addresses at the worst possible moment.
It's not trivial to set up — there are a good number of moving parts, and you'll spend more time than you'd like reading AWS documentation about subnet CIDR ranges. But once it's running, it's remarkably robust, cost-effective with Spot pricing, and scales well beyond what any single machine could handle.
Want to go deeper on Scaling Python Financial Models on AWS: A Serverless Batch Processing Approach?
This article covers the essentials, but there's a lot more to learn. Inside Quantt, you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.
Free to get started · No credit card required