The Data Challenge in Finance
Financial firms are drowning in data. Market data feeds deliver millions of events per second. Trade records accumulate by the millions daily. Alternative data sources — satellite imagery, social media sentiment, web traffic — add another layer of volume. Regulatory reporting requires comprehensive historical data access.
Processing this data reliably, at scale, and on time is what data engineering is about. Get it right and your quant researchers have clean, timely data to build models on. Get it wrong and everyone downstream suffers: models train on stale data, reports are late, and risk calculations are incomplete.
Batch vs Streaming
Data pipelines fall into two categories, and most financial systems use both.
Batch Processing
Process large volumes of data at scheduled intervals. "Every night at midnight, calculate the end-of-day positions and P&L for every account."
# Simplified batch pipeline def daily_eod_pipeline(date: str): # Extract trades = read_trades_from_database(date) market_data = read_closing_prices(date) positions = read_current_positions() # Transform enriched_trades = enrich_with_market_data(trades, market_data) daily_pnl = calculate_pnl(enriched_trades, positions, market_data) updated_positions = update_positions(positions, trades) # Load write_to_data_warehouse(daily_pnl) write_to_reporting_database(updated_positions) generate_regulatory_report(daily_pnl, updated_positions)
Batch is appropriate for: end-of-day processing, historical analysis, regulatory reporting, model training.
Stream Processing
Process data as it arrives, in real time or near real time. "As each trade executes, update the position, recalculate risk, and check compliance."
# Simplified streaming pipeline using Kafka from kafka import KafkaConsumer consumer = KafkaConsumer('trades', bootstrap_servers='kafka:9092') for message in consumer: trade = deserialise(message.value) update_position(trade) updated_risk = recalculate_risk(trade.symbol) if updated_risk > risk_limit: send_alert(trade.symbol, updated_risk) publish_to_dashboard(trade, updated_risk)
Stream is appropriate for: real-time risk monitoring, live P&L, trade surveillance, market data processing.
The Modern Data Stack
Apache Kafka: The Message Bus
Kafka is the backbone of most streaming architectures in finance. It is a distributed message queue that can handle millions of messages per second with built-in durability and replayability.
Producers publish messages to topics. Consumers subscribe to topics and process messages. Messages are persisted, so if a consumer fails, it can restart and pick up where it left off.
Key concepts:
- Topics — named channels (
trades,market-data,risk-alerts) - Partitions — parallelism unit within a topic
- Consumer groups — multiple consumers sharing the load
- Retention — messages kept for configurable time (days, weeks, forever)
Apache Spark: Batch and Micro-Batch
Spark processes large datasets in parallel across a cluster. It is the standard for batch processing at scale and can handle streaming via Spark Structured Streaming.
from pyspark.sql import SparkSession import pyspark.sql.functions as F spark = SparkSession.builder.appName("DailyPnL").getOrCreate() # Read trades from data lake trades = spark.read.parquet("s3://data-lake/trades/date=2024-01-15/") # Aggregate P&L by account and symbol pnl = trades.groupBy("account_id", "symbol").agg( F.sum((F.col("exit_price") - F.col("entry_price")) * F.col("quantity")).alias("pnl"), F.count("*").alias("trade_count"), F.sum("quantity").alias("total_volume"), ) # Write results to data warehouse pnl.write.mode("overwrite").parquet("s3://warehouse/daily-pnl/date=2024-01-15/")
Apache Airflow: Orchestration
Airflow schedules and monitors pipelines. It defines workflows as directed acyclic graphs (DAGs) — ensuring tasks run in the right order and handling retries and failures.
from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime dag = DAG( 'daily_eod_pipeline', schedule_interval='0 22 * * 1-5', # 10 PM on weekdays start_date=datetime(2024, 1, 1), ) extract = PythonOperator(task_id='extract_trades', python_callable=extract_trades, dag=dag) transform = PythonOperator(task_id='calculate_pnl', python_callable=calculate_pnl, dag=dag) load = PythonOperator(task_id='write_reports', python_callable=write_reports, dag=dag) extract >> transform >> load # Define execution order
The Data Lake Pattern
A data lake stores raw data in its original format — no upfront schema required. Data is structured and cleaned when it is read (schema-on-read) rather than when it is written (schema-on-write).
Data Lake (S3 / Azure Blob)
├── raw/ # Raw data as received
│ ├── trades/
│ ├── market-data/
│ └── reference-data/
├── processed/ # Cleaned and normalised
│ ├── trades/
│ └── market-data/
├── curated/ # Business-ready datasets
│ ├── daily-pnl/
│ ├── positions/
│ └── risk-reports/
The lake stores data in Parquet format for analytical workloads, partitioned by date for efficient querying. Cloud object storage (S3) provides virtually unlimited, cheap, durable storage.
Data Quality
Bad data in, bad decisions out. Data quality checks are not optional in finance:
def validate_trade_data(df): checks = [] # Completeness null_counts = df.isnull().sum() checks.append(("No nulls in required fields", null_counts[["symbol", "price", "qty"]].sum() == 0)) # Range checks checks.append(("All prices positive", (df["price"] > 0).all())) checks.append(("All quantities non-zero", (df["qty"] != 0).all())) # Consistency checks.append(("Sides valid", df["side"].isin(["BUY", "SELL"]).all())) # Freshness latest = df["timestamp"].max() checks.append(("Data is recent", (datetime.now() - latest).seconds < 300)) for name, passed in checks: status = "PASS" if passed else "FAIL" print(f" [{status}] {name}") return all(passed for _, passed in checks)
Getting Started
You do not need Kafka and Spark on day one. Start with:
- Simple Python scripts reading files from a directory
- Pandas for transformation
- A database for storage
- Cron jobs (or Airflow) for scheduling
Scale up as your data volumes demand it. The architecture patterns above are well-established and documented — the challenge is usually understanding when each tool is appropriate, not how to use it.
Want to go deeper on Big Data Pipelines in Finance?
This article covers the essentials, but there's a lot more to learn. Inside Quantt, you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.
Free to get started · No credit card required