Big Data12 min read·

Big Data Pipelines in Finance

How financial firms process massive datasets — batch and streaming architectures, ETL patterns, data lakes, and the tools that power modern data infrastructure.

The Data Challenge in Finance

Financial firms are drowning in data. Market data feeds deliver millions of events per second. Trade records accumulate by the millions daily. Alternative data sources — satellite imagery, social media sentiment, web traffic — add another layer of volume. Regulatory reporting requires comprehensive historical data access.

Processing this data reliably, at scale, and on time is what data engineering is about. Get it right and your quant researchers have clean, timely data to build models on. Get it wrong and everyone downstream suffers: models train on stale data, reports are late, and risk calculations are incomplete.


Batch vs Streaming

Data pipelines fall into two categories, and most financial systems use both.

Batch Processing

Process large volumes of data at scheduled intervals. "Every night at midnight, calculate the end-of-day positions and P&L for every account."

# Simplified batch pipeline def daily_eod_pipeline(date: str): # Extract trades = read_trades_from_database(date) market_data = read_closing_prices(date) positions = read_current_positions() # Transform enriched_trades = enrich_with_market_data(trades, market_data) daily_pnl = calculate_pnl(enriched_trades, positions, market_data) updated_positions = update_positions(positions, trades) # Load write_to_data_warehouse(daily_pnl) write_to_reporting_database(updated_positions) generate_regulatory_report(daily_pnl, updated_positions)

Batch is appropriate for: end-of-day processing, historical analysis, regulatory reporting, model training.

Stream Processing

Process data as it arrives, in real time or near real time. "As each trade executes, update the position, recalculate risk, and check compliance."

# Simplified streaming pipeline using Kafka from kafka import KafkaConsumer consumer = KafkaConsumer('trades', bootstrap_servers='kafka:9092') for message in consumer: trade = deserialise(message.value) update_position(trade) updated_risk = recalculate_risk(trade.symbol) if updated_risk > risk_limit: send_alert(trade.symbol, updated_risk) publish_to_dashboard(trade, updated_risk)

Stream is appropriate for: real-time risk monitoring, live P&L, trade surveillance, market data processing.


The Modern Data Stack

Apache Kafka: The Message Bus

Kafka is the backbone of most streaming architectures in finance. It is a distributed message queue that can handle millions of messages per second with built-in durability and replayability.

Producers publish messages to topics. Consumers subscribe to topics and process messages. Messages are persisted, so if a consumer fails, it can restart and pick up where it left off.

Key concepts:

  • Topics — named channels (trades, market-data, risk-alerts)
  • Partitions — parallelism unit within a topic
  • Consumer groups — multiple consumers sharing the load
  • Retention — messages kept for configurable time (days, weeks, forever)

Apache Spark: Batch and Micro-Batch

Spark processes large datasets in parallel across a cluster. It is the standard for batch processing at scale and can handle streaming via Spark Structured Streaming.

from pyspark.sql import SparkSession import pyspark.sql.functions as F spark = SparkSession.builder.appName("DailyPnL").getOrCreate() # Read trades from data lake trades = spark.read.parquet("s3://data-lake/trades/date=2024-01-15/") # Aggregate P&L by account and symbol pnl = trades.groupBy("account_id", "symbol").agg( F.sum((F.col("exit_price") - F.col("entry_price")) * F.col("quantity")).alias("pnl"), F.count("*").alias("trade_count"), F.sum("quantity").alias("total_volume"), ) # Write results to data warehouse pnl.write.mode("overwrite").parquet("s3://warehouse/daily-pnl/date=2024-01-15/")

Apache Airflow: Orchestration

Airflow schedules and monitors pipelines. It defines workflows as directed acyclic graphs (DAGs) — ensuring tasks run in the right order and handling retries and failures.

from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime dag = DAG( 'daily_eod_pipeline', schedule_interval='0 22 * * 1-5', # 10 PM on weekdays start_date=datetime(2024, 1, 1), ) extract = PythonOperator(task_id='extract_trades', python_callable=extract_trades, dag=dag) transform = PythonOperator(task_id='calculate_pnl', python_callable=calculate_pnl, dag=dag) load = PythonOperator(task_id='write_reports', python_callable=write_reports, dag=dag) extract >> transform >> load # Define execution order

The Data Lake Pattern

A data lake stores raw data in its original format — no upfront schema required. Data is structured and cleaned when it is read (schema-on-read) rather than when it is written (schema-on-write).

Data Lake (S3 / Azure Blob)
├── raw/                    # Raw data as received
│   ├── trades/
│   ├── market-data/
│   └── reference-data/
├── processed/              # Cleaned and normalised
│   ├── trades/
│   └── market-data/
├── curated/                # Business-ready datasets
│   ├── daily-pnl/
│   ├── positions/
│   └── risk-reports/

The lake stores data in Parquet format for analytical workloads, partitioned by date for efficient querying. Cloud object storage (S3) provides virtually unlimited, cheap, durable storage.


Data Quality

Bad data in, bad decisions out. Data quality checks are not optional in finance:

def validate_trade_data(df): checks = [] # Completeness null_counts = df.isnull().sum() checks.append(("No nulls in required fields", null_counts[["symbol", "price", "qty"]].sum() == 0)) # Range checks checks.append(("All prices positive", (df["price"] > 0).all())) checks.append(("All quantities non-zero", (df["qty"] != 0).all())) # Consistency checks.append(("Sides valid", df["side"].isin(["BUY", "SELL"]).all())) # Freshness latest = df["timestamp"].max() checks.append(("Data is recent", (datetime.now() - latest).seconds < 300)) for name, passed in checks: status = "PASS" if passed else "FAIL" print(f" [{status}] {name}") return all(passed for _, passed in checks)

Getting Started

You do not need Kafka and Spark on day one. Start with:

  1. Simple Python scripts reading files from a directory
  2. Pandas for transformation
  3. A database for storage
  4. Cron jobs (or Airflow) for scheduling

Scale up as your data volumes demand it. The architecture patterns above are well-established and documented — the challenge is usually understanding when each tool is appropriate, not how to use it.

Want to go deeper on Big Data Pipelines in Finance?

This article covers the essentials, but there's a lot more to learn. Inside Quantt, you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.

Free to get started · No credit card required