Why Format Choice Matters
In finance, you move a lot of data around. Market data feeds, trade records, risk reports, model outputs, regulatory filings — it all needs to be stored somewhere and read back efficiently. The format you choose directly impacts storage costs, read and write speed, data integrity, and pipeline complexity.
Choosing the wrong format for a large dataset can turn a 30-second pipeline step into a 30-minute one. Getting this right is one of those quiet decisions that compounds over time.
The Common Formats
CSV: The Universal Default
Everyone knows CSV. It is human-readable, universally supported, and dead simple. You can open it in Excel, edit it in a text editor, and every programming language can parse it.
date,symbol,price,volume
2024-01-15,AAPL,185.60,52340000
2024-01-15,GOOGL,140.25,28100000
Strengths: Universal compatibility, human readable, great for small datasets and debugging.
Weaknesses: No type information (is "100" a number or a string?), no compression, slow to parse at scale, ambiguous handling of special characters and encodings, no schema enforcement.
For loading into Pandas, CSV works fine for files under ~100MB. Beyond that, you should be thinking about better options.
JSON: Structured and Flexible
{ "date": "2024-01-15", "symbol": "AAPL", "price": 185.60, "volume": 52340000, "metadata": { "exchange": "NASDAQ", "currency": "USD" } }
Strengths: Handles nested and semi-structured data well, self-describing, standard for REST APIs, good type support (numbers vs strings vs booleans).
Weaknesses: Verbose (lots of repeated keys), slow to parse large files, not efficient for purely tabular data.
JSON is the right choice for API communication, configuration files, and semi-structured data. It is the wrong choice for storing a billion trade records.
Parquet: The Modern Standard for Analytics
Parquet is a columnar binary format that has become the de facto standard for data engineering. It is compressed, fast, and preserves full type information.
import pandas as pd # Write: typically 5-10x smaller than equivalent CSV df.to_parquet("trades.parquet", compression="snappy") # Read: 5-10x faster than CSV df = pd.read_parquet("trades.parquet") # Read only specific columns — a major advantage prices = pd.read_parquet("trades.parquet", columns=["symbol", "price", "date"])
Strengths: Excellent compression, very fast reads (especially column-selective reads), full type preservation, schema embedded in the file, widely supported across the data ecosystem.
Weaknesses: Not human-readable, requires libraries to work with, slight overhead for very small files.
Other Formats Worth Knowing
Avro — row-oriented binary format with schema evolution support. Popular in streaming systems like Kafka.
ORC — similar to Parquet, common in the Hadoop ecosystem.
Protocol Buffers / FlatBuffers — extremely fast serialisation for inter-service communication, common in low-latency systems.
Columnar vs Row Storage
This distinction is important and applies to both file formats and databases.
Row storage (CSV, JSON, traditional databases) keeps all fields for a record together. Reading one row is fast. Reading one column across millions of rows is slow because you read everything.
Columnar storage (Parquet, ORC, ClickHouse) keeps all values for one column together. Reading specific columns is extremely fast. Compression is better because similar values are adjacent.
For analytics — "give me the average price across 10 million trades" — columnar is dramatically faster because it only touches the price column, ignoring the dozens of other fields. This same principle applies at the database level: row-oriented databases like PostgreSQL versus columnar engines like ClickHouse or Redshift. Your database design choices should factor this in.
Practical Recommendations
| Use Case | Recommended Format |
|---|---|
| Small datasets, prototyping | CSV |
| API requests and responses | JSON |
| Data pipeline storage | Parquet |
| Streaming events | Avro or JSON |
| Inter-service communication | Protocol Buffers |
| High-frequency market data | Binary formats or TSDB |
The general rule: if data will be written once and read many times (which is most analytical data in finance), Parquet is almost always the right choice. The compression saves storage costs, the columnar layout speeds up queries, and the type information prevents subtle bugs.
How you build pipelines around these formats and how you load them into analysis tools determines whether your data workflows are fast or frustrating. It is one of those decisions worth getting right early.
Want to go deeper on Data Formats for Financial Systems: CSV, JSON, Parquet, and Beyond?
This article covers the essentials, but there's a lot more to learn. Inside Quantt, you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.
Free to get started · No credit card required