Polars is 8.7x faster than pandas. DuckDB is 9.4x faster. Both handle larger-than-RAM data. Here's when to use each — with benchmarks.
I ran a GroupBy on 100 million rows last week. Pandas took 107 seconds, then crashed with an out-of-memory error. Polars finished in 11 seconds. DuckDB finished in 9 seconds — and it never loaded the full dataset into RAM. Same data, same machine, three completely different experiences.
This isn't a contrived benchmark. It's the daily reality for anyone working with datasets that outgrew "fits in a Jupyter notebook" territory. And in 2026, the tooling has finally caught up to the problem.
Let's get the numbers out of the way first.
| Metric | Pandas | Polars | DuckDB |
|---|---|---|---|
| GitHub Stars | 45,000+ | 37,600+ | 30,000+ |
| Monthly PyPI Downloads | 200M+ | Growing rapidly | 20M+ |
| Language | Python (C under the hood) | Rust | C++ |
| Latest Version (April 2026) | 3.0 | 1.x | 1.5 |
| API Style | DataFrame | DataFrame + SQL | SQL-first + DataFrame |
| Parallelism | Single-threaded | Multi-threaded by default | Multi-threaded by default |
| Larger-than-RAM | No | Yes (streaming mode) | Yes (out-of-core) |
| Arrow Native | Partial (3.0) | Yes | Yes |
| Funding | NumFOCUS nonprofit | $25M from Accel | DuckDB Foundation nonprofit |
Three tools. Three different architectures. Three different philosophies.
Most benchmark articles test trivial operations on small datasets. That's not useful. Here's what happens when you push these tools on operations that reflect real data engineering work.
This is the bread and butter of analytics — "give me the sum of sales by region by quarter."
| Tool | Time | Memory Usage | Notes |
|---|---|---|---|
| Pandas | 100s+ or OOM crash | Entire dataset in RAM | Single-threaded |
| Polars | Under 30s | 30-60% less than Pandas | All cores used |
| DuckDB | Under 30s | Lowest (spill-to-disk) | Vectorized execution |
| Operation | Pandas | Polars | DuckDB |
|---|---|---|---|
| CSV Reading (large file) | Baseline | 7.7x faster | ~5x faster |
| Join (two large tables) | Baseline | 5x faster | ~4x faster |
| Overall vs Pandas | 1x | 8.7x faster | 9.4x faster |
This is where things get interesting. Polars showed 30-60% lower peak memory than Pandas on large joins. But DuckDB? DuckDB uses the least memory of all three because it can spill intermediate results to disk automatically. You never hit an OOM error — it just gets slower.
Polars achieves its memory efficiency differently: through lazy evaluation. When you build a Polars query lazily, it optimizes the execution plan before running anything. It pushes filters down, eliminates unnecessary columns, and can process data in streaming batches when you call collect(engine="streaming").
Pandas loads your entire dataset into memory as a collection of NumPy arrays (or, since version 3.0, optionally as Arrow arrays). Every operation creates a copy of the data (though 3.0's copy-on-write default helps here). It's single-threaded — your 16-core machine uses one core for .groupby().
Pandas 3.0, released January 21, 2026, made real improvements:
SettingWithCopyWarning and reduces unnecessary memory copies.str.contains() run 5-10x fasterThese are genuine improvements. But they don't change the fundamental architecture: pandas is still single-threaded, still loads everything into memory, and still copies data more than it needs to.
Polars is written in Rust and exposed to Python via PyO3. It uses Apache Arrow's columnar memory format natively. Every operation parallelizes across all available CPU cores automatically.
The key insight in Polars is its lazy evaluation engine. Instead of executing operations immediately (like pandas), you build a query plan:
import polars as pl
# This builds a plan — nothing executes yet
result = (
pl.scan_parquet("sales_data/*.parquet")
.filter(pl.col("year") >= 2024)
.group_by("region", "quarter")
.agg(pl.col("revenue").sum())
.sort("revenue", descending=True)
.collect() # NOW it executes, with optimizations applied
)
Before execution, Polars optimizes the plan: it pushes the year filter down to the file scan (so it never reads rows it doesn't need), projects only the columns used in the query, and parallelizes the group-by across cores.
For datasets larger than RAM, you swap .collect() for .collect(engine="streaming"), and Polars processes the data in batches without loading everything at once. There's a documented case of processing a 31GB CSV file on a machine with far less RAM.
DuckDB is not a DataFrame library. It's an embedded analytical database — "SQLite for analytics." It runs inside your Python process (no server, no network calls) and executes SQL queries using a vectorized, columnar engine.
import duckdb
# Query Parquet files directly — no loading step
result = duckdb.sql("""
SELECT region, quarter, SUM(revenue) as total_revenue
FROM 'sales_data/*.parquet'
WHERE year >= 2024
GROUP BY region, quarter
ORDER BY total_revenue DESC
""").fetchdf() # Returns a pandas DataFrame
DuckDB's magic is that it queries files directly — Parquet, CSV, JSON, even remote files on S3 — without an explicit loading step. Its out-of-core processing means it automatically spills data to disk when memory fills up. You don't configure this. It just works.
Version 1.5, released March 2026, brought further improvements. The 1.4 LTS release added AES-256 encryption at rest — a requirement for healthcare, finance, and legal use cases that previously forced teams onto heavier solutions.
Here's what most comparison articles miss entirely: you don't have to choose just one.
All three tools now speak Apache Arrow, which means data can move between them with zero serialization cost:
import polars as pl
import duckdb
import pandas as pd
# Start with DuckDB for heavy SQL aggregation
heavy_result = duckdb.sql("""
SELECT customer_id, SUM(amount) as total
FROM 'transactions/*.parquet'
GROUP BY customer_id
HAVING total > 10000
""")
# Convert to Polars for complex transformations — zero copy via Arrow
polars_df = heavy_result.pl()
# Apply Polars-specific operations
enriched = (
polars_df
.with_columns(
pl.col("total").rank().alias("rank"),
pl.col("total").log().alias("log_total")
)
.filter(pl.col("rank") <= 100)
)
# Convert to pandas for sklearn or visualization — Arrow-backed
pandas_df = enriched.to_pandas()
The .pl() call converts a DuckDB result to a Polars DataFrame via Arrow with zero copy. The .to_pandas() call from Polars is also Arrow-backed. No serialization. No data duplication. The same memory buffers get reused.
DuckDB can also query Polars DataFrames directly:
import polars as pl
import duckdb
# Create a Polars DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"score": [95, 87, 92]
})
# Query it with SQL — zero copy, no data movement
result = duckdb.sql("SELECT * FROM df WHERE score > 90")
This is the real power move in 2026: use DuckDB for SQL-heavy work, Polars for DataFrame transformations, and pandas for the last mile (visualization, sklearn integration). Arrow makes the handoffs free.
No. Polars has a fundamentally different execution model. Lazy evaluation, query optimization, and streaming are not speed improvements on the same approach — they're a different approach. You can't add lazy evaluation to pandas with a library. It's architectural.
DuckDB runs in-process with zero infrastructure. There's no server. No daemon. No port. You import duckdb and write SQL. Calling it a "database" and comparing it to PostgreSQL misses the point. It competes with pandas and Polars, not with Postgres.
Pandas has 200 million monthly downloads. Every data science tutorial, every kaggle notebook, every sklearn example uses pandas. The ecosystem integration is unmatched — seaborn, plotly, matplotlib, scikit-learn, and hundreds of other libraries expect pandas DataFrames. Pandas 3.0 with copy-on-write and Arrow strings is genuinely better. Pandas isn't dead. It's the last mile.
Speed isn't the only axis. DuckDB's SQL interface matters if your team thinks in SQL. Polars' type safety matters if you're building production pipelines. Pandas' ecosystem matters if you need to plug into sklearn or create a quick visualization. The right tool depends on the job, not the benchmark.
Here's how I'd actually decide:
| Your Situation | Use This | Why |
|---|---|---|
| Quick EDA, small data (under 1GB) | Pandas | Fastest to write, best ecosystem |
| Production data pipeline | Polars | Type-safe, lazy execution, testable |
| SQL-heavy analytics | DuckDB | Native SQL, zero infrastructure |
| Data larger than RAM | DuckDB or Polars (streaming) | Both handle it; DuckDB is simpler |
| Team knows SQL, not Python | DuckDB | They can be productive immediately |
| Team knows pandas, wants speed | Polars | Easier migration than learning SQL |
| ML preprocessing | Pandas (last mile) + Polars or DuckDB | sklearn expects pandas |
| Ad-hoc file queries | DuckDB | SELECT * FROM 'file.parquet' — done |
| Complex window functions in code | Polars | Expression API is more composable |
| Regulated industry (encryption) | DuckDB | AES-256 at rest since v1.4 |
For most data teams in 2026, the answer isn't one tool. It's this:
The Apache Arrow integration between all three makes handoffs essentially free. You're not choosing a religion. You're building a toolbox.
If you're currently using pandas and want to try Polars, here's the translation table for common operations:
# PANDAS
import pandas as pd
df = pd.read_csv("data.csv")
result = (
df[df["age"] > 30]
.groupby("city")["salary"]
.mean()
.sort_values(ascending=False)
.head(10)
)
# POLARS (eager mode — pandas-like)
import polars as pl
df = pl.read_csv("data.csv")
result = (
df.filter(pl.col("age") > 30)
.group_by("city")
.agg(pl.col("salary").mean())
.sort("salary", descending=True)
.head(10)
)
# POLARS (lazy mode — optimized)
import polars as pl
result = (
pl.scan_csv("data.csv")
.filter(pl.col("age") > 30)
.group_by("city")
.agg(pl.col("salary").mean())
.sort("salary", descending=True)
.head(10)
.collect()
)
The mental model shift: in pandas, you chain method calls that each produce a new DataFrame. In Polars, you build expressions using pl.col() and compose them. The lazy version (scan_csv + collect) is what you should use in production — it lets Polars optimize the entire query before executing.
# PANDAS
import pandas as pd
df = pd.read_parquet("data.parquet")
result = df.groupby(["region", "year"])["revenue"].sum().reset_index()
result = result[result["revenue"] > 100000].sort_values("revenue", ascending=False)
-- DUCKDB (just SQL)
SELECT region, year, SUM(revenue) as revenue
FROM 'data.parquet'
GROUP BY region, year
HAVING revenue > 100000
ORDER BY revenue DESC
That's it. No import, no read_parquet, no reset_index. DuckDB reads the Parquet file directly in the SQL query. If you know SQL, you already know DuckDB.
One thing I care about that most comparison articles ignore: who controls these tools?
Pandas is a NumFOCUS fiscally sponsored project. Community-governed, volunteer-maintained, been stable for over a decade. Not going anywhere.
DuckDB is owned by the DuckDB Foundation, a non-profit. The intellectual property was purposefully moved to the foundation to ensure DuckDB remains MIT-licensed in perpetuity, independent of any commercial entity. DuckDB Labs is the commercial company, and MotherDuck is the cloud offering — but the core project is protected. This is the gold standard for open source governance.
Polars is backed by Polars Inc., a VC-funded company with $25 million raised ($21M Series A from Accel in September 2025). The project is MIT-licensed, but the company controls development direction. This is the same model that worked for companies like HashiCorp (until it didn't — remember the BSL license switch?). I don't think Polars will pull a HashiCorp, but the structural risk exists.
This matters if you're choosing a tool for a 10-year production system. DuckDB's nonprofit foundation structure gives it the strongest long-term guarantee. Pandas has community inertia. Polars has VC money, which is great for development velocity but comes with expectations of returns.
Pandas is not dead. But pandas is now the wrong default for new data projects.
For the last decade, "learn data science" meant "learn pandas." Every tutorial, every bootcamp, every YouTube video started with import pandas as pd. That made sense when pandas was the only real option. It doesn't make sense anymore.
Here's my position: new data projects in 2026 should start with Polars as the default DataFrame library and add DuckDB for SQL-heavy work. Pandas should be the integration layer — the tool you convert to when you need sklearn, seaborn, or another library that hasn't adopted the Arrow standard yet.
I think Polars has a slight edge over DuckDB for most data engineering work because:
But I reach for DuckDB when:
The hybrid approach — DuckDB for SQL, Polars for transformations, pandas for the last mile — is not a compromise. It's the optimal architecture. Apache Arrow makes it work without performance penalties.
One more thing: if you're still using pandas 2.x, upgrade to 3.0. Copy-on-write alone will save you hours of debugging SettingWithCopyWarning. The Arrow string backend is a free performance win. And the Arrow PyCapsule interface means your pandas DataFrames can now talk to Polars and DuckDB without serialization overhead.
The data tooling ecosystem in Python finally makes sense. Three tools, three specializations, one memory format connecting them all. It's the best it's ever been.
23 aprel 2026
20 aprel 2026
19 aprel 2026