Using ConnectorX and DuckDB in Python: Step by Step Guide
Introduction
When working with large datasets, execution time and efficiency comes into play. Traditional methods of extracting data from the relational databases into Python often involve loading everything into memory, which can be painful and very slow. That’s where connectorX and DuckDB come in handy. Together, they make data extraction and analytics in python very fast and memory-efficient.
What is ConnectorX?
-
ConnectorX is an open-source library built to load data from databases directly into pandas, Polars, or NumPy efficiently. Instead of fetching row by row via
psycopg2
orsqlalchemy ConnectorX p
arallely fetch chunks of data and stream them directly into Python. Supports many databases: MySQL, SQLite, PostgreSQL, SQL Server, BigQuery, Snowflake, and many more.
What is DuckDB?
- DuckDB is an in-process SQL OLAP database. Can query CSV, Parquet, JSON, Arrow datasets, and even pandas/Polars DataFrames.
Works directly inside Python and R.
# installing the libraries pip install connectorx pip install duckdb
Hands-On
import connectorx as cx # importing the libraries import pandas as pd import duckdb # Database connection string conn = "postgresql://username:password@localhost:5432/mydb" # SQL query query = "SELECT * FROM sales LIMIT 1000000" # Load data into Pandas using ConnectorX df_sales = cx.read_sql(conn, query) print(df_sales.head(10)) # Run SQL directly on Pandas DataFrame region_sales = duckdb.query(""" SELECT region, SUM(amount) AS total_sales FROM df GROUP BY region ORDER BY total_sales DESC """).to_df() print(region_sales)
Use Cases
- Fast ETL: Extract from PostgreSQL/MySQL → Analyze with DuckDB.
- Data Science: Query CSV/Parquet files quickly without heavy Spark clusters.
- Interactive Analytics: Run SQL on Pandas/Polars DataFrames with minimal overhead.
Comments
Post a Comment