Using ConnectorX and DuckDB in Python: Step by Step Guide

Introduction

When working with large datasets, execution time and efficiency comes into play. Traditional methods of extracting data from the relational databases into Python often involve loading everything into memory, which can be painful and very slow. That’s where connectorX and DuckDB come in handy. Together, they make data extraction and analytics in python very fast and memory-efficient.

What is ConnectorX?

  • ConnectorX is an open-source library built to load data from databases directly into pandas, Polars, or NumPy efficiently. Instead of fetching row by row via psycopg2 or sqlalchemy ConnectorX parallely fetch chunks of data and stream them directly into Python.

  • Supports many databases: MySQL, SQLite, PostgreSQL, SQL Server, BigQuery, Snowflake, and many more.

What is DuckDB?

  • DuckDB is an in-process SQL OLAP database. Can query CSV, Parquet, JSON, Arrow datasets, and even pandas/Polars DataFrames.
  • Works directly inside Python and R.

Data processing comparison between Conventional and connectorx+DuckDB 

Installation

You can easily install these libraires by using pip in python 
# installing the libraries
pip install connectorx
pip install duckdb

Hands-On

Below is the example of loading data from PostgreSQL into Pandas with ConnectorX and Analyzing data using DuckDB.

import connectorx as cx # importing the libraries
import pandas as pd
import duckdb

# Database connection string
conn = "postgresql://username:password@localhost:5432/mydb"

# SQL query
query = "SELECT * FROM sales LIMIT 1000000"

# Load data into Pandas using ConnectorX
df_sales = cx.read_sql(conn, query)
print(df_sales.head(10))

# Run SQL directly on Pandas DataFrame
region_sales = duckdb.query("""
    SELECT region, SUM(amount) AS total_sales
    FROM df
    GROUP BY region
    ORDER BY total_sales DESC
""").to_df()

print(region_sales)

Use Cases

  1. Fast ETL: Extract from PostgreSQL/MySQL → Analyze with DuckDB.
  2. Data Science: Query CSV/Parquet files quickly without heavy Spark clusters.
  3. Interactive Analytics: Run SQL on Pandas/Polars DataFrames with minimal overhead.

Conclusion

ConnectorX solves the data loading bottleneck by parallelizing data extraction instead of row by row extraction.DuckDB solves the analytics bottleneck by providing a super-fast SQL engine inside Python.
Together, they form a powerful workflow for anyone working with large datasets without heavy Spark clusters and minimal overhead.

Comments

Popular posts from this blog

Step-by-Step Guide to Setting Up AWS SES with Configuration Sets

Integrating Amazon Cognito with API Gateway for Secure API Access

How to Secure Data with AWS KMS Server-Side Encryption

How to Configure AWS SES Event Destinations: Step-by-Step Methods

How to Manage Secrets Securely with AWS Secrets Manager and Lambda

Creating a Scalable Lambda Layer for PostgreSQL or MySQL Drivers in Python