Melchi simplifies the process of replicating your Snowflake data to DuckDB, ensuring both initial ingestion and continuous updates with ease. By automating table creation and change tracking, it enables fast evaluations and local development without the complexities of traditional ETL. Experience performance insights while reducing costs and enhancing productivity.
Melchi: A Simplified Snowflake to DuckDB Replication Tool
Overview ๐
Melchi is designed to streamline the process of replicating data from Snowflake to DuckDB, providing both initial data ingestion and ongoing updates via Change Data Capture (CDC). This tool allows you to maintain a synchronized local copy of your Snowflake data without the hassle of complicated ETL processes.
How Melchi Works
- Specify a list of Snowflake tables for replication.
- Run a simple setup command in your terminal.
- Melchi takes care of:
- Creating corresponding tables in DuckDB based on the schemas of your Snowflake tables.
- Configuring streams and change tracking tables in Snowflake for each table being replicated.
- Establishing change tracking tables in DuckDB to monitor update times.
After the initial setup, simply execute the sync_data
command whenever updates are needed. Melchi efficiently identifies inserts, updates, and deletes from Snowflake and applies these changes to DuckDB. All you need is to set up a role in Snowflake with the appropriate permissions, and Melchi handles the rest, delivering a low-maintenance and effective solution for data synchronization.
Why Choose Melchi? ๐ค
Melchi addresses significant gaps in the current data replication landscape, perfect for several key scenarios:
Key Use Cases
- Evaluating DuckDB Performance
- Assess DuckDB against your Snowflake workloads without impacting production.
- Experiment and prototype analytics workflows locally.
- Efficient Data Transfer Between Warehouses
- Unlike many CDC tools that primarily handle OLTP to warehouse, Melchi focuses on warehouse to warehouse transfers, optimizing for analytical workloads while properly handling warehouse-specific data types (e.g., ARRAY, VARIANT, GEOGRAPHY).
- Local Development Environment
- Quickly work with production data locally, eliminating Snowflake costs and enabling rapid iterations on analytics queries without needing internet connectivity.
- Cost Reduction
- Lower Snowflake compute costs during development and testing phases.
- Cache frequently accessed data locally and test expensive queries without incurring warehouse charges.
We invite you to explore how Melchi can benefit your use cases! Share your experiences in our Discord community.
Quick Start ๐
Set up a local copy of your Snowflake data in under 5 minutes by following these simple steps:
- Clone the repository:
git clone https://github.com/ryanwith/melchi.git cd melchi python3 -m venv venv && source venv/bin/activate # For Windows: venv\Scripts\activate pip install -r requirements.txt
- Create
config/config.yaml
:source: type: snowflake account: ${SNOWFLAKE_ACCOUNT_IDENTIFIER} user: ${SNOWFLAKE_USER} password: ${SNOWFLAKE_PASSWORD} role: YOUR_ROLE warehouse: YOUR_WAREHOUSE change_tracking_database: melchi_cdc_db change_tracking_schema: streams target: type: duckdb database: output/local.duckdb change_tracking_schema: melchi tables_config: path: "config/tables_to_transfer.csv"
- Specify tables to transfer in
config/tables_to_transfer.csv
:database,schema,table,cdc_type your_db,your_schema,your_table,full_refresh
- Set environment variables in
.env
:SNOWFLAKE_ACCOUNT_IDENTIFIER=your_account SNOWFLAKE_USER=your_username SNOWFLAKE_PASSWORD=your_password
- Initialize setup and sync:
python main.py setup --config config/config.yaml python main.py sync_data --config config/config.yaml
Now your data is securely placed in DuckDB! Query it easily:
import duckdb
conn = duckdb.connect('output/local.duckdb')
conn.execute('SELECT * FROM your_schema.your_table').fetchall()
For further details, see the Configuration section for more setup options and explore the CDC Types for advanced change tracking strategies.
Technical Architecture ๐ง
Overview
Melchi leverages Snowflake's native change tracking combined with custom metadata tables to synchronize data efficiently. Hereโs how it works:
Core Components
- CDC Tracking Layer
- Utilizes Snowflake's standard streams, append-only streams, and table copies to capture changes efficiently.
- Metadata Management
- Manages table schemas, primary keys, packaging CDC configurations, and tracking sync states.
- Data Movement Pipeline
[Snowflake Source] โ [Change Detection] โ [Batched Processing] โ [DuckDB Target]
Key Design Decisions
- Transactional Consistency:
- All operations are conducted within transactions, ensuring no partial updates are committed.
- Memory Management:
- Processes data in configurable batch sizes to optimize resource utilization.
- Type System:
- Automates type mapping to handle complex types while ensuring precision preservation.
- Error Handling:
- Features robust retry logic for transient failures and detailed error logging.
Sync Process Flow
For initial setup:
graph LR
A[Configure Tables] --> B[Create CDC Objects]
B --> C[Setup DuckDB Schema]
C --> D[Initialize Metadata]
For regular synchronization:
graph TD
A[Check for Changes] --> B{CDC Type?}
B -->|Standard| C[Process Changes]
B -->|Append-Only| D[Process Inserts]
B -->|Full Refresh| E[Copy Table]
C --> F[Apply Changes]
D --> F
E --> F
F --> G[Update Metadata]
G --> H[Cleanup CDC Data]
Limitations
- Current limitations include handling of Geography and Geometry columns under specific CDC types and the requirement of defined primary keys in Snowflake for certain operations.
- Note that modifications in the transfer configuration necessitate full table replacements.
- You cannot replicate tables with identical schema and column names into DuckDB, even if they reside in different databases within Snowflake.
Future Enhancements
- Additional support for more data warehouses like BigQuery and Redshift.
- Consider innovations such as real-time CDC and built-in data validation.
CDC Types
Melchi supports various Change Data Capture (CDC) strategies configured per table in tables_to_transfer.csv
:
- Full Refresh (full_refresh):
- Completely drops and re-creates the target table on each sync. Ideal for small lookup tables.
- Standard Stream (standard_stream):
- Uses standard streams to capture all changes; efficient for large tables with frequent updates and deletions.
- Append-Only Stream (append_only_stream):
- Captures only new records; suitable for insert-only tables like logs or time-series data.
Select the appropriate type based on the nature of your data and operational requirements.
Contribution
We're looking for contributors to help improve Melchi! Hereโs how you can get involved:
- Fork the Repository: Start by forking the Melchi repository to your GitHub account.
- Clone Your Fork: Clone it to your system and set up your virtual environment.
- Make Changes: Develop your feature or fix, ensuring to follow coding standards and create relevant tests.
- Submit a Pull Request: Push your changes to your fork and submit a PR in the main repository for review.
Your contributions make Melchi better for everyone, and we greatly appreciate your efforts!