PitchHut
Log in / Sign up
melchi
3 views
Effortlessly sync your Snowflake data to DuckDB with Melchi.
Pitch

Melchi simplifies the process of replicating your Snowflake data to DuckDB, ensuring both initial ingestion and continuous updates with ease. By automating table creation and change tracking, it enables fast evaluations and local development without the complexities of traditional ETL. Experience performance insights while reducing costs and enhancing productivity.

Description

Melchi: A Simplified Snowflake to DuckDB Replication Tool

Overview ๐Ÿ“š

Melchi is designed to streamline the process of replicating data from Snowflake to DuckDB, providing both initial data ingestion and ongoing updates via Change Data Capture (CDC). This tool allows you to maintain a synchronized local copy of your Snowflake data without the hassle of complicated ETL processes.

How Melchi Works

  1. Specify a list of Snowflake tables for replication.
  2. Run a simple setup command in your terminal.
  3. Melchi takes care of:
    • Creating corresponding tables in DuckDB based on the schemas of your Snowflake tables.
    • Configuring streams and change tracking tables in Snowflake for each table being replicated.
    • Establishing change tracking tables in DuckDB to monitor update times.

After the initial setup, simply execute the sync_data command whenever updates are needed. Melchi efficiently identifies inserts, updates, and deletes from Snowflake and applies these changes to DuckDB. All you need is to set up a role in Snowflake with the appropriate permissions, and Melchi handles the rest, delivering a low-maintenance and effective solution for data synchronization.

Why Choose Melchi? ๐Ÿค”

Melchi addresses significant gaps in the current data replication landscape, perfect for several key scenarios:

Key Use Cases

  1. Evaluating DuckDB Performance
    • Assess DuckDB against your Snowflake workloads without impacting production.
    • Experiment and prototype analytics workflows locally.
  2. Efficient Data Transfer Between Warehouses
    • Unlike many CDC tools that primarily handle OLTP to warehouse, Melchi focuses on warehouse to warehouse transfers, optimizing for analytical workloads while properly handling warehouse-specific data types (e.g., ARRAY, VARIANT, GEOGRAPHY).
  3. Local Development Environment
    • Quickly work with production data locally, eliminating Snowflake costs and enabling rapid iterations on analytics queries without needing internet connectivity.
  4. Cost Reduction
    • Lower Snowflake compute costs during development and testing phases.
    • Cache frequently accessed data locally and test expensive queries without incurring warehouse charges.

We invite you to explore how Melchi can benefit your use cases! Share your experiences in our Discord community.

Quick Start ๐Ÿš€

Set up a local copy of your Snowflake data in under 5 minutes by following these simple steps:

  1. Clone the repository:
    git clone https://github.com/ryanwith/melchi.git  
    cd melchi  
    python3 -m venv venv && source venv/bin/activate  # For Windows: venv\Scripts\activate  
    pip install -r requirements.txt  
    
  2. Create config/config.yaml:
    source:  
      type: snowflake  
      account: ${SNOWFLAKE_ACCOUNT_IDENTIFIER}  
      user: ${SNOWFLAKE_USER}  
      password: ${SNOWFLAKE_PASSWORD}  
      role: YOUR_ROLE  
      warehouse: YOUR_WAREHOUSE  
      change_tracking_database: melchi_cdc_db  
      change_tracking_schema: streams  
    target:  
      type: duckdb  
      database: output/local.duckdb  
      change_tracking_schema: melchi  
    tables_config:  
      path: "config/tables_to_transfer.csv"  
    
  3. Specify tables to transfer in config/tables_to_transfer.csv:
    database,schema,table,cdc_type  
    your_db,your_schema,your_table,full_refresh  
    
  4. Set environment variables in .env:
    SNOWFLAKE_ACCOUNT_IDENTIFIER=your_account  
    SNOWFLAKE_USER=your_username  
    SNOWFLAKE_PASSWORD=your_password  
    
  5. Initialize setup and sync:
    python main.py setup --config config/config.yaml  
    python main.py sync_data --config config/config.yaml  
    

Now your data is securely placed in DuckDB! Query it easily:

import duckdb  
conn = duckdb.connect('output/local.duckdb')  
conn.execute('SELECT * FROM your_schema.your_table').fetchall()  

For further details, see the Configuration section for more setup options and explore the CDC Types for advanced change tracking strategies.

Technical Architecture ๐Ÿ”ง

Overview

Melchi leverages Snowflake's native change tracking combined with custom metadata tables to synchronize data efficiently. Hereโ€™s how it works:

Core Components

  1. CDC Tracking Layer
    • Utilizes Snowflake's standard streams, append-only streams, and table copies to capture changes efficiently.
  2. Metadata Management
    • Manages table schemas, primary keys, packaging CDC configurations, and tracking sync states.
  3. Data Movement Pipeline
    [Snowflake Source] โ†’ [Change Detection] โ†’ [Batched Processing] โ†’ [DuckDB Target]  
    

Key Design Decisions

  1. Transactional Consistency:
    • All operations are conducted within transactions, ensuring no partial updates are committed.
  2. Memory Management:
    • Processes data in configurable batch sizes to optimize resource utilization.
  3. Type System:
    • Automates type mapping to handle complex types while ensuring precision preservation.
  4. Error Handling:
    • Features robust retry logic for transient failures and detailed error logging.

Sync Process Flow

For initial setup:

graph LR  
   A[Configure Tables] --> B[Create CDC Objects]  
   B --> C[Setup DuckDB Schema]  
   C --> D[Initialize Metadata]  

For regular synchronization:

graph TD  
   A[Check for Changes] --> B{CDC Type?}  
   B -->|Standard| C[Process Changes]  
   B -->|Append-Only| D[Process Inserts]  
   B -->|Full Refresh| E[Copy Table]  
   C --> F[Apply Changes]  
   D --> F  
   E --> F  
   F --> G[Update Metadata]  
   G --> H[Cleanup CDC Data]  

Limitations

  • Current limitations include handling of Geography and Geometry columns under specific CDC types and the requirement of defined primary keys in Snowflake for certain operations.
  • Note that modifications in the transfer configuration necessitate full table replacements.
  • You cannot replicate tables with identical schema and column names into DuckDB, even if they reside in different databases within Snowflake.

Future Enhancements

  1. Additional support for more data warehouses like BigQuery and Redshift.
  2. Consider innovations such as real-time CDC and built-in data validation.

CDC Types

Melchi supports various Change Data Capture (CDC) strategies configured per table in tables_to_transfer.csv:

  1. Full Refresh (full_refresh):
    • Completely drops and re-creates the target table on each sync. Ideal for small lookup tables.
  2. Standard Stream (standard_stream):
    • Uses standard streams to capture all changes; efficient for large tables with frequent updates and deletions.
  3. Append-Only Stream (append_only_stream):
    • Captures only new records; suitable for insert-only tables like logs or time-series data.

Select the appropriate type based on the nature of your data and operational requirements.

Contribution

We're looking for contributors to help improve Melchi! Hereโ€™s how you can get involved:

  1. Fork the Repository: Start by forking the Melchi repository to your GitHub account.
  2. Clone Your Fork: Clone it to your system and set up your virtual environment.
  3. Make Changes: Develop your feature or fix, ensuring to follow coding standards and create relevant tests.
  4. Submit a Pull Request: Push your changes to your fork and submit a PR in the main repository for review.

Your contributions make Melchi better for everyone, and we greatly appreciate your efforts!