Best Practices for Data Ingestion into Snowflake -- your opinions and experiences wanted #146

MaRo99382 · 2024-11-16T09:21:57Z

MaRo99382
Nov 16, 2024

Hey everyone,

As we all work with SnowDDL, it's a given that we're using Snowflake. I wanted to open up this discussion to gather insights, opinions, and shared experiences on the different solutions available for getting data into Snowflake.

I'm particularly interested in batch processing with file formats like Parquet, JSON, or CSV. However, if you have experience with streaming data pipelines, your input is definitely welcome too!

For batch processing, I've observed that there are typically two main stages:

1. Extracting data from source systems and staging it in a cloud storage service (e.g., AWS S3, Azure Blob Storage, etc.).
2. Loading data from the staging area into Snowflake tables using COPY INTO commands, whether these are run manually or automated through services like Snowpipe.

I’d love to hear what solutions you all use for both of these stages:

Which tools, systems, or frameworks do you rely on for data extraction and staging?
How do you handle the data loading step? Do you write COPY INTO statements by hand, or do you use an existing tool to automate this process?
How do you manage orchestration? Are you using workflow tools like Airflow, or built-in Snowflake features?
What solutions have worked well for you, and what have been challenging or problematic? What specific pain points have you encountered, and how did you address them?

Looking forward to your thoughts and insights!

littleK0i · 2024-11-19T10:29:08Z

littleK0i
Nov 19, 2024
Maintainer

It is an interesting topic. I can probably write a book about it at this point.

A few ideas:

Keep SnowDDL config, ETL logic and orchestrator code together in the same repository. Big changes normally involve both "schema" evolution and "business logic" evolution. Sometimes orchestration logic needs to be changed. It is really nice to see everything in one commit. Also, CI/CD becomes easier to manage.
Use declarative approach to orchestration. Create some building blocks with actual logic like inside. Define and call these blocks using YAML configs. Refactoring becomes a breeze.
It should be possible to create an independent sub-environment (env_prefix) and run full ETL on it. This helps to improve development speed and quality tremendously.
Asset-based orchestration and global lineage is good. Dagster approach is good. More and more people will realize it as the time goes by.

Custom ETL works well when organized in the following pairs:

(1) From source to object storage, e.g. S3
Connect to external source, pull data, write files AND metadata. Metadata must contain column structure and data types.

(2) From S3 to Snowflake or any other DBMS
Read column structure from actual table. Read columns from metadata in S3. Compare column structures and generate custom COPY INTO command. Address columns which were added / dropped / changed data type / require transformations. Consider running TRUNCATE or DELETE in the same transaction as COPY INTO in order to protect from duplicates. It costs a bit more, but having actual duplicates and broken reports is much worse.

This approach makes your ETL highly modular and resilient to schema changes and sudden outages.

Best ingestion format for Snowflake is CSV. Other data formats are substantially slower. I guess it is expensive to generate VARIANT data structure which is used for other formats.

It might be a good idea to avoid Snowpipe, if possible. Snowpipe makes it harder to address duplicates and reloads. Monitoring is harder. You still need to implement reload logic anyway, so maybe just go ahead with DELETE + COPY INTO from the start.

1 reply

MaRo99382 Nov 23, 2024
Author

Thank you @littleK0i!
Those are all really interesting points.
Especially the "orchestration by YAML configs" and "write metadata to S3 with data files".
Do you know any example projects / repositories that implements those? I would like to understand it better but it is a bit abstract just reading it like that.

What parts of the orchestration can or should you put into YAML configs?
"Address columns which were added / dropped / changed data type / require transformations" -> do you have an alerting system telling you that there are differences and then you do the changes manually once you received an alert?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best Practices for Data Ingestion into Snowflake -- your opinions and experiences wanted #146

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Best Practices for Data Ingestion into Snowflake -- your opinions and experiences wanted #146

MaRo99382 Nov 16, 2024

Replies: 1 comment · 1 reply

littleK0i Nov 19, 2024 Maintainer

MaRo99382 Nov 23, 2024 Author

MaRo99382
Nov 16, 2024

Replies: 1 comment 1 reply

littleK0i
Nov 19, 2024
Maintainer

MaRo99382 Nov 23, 2024
Author