Posted on 2023-12-14

Snowflake vs Redshift

I recently played a major part in a migration from Snowflake to Redshift. I'm writing this for me if I need to reference it in the future, but I hope others can find this useful as well, as there was a lack of similar posts when I was at the discovery phase of the project.

Briefly about Snowflake

Snowflake is a pretty well known piece of software in the data industry. However if you've not encountered it before, it can be a little intimidating. I'll summarize my knowledge on Snowflake here in terms an Engineer should find accessible.

If you're only going to learn one thing about Snowflake, make that Micro partitions. If you have a basic idea about how databases work, this single piece of knowledge takes you 80% of the way to understanding Snowflake. Snowflake lays out data in small immutable chunks, called Micro partitions. As of this writing their size is set to 16MB, which also sets the maximum limit of text/json columns. Ingestion is limited by the speed at which you can create these micro partitions and write them (I believe they just store them on S3 or equivalent). Updates/deletes can be very expensive, because whole micro partitions have to be rewritten to materialize the change. Reads can be quick depending on the ability of the query planner to eliminate micro partitions from scanning. Multiple warehouses can access the same data at the same time. Time travel, zero copy cloning and all the related features all naturally follow from immutable micro partitions, any point in time snapshot is just a set of pointers to micro partitions.

Secondly, it's important to know that databases (storage) in Snowflake are persistent, but data warehouses (compute) are ephemeral. They spin up and down in a matter of seconds and are billed per second after 1 minute. The cost of storage is very low (usually <10% of the total bill). The cost of compute is something you have firmly in your hand by planning scaling policies on your warehouses. Also attributing costs is easy, just assign a warehouse per cost center.

Briefly about Redshift

While Snowflake feels right of the bat as a solid database designed by people who knew what they were doing, Redshift feels very much not. To my understanding, Redshift was hastily put together by taking the PostgreSQL query parser and wire protocol (at version 8) and slapping it on a custom query engine. While AWS is making some great work in the recent years, they still have a long way to go. Everything feels ad-hoc and bolted on somehow.

There are 3 main variants of Redshift: Serverless, RA and DC. DC was initially the only option, it stores data directly on the nodes, so if your data grows, you need to add more nodes and then they just sit there unused, so it can be very expensive if you have a lot of data. RA was the second option, this is somewhat similar to Snowflake in that it persists the data on S3 and uses some local caches to speed up, however it still has limitations, e.g. scaling the cluster will take it down for ~15 min. Serverless wants to be even more reactive and decoupled, but still ends up being pretty inflexible and expensive.

Some problems we had with Redshift:

Because of the 15 min scaling downtime, we kept it at a fix size, meaning we didn't have enough scale for the ETL at night, but also didn't use the capacity during the day, so we overpaid
Because all the queries go through the Leader, it is the bottleneck, so there are limits to how much you can scale. Some operations can only be performed on the Leader as well, adding to the problem
dbt uses a lot of metadata queries, which are all answered by the leader
Limted concurrency support often lead to circular locks on tables and failed ETLs
JSON support is pretty weak

A word about the scale

Without going too much into the details, I'm talking about tens of TB of data and thousands of tables.

The migration

I really like the idea (I think it comes from Martin Folwer's Refactoring book) that (paraphrased):

Refactoring is a 3-step process: (1) Pre-refactoring, (2) Refactoring and (3) Post-refactoring

What I take away from that is to insert a step into the process that does not change the behaviour at all, but makes the subsequent refactoring easy. In a migration such as this one that is going to touch hundreds of jobs, you really don't want to both change the job code and also the interface at the same time.

You may want to introduce an interface that all the jobs use (pre-refactoring) and then swap that interface from Redshift to Snowflake in a single atomic operation. So that if you mess something up, you can easily revert it. You do not want to maintain two versions of code that differ in more than one simple way.

In our case we have massaged our architecture from ETL to ELT, with a strong emphasis on the fact that the raw data is the ground truth. Meaning that all the later layers of the data warehouse must be able to be completely recreated from the raw data by a series of transformations encoded in SQL.

Additionally, we have ensured that all our raw data is stored in some sensible format on S3.

That meant we could split our migration into 2 phases:

Migrate the raw layer
Migrate the transformations

Both of these could be executed in parallel, so we had lists of all the jobs that needed to be mirgated and we assigned the migration tasks (which were detailed in advance) to the team members.

The migration of the raw layer was the more difficult bit, because there was some heterogeneity among all the various extractor jobs, but we ended in a better place overall with the codebase thanks to this migration anyways.

The transformation phase was a lot easier, thanks to dbt. There are packages and even dbt has built in some macros that you just need to make sure you're using in your models and they will be portable between various supported warehouses. It was not perfect and I ended up writing some macros, most notably replacing the JSON_EXTRACT_PATH_TEXT function with GET_PATH.

Finally as a post-refactoring we have cleaned up all the Redshift-specific code that was awkwardly left lying around. In many cases Snowflake has better options for things that Redshift can't do well.

Thougts on Redshift

Overall I have used Redshift 2x in my career and both times it ended up not being the right tool for the job. Partly because we chose poorly, but also because I feel the way AWS presents their products can be misleading.

For instance the first time we used Redshift, it was on a path that was hit from UI. Nothing in the documentation suggested what kind of problems we might encounter if we did that, though in retrospect it was a stupid decision, because OLAP databases are not suited for low latency querying. However it can work with a generalized OLAP database and e.g. Snowflake or Clickhouse would manage this use case better. The reason it didn't work though, was that p99 latency was off the charts, p50 was in the 200ms range, which is slow but acceptable for a dashboarding use-case. However p99 was 2-10 seconds! This completely destroys any UX, especially the randomness of it. The way I understand it is that the latency comes from the Leader. The Leader receives the query, parses it and then compiles a C binary that it then ships to the executor nodes who then run it.

Secondly using Redshift to back BI tooling can work if the size of the data and the size of the team is small. However at a certain point you have to start trading usability for cost, because either you scale up and down, taking the cluster offline for 15 min at a time, or you end up running it at close to the max spec at times when you don't need it. Additionally, because only the executor nodes can scale, the leader is always going to be the bottleneck, especially with use-cases like dbt, when many metadata queries are executed. I don't honestly know if this problem is better on Serverless, but in our testing it was worse and more expensive. But we also haven't spent that much time trying to optimize it.

Therefore I don't really understand the point of Redshift, I think the only way I can see it working is if all the ETL is done before Redshift and then stored on S3 and loaded into Redshift, which is then only used for querying on ready tables. I would avoid using it as a general purpose data warehouse and I would avoid using it if latency is of concern.

Thougts on Snowflake

Snowflake is a different beast altogether. It has the ability to spin up and down the compute power within seconds (or faster). So the second concern from Redshift goes away. You can simply create a warehouse allocation for different use-cases (e.g. different parts of the org, different projects, different tools) and give a correctly sized warehouse to each. When the warehouse is not needed, it automatically goes away and you stop paying for it.

The storage costs are higher than S3 but overall negligible, they usually come out to about 10% of the bill.

The warehouses can scale up and out. Up, means that you get bigger machines and out means you get more of them. So Up can take bigger queries and Out can take more queries in parallel.

With respect to user-facing features, there are 3 categories where Snowflake has a significant upper hand over Redshift:

UI and Apps
Data Governance
Data Sharing

UI and Apps

Overall, Snowflake has a nice UI called Snowsight that has some rudimentary text editor where the users can write SQL queries and execute them. It even has some basic charting capability so you can create dashboards directly in Snowsight. The one thing I would like to still see is Vim key bindings in the editor.

On top of that there are some interestng new options coming up, such as Streamlit, which is a way to write interactive web apps that work with data in Snowflake using Python.

On the admin side, there are also nice amenities for monitoring the spending and load of the warehouses. Also the query plan visualizer is worth mentioning.

Data Governance

Snowflake has 2 interesting features here. (1) Role hierarchies: Users assume roles and the roles can inherit their privileges from other roles. But at the end of the day, all the queries are logged into the audit log with the user that was logged in as the source. (2) Data masking: this is an amazing feature that lets you assign "tags" (actually these should be called something else since it's a key-value pair) and then based on these tags, you can mask out the data in certain columns or tables from a group of people based on their assumed roles. This let's you set up a rule that e.g. Marketing can see the users table, but can't see the users' physical addresses, while analysts can not see the PII at all. Importantly, the masking is done when the data is queried, so it will look differend depending on who's looking. You can also apply different UDFs that do the masking, e.g. you can just hash the value being masked, so you still can do things like count(distinct <col>) over the masked data.

This is a very powerful idea, because it lets 3rd parties make applications or data sets and make them easily available to you. Same goed the other way around. If you want to sell your data, you can do so through the data marketplace. From the snowflake perspective, all you see is that there is a new database you can read data from. And the kicker is that dbt, even though normally it can only connect to one data source, doesn't mind! You just need to explicitly specify the database in your query!

Conclusion

I think by this point it's clear which way I'm leaning personally. I think if you're on AWS, Snowflake is still probably the best option for any mid-size and above company. It ended up being about 40% cheaper and our ETL ended up being 70% faster on Snowflake compared to Redshift and on top we got better UI and better data governance!