When you are starting with Software Engineering right now, there is really a lot of material already available online, from videos, books, tutorials to online classes. Most of this material is either structured to teach you how a certain concept works or how a specific technology works. When you're starting your first job, you think you'll start on a green-field new project, choose the tech stack and go to town.
But then you join a company and they already have some legacy application, nobody knows how it really works but it supports 80% of the business.
And so:
Any change to an existing application is a migration.
My guesstimate is that 80% of the time of a software engineer that is not spent in meetings is spent on shepherding some long-running migration. So it makes sense to think about how to get better at migrations as a means of getting better as a software engineer.
What is a migration
A migration is the process of applying a change on a system.
Lets unpack this, process implies this is going to take a while, applying means that you have to do it, a change signifies that the new system is different from the old one and the system can really be anything from company processes, through organization to software.
Some examples:
Introducing an RFC process at the company is a migration that consists of preparing the process definitions and ceremonies, getting initial approvals from management, presenting it to everyone and finally following up and making sure its adhered to.
Replacing a legacy service with a service written in Rust while maintaining the API compatibility can look something like writing the new service, deploying it with a dark traffic split, checking all is working and finally switching the client traffic to the new service.
Migrating a data warehouse means first rewriting all your SQL in a format that is compatible with both warehouses, bringing the new warehouse online, checking all the data and calculations and finally switching all the users to the new warehouse.
Why everything is a migration
Even the most mundane tasks, like updating a README on a repo are migrations. The PR merge triggers CI/CD and Kubernetes will perform a rolling update of the service. This is a migration that's already been automated, so you don't experience it directly, but its still good to know what happens when there is a failure in these processes.
A migration consists of 3 steps:
- Pre-refactoring
- Change
- Post-refactoring
First you have to prepare the old system for the change. Most commonly this is some kind of clean up that is necessary to make the change atomic later. It can literally be a refactoring, but it can mean bringing up a whole new service, or setting up a new warehouse.
Then you can finally do the switch. The deployment or switch the load balancer or present the RFC process.
After the change, if everything went well, you likely still need to clean up. Maybe you had to introduce some constructs to make the migration easy or now that you're on the new system, some old ways can be replaced with new ones.
What makes migrations hard
Often there is an easy "migration" when system uptime is not required. For example when you develop locally and you mess up your database, you can just nuke it and start from the beginning. Not a problem because nobody else depended on that database to be available.
In production however, the most common source of complications is the requirement to keep the system running without downtime.
This usually means that migrations are done by bringing the changed system online and then atomically, reversibly switching to it.
How to get better at migrating
The most important thing to keep in mind is that things can and often do go wrong during a migration. What differentiates good migrations from bad migrations is how easy it is to get back to a working system when something went wrong.
So to get better at migrations, these are the things to think about:
- Get better at finding what can go wrong (make sure you have your known unknowns covered)
- Prepare scenarios for each of the potential problems on how to restore to a working state
- Make sure all your migration steps are atomic, meaning they can be applied and reversed with a single change