Zero-downtime Rails migrations: four moves that cover 90% of cases

CAVEMAN MODE noted — applies to chat, not deliverable. Blog body needs full editorial voice. Writing it normal.

Here’s the MDX body:

Every Rails developer has the same scar: you ran a migration during a deploy, it grabbed a lock, and the app went dark for ninety seconds while a NOT NULL constraint validated against four million rows. The fix is not a tool you install. It is a small set of moves you internalize until the dangerous ones feel wrong in your hands. This post is the rubric I reach for — four moves that cover roughly ninety percent of the migrations a working Rails app actually needs, plus the handful of cases where none of them save you. No magic gem required, though I’ll mention one. Mostly it’s understanding which operations Postgres can do while holding a cheap lock, and which ones quietly rewrite a whole table.

What “zero-downtime” actually means here

Zero-downtime migration means the schema change runs while the application keeps serving requests, without a maintenance window and without a query queue backing up behind a held lock. It is a Postgres locking problem dressed as a Rails problem.

The scope I care about: a single primary Postgres database, a Rails app deployed continuously, tables large enough that a full rewrite or a long ACCESS EXCLUSIVE lock would be felt by users. If your largest table has ten thousand rows, most of this is academic — run the migration and move on.

The thing to hold in your head is the lock hierarchy. Postgres takes different lock strengths for different DDL. ADD COLUMN with a constant default takes a brief ACCESS EXCLUSIVE lock but does not rewrite the table (since Postgres 11). CREATE INDEX blocks writes for the duration. VALIDATE CONSTRAINT takes a weaker SHARE UPDATE EXCLUSIVE lock that allows reads and writes. The whole game is keeping strong locks short and pushing slow work under weak locks.

The four moves

Here is the rubric. Almost every safe migration is one of these, or a sequence of them.

Move	When	The discipline
Additive only	New column, new table, new index	Never combine adding a column with backfilling it in the same migration
Concurrent index	Any index on a non-trivial table	`CREATE INDEX CONCURRENTLY` outside a transaction
Batched backfill	Populating a new column	Update in chunks, off the deploy path
Expand/contract	Renames, removals, type changes, `NOT NULL`	Split the change across multiple deploys

Additive only. Adding a column or table is safe because it touches the catalog, not the data — provided the default is a constant. Adding a column with a default that gets backfilled into existing rows used to rewrite the table; modern Postgres stores constant defaults as metadata, but a default: -> { "gen_random_uuid()" } is volatile and still rewrites. Keep adds boring.

Concurrent index. A plain CREATE INDEX locks the table against writes until it finishes. CONCURRENTLY builds it without blocking writes, at the cost of two table scans and no transaction wrapper:

class AddIndexToOrdersOnCustomerId < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def change
    add_index :orders, :customer_id, algorithm: :concurrently
  end
end

Forget disable_ddl_transaction! and Rails will refuse — or worse, run it inside a transaction where CONCURRENTLY is illegal.

Batched backfill. Never UPDATE a large table in one statement inside a migration; you’ll hold row locks and bloat the WAL. Loop in batches, ideally in a separate migration or a one-off task so the schema change and the data change can fail independently.

Expand/contract. The meta-move. Any destructive or transforming change becomes a sequence: first expand the schema so old and new code both work, deploy, migrate the data, then contract by removing the old shape. I’ll walk a full example below.

Pitfalls and anti-patterns

The classic killer is add_column :users, :status, :string, null: false, default: "active" followed in the same file by code that reads users.status. The migration is fine. The deploy ordering is not — the new code can ship to a server before the migration finishes, or the migration can finish before old code drains, and one of them sees a column that doesn’t match its expectations.

Renaming a column is the trap everyone falls into once. rename_column is a single fast DDL, so it looks safe. But the running app still references the old name. The instant the rename commits, every in-flight request using the old name throws. Renames are never atomic at the application layer, only at the database layer.

Removing a column has a subtler version of the same bug. Rails caches the column list at boot. If you drop a column that old, still-running processes think exists, their INSERT statements break. The fix is ignored_columns:

class User < ApplicationRecord
  self.ignored_columns += ["legacy_token"]
end

Deploy that first, then drop the column in a later deploy.

Other reliable foot-guns: adding a NOT NULL constraint that validates the whole table under a strong lock; adding a foreign key without validate: false; wrapping a CONCURRENTLY index in the default transaction. The strong_migrations gem catches most of these at migration time and is worth adding on day one — it turns tribal knowledge into a failing test.

A worked example: renaming a column safely

Say you want to rename users.name to users.full_name. Here is how the expand/contract pattern looks, spread across three deploys. I’m describing the shape of the sequence, not a specific project — the structure is the same every time.

Deploy 1 — expand. Add the new column, additive and safe. Ship code that writes to both columns and reads from the old one. The schema now has both shapes; the app behaves as before.

def change
  add_column :users, :full_name, :string
end

Backfill. In a separate step, copy existing data in batches:

User.unscoped.in_batches(of: 5_000) do |batch|
  batch.update_all("full_name = name")
end

This runs outside the deploy path. If it dies halfway, you re-run it; nothing user-facing breaks because the app still reads name.

Deploy 2 — switch reads. Now that full_name is populated and kept in sync, ship code that reads from full_name. Still writing to both. At this point name is dead weight but harmless.

Deploy 3 — contract. Add name to ignored_columns, deploy, then in a follow-up migration drop the column. Stop the dual-write.

The painful renames are the ones I tried to do in one deploy because “it’s just a rename.” It is never just a rename.

— Self note

Four deploys for one rename feels absurd until the first time it saves you a 2 a.m. incident. The cost is calendar time and discipline, not engineering difficulty. (If you’re weighing how much of this rigor a given project warrants, that’s really a scoping question — early MVPs can often skip it.)

What done looks like

You’ve done this right when the deploy is unremarkable. Specifically:

The migration acquired only short-lived locks. You can confirm this after the fact in pg_stat_activity and in your slow-query logs — no long ACCESS EXCLUSIVE holds, no lock queues.
Old and new code both ran correctly against the schema at every intermediate state. There was never a moment where a running process saw a column shape it didn’t expect.
The data migration was idempotent and resumable. Re-running the backfill changed nothing the second time.
Nothing required a maintenance window, a read-only mode, or a “deploying, back in 5” banner.

The honest test is reversibility. At each step, could you have rolled back the code deploy without a schema rollback, and vice versa? If yes, the change was genuinely decoupled. If a rollback would have stranded the app against an incompatible schema, you skipped a step.

When this doesn’t apply

This whole apparatus is overhead, and overhead you don’t need is just cost. Pre-launch, with no production traffic and no users to inconvenience, run whatever migration you like and reset the database if it goes wrong — the expand/contract dance is pure ceremony there.

It also breaks down at extreme scale, where even a metadata-only DDL change can stall behind autovacuum or a long-running transaction, and you need lock timeouts and retry logic on top of these moves. And some changes — partitioning a hot table, a major type change on a huge column — genuinely warrant a planned window. Knowing which case you’re in is the actual skill.

The claim

Here’s something falsifiable to take away: if a Rails migration holds an ACCESS EXCLUSIVE lock for longer than it takes Postgres to update a few catalog rows, it is either rewriting a table or waiting on one — and you will see exactly which in pg_stat_activity, not in code review. Every safe migration I’ve written keeps that lock window down in the milliseconds. If yours doesn’t, you haven’t found a clever exception; you’ve found the rewrite you didn’t know was there.