Architecture

Mar 23, 2026

Zero-Downtime Tenant Migrations with Native Replication

Moving a tenant's database between servers without dropping a single query. Here is how native replication, connection draining, and proxy routing make zero-downtime migrations work.

The problem with traditional database migrations

Moving a database from one server to another typically involves downtime. The standard approach is: stop writes, take a backup, restore on the new server, verify, switch connections. Depending on the database size, this process takes minutes to hours. During that window, the application is either down or running in read-only mode.

For a single-database application, scheduled maintenance windows make this acceptable. For a multi-tenant platform where each tenant expects continuous availability, downtime is not an option. You cannot tell 200 customers that their databases will be unavailable for 30 minutes while you move one tenant to a new server.

The solution is native database replication. Instead of stop-backup-restore, the migration uses the database engine's built-in replication protocol to synchronize data in real time. The tenant's application keeps running on the source server while the target server catches up. The actual switch happens in seconds, not minutes.

How native replication works per engine

Each database engine has its own replication protocol. The migration system uses whichever protocol is native to the engine being migrated. No intermediate format, no export-import, no custom synchronization code.

PostgreSQL: Streaming replication. The target server connects to the source as a streaming replica. It receives the write-ahead log (WAL) in real time and applies every transaction as it happens on the source. The replica stays within seconds of the primary at all times. When replication lag reaches zero, the migration is ready for cutover.

PostgreSQL streaming replication is the same mechanism used for high-availability setups and read replicas in production. It is battle-tested, supports large datasets, and handles schema changes during replication.

MySQL: Binary log replication. The target server connects to the source and reads the binary log, which records every data modification. Each event is replayed on the target in order. MySQL replication supports both statement-based and row-based formats. The migration uses row-based replication for consistency, ensuring that every row on the target matches the source exactly.

MongoDB: Replica set synchronization. MongoDB's replication is built around the oplog, an ordered log of every write operation. The target instance connects and tails the oplog, applying each operation as it arrives. Initial sync copies the full dataset, then the oplog keeps the target current. MongoDB handles this natively as part of its replica set protocol.

Redis: Replication via REPLICAOF. Redis replication starts with a full dataset transfer (RDB snapshot), followed by continuous command streaming. Every write command executed on the source is forwarded to the replica in real time. Redis replication is fast and low-overhead, making it suitable for migrating even high-throughput key-value workloads.

In every case, the migration uses the protocol that the database engine was designed around. There is no custom synchronization layer adding complexity or risk.

The migration timeline

A typical migration follows five phases. The total elapsed time depends on data volume, but the application downtime is limited to the cutover phase, which lasts under 2 seconds.

Phase 1: Provisioning (60 to 90 seconds). A new server is provisioned in the target environment. The database engine is installed and configured. TLS certificates are generated. The server is health-checked and ready to receive data.

Phase 2: Initial sync (varies by data size). The full dataset is transferred from source to target. For PostgreSQL, this is a base backup followed by WAL streaming. For MongoDB, this is the initial sync phase. For MySQL, this is a snapshot followed by binlog streaming. For Redis, this is the RDB transfer.

During this phase, the tenant's application continues running normally on the source server. Reads and writes are unaffected. The target server is not yet serving traffic.

Phase 3: Catch-up replication (seconds to minutes). After the initial sync, the target server stays in sync with ongoing writes through continuous replication. The replication lag, measured as the delay between a write on the source and its application on the target, decreases as the target catches up. The system monitors replication lag and waits for it to approach zero.

Phase 4: Cutover (under 2 seconds). This is the only phase where the tenant's application is affected. The proxy drains active queries for the tenant, allowing in-flight queries to complete but holding new queries briefly. Once all active queries have finished, the proxy switches the routing from the source server to the target server. New queries are released and execute against the target.

The drain-and-switch process is designed to be as brief as possible. In practice, the pause is under 2 seconds for most workloads. The tenant's application experiences a brief delay, not an error. Connections are not dropped. Queries are held, not rejected.

Phase 5: Cleanup. The source server's copy of the tenant's data is no longer needed. For shared-to-dedicated migrations, the tenant's database on the shared server is removed. For dedicated-to-dedicated migrations (region changes), the old VM is terminated. A safety backup taken before the migration is retained for a configurable period.

The drain process

The cutover phase depends on a clean drain of active queries. This ensures that no query is mid-execution when the routing switches.

When the system determines that replication lag is near zero, it initiates a drain for the specific tenant being migrated. The drain works as follows:

The proxy stops accepting new queries for this tenant. New queries are held in a buffer.
Active queries that are already executing are allowed to complete. The proxy tracks active query count per tenant.
Once the active query count reaches zero, the routing switch happens.
The buffered queries are released and execute against the new server.

The drain has a timeout. If active queries do not complete within 60 seconds, the system forces the switch. In this case, any still-running queries on the source may fail. The forced cutover is a safety mechanism to prevent migrations from hanging indefinitely due to a long-running query.

In practice, forced cutover is rare. Most tenant workloads consist of queries that complete in milliseconds. The drain window is usually under 1 second.

Only the migrating tenant is affected by the drain. All other tenants continue to execute queries normally throughout the entire process.

Safety mechanisms

Database migrations carry inherent risk. The system includes multiple safety mechanisms to ensure that a failed migration does not result in data loss.

Pre-migration backup. Before replication begins, an automated backup of the tenant's database is taken and stored in S3. If anything goes wrong during the migration, the backup provides a recovery point.

Replication verification. Before cutover, the system verifies that the target server has received all data. For PostgreSQL, this means confirming that the WAL position on the target matches the source. For MongoDB, this means confirming that the oplog timestamp on the target matches the source. For MySQL, this means confirming binlog position parity.

Automatic fallback. If native replication fails to establish or falls too far behind, the system automatically falls back to a backup-and-restore approach. The pre-migration backup is restored on the target server. This is slower but guarantees that the migration completes even when replication encounters issues.

Rollback window. After a successful migration from shared to dedicated infrastructure, the original data on the shared server is retained for a configurable period. If an issue is discovered after cutover, the routing can be switched back to the source server.

What the application sees

From the application's perspective, a migration looks like a brief increase in query latency during the cutover phase. Connections are not dropped. Connection strings do not change. No application code changes are required.

The single slow response during cutover is the only visible effect. All subsequent queries route to the new server at normal latency. If the dedicated server is in a region closer to the tenant's users, latency may actually decrease after migration.

Migration types

The same replication-based migration handles multiple scenarios.

Shared to dedicated (L1 to L2). The tenant's database moves from a shared server to a dedicated VM. This is the most common migration type, triggered when a tenant needs dedicated resources, a specific region, or physical isolation for compliance.

Dedicated to dedicated (L2 to L2). The tenant's database moves between dedicated VMs. This is used for region changes, such as moving a tenant's database from Europe to the US, or for infrastructure upgrades.

Dedicated to shared (L2 to L1). The tenant's database moves from a dedicated VM back to shared infrastructure. This is used when dedicated infrastructure is no longer needed, such as after a compliance audit period ends or a high-traffic season passes.

All three migration types use the same replication-based process. The direction of the migration does not change the mechanism.

Per-database granularity

A tenant with multiple databases can migrate them independently. If a tenant has PostgreSQL and MongoDB, you can promote PostgreSQL to dedicated while keeping MongoDB on shared infrastructure.

Each database migrates independently using its own engine's replication protocol. There is no dependency between them. You can migrate them simultaneously or sequentially.

Monitoring a migration

The migration status is visible through the CLI and API at every phase.

The status field reflects the current phase:

migrating_sync: Data is being replicated. The percentage indicates initial sync progress. After initial sync, the status remains here while catch-up replication runs.

migrating: Cutover is in progress. This status lasts only seconds.

ready: Migration complete. The tenant is live on the new server.

If a migration fails, the status changes to failed with an error message. The tenant remains on the original server, unaffected. The pre-migration backup is available for manual recovery if needed.

The engineering behind the simplicity

A zero-downtime migration that takes one command to execute requires significant engineering underneath. Connection draining per tenant without affecting other tenants. Replication monitoring per database engine. Automatic fallback from replication to backup-restore. Proxy routing updates that take effect mid-connection without dropping sessions. Safety backups before every migration. Status tracking across asynchronous processes.

This is infrastructure that each database engine handles differently. PostgreSQL streaming replication has different failure modes than MongoDB oplog tailing. MySQL binary log replication has different lag measurement than Redis command streaming. The migration system normalizes these differences behind a single interface.

Building this yourself is possible. The replication protocols are documented and well-understood. The challenge is not any single component. It is the integration of all components into a reliable, automated process that works across engines, handles failures gracefully, and never loses data.

TenantsDB handles this for PostgreSQL, MySQL, MongoDB, and Redis through a single command. Every migration uses native replication, includes safety backups, supports automatic fallback, and completes with under 2 seconds of application impact.

Start free with up to 5 tenants at docs.tenantsdb.com.