Skip to content

Deadlock during cleanup when --attempt-instant-ddl succeeds (GhostTableMigrated signal never drained) #1735

Description

@peterbollen

Overview

When --attempt-instant-ddl is used and the instant ALTER succeeds, gh-ost can deadlock during final cleanup and hang forever instead of exiting.

Root cause

initiateApplier writes a GhostTableMigrated changelog row. The streamer's changelog listener callback (Migrator.onChangelogStateEvent) publishes that signal synchronously via base.SendWithContext(ctx, mgtr.ghostTableMigrated, true) while holding EventsStreamer.listenersMutex (the send happens inside notifyListeners, which holds the mutex for the duration of the callback).

On the normal migration path there is a dedicated receiver:

if !mgtr.migrationContext.Resume {
    <-mgtr.ghostTableMigrated
}

But on the instant-DDL success path (go/logic/migrator.go), Migrate() returns early right after finalCleanup() and never receives from ghostTableMigrated:

if err := mgtr.applier.AttemptInstantDDL(); err == nil {
    if err := mgtr.finalCleanup(); err != nil {
        return nil
    }
    ...
    return nil
}

So the changelog send blocks forever, keeping listenersMutex held. finalCleanup() then closes the binlog reader, whose rows-event decode callback (EventsStreamer.shouldDecodeRowsEvent) needs the same mutex, and BinlogSyncer.Close() waits (via WaitGroup) for that goroutine to exit → permanent deadlock. gh-ost hangs and never completes the migration.

Reproduction

Run an instant-DDL-eligible migration against MySQL 8.0 with --attempt-instant-ddl, e.g. adding a column with a default:

gh-ost --attempt-instant-ddl --execute \
  --alter="ADD COLUMN c INT NOT NULL DEFAULT 1" \
  --host=127.0.0.1 --port=3306 --database=db --table=t ...

gh-ost applies the instant DDL, logs the "migrated instantly" success, but then hangs in cleanup instead of exiting.

Proposed fix

Drain the GhostTableMigrated signal on the instant-DDL success path before finalCleanup(), mirroring the receive already present on the normal path, guarded by !Resume (resume migrations never emit the signal). I have a PR ready with the fix plus regression tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions