- Cutover would stall after `lock tables` wait-timeout due do waiting on a channel that would never be written to. This has been identified, reproduced, fixed, confirmed.
- Change of table names. Heres the story:
- Because were testing this even while `pt-online-schema-change` is being used in production, the `_tbl_old` naming convention makes for a collision.
- "old" table name is now `_tbl_del`, "del" standing for "delete"
- ghost table name is now `_tbl_gho`
- when issuing `--test-on-replica`, we keep the ghost table around, and were also briefly renaming original table to "old". Well this collides with a potentially existing "old" table on master (one that hasnt been dropped yet).
`--test-on-replica` uses `_tbl_ght` (ghost-test)
- similar problem with `--execute-on-replica`, and in this case the table doesnt stick around; calling it `_tbl_ghr` (ghost-replica)
- changelog table is now `_tbl_ghc` (ghost-changelog)
- To clarify, I dont want to go down the path of creating "old" tables with 2 or 3 or 4 or 5 or infinite leading underscored. I think this is very confusing and actually not operations friendly. Its OK that the migration will fail saying "hey, you ALREADY have an old table here, why dont you take care of it first", rather than create _yet_another_ `____tbl_old` table. Were always confused on which table it actually is that gets migrated, which is safe to `drop`, etc.
- just after rowcopy completing, just before cutover, during cutover: marking as point in time _of interest_ so as to increase logging frequency.
- Rowcopy time is bounded by copy end-time
- Retries are configurable via `--default-retries` (default: `60`)
- `migrator` notes the hostname
- `applier` and `inspector` note `impliedKey` (`@@hostname` and `@@port`)
- Added lots of code comments
- Adding documentation for "triggerless design"
- Supporting multi-step, safe cut-over phase, where queries are blocked throughout the phase, and worst case scenario is table outage (no data corruption)
- Self-rollsback in case of failure (restored original table)
The idea is that the user is forced to specify the cut-over type they wish to use, given that each type has some drawbacks.
- More data in status hint
- `select count(*)` is deferred till after we validate migration is valid. Also, it is skipped on `--noop`
- Supporting `--allow-nullable-unique-key`
- Tool will bail out if chosen key has nullable columns and the above is not provided
- Fixed `OriginalBinlogRowImage` comaprison (lower/upper case issue)
- Introduced reasonable streamer reconnect sleep time
- Fixed version printing
- `rowCopyCompleteFlag` is a hint that allows us to escape the infinite loop of rowcopy once we are sure we have reached the end
- Operation would terminate after events lock noticed but before applying all events: race condition where the event would be captured asynchronously. The event is now handled sequentially with the DML events, hence now safe.
- Multiple rowcopy operations would still write to `rowCopyComplete` channel. This is still the case, but now we only wait for the first and then just flush (read and discard) any others, to avoid blocking
- Events DML listener is only added after table creation: the problem was that with very busy tables, the events func buffer would fill up, and the "tables-created" event would be blocked.
- `waitForEventsUpToLock()` unifies the waiting on all variants of complete-migration
- With `--test-on-replica`, now stopping replication "nicely", using `master_pos_wait()`
- With `--test-on-replica`, not throttling on replication after replication is stopped (duh)
- More debug output
- when `--test-on-replica`, the tested replica is implicitly a control replica
- added `replication-lag-query`, an alternate query to `SHOW SLAVE STATUS` to get replication lag
- throttling takes both the above into consideration
- Introduced `SwapTablesTimeoutSeconds`; `RENAME` is limited by this timeout
- If `RENAME` fails (due to the above), we throttle and retry
- `SwapTablesAtomic()` sets `lock_wait_timeout` and notifies with connection id
- `GrabVoluntaryLock()` intentionally grabs (and later releases) voluntary lock. It notifies when it is taken and awaits instructions as for when it could be released.
- `IssueBlockingQueryOnVoluntaryLock()` does what it says. It notifies with its connection_id so that it can be easily traced
- `stopWritesAndCompleteMigrationOnMasterViaLock()` does the thang. Oh dear this was agonizing and the code is a pain to look at, though under the limitations I do believe it is as clean as I could hope for.