Merge branch 'master' into tz-a-different-approach

2016-10-11 17:17:42 +02:00 · 2016-10-11 17:17:42 +02:00 · 5d92da4a74
commit 5d92da4a74
parent dbf50afbc7 167cda3981
7 changed files with 49 additions and 15 deletions
--- a/doc/command-line-flags.md
+++ b/doc/command-line-flags.md
@ -43,6 +43,16 @@ password=123456

 See `exact-rowcount`

+### critical-load-interval-millis
+
+`--critical-load` defines a threshold that, when met, `gh-ost` panics and bails out. The default behavior is to bail out immediately when meeting this threshold.
+
+This may sometimes lead to migrations bailing out on a very short spike, that, while in itself is impacting production and is worth investigating, isn't reason enough to kill a 10 hour migration.
+
+When `--critical-load-interval-millis` is specified (e.g. `--critical-load-interval-millis=2500`), `gh-ost` gives a second chance: when it meets `critical-load` threshold, it doesn't bail out. Instead, it starts a timer (in this example: `2.5` seconds) and re-checks `critical-load` when the timer expires. If `critical-load` is met again, `gh-ost` panics and bails out. If not, execution continues.
+
+This is somewhat similar to a Nagios `n`-times test, where `n` in our case is always `2`.
+
 ### cut-over

 Optional. Default is `safe`. See more discussion in [cut-over](cut-over.md)
--- a/doc/migrating-with-sbr.md
+++ b/doc/migrating-with-sbr.md
@ -2,7 +2,7 @@

 Even though `gh-ost` relies on Row Based Replication (RBR), it does not mean you can't keep your Statement Based Replication (SBR).

-`gh-ost` is happy to, and actually prefers and suggests so, connect to a replica. On this replica, it is happy to:
+`gh-ost` is happy to, and actually prefers and suggests to, connect to a replica. On this replica, it is happy to:
 - issue the heavyweight `INFORMATION_SCHEMA` queries that make a table structure analysis
 - issue a `select count(*) from mydb.mytable`, should `--exact-rowcount` be provided
 - connect itself as a fake replica to get the binary log stream
@ -11,7 +11,7 @@ All of the above can be executed on the master, but we're more comfortable that

 Please note the third item: `gh-ost` connects as a fake replica and pulls the binary logs. This is how `gh-ost` finds the table's changelog: it looks up entries in the binary log.

-The magic is that your master can still produce SRB, but if you have a replica with `log-slave-updates`, you can also configure it to have `binlog_format='ROW'`. Such a replica accepts SBR statements from its master, and produces RBR statements onto its binary logs.
+The magic is that your master can still produce SBR, but if you have a replica with `log-slave-updates`, you can also configure it to have `binlog_format='ROW'`. Such a replica accepts SBR statements from its master, and produces RBR statements onto its binary logs.

 `gh-ost` is happy to modify the `binlog_format` on the replica for you:
 - If you supply `--switch-to-rbr`, `gh-ost` will convert the binlog format for you, and restart replication to make sure this takes effect.
--- a/doc/throttle.md
+++ b/doc/throttle.md
@ -46,7 +46,7 @@ Note that you may dynamically change both `replication-lag-query` and the `throt

  `--max-load='Threads_running=100,Threads_connected=500'`

-  Metrics must be valid, numeric [statis variables](http://dev.mysql.com/doc/refman/5.6/en/server-status-variables.html)
+  Metrics must be valid, numeric [status variables](http://dev.mysql.com/doc/refman/5.6/en/server-status-variables.html)

 #### Throttle query

@ -80,7 +80,7 @@ In addition to the above, you are able to take control and throttle the operatio

 Any single factor in the above that suggests the migration should throttle - causes throttling. That is, once some component decides to throttle, you cannot override it; you cannot force continued execution of the migration.

-`gh-ost` collects different throttle-related metrics at different times, independently. It asynchronously reads the collected metrics and checks if they satisfy conditions/threasholds.
+`gh-ost` collects different throttle-related metrics at different times, independently. It asynchronously reads the collected metrics and checks if they satisfy conditions/thresholds.

 The first check to suggest throttling stops the check; the status message will note the reason for throttling as the first satisfied check.

@ -97,7 +97,7 @@ Copy: 0/2915 0.0%; Applied: 0; Backlog: 0/100; Elapsed: 42s(copy), 42s(total); s

 Throttling time is limited by the availability of the binary logs. When throttling begins, `gh-ost` suspends reading the binary logs, and expects to resume reading from same binary log where it paused.

-Your availability of binary logs is typically determined by the [expire_logs_days](dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_expire_logs_days) variable. If you have `expire_logs_days = 10` (or check `select @@global.expire_logs_days`), then you should be able to throttle for up to `10` days.
+Your availability of binary logs is typically determined by the [expire_logs_days](https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_expire_logs_days) variable. If you have `expire_logs_days = 10` (or check `select @@global.expire_logs_days`), then you should be able to throttle for up to `10` days.

 Having said that, throttling for so long is far fetching, in that the `gh-ost` process itself must be kept alive during that time; and the amount of binary logs to process once it resumes will potentially take days to replay.

--- a/doc/understanding-output.md
+++ b/doc/understanding-output.md
@ -1,4 +1,4 @@
-# Understading gh-ost output
+# Understanding gh-ost output

 `gh-ost` attempts to be verbose to the point where you really know what it's doing, without completely spamming you.
 You can control output levels:
--- a/go/base/context.go
+++ b/go/base/context.go
@ -90,6 +90,7 @@ type MigrationContext struct {
 	ThrottleCommandedByUser             int64
 	maxLoad                             LoadMap
 	criticalLoad                        LoadMap
+	CriticalLoadIntervalMilliseconds    int64
 	PostponeCutOverFlagFile             string
 	CutOverLockTimeoutSeconds           int64
 	ForceNamedCutOverCommand            bool
--- a/go/cmd/gh-ost/main.go
+++ b/go/cmd/gh-ost/main.go
@ -100,6 +100,7 @@ func main() {

 	maxLoad := flag.String("max-load", "", "Comma delimited status-name=threshold. e.g: 'Threads_running=100,Threads_connected=500'. When status exceeds threshold, app throttles writes")
 	criticalLoad := flag.String("critical-load", "", "Comma delimited status-name=threshold, same format as `--max-load`. When status exceeds threshold, app panics and quits")
+	flag.Int64Var(&migrationContext.CriticalLoadIntervalMilliseconds, "critical-load-interval-millis", 0, "When 0, migration bails out upon meeting critical-load immediately. When non-zero, a second check is done after given interval, and migration only bails out if 2nd check still meets critical load")
 	quiet := flag.Bool("quiet", false, "quiet")
 	verbose := flag.Bool("verbose", false, "verbose")
 	debug := flag.Bool("debug", false, "debug mode (very verbose)")
--- a/go/logic/throttler.go
+++ b/go/logic/throttler.go
@ -130,6 +130,20 @@ func (this *Throttler) collectControlReplicasLag() {
 	}
 }

+func (this *Throttler) criticalLoadIsMet() (met bool, variableName string, value int64, threshold int64, err error) {
+	criticalLoad := this.migrationContext.GetCriticalLoad()
+	for variableName, threshold = range criticalLoad {
+		value, err = this.applier.ShowStatusVariable(variableName)
+		if err != nil {
+			return false, variableName, value, threshold, err
+		}
+		if value >= threshold {
+			return true, variableName, value, threshold, nil
+		}
+	}
+	return false, variableName, value, threshold, nil
+}
+
 // collectGeneralThrottleMetrics reads the once-per-sec metrics, and stores them onto this.migrationContext
 func (this *Throttler) collectGeneralThrottleMetrics() error {

@ -144,15 +158,23 @@ func (this *Throttler) collectGeneralThrottleMetrics() error {
 			this.migrationContext.PanicAbort <- fmt.Errorf("Found panic-file %s. Aborting without cleanup", this.migrationContext.PanicFlagFile)
 		}
 	}
-	criticalLoad := this.migrationContext.GetCriticalLoad()
-	for variableName, threshold := range criticalLoad {
-		value, err := this.applier.ShowStatusVariable(variableName)
-		if err != nil {
-			return setThrottle(true, fmt.Sprintf("%s %s", variableName, err))
-		}
-		if value >= threshold {
-			this.migrationContext.PanicAbort <- fmt.Errorf("critical-load met: %s=%d, >=%d", variableName, value, threshold)
-		}
+
+	criticalLoadMet, variableName, value, threshold, err := this.criticalLoadIsMet()
+	if err != nil {
+		return setThrottle(true, fmt.Sprintf("%s %s", variableName, err))
+	}
+	if criticalLoadMet && this.migrationContext.CriticalLoadIntervalMilliseconds == 0 {
+		this.migrationContext.PanicAbort <- fmt.Errorf("critical-load met: %s=%d, >=%d", variableName, value, threshold)
+	}
+	if criticalLoadMet && this.migrationContext.CriticalLoadIntervalMilliseconds > 0 {
+		log.Errorf("critical-load met once: %s=%d, >=%d. Will check again in %d millis", variableName, value, threshold, this.migrationContext.CriticalLoadIntervalMilliseconds)
+		go func() {
+			timer := time.NewTimer(time.Millisecond * time.Duration(this.migrationContext.CriticalLoadIntervalMilliseconds))
+			<-timer.C
+			if criticalLoadMetAgain, variableName, value, threshold, _ := this.criticalLoadIsMet(); criticalLoadMetAgain {
+				this.migrationContext.PanicAbort <- fmt.Errorf("critical-load met again after %d millis: %s=%d, >=%d", this.migrationContext.CriticalLoadIntervalMilliseconds, variableName, value, threshold)
+			}
+		}()
 	}

 	// Back to throttle considerations