goroutine panic and recover
Do Panics Crash the Entire Program?🔗
There was code panicking at work recently which was bringing the server down. That panic was caused by a nil pointer access in a goroutine fired from an HTTP handler. Up to this point, I thought that panics in goroutines were contained to their thread, so the failure would be isolated to that thread without bringing down the whole program.
I was completely wrong about this. Let's check out what the Go article Defer, Panic, and Recover says:
Panic is a built-in function that stops the ordinary flow of control and begins panicking. When the function F calls panic, execution of F stops, any deferred functions in F are executed normally, and then F returns to its caller. To the caller, F then behaves like a call to panic. The process continues up the stack until all functions in the current goroutine have returned, at which point the program crashes. Panics can be initiated by invoking panic directly. They can also be caused by runtime errors, such as out-of-bounds array accesses.
Recover is a built-in function that regains control of a panicking goroutine. Recover is only useful inside deferred functions. During normal execution, a call to recover will return nil and have no other effect. If the current goroutine is panicking, a call to recover will capture the value given to panic and resume normal execution.
Breaking this down into key points:
- panic immediately stops normal execution in the current goroutine.
- Before unwinding the stack, Go runs all deferred functions in that goroutine.
- The panic propagates up the call stack only within that goroutine; it does not cross goroutine boundaries.
- If a
recover()is called inside a deferred function, it can catch the panic and prevent the program from crashing. - If the panic is not recovered, the program crashes after the goroutine’s stack unwinds.
So, basically it means:
Each goroutine has its own stack. A panic in one goroutine only unwinds that goroutine's stack, but an unrecovered panic in any goroutine still terminates the entire program. To handle panics, a recover must be called from a defer function within the goroutine's scope.
Therefore, to prevent a goroutine spawned in an HTTP handler, either for a concurrent task or a background job, from bringing down the program, every goroutine must have deferred panic handlers as a safeguard.
Where did this misconception come from?🔗
Most panics I have seen in HTTP servers never brought down the program; they only terminated the request itself. Since HTTP handlers handle each request in a goroutine, I just assumed that was the same behavior for any goroutine. However, as we have seen earlier, that is not true.
If ServeHTTP panics, the server (the caller of ServeHTTP) assumes that the effect of the panic was isolated to the active request. It recovers the panic, logs a stack trace to the server error log, and either closes the network connection or sends an HTTP/2 RST_STREAM, depending on the HTTP protocol.
So, it does have a panic recover safeguard for its own goroutine. However, as discussed earlier, panics must be handled per goroutine, and a recover in a parent goroutine won't prevent a panic in a child goroutine from bringing down the program.
Should every goroutine have a recover?🔗
Someone already suggested that on Reddit - Recover from panics in all Goroutines you start.
Many engineers advocate for the "let it crash" error handling strategy for panics, like the philosophy followed by Erlang. If something is panicking, the program might be in such a bad state that it's safer to let it crash and restart. If the server has a proper scaling mechanism configured, it might not affect many users and makes it pretty clear something is wrong and needs to be fixed.
However, the same way the http package from the Go standard library assumes the effect of the panic is isolated to the active request and handles it gracefully, I think it is fair to apply that same rule to goroutines started from that request and handle them gracefully as well.
By the way, that is not the only case where recover is used by the Go standard library. Defer, Panic, and Recover also mentions:
For a real-world example of panic and recover, see the json package from the Go standard library. It encodes an interface with a set of recursive functions. If an error occurs when traversing the value, panic is called to unwind the stack to the top-level function call, which recovers from the panic and returns an appropriate error value (see the ’error’ and ‘marshal’ methods of the encodeState type in encode.go).
The convention in the Go libraries is that even when a package uses panic internally, its external API still presents explicit error return values.
I think it depends on your architecture. In a microservice with several deployed instances, a few instances going down may not affect the product significantly, so letting the panic terminate the server could be acceptable. For monolithic projects, however, it's safer to handle panics more gracefully and prevent goroutines started from HTTP handlers from bringing down the entire server. The impact is much larger in this case, instead of one part of the app breaking, the whole application becomes unavailable due to an error in a single request.
As long as the recover mechanism logs the failure and reports it to error monitoring tools, handling panics in goroutines seems like a reasonable approach to improve server resilience.