Maintaining the balance between too much and too little
We can find out with a simple experiment using Go's default log package and benchmarks.
Â
Our experiment will compare the execution of these two functions.
func Reverse(s string) string {
var r string
for _, x := range s {
r = string(x) + r
}
return r
}
func ReverseLog(s string) {
r := Reverse(s)
log.Printf("The reverse of %s is %s", s, r)
}
In general, logging is an expensive operation; but how expensive is it anyway?
To figure out how much overhead a single log statement can add, we can use Go's benchmark tests.
$ go test -bench . -benchmem
goos: darwin
goarch: amd64
pkg: github.com/madflojo/logging-benchmarks
BenchmarkReverse/No_Logs-4 8108514 142 ns/op 6 B/op 3 allocs/op
BenchmarkReverse/Default_Logs-4 579961 2144 ns/op 70 B/op 6 allocs/op
From these tests, we can see that a normal execution of our function takes 142ns. But once we add a single log statement, our execution time increases to 2.1Ξs.
But why does logging add so much overhead? We can use Go's CPU profiling to breakdown where we spend our time.
Most of the execution time is spent writing to standard out.Â
We know a single log statement is 2Ξs of overhead, but that's not so much, right? Let's see how much overhead this is at 2,000 requests per second.
Not too bad, but let's be realistic, most applications have more than one log entry per request. Let's say on average we have five per request.
So far the overhead may not seem like much. But let's put this in context by figuring out how many executions we could have performed in 20ms.
Most modern log libraries are asynchronous, right? Doesn't that mean they eliminate the logging overhead?
Let's experiment again using the Uber Zap âĄïļ logger. A logging library that is known for its performance-focused async logging.
func ReverseZap(s string) {
r := Reverse(s)
logger.Info("Reverser ran", zap.String("original", s), zap.String("reversed", r))
}
Zap is quick, a lot faster than the default logging library. But the performance is not free; it is still 4 x slower than not logging; it also allocates more Bytes per operation.
$ go test -bench . -benchmem
goos: darwin
goarch: amd64
pkg: github.com/madflojo/logging-benchmarks
BenchmarkReverse/No_Logs-4 8108514 142 ns/op 6 B/op 3 allocs/op
BenchmarkReverse/Default_Logs-4 579961 2144 ns/op 70 B/op 6 allocs/op
BenchmarkReverse/Zap_Logs-4 2480517 473 ns/op 137 B/op 4 allocs/op
Async loggers are not magic; they all work under the same principle of putting log data into a buffer and having another thread/goroutine/worker perform the slow work.
Even though this work is out of the transaction path, the overall application is still spending resources logging.
If logging is such a performance drain why even log in the first place?
With the addition of metrics and tracing to the observability stack, it's understandable to question the value logs. However, logs still provide insights other telemetry methods do not.
Before Prometheus and other metrics tools became standard, it was common to extrapolate business and system metrics from logs. This meant having to log every event in the system so that those metrics could be created. Now we can easily create thousands of different metrics with much less overhead.
Like metrics, before tracing existed, it was necessary to log at least once for every transaction. Afterward, you would have to piece together a request from each service's logs: a time consuming and usually fruitless exercise.
When working with a platform with tracing and metrics implemented, we can focus on logging the application's internals. Request level logs can be pushed down to debug, and request errors can be made into warn logs.
Â
Logging becomes less about request level tracking and focuses more on what the service is doing, DB connectivity, reconnection attempts, dependency errors, etc.
Since logs are still important, we must find the balance between logging enough to support the system, while maintaining adequate performance.
When we do log, it's important to relay the most information possible with the least amount of log calls. There are a few best practices we can follow to help achieve this goal.
Log levels are important because they both drive action and change the performance overhead of a log statement. Each log level tells ops what to do with the information. Some log levels are disabled by default; when disabled, the performance overhead is negligible.
Level | Default On/Off | Description |
---|---|---|
Critical, Fatal | On | Critical errors, the service is crashing or in a terrible state. You want to wake someone up for this and immediately fix it. |
Error | On | Important to fix errors; most likely, someone needs to wake up at night to fix them. This should not typically be used for individual requests but rather system-level problems. |
Warn | On | These are errors someone should know about and react to trends. This is where you put request level problems. |
Info | On | This is just information, i.e. connected to a DB or fetched config. Don't log the request received logs here. |
Debug | Off | Request level logs go here, also use this for internals that are useful but not important to log all the time. |
Trace | Off | Log raw transactional details here, but be careful not to log anything sensitive. Logs are not secure, they are open to many. |
Let's breakdown this "real life" example log message with an Ops perspective.
{
"timestamp": 1605067527000,
"service": "my-service-1",
"level": "ERROR",
"request_id": "0000000001",
"message": "Workflow execution failed"
}
{
"timestamp": 1605067527000,
"level": "ERROR",
"request_id": "0000000001",
"message": "Task example.task1 failed during example.workflow42 execution - call to db failed: password expired"
}
A common reason for "over logging" is to put log statements in both request handlers and the functions those handlers call. Not only does this increase the complexity of the code, but this also increases the number of logs per request.
ðē x 2
ðē x 1
ðē x 1
ðē x 1
ðē x 1
ðē x 1
â ïļ Don't log at each function call, log the results from the handler
Instead of logging within each function, pass the error to the request handler allowing the handler to log appropriately. This not only keeps code cleaner and easier to test, but it also makes it easier to add request level context.
func MyFunction(...) (string, error) {
// do work
if err != nil {
return nil, fmt.Errorf("unable to perform function because - %s", err)
}
return results, nil
}
func RequestHander(...) {
r, err := MyFunction(...)
if err != nil {
logger.Warn("Request from %s from IP %s failed - %s", req.RequestID,
req.IPAddr, err)
}
}
Like this talk? Check out my some of my other talks:
Or if you really like this, Buy an offline copy!