Logging for High Performance Applications

Maintaining the balance between too much and too little

For high throughput, low latency applications, a single unoptimized operation can cause performance degradation.

Logging and Performance  ðŸŒē🏎

We can find out with a simple experiment using Go's default log package and benchmarks.

 

Our experiment will compare the execution of these two functions.

func Reverse(s string) string {
  var r string
  for _, x := range s {
    r = string(x) + r
  }
  return r
}

func ReverseLog(s string) {
  r := Reverse(s)
  log.Printf("The reverse of %s is %s", s, r)
}

In general, logging is an expensive operation; but how expensive is it anyway?

Just how expensive is logging? 💰

To figure out how much overhead a single log statement can add, we can use Go's benchmark tests.

$ go test -bench . -benchmem
goos: darwin
goarch: amd64
pkg: github.com/madflojo/logging-benchmarks
BenchmarkReverse/No_Logs-4       8108514      142 ns/op     6 B/op    3 allocs/op
BenchmarkReverse/Default_Logs-4  579961       2144 ns/op    70 B/op   6 allocs/op

From these tests, we can see that a normal execution of our function takes 142ns. But once we add a single log statement, our execution time increases to 2.1Ξs.

Understanding the overhead 🔎

But why does logging add so much overhead? We can use Go's CPU profiling to breakdown where we spend our time.

Most of the execution time is spent writing to standard out. 

Overhead at scale ⏱

We know a single log statement is 2Ξs of overhead, but that's not so much, right? Let's see how much overhead this is at 2,000 requests per second.

2\mu\times2000=4000\mu

Not too bad, but let's be realistic, most applications have more than one log entry per request. Let's say on average we have five per request.

(2\mu\times2000)\times5=20000\mu

So far the overhead may not seem like much. But let's put this in context by figuring out how many executions we could have performed in 20ms.

((2\mu\times2000)\times5)\div.100\mu=200000

But What about Async Logging? 🔁

Most modern log libraries are asynchronous, right? Doesn't that mean they eliminate the logging overhead?

Let's experiment again using the Uber Zap ⚡ïļ logger. A logging library that is known for its performance-focused async logging.

func ReverseZap(s string) {
  r := Reverse(s)
  logger.Info("Reverser ran", zap.String("original", s), zap.String("reversed", r))
}

It's fast, but Not A free Lunch ðŸą

Zap is quick, a lot faster than the default logging library. But the performance is not free; it is still 4 x slower than not logging; it also allocates more Bytes per operation.

$ go test -bench . -benchmem
goos: darwin
goarch: amd64
pkg: github.com/madflojo/logging-benchmarks
BenchmarkReverse/No_Logs-4       8108514      142 ns/op     6 B/op    3 allocs/op
BenchmarkReverse/Default_Logs-4  579961       2144 ns/op    70 B/op   6 allocs/op
BenchmarkReverse/Zap_Logs-4      2480517      473 ns/op     137 B/op  4 allocs/op

Breaking down how Async loggers work 🔍

Async loggers are not magic; they all work under the same principle of putting log data into a buffer and having another thread/goroutine/worker perform the slow work.

Even though this work is out of the transaction path, the overall application is still spending resources logging.

What have we learned so far? 📝

  • A single sync log call can drastically slow execution
  • Async logs are fast, but they use more resources and still impact execution time
  • The more transactions you process, the more logs you create
  • When logging too much, it's possible to fill the async buffer/queue which can slow an application even further

If logging is such a performance drain why even log in the first place?

A slow application that can be managed is more performant than a Fast application that is broken

Are you sure Logs are still relevant? ðŸ‘ī🏞 

With the addition of metrics and tracing to the observability stack, it's understandable to question the value logs. However, logs still provide insights other telemetry methods do not.

📈 Metrics give us stats

🔍 Tracing gives us request level insights

ðŸŒē Logging tells us what is happening within the system

Metrics give us Stats 📈

Before Prometheus and other metrics tools became standard, it was common to extrapolate business and system metrics from logs. This meant having to log every event in the system so that those metrics could be created. Now we can easily create thousands of different metrics with much less overhead.

  • Don't create log messages for the sole purpose of tracking data. Use metrics.
  • Metrics have less overhead per metric than logs
  • Use metrics to track business data such as types of requests, users, etc.
  • Also, use metrics to track system data such as database connections, worker pool sizes, etc.

Tracing gives us Request Visibility 🔍

Like metrics, before tracing existed, it was necessary to log at least once for every transaction. Afterward, you would have to piece together a request from each service's logs: a time consuming and usually fruitless exercise.

  • Tracing allows you to reduce how much you log per request
  • Most tracing tools provide sampling, keeping the overhead of tracing low with the cost of less visibility
  • It's important to use protocol-based errors with tracing to signal the transaction failed

Logs tell us what the system is doing ðŸŒē

When working with a platform with tracing and metrics implemented, we can focus on logging the application's internals. Request level logs can be pushed down to debug, and request errors can be made into warn logs.

 

Logging becomes less about request level tracking and focuses more on what the service is doing, DB connectivity, reconnection attempts, dependency errors, etc.

Logs Change In the age of Observability, but they still Matter ðŸ–Ĩ

  • Metrics can tell you something is wrong
  • Tracing can tell you which transactions failed due to something being wrong
  • Logs tell us what went wrong with both the system and the request
  • With tracing, you don't need to log every request, just the failed ones

Since logs are still important, we must find the balance between logging enough to support the system, while maintaining adequate performance.

logs are expensive, So make every log operation count

say a lot, with just a little ðŸ“Ģ

When we do log, it's important to relay the most information possible with the least amount of log calls. There are a few best practices we can follow to help achieve this goal.

  • Use appropriate log levels
  • Provide as much context as possible
  • Use Structured logs to make it easier to relay context.
  • Context is not limited to transactional details; context explains the why and the what
  • Reduce logs per request, pass errors up the stack to request handlers for logging

Getting Levels right 🆙

Log levels are important because they both drive action and change the performance overhead of a log statement. Each log level tells ops what to do with the information. Some log levels are disabled by default; when disabled, the performance overhead is negligible.

Level Default On/Off Description
Critical, Fatal On Critical errors, the service is crashing or in a terrible state. You want to wake someone up for this and immediately fix it.
Error On Important to fix errors; most likely, someone needs to wake up at night to fix them.  This should not typically be used for individual requests but rather system-level problems.
Warn On These are errors someone should know about and react to trends. This is where you put request level problems.
Info On This is just information, i.e. connected to a DB or fetched config. Don't log the request received logs here.
Debug Off Request level logs go here, also use this for internals that are useful but not important to log all the time.
Trace Off Log raw transactional details here, but be careful not to log anything sensitive. Logs are not secure, they are open to many.

Anatomy of a log message

Let's breakdown this "real life" example log message with an Ops perspective.

{
	"timestamp": 1605067527000,
	"service": "my-service-1",
	"level": "ERROR",
	"request_id": "0000000001",
	"message": "Workflow execution failed"
}

What we know:

  • The ERROR level means we should be worried about this failure
  • We know which service this log is from
  • We can use the request ID to find other logs

What we Don't know:

  • Why did the workflow fail?
  • Which workflow failed?
  • Is this a critical workflow?

Anatomy of a Good log message ✅

{
	"timestamp": 1605067527000,
	"level": "ERROR",
	"request_id": "0000000001",
	"message": "Task example.task1 failed during example.workflow42 execution - call to db failed: password expired"
}

What we now know:

  • We know which workflow failed - (example.workflow42)
  • We know which task in that workflow failed - (example.task1)
  • We know why it failed - (password expired)

Logging in common functions ðŸšĻ

A common reason for "over logging" is to put log statements in both request handlers and the functions those handlers call. Not only does this increase the complexity of the code, but this also increases the number of logs per request.

ðŸŒē x 2

ðŸŒē x 1

ðŸŒē x 1

ðŸŒē x 1

ðŸŒē x 1

ðŸŒē x 1

⚠ïļ Don't log at each function call, log the results from the handler

Pass Errors back to the Handler 🔁

Instead of logging within each function, pass the error to the request handler allowing the handler to log appropriately. This not only keeps code cleaner and easier to test, but it also makes it easier to add request level context.

func MyFunction(...) (string, error) {
	// do work
	if err != nil {
		return nil, fmt.Errorf("unable to perform function because - %s", err)
	}
	return results, nil
}

func RequestHander(...) {
	r, err := MyFunction(...)
	if err != nil {
		logger.Warn("Request from %s from IP %s failed - %s", req.RequestID, 
		req.IPAddr, err)
	}
}

Logging Best Practices in Summary 📖

  • Make sure the log level used reflects the action you want someone to take
  • Warnings should trigger an action when there are many of them; a single warning is not important
  • Errors and Criticals should trigger immediate action; use these only when someone should wake up and do something.
  • When logging, add as much context as possible to the message so that ops know what broke and why it broke
  • It's easier to add context by sending errors back to request handlers and let the handlers worry about logging

Finding Balance is hard, but we can improve by practicing everyday

What this talk has taught us 🧙🏞‍♂ïļ

  • Logging too much is very easy to do by logging in the wrong places
  • Logging too much impacts performance, even with async logging
  • Logging too little makes an application unmanageable
  • Implementing metrics and tracing help reduce how much logging is required, without them you probably need to log at least once per request, but be careful
  • Use log levels to give yourself the ability to drill further with a log config change
  • When you do log, make it worthwhile don't log junk

EOF

Benjamin Cane

Like this talk? Check out my some of my other talks:

Or if you really like this, Buy an offline copy!