November 23, 2008

How Clean is Your Log?

Think back on the last time you urgently needed to figure out what a live, production piece of software was doing. It's very likely that nearly the first thing you did was to pull up it's log. It's also very likely that the majority of the content was either useless, unintelligible, or flat-out misleading.

The reason most logs fall into this state is that they are a piece-meal accumulation of statements which were thrown in over time. This is a problem because the only possible reason to waste the computer's resources producing them is to be as correct, intelligible, and useful as possible. In short, log files should be explicitly written for consumption by people.

As with anything else, clear writing requires the author to carefully select the content, and to create it with the reader in mind. In the case of log files, we know other engineers (or scripts they write) are going to read them, and they are going to want to know how the server has been behaving. Moreover, no one reads log files unless something is going wrong. Therefore, log files should be written to make it as easy as possible to debug why a server isn't behaving as it should. To that end, here are a few specific guidelines I've found most useful:


Make it easy to track down errors

It should go without saying, but it is imperative to make it as easy as possible to identify where - in the code - an error occurred. In Java this may be printing out the stack trace; in C/C++/Objective-C, this may be using the __FILE__ and __LINE__ macros. Most languages have some similar feature. Along these same lines, try to avoid having logging statements which look too much alike. This makes it easy to misinterpret the log, and completely identical statements make it impossible to know exactly where an error occurred.


Log all inputs and outputs

The fastest way to track down the reasons for strange output is to know whether the input was strange, too. This is especially true for a server (i.e. a malformed request), but it applies to other programs as well, provided you have a sufficiently broad view of "inputs" and "outputs".

I consider an input/output to be any data which enters (or leaves) the program's runtime context. This includes: user input (mouse/keyboard events and the like), files read/written to disk, data sent/received on a socket, or requests/results from a database query. I recommend at least logging something every time such an input or output occurs, and if reasonable, the actual content itself.

Logging inputs and outputs provides a number of major benefits. First, you know exactly what the system was asked to do. This allows you to immediately know whether a problem is in your program or outside of it. Second, it gives you a pretty clear indication of what will reproduce a problem (if there is one). Finally, it allows you to report errors to the owners of the "other" system in case you received an unexpected input.


Record how long it takes

Errors are frequently just that a system is taking too long to respond. Being able to quantify the time required for each request is an extremely valuable way to isolate the problem. However, it is not sufficient to simply record how long your system took. You must also record timings for each request you make outside of your application's runtime context. That way, you'll know right away if your program appears slow because it's waiting for something else: whether it's a web service call or reading a file from disk.


Don't log the expected case

If something is supposed to happen, you generally don't need to know about in the log (unless it's one of the things mentioned above). Bear in mind that no one is interested in reading a program's log unless the program is misbehaving. At that point, the person wants to read as little of the log as possible. Including a lot of "business as usual" statements in the log only makes the process more difficult.


Use log levels to indicate action required

Since a log file's purpose is to assist in figuring out why a program isn't running correctly, it's extremely valuable to use log levels to indicate how "badly" things are going. I recommend thinking of the action required as a guide to what log level to use:

  • FATAL: page someone immediately
  • ERROR: send an email immediately
  • WARN: send in a daily report
  • INFO: include for later debugging purposes

Depending upon your logging system, the various levels often have different names, but these should map to whatever you use. Even if you don't actually have an automated system to page people based upon your log file, it's extremely clarifying to think about each level in these terms.


Make it easy to parse

Finally, bear in mind that most uses of a log file involve searching or parsing them using automated tools. Be sure to include standard fields in each line so that these tools can easily extract information from them. Here are some examples of what fields you may want to include:

  • time (down to milliseconds and time zone)
  • the number of the line in the log statement (to ensure each line is unique)
  • the current thread
  • the current user ID / IP address / other client-related identifier
  • the log level
  • the file/class


There's a lot more to be said on logging, but if every log I've read implemented these few principles, my life would have been a lot easier. If there's some piece of wisdom that's too good to be missed, feel free to add it in the comments!

3 comments:

  1. Another useful piece of information in a log entry would be the category (from a predefined list) what's being accomplished, i.e. "login_request" or "parse_document."

    ReplyDelete
  2. Do you differentiate log files from trace files?

    My product produces both. Log files are always on files which are used to diagnosis general product activity. This are primarily used by system administrators of our product. An example of this is an web server log for getting per connection information.

    Trace files are more specific and only turned on during testing or when a problem is occurring to provide debuggable information. Example of this include, for windows, WPP tracing.

    For us, these two things are similar and contain much of what you discuss above but extremely different. "logs" have a specific format and are meant to be consumed through parsing tools, though they often provide much of the information you mention, every release that we have makes it harder to change them.

    "traces" are of use only to the development team (in fact you cannot even read Window's WPP traces without special files to translate them for you). Obviously the second is more likely to be badly handled, but are much easier to improve if necessary to diagnosis issues.

    ReplyDelete
  3. I've never worked with trace files before, so I'd say the post pretty much only applies to log files.

    ReplyDelete