November 23, 2008

How Clean is Your Log?

Think back on the last time you urgently needed to figure out what a live, production piece of software was doing. It's very likely that nearly the first thing you did was to pull up it's log. It's also very likely that the majority of the content was either useless, unintelligible, or flat-out misleading.

The reason most logs fall into this state is that they are a piece-meal accumulation of statements which were thrown in over time. This is a problem because the only possible reason to waste the computer's resources producing them is to be as correct, intelligible, and useful as possible. In short, log files should be explicitly written for consumption by people.

As with anything else, clear writing requires the author to carefully select the content, and to create it with the reader in mind. In the case of log files, we know other engineers (or scripts they write) are going to read them, and they are going to want to know how the server has been behaving. Moreover, no one reads log files unless something is going wrong. Therefore, log files should be written to make it as easy as possible to debug why a server isn't behaving as it should. To that end, here are a few specific guidelines I've found most useful:

Make it easy to track down errors

It should go without saying, but it is imperative to make it as easy as possible to identify where - in the code - an error occurred. In Java this may be printing out the stack trace; in C/C++/Objective-C, this may be using the __FILE__ and __LINE__ macros. Most languages have some similar feature. Along these same lines, try to avoid having logging statements which look too much alike. This makes it easy to misinterpret the log, and completely identical statements make it impossible to know exactly where an error occurred.

Log all inputs and outputs

The fastest way to track down the reasons for strange output is to know whether the input was strange, too. This is especially true for a server (i.e. a malformed request), but it applies to other programs as well, provided you have a sufficiently broad view of "inputs" and "outputs".

I consider an input/output to be any data which enters (or leaves) the program's runtime context. This includes: user input (mouse/keyboard events and the like), files read/written to disk, data sent/received on a socket, or requests/results from a database query. I recommend at least logging something every time such an input or output occurs, and if reasonable, the actual content itself.

Logging inputs and outputs provides a number of major benefits. First, you know exactly what the system was asked to do. This allows you to immediately know whether a problem is in your program or outside of it. Second, it gives you a pretty clear indication of what will reproduce a problem (if there is one). Finally, it allows you to report errors to the owners of the "other" system in case you received an unexpected input.

Record how long it takes

Errors are frequently just that a system is taking too long to respond. Being able to quantify the time required for each request is an extremely valuable way to isolate the problem. However, it is not sufficient to simply record how long your system took. You must also record timings for each request you make outside of your application's runtime context. That way, you'll know right away if your program appears slow because it's waiting for something else: whether it's a web service call or reading a file from disk.

Don't log the expected case

If something is supposed to happen, you generally don't need to know about in the log (unless it's one of the things mentioned above). Bear in mind that no one is interested in reading a program's log unless the program is misbehaving. At that point, the person wants to read as little of the log as possible. Including a lot of "business as usual" statements in the log only makes the process more difficult.

Use log levels to indicate action required

Since a log file's purpose is to assist in figuring out why a program isn't running correctly, it's extremely valuable to use log levels to indicate how "badly" things are going. I recommend thinking of the action required as a guide to what log level to use:

  • FATAL: page someone immediately
  • ERROR: send an email immediately
  • WARN: send in a daily report
  • INFO: include for later debugging purposes

Depending upon your logging system, the various levels often have different names, but these should map to whatever you use. Even if you don't actually have an automated system to page people based upon your log file, it's extremely clarifying to think about each level in these terms.

Make it easy to parse

Finally, bear in mind that most uses of a log file involve searching or parsing them using automated tools. Be sure to include standard fields in each line so that these tools can easily extract information from them. Here are some examples of what fields you may want to include:

  • time (down to milliseconds and time zone)
  • the number of the line in the log statement (to ensure each line is unique)
  • the current thread
  • the current user ID / IP address / other client-related identifier
  • the log level
  • the file/class

There's a lot more to be said on logging, but if every log I've read implemented these few principles, my life would have been a lot easier. If there's some piece of wisdom that's too good to be missed, feel free to add it in the comments!

November 16, 2008

The Right Size for a Method

One occasionally hears of some programming zealot who swears up and down that methods should be kept to 'n' lines or less. The actual value may vary, but the arbitrary nature of the claim remains. However, there is a kernel of truth there, and it has to do with preserving Unit Economy.

We all know that the longer a method is, the more we have to keep in our minds to understand it. There are likely to be more local variables, more conditional statements, more exceptions caught and thrown, and more side-effects of all those lines of code. Furthermore, the problem grows faster and faster as there are more lines of code since they all potentially can have an impact on one another. Keeping methods short has a disproportionately large benefit.

Of course, claiming that there's some absolute "correct" number is clearly nonsensical. The same number of lines in C, Lisp, Java, Assembler, or Ruby will accomplish radically different things, and even how one legitimately counts lines will change dramatically. What does not change, though, is the need for the reader (and author) of the code to understand it as a whole. To this end, one should strive to keep the number of discrete tasks a method accomplishes to within the range of what people generally can remember at once: between one and six.

Each task within a method may have several lines of code of its own; how many tends to vary widely. Consider the process of reading a series of rows from a database. There may be a task to establish a database connection, another to create the query, another to read the values, and perhaps one more to close everything down. Each of these may be composed of anywhere from one to many lines of code.

Tasks may even have subtasks. Consider the example of building a login dialog. At some point, there is likely to be some code which creates a variety of controls, and places them on the screen (e.g. an image control for the company logo, a text field to capture the user name, etc). In the method which does this, one may consider the process of creating the components a single task which has a number of subtasks: one for each component.

In both cases, the important consideration is how organizing the method into tasks and subtasks helps preserve Unit Economy. By creating tasks which have strong Cohesion (i.e. you can name what that group of code does) and loose Coupling (i.e. you can actually separate that group of lines from the others), you give the reader ready-made abstractions within your method. In the first example, the reader can readily tell that there's a section for setting up the connection, and be able to mentally file that away as one unit without the need to remember each line of code in it. In the latter example, the reader can categorize the series of UI element creation subtasks as a single "build all the UI components" task, and again be able to abstract the entire thing away under a single unit. Even if there are a dozen or more individual components, it still can be considered a single task, that is, a single mental unit.

This ability to create abstractions within a single method is why there is no absolute "right" size for a method. Since grouping like things into tasks and subtasks preserves the reader's (and author's) Unit Economy, it is quite possible to have a method which is very long in absolute terms, and still quite comprehensible. It also implies that a fairly short method can be "too long" if it fails to provide this kind of mental structure. The proper length will always be determined by the amount of units (tasks) which one has to keep in mind, and the complexity of how those tasks are interrelated.

November 09, 2008

Dependency Injection

In my last post, I spent a good deal of space ranting against the Singleton pattern, and at the end, I promised an alternative. The one I had in mind was Dependency Injection (if you're not familiar with the term, I highly suggest you read the linked Wikipedia article before proceeding).

There are a large number of benefits to Dependency Injection (DI), but I'm going to focus on a few which fall under the categories of improved Cohesion, and improved Coupling.

Improved Cohesion

The strongest cohesive benefit of DI is that it greatly encourages thinking of objects as self-sufficient, stand-alone, software components. Each one must explicitly declare what settings it can accept (i.e. its configurable attributes), what other objects it interacts with (i.e. its dependencies), and what services it provides (i.e. its public methods). Since objects must be configured from the outside, they must provide all the necessary attributes and methods to do so. This requires the author to consider exactly what the function of the object is, and to consider what its public interface should be. The thinking process involved in DI leads to stronger cohesion in the individual objects.

Another major cohesive benefit is that objects don't need to include code to configure themselves. A substantial amount of code in non-DI objects tends to be for looking up configuration values (key-value pairs from files, structured data from a database, object references from a JNDI store, etc). The parsing and handling of this information is generally not the purpose of the class; it is merely an incidental price to make the object sufficiently flexible. By injecting such values from the outside, the class becomes much more focused on its real function, and less on the incidentals of configuration.

Improved Coupling

The crucial and most significant benefit of DI is that objects truly only know about one another's interfaces. In non-DI code, each object may be written in terms of an interface, but somewhere it needed to create or look up an instance using some hard-coded value. This could be as simple as calling a constructor, or as complex as reading a key from a file and looking up the actual object from JNDI. In either case, the object is tied to the actual implementation of that class. In a DI class, there is no connection to the actual implementation class in any way, thereby reducing coupling back to only the shared interface.

The second benefit is simply an implication of the first: that the number and arrangement of objects becomes tremendously more flexible. Since no object ties itself to any other instance, any number of objects can be created and configured to use any combination of dependencies. Perhaps multiple instances will all share the same reference to a dependency; perhaps they will each have their own instance. It may be that one instance of a dependency has been given a version which caches responses while another does not. The fact that every object is given its dependencies (instead of creating them) provides radically looser coupling. As such, the possibilities for arranging classes is dramatically increased.

I've only scratched the surface of the value of Dependency Injection as a design pattern. If you have worked with it before, you undoubtedly already understand the tremendous benefits it provides. If not, I strongly recommend you check out a Dependency Injection framework for your next project. Once you wrap your head around it, you'll never want to go back.

November 03, 2008

Singletons Deemed Dangerous

If you ask a software developer to name a few common design patterns, nine times out of ten, you'll hear about the venerable Singleton pattern. After all, it's one of the simplest design patterns, and one of the half-dozen or so creator patterns in the Gang of Four book. Sadly, this is an incredibly over-used pattern: to the point of qualifying as an anti-pattern.

I say anti-pattern because it is most commonly used as a way of creating an object-oriented global variable. This commonly occurs when the author of the code imagines that only one of a specific resource should ever be available within the system, and he wishes to make sure there is only one instance to match. Other times, the object is needed in several places which don't have a convenient way to pass an instance around, and a singleton is used to make sure the object is always available. In any case, the net result is the introduction of what is effectively a global variable into the code.

Once in place, a Singleton causes a number of insidious problems. The most difficult is that it is impossible to change how many objects of that class there are. It is frequently the case that it seems like there should only be one instance of an object at first, but that requirements change down the road, and multiple objects are needed later on. One example of this is a database connection. Since many different parts of a program may rely on a database, and most programs only address a single database, it is tempting to make the database connection pool a Singleton. Unfortunately, this doesn't allow for a situation where multiple databases must be addressed at once, or if some of the data moves behind a remote service call instead of the database.

A second major dilemma produced by Singletons is that it is impossible to substitute an alternate implementation. Either the behavior of the Singleton is changed for everyone, or not at all. This can lead to an explosion of complexity both within the Singleton (as it is adapted and configured to suit more and more separate and conflicting needs), and around it (as other classes need to provide more and more context). In ordinary circumstances, it would be easy enough to provide a group of subclasses with a shared interface to manage this complexity. Since the nature of Singletons makes this impossible, one is stuck with more complex means of altering the class's behavior.

A third problem is that Singletons make code which uses them very hard to unit test. Since every class which uses a Singleton fetches it at will from a global, static location, it is nearly impossible to test that class without also stubbing out everything which the Singleton interacts with (in addition to whatever stubs were necessary for the class itself). In effect, one must consider the functionality of the Singleton as being part of every class which uses it, since the two are so tightly Coupled as to be inseparable.

Finally, Singletons are not at all friendly to a programmer's Unit Economy. In order to successfully alter or maintain any class, one must bear in mind how it is used by every known class. By using layers and avoiding tight Coupling, one can make this a fairly manageable task. However, a Singleton destroys this effort completely. It is inherently available to any object of any layer, and it is inherently tightly coupled to any class which uses it. Both of these things mean that modifying a Singleton is a much larger undertaking than any other sort of class because of the potentially enormous mental context required to understand the ramifications of doing so.

So, what's the solution? There really are cases where everything in the system should refer to the same instance of an object, and there really are situations where an object is needed in a lot of places that aren't really connected to one another. Fair enough... I'll discuss one excellent alternative next time.