Good metrics

What makes a good metric?

When I say “metric” for the purposes of this post, I mean something specific. I don’t mean every bit of data you can get, every rollup and statistical trick you can do to produce an interesting number. I mean things that will be included in dashboards, used to evaluate performance, or used to drive alerting and scaling decisions. By “metrics” I mean to call out practical numbers, whether human or machine facing.

As for what makes a metric good, I’d point to three main factors.

Intuitive

  • Metrics should be trivial to understand. It should not require additional thought or mental combination with other metrics to come to a conclusion. You should not be able to come to opposite conclusions from the same metric.
  • Metrics should take advantage of automatic mental processes and not ignore or contravene them. The correct conclusion from a metric should never be the opposite of the initial impression.

Some examples of unintuitive metrics:

  • CPU Usage. 98% CPU usage seems like a bad thing: the server is almost at capacity. The natural conclusion is that we need to scale up or out to alleviate the pressure. Working from only this metric, you’ll probably end up with many of your servers sitting at 80% CPU usage since that feels more comfortable. That’s a bad outcome, though: with a constant workload, you’re paying for 25% more CPU capacity than you use. With a burst-style workload, it makes sense to overprovision CPU. This metric could therefore be improved by bringing in the knowledge of what workloads your servers typically operate at. Track a high water mark for burst (e.g. 2nd standard deviation or 95th percentile) and make your server provisioning decisions based on that.
  • Error count. An increase in the number of errors could indicate a major problem but it could just indicate higher than normal usage.

Actionable

  • Metrics should clearly motivate decisions. It’s alright if that decision is to stay the course - with metrics backing it, you can be confident it is the right decision. Metrics should not indicate multiple courses of action depending on interpretation.

Some examples of unactionable metrics:

  • Lines of code. Is a developer who ships more code more productive or more likely to burn out? Or is their high line count because they use more verbose languages or supporting tools? It’s completely unclear what to do once you have this number. We could improve it by bucketing it. The range of normal code contributions is probably quite wide, but we should be able to establish a lower bound for acceptable contributions in a given timeframe for a developer. For your organization this may be a single line, leaving us with a bucket for 0 lines and a bucket for any contribution at all - but we’ve made the metric actionable (assuming we are also able to identify which employees need to be slinging code at all). It’s not incredibly useful, but if we were to hire a dev who did absolutely nothing, this metric would motivate some action.
  • Test coverage. Test coverage is another that can be made useful by bucketing. If coverage right now is 95%, who’s to say that last 5% is worth adding? And who would commit to saying that the test coverage should be no lower than it is now? But if we again set a remote lower bound - say 70% - falling below that should clearly trigger a re-evaluation of the test setup.

Some examples of actionable metrics:

  • Percent of files touched in the average pull request. If this is high, that clearly indicates a need for refactoring, since there’s a lot of interdependence in the project.

Complete

  • Metrics should bring in all relevant information. Metrics should not ignore part of the problem. This does not mean that metrics need to incorporate all possible indicators - just that everything relevant should be represented in the final metric. It’s entirely acceptable to include only one system usage indicator to build a metric that drives decisions about scaling, but skipping that and just using job count on a particular server will fall short of the mark.
  • It’s part of the above, but I want to call it out specifically: metrics should be up to date. Quickly incorporating recent information is important.

Some examples of incomplete metrics:

  • Badly-formatted packets entering a network. This could be used to indicate a possible attack but it needs a bit more in order to be complete. We should combine it with a comparison against the signatures of known hacking tools (this sort of metric is available in your average intrusion detection system).

I’m sure I’ve missed something here. This is the product of only a few hours' thought. I’m interested to hear what good & bad metrics you have encountered - do the bad ones fail for one of these reasons? Drop a comment below.