Metrics

Metrics are always an interesting topic. What do you measure and why? And how do you keep people from gaming it?

Metrics
Image by Mikes-Photography from Pixabay

In my last post, I mentioned the DORA Metrics. Here I want to talk more about metrics in general. But first, a story.

Wilderness First Responder

For those who don't know, I used to do a lot of outdoor stuff - rock and ice climbing, trail running, mountain biking, skiing, etc. I still do some, but having kids has definitely slowed me down. When I was very serious about it, I was spending a lot of time in the back-country where medical care was not always readily available, so I decided to become a Wilderness First Responder (WFR).

WFR is a certification program that teaches you how to treat patients in the back-country where definitive medical care is not readily available. It's a 10 day course. It's not med school or anything, but you do learn a lot of anatomy and physiology. You learn a lot of practical skills like doing exams, moving patients safely, splinting, doing hypothermia wraps, etc. There is also a lot of emphasis on evacuation criteria: who needs evacuated? and how rapidly? A large part of making those decisions is monitoring vitals.

Vital Signs are metrics

You learn a variety of vital signs to check for, but the 2 most well-known ones are Heart Rate and Respiratory Rate (HR/RR). These metrics are easy to check in the wilderness because they don't require any special tools. They are also a good indication of the function of your high-level bodily systems. I should also point out that there is a quality associated with these numbers as well. How strong is their heart beating? and How difficult is it for them to breath? In medicine that is important, but we're going to ignore the quality part here and just focus on the number.

What is normal?

The first thing you learn is that there are some normal ranges for those numbers. An HR of 10 or 250 should be alarming as should a RR of 4 or 60.  You quickly learn that the "normal" range for the number is quite large. For HR (at rest), typical is 60-100 BPM, but 40 isn't unusual in athletes. So if some one shows up with a HR of 80 - is that good? The question to ask in this case is: "What is normal for that person?" That forms the baseline and we don't care so much about the absolute value (unless it is approaching 0), but we do care about the change from baseline. If an extremely fit athlete, who normally has an HR of 40, shows up with a HR of 80BPM I might be concerned.

How is that patient doing?

I say might be concerned, because one metric alone doesn't always tell the whole story. You should avoid looking at just one metric in a vacuum. You need context. So the next question to ask is: "How does the patient appear to be doing?" If their HR/RR is elevated or depressed, but they seem to not be struggling, I'm a little less worried.

It's the trend not the actual value

I'm less worried, but the person isn't out of danger yet. The difference from the baseline is important, but more important is the trend over time. Are the values stable or going up or down? Are they approaching normal? or heading to the extremes? How quickly?  That is all very important information and generally factors into the decision making far more than the actual value.

It's a leading indicator

Why do we pick HR and RR to monitor. As I mentioned, they are easy to monitor. Combined they are also a good leading indicator for shock. Shock is a very serious condition and hard to reverse. By monitoring HR/RR overtime and in relation to each other, we can predict when a patient is slipping into shock and treat them before it is too late. There are plenty of lagging indicators as well, but by the time we notice those, it is too late.

It's not a target

This last point might be obvious, but it is worth saying because it plays into the later discussion. We use HR/RR as an indicator not a target. That is to say, we don't do things to the patient to try to manipulate the HR/RR. We do things to the patient to try to improve their condition and then we use the HR/RR as an indicator that things got better or at least didn't get worse. Here it is pretty obvious because, while a patient has some control over their RR, there nothing we or the patient can do to directly control their HR. It's just not possible.

Another Real-Rife Example: Driving

Enough medicine. Let's switch gears. Let's talk about driving. While driving we have lots of indicators - speedometer, tachometer, oil light, etc. I want to use this analogy to drive home the last point. Unlike HR/RR for a patient, when you are driving a car, you do have more direct control over your speed. With the gas-pedal, you can manipulate that. Many people often try to drive at exactly the speed limit (or in the US 5-10 miles above). If you want, you could use the speedometer as a target instead of an indicator.

Metrics make bad targets

If  you use the speedometer as a target rather than an indicator that can cause problems. If all your attention is focused solely on maintaining the speed limit, you will miss important things like traffic slowing, wet roads, icy/snowy roads, sharp bends, fog, etc. Simply trying to maintain the speed limit in those situations will cause you to crash. What you should be doing is using the speedometer as an indicator. You should drive at an appropriate speed for the conditions while monitoring the speedometer to make sure you aren't speeding and get an idea of how long it is going to take to get wherever you are going.

Applying this to software

So how does all this apply to software? Here are a few observations.

Use Metrics As Indicators Not Targets

There is a saying - "As soon as a metric becomes a target, it is no longer a useful metric." The reason for that is because it becomes too easily gamed at the expense of other attributes of the system that you care about. The target should be a specific outcome, not a specific metric. You should be able to deduce the health of the system and the success of your interventions from the metric, but you shouldn't try to drive the number.

Never Rely On Just One Metric

You should never rely on a single metric. A metric gives you a snapshot from one particular angle. There are undoubtedly other aspects of the system that you care about. Try to pick another metric to represent those aspects. For example in the DORA metrics 3 of the 4 are about speed, but there is one quality metric. That quality metric is there to make sure we aren't inadvertently sacrificing quality for speed.  The DORA metrics work well as a group, you shouldn't just pick off one or two to monitor.

Pick Leading Indicators When Possible

There are leading and lagging indicators. Whenever possible we want to track leading indicators. We want to know when things are going downhill before we run into problems. It is easier to avoid problems than it is to fix them after they happen.

Always Take A Baseline

Before you start messing with the system, first observe and get a good baseline. Know where you are so you can measure your progress. Know what is normal for you.

Keep in mind that trends are almost always more important than absolute values. This particularly true if we are looking at leading indicators. We really want to know which way things are moving and we want to be able to correlate that with our actions over time. We want to be able to see that when we do X, do the numbers go up or down.

Use Metrics For Internal Not External Comparison

Keep your own scoreboard, not someone else's. In running, their is this idea of a PR, a personal record. It's your fastest time on a particular course. It's about celebrating your own accomplishments without getting caught up and distracted about what everybody else is doing. It's about comparing yourself to your past self, not someone else. It goes back to the "What is a normal heart rate for this patient?" question. What's normal for you is not normal for someone else. So stop comparing yourself to others.

Experiment

I encourage you to experiment with this. Here is a simple formula:

  • Pick an aspect of your software development process that you'd like to improve, or a specific problem you'd like to solve.
  • Come up with an intervention.
  • Come up with some related metrics.
  • Take a baseline.
  • Make a prediction.
  • Implement the intervention.
  • Observe the trends.
  • Decide whether to keep the change or not

An Example

Maybe you aren't currently doing unit testing, but you just found a bug in production that could be easily caught by unit tests, so you say let's add unit testing into our process. The problem you want to solve is "We end up with easily preventable bugs in production." Your intervention is "We are going to start writing unit tests and automating them"

Your related metrics might be the DORA Metrics. As a baseline, instead of waiting to implement your intervention, you determine that you already have the data you need, you just have to go calculate it. You pull up your records and come up with your DORA Metrics for the past year. You deployed roughly every 2 months. Your lead time for changes is about 4 months. Your change failure rate is 2 out of 6 or 33%. and your mean time to recover is 1 month. That is your starting point.

Now you make some predictions. You expect your change failure rate to go down. You should be catching more potential problems with your unit tests. Your mean time to recover might go up because unit tests will catch the easy things, which means anything that gets through is going to be harder to fix. Your deployment frequency is driven more by your customer so that probably won't change much. Right now you spend time fixing bugs that QA finds, creating this rework cycle, so hopefully, the tests will find these bugs before they make it to QA and there will be less rework there. That should cause your lead time for changes to go down.

Now that you have a prediction, you can go implement it for a few iterations and observe the trends. Compare the results to your predictions. Were your predictions accurate? If not, why? What did you miss? You probably don't have perfect controls, so what else might have changed over the course of your experiment that might affect your numbers? If the numbers look good, then keep the change. If not, roll it back.