Reasoning about modern software systems can be hard. Debugging modern software systems can be hard. Whether you’re working on a multithreaded service or re-entrant UI code, tracking responses from network services or running many asynchronous IO tasks, understanding the overall state of your application especially when there’s a hard-to-reproduce bug can be very hard.
Common styles of older multithreaded code are the worst. Shared state and locks, manual coordination between threads, fighting with race conditions, deadlocks and live-locks. It’s very hard to diagnose problems and common to have applications that have been running for weeks suddenly lock up and never find out why.
New styles of programming, borrowed from functional languages and only recently become mainstream, avoid sharing mutable state by passing data and work between threads as packaged tasks. Problems with race conditions and deadlocks can be avoided but you can still hit problems with throughput and starvation with some types of task never finishing.
And there are many single-threaded applications with UI events or network requests that can lead to re-entrant behaviour where a UI or network event can be called unexpectedly while you’re in the middle of another UI or network call, or just coping with multiple streams of data can lead to bugs where occasionally one stream isn’t fully processed.
Of course many applications have a mix of styles and types of work going on at the same time. Maybe an incoming data stream requires two different calculations and these can run in the background at different speeds but the results need to be mixed together, and maybe you also need to send the results to another service and wait for a reply before storing the final state. If a bug causes it to occasionally only process half the data it’s going to a long day working out where it went wrong.
One reason this is difficult is that we often try to think the process through for one data item at a time, because that’s how an old single-threaded application would work, and we then try an imagine many of these happening at once. Most of us can’t see the possible places these multiple activities could affect each other. For many problems this is still productive as when you can reproduce the problem, or when it’s clear what type of data causes the problem that lets you focus on only a small set of code. But sometimes a problem only happens in a full production situation and no-one can find a pattern in the data that triggers it, so you need to find a different way in.
I’ve found that ignoring the individual data items, ignoring the actual steps in processing, stepping back and only thinking about the overall aggregate set of work tasks can help. If it’s hard to imagine and reason about what’s going on in detail we can reason about how many things are going on. How many different types of action are there? In the example above there are 6 – receive, calc1, calc2, send, wait-for-reply, save. How many of each type of action have we done or planned? Maybe we can just count them: count incoming data items, count actions completed, count actions still on the queue, count messages sent. Some applications may already be logging every action and tools like LogStash or Splunk may let you pull out these counts. In other applications you can easily log them together in a single block as a representative of the overall high-level state of the application.
Counting actions can help you see that many things are just deferred tasks. The list of IDs you’re waiting for replies on – just a list of tasks for later. The queue passing data to the background worker thread – just a list of tasks for later. Buffered data – just a list of tasks for later. Making the quantity of each task visible lets you see the dynamic shape of your application, which parts are used the most and which parts rely on others, and show this while it’s running.
Then we can reason about how these numbers should affect each other. In my example all 6 actions should happen the same number of times, once for each received data item, and should happen in a specific order. It’s ok if the received counter is higher than the others for a while as long as they catch up. If the sent counter is higher than the received counter we have a bug. If the wait-for-reply counter gets incremented after the save counter we have a bug.
Even for problems that we can’t consistently reproduce when they do happen we get some valuable information about where to look. If some of the data is not being processed we can quickly see which counter is lower than the others to show which step is holding things up. Then we can do a deeper investigation looking for problems with how that action gets triggered or the error handling in the action itself.
Thinking like this can also help improve your design, help you separate the high-level application logic coordinating these actions from the detail code running the actions. Fundamentally this is just again applying standard design goals: separation of concerns, and separation of levels of abstraction. Designing this way will probably prevent many of these types of bugs occurring in the first place, just because we started counting.
No Comments