Case Study: Versatile CraneView Refactoring

For the past 8 months I’ve been working with Versatile, a construction technology startup aiming to provide data and aid decision-making in construction projects. I spent time pair-programming with many of their engineers, helping them build maintainable software using TDD, and to make existing software more maintainable by means of refactoring and adding tests. This is a case study of how I accompanied them in refactoring a critical Python system in the heart of their business domain.

CraneView

Versatile is a construction technology startup aiming to provide data and aid decision making in construction projects. Their CraneView device is placed between a crane’s hook and its actual payload, providing telemetry such as GPS coordinates, observed payload weight, altitude and so on. The Versatile backend software then consumes events sent from cranes and deduces what the crane actually did by analyzing the change in events over time.

Like any growth startup, Versatile had to bootstrap very quickly and optimize their initial funding. Being a hardware/software endeavor, they had to come up with a robust device that can withstand the harsh environmental and weather conditions of a construction site. Thus, the CraneView device is comprised of a microprocessor hooked to commonly-available sensors, with a Python daemon collecting measurements and sending telemetry via MQTT to a cloud IoT backend, where the rest of the Versatile stack consumes the events and derives data from them.

The Problem

Over time, due to the realities of life such as changing requirements, backtracking and the general state of affairs of an early-stage startup, the Python code, which was not covered by a comprehensive test suite, began to decay. As the company grew, new engineers joined, some engineers left and the team maintaining the codebase was living with significant fear of change. Although they had the means to deliver software updates to devices automatically over the air, and even had a Continuous Delivery pipeline in place, they became reluctant to change the software for fear of unknown side effects and production downtime.

Towards A Brighter Future

The tipping point came when the need to develop a new monitoring flow arose, but was hampered by the team’s deep fear of change. The desired architecture would wrap each sensor with logic to manage the lifecycle of the measurement code, allowing various interventions by a piece of supervision logic when the measurement code’s state is deemed unhealthy. Most of the existing implementation resided in two Python files, comprising all measurement logic, business logic and networking, with separate files providing drivers for the various sensors, all running as a single process. The new architecture would run separate units of logic in separate processes so that a single sensor logic could be restarted without downtime for the rest of the system.

At this point, I had already been working with Versatile for several months. The company’s CTO approached me to help deal with the pains of hypergrowth they had experienced, and to instill a culture of sustainable delivery and engineering excellence. Over a period of several months I pair-programmed with numerous engineers to improve feedback cycles and reduce fear of change by adding test coverage to existing projects, followed by extensive refactors, and by creating new services using Outside-In TDD.

With the CraneView device software, my goal was to combat fear of change by creating a safety net in the form of a smoke test, and then make a series of small, non-breaking changes to gradually reach the desired goal. I suggested that I start pair-programming with a couple of engineers from the team, and work towards what I call “mitosis” - gradually separating the different pieces of logic to well-defined subsystems inside the existing process, then breaking it down to separate processes in a final, single step, in a process not unlike the methodology for breaking down a monolithic web server into microservices.

The Safety Net

Because the codebase was mission-critical and was lacking test coverage, we could not risk making any change before having at least a gross assurance that we do not break anything, and so the first order of business was creating a single smoke test to prove that the software generally works as expected. The smoke test initially runs manually, subscribing to the MQTT topic, expecting to get a fixed number of events from a specific device over a period of time, and asserting that all fields are present as expected - that all relevant metadata is as specified in device configuration, and that all measurements are valid (non-zero, ranges of data, etc). This test will not detect all bugs, but if something catastrophic happens to the general flow of data, it will be caught. Given a lab device that can be controlled by the tester (be it a human or code), this test could go even further, asserting specific values, and serve as some sort of end-to-end test.

Hexagonal Rearchitecture

After writing and verifying that the smoke test works and can catch gross breakages, it was time to start with the actual refactoring. My goal was to help the team write a suite of fast, integrative tests (of the types I call Acceptance Tests or Component Tests) that can provide them with tighter validation of business logic at the level of the whole system (acceptance tests) and later at the level of a subsystem such as a specific sensor process (component test). Having a fast suite of integrative tests, where each test takes milliseconds to run, yields a quick feedback cycle, facilitating rapid iterative development, making engineers happier, and resulting in better code. In order to achieve such speed, we have to replace any I/O operation with a test double that does not perform I/O. The system had 3 types of I/O: reading measurements from a sensor, writing data to local storage, and publishing events to MQTT.

In order to replace the I/O operations, we gradually, concern by concern, introduced a Hexagonal Architecture into the system. This is an architecture that uses dependency injection to instantiate the system (or a subsystem), injecting it with all I/O-related dependencies it needs in order to run. An outer layer, called the Outer Hexagon, is responsible for instantiating all adapters, and uses them to construct the Inner Hexagon, where all logic resides. For instance, the subsystem responsible for transmitting events to the cloud would require a MessagePublisher, implemented in the case using MQTT. We wrapped all measurement, file system and networking pieces of logic with appropriate Adapters, instantiating the adapters in the process’ main function (the outer hexagon) and injecting them into the existing logic as constructor parameters.

Another issue that could complicate acceptance test was that the logic was spread between multiple threads with different lifecycles, making it impossible to have deterministic assertions on the state at any given point. To overcome these issues, acceptance tests run the entire business flow in a single thread (the test thread). This is achieved by extracting any main / thread loop to the other hexagon, with the inner hexagon exposing a “tick” function that performs a single step of the loop (for instance, polling a single measurement). The test can then “tick” different flows in the system instead of busy-waiting for actions to happen or for data to flow, providing determinism and saving time on sleeps.

Note that this step was somewhat of a “blind” refactor, since there was no viable test suite we could run continuously during refactor to make sure nothing’s been broken. The solution was to take care, work in pairs, and periodically deploy the code to a production-like environment and make sure it still worked as expected. The aforementioned safety net could be used to do that, but its feedback cycle is not realistic for running mid-refactor.

Fake It Till You Fake It

In order to be able to rely on test doubles (fake adapters) in the fast, integrative tests, we had to prove that the fakes behave exactly like the adapters they replace. This is done by writing a suite of contract tests, one per adapter, that runs twice: once against the fake adapter and again against the real adapter. If the same test passes against both adapters, we know that they behave the same way and can safely use the fake adapter in the integrative tests.

The file system and network adapters were simple: the file system, essentially a key-value store of file name to binary content, can be replaced by a hash table (dictionary in Pythonese), and the network can be replace with an in-memory queue (Python queue) that allows the test to block until an event is received, then make assertions on that event. As such, and since both versions of the adapters must inherently conform to the same interface, it’s easy to write tests for these. For instance, instantiate the file system file adapter with the path of a temp file, instantiate the fake adapter, then see that both can set and get data.

More complicated were the sensors; it’s easy to fake, say, a GPS adapter - the fake adapter simply allows the test to report a coordinate manually, but how do you write a contract test? The test needs to run on a machine connected to the same type of GPS that’s installed on the real device. The easiest way to do that would be on a lab device, along with the end to end test. So a (future) part of the refactor project would entail creating a build configuration that deploys the code to a lab device and runs these tests as part of the system’s CI flow.

Acceptance tests

After extracting all I/O and threading logic to injectable adapters, we now had an inner hexagon that was completely devoid of I/O and two outer hexagons: one for the production system, that kept behaving exactly as it did before (since at no point did we change any behavior), and another for the integrative tests, which we call the Test Harness. This second outer hexagon composes the inner hexagon with fakes and exposes the fakes for the test to interact with. Since it is completely in-memory, each test case can construct its own instance of the test harness, so tests can run in parallel and in complete isolation.

Before moving forward with the refactor, I advised the team to deploy the code to a lab device, run the smoke test, and make sure that everything still works. As expected, this took a few back-and-forths iterations for them to flush out bugs missed in the blind refactoring process.

Eventually the team felt confident enough that the codebase is stable, and we could proceed with the next step: writing acceptance tests. These tests are intended to cover integrative flows that represent product requirements, for instance: “the device transmits an event with all required measurements”, or “the device alerts when battery level is lower than the configured threshold”. Each of these flows is represented by a test case that is completely isolated from any other test. Any configuration or metadata that can impact the expected results (such as the battery level threshold) is set by the test as part of its setup phase.

Completing The Mitosis

Having covered all known product flows with fast, integrative tests, the team could now proceed with the initial goal of breaking the logic down into separate subsystems that could only talk to each other via hard interfaces. This necessitated creating a new type of adapter, for IPC (inter-process communication); whereas under the monolithic process the different subsystems could talk to each other via queues or function calls, the new architecture would require an explicit communication mechanism. Eventually a TCP-based solution was chosen, and as with the other adapters, a fake implementation using in-memory queues was introduced into the test harness’ outer hexagon.

At this point, running all subsystems in the same process was simply an implementation detail. In one fell swoop, the team simply moved the initialization for each subsystem into its own main file, effectively creating multiple “outer hexagons”, one per process.

Retroactive TDD

Now the original goal could be achieved. That is, creating a new monitoring and supervision flow to allow finer control of the device’s software over failures in the various subsystems - be it networking or sensors. Because the system has been refactored to hexagonal architecture and covered by an extensive suite of fast, integrative tests, it was now possible to add new behavior using TDD, facilitating emergent design and thus helping prevent code decay. We wrote a test requiring the desired new monitoring / supervision behavior in one of the subsystems, ran it to observe the expected failing behavior, then implemented the new behavior for that one subsystem.

Component tests

After implementing the new behavior in one of the subsystems, we noticed that we’re repeating the same test for other subsystems, each instantiating the full test harness and only exercising one subsystem. This is a smell that hints that a smaller-scoped test is in order; in this case, a component test for each subsystem. These component tests would, like the acceptance test harness, instantiate the inner hexagon of the subsystem-under-test with the appropriate fake adapters, use these fakes to simulate the failure scenario(s) specific to the subsystem, then assert that the subsystem declares itself as unhealthy in the API exposed to the monitoring subsystem.

This type of test can also be used to cover branching behaviors that are not part of the major flows, for instance decisions pertaining to one of the subsystems but not the others. In general, as a software system becomes more complex, the suite of component tests grows faster than the acceptance suite, as this is the proper place for specifying the nuanced behavior of each subsystem.

Summary

Fear of change cripples most software development teams. The conditions that led Versatile to an unmaintainable codebase in the heart of their business domain are not unique; they plague almost all startups, not only for want of experienced engineers, but also because the very nature of a startup requires agility and experimentation until finding the right product-market fit - at which point, the codebase might have underwent a lot of sporadic changes, often without much thought of the long term. It’s not unavoidable, but it’s still the most common case.

Using simple refactoring techniques and retroactively covering the codebase with a good suite of tests, resulting with a design similar to what would’ve emerged if the system had been TDDed to begin with, is a proven and effective way of combating fear of change and facilitating the continuing growth of a startup and its product.