Big Design Up-Front or Emergent Design? Hexagonal Architecture Gives Us Both
This blog post was originally published in the Orbs Engineering blog in 2018.
When we started coding the Orbs reference implementation in the Go programming language, we had several guidelines in mind:
The code must be written outside-in using TDD
The code must have bounded contexts, expressed as interfaces generated from Protobuf
The feedback cycle for a core developer needs to be sub-second
These 3 guidelines are, unfortunately, somewhat contradictory; outside-in TDD is all about emergent design, while Protobuf-first interfaces enforce upfront-design of the bounded contexts between these interfaces; and outside-in TDD starts with an E2E test, which takes seconds or dozens of seconds to run — many orders of magnitude slower than the acceptable duration for running tests for a fast feedback cycle. In this post, I’d like to share our thought process, conflicts and resolutions, as well as to present the approach we decided to take — an approach that ended up solving even more problems than we initially considered. But before we do that, let’s talk a bit about the rationale behind each of the three guidelines.
There are two major schools of thought within the TDD movement: Outside-in and inside-out; while not going into details here, I’ll just mention that I’m an avid practitioner of outside-in TDD, a methodology I’ve used in the past to craft software systems that are adaptive to changing requirements and that reduces the amount of unused code.
There are multiple reasons for TDDing the entire system from the get-go: Having confidence in the implementation, reducing fear of change, and — most importantly — setting the standard for the core team and the contributor community. In addition, being a blockchain product, we were especially interested in proving correctness and avoiding nasty bugs, so having a regression suite is a must.
IDL-first Interfaces Using Protobuf
…a language-neutral, platform-neutral, extensible mechanism for serializing structured data.
If you think about it from a non-developer’s perspective, a blockchain is composed of two parts: a protocol, specifying how nodes should talk to each other, consensus models, the data format of stored blocks, etc; and a network (or multiple networks) running nodes that perform computations, reach consensus, and persist the ledger. While all (or most?) blockchain projects also provide a reference implementation, it is the protocol that matters the most — since anyone can fork a blockchain to create their own separate network.
However, there’s a tight relationship between code and specification (the protocol). If the spec diverges from the code (which inevitably happens in all software projects), it becomes stale and loses relevance. But at the same time, you never know if your spec is foolproof until you’ve implemented it. So a good solution would couple the spec to the reference implementation in a way that allows quick validation of the spec, and breaks the reference implementation when the spec changes.
The Orbs spec describes several major parts of the system, or Bounded Contexts, each with its own concerns. These are the Consensus Algorithm, Block Storage, Gossip, etc. At Orbs, we decided to use Protobuf as the IDL, with code-generation in the Go language for all bounded context interfaces. Initially all of these interfaces live in the same process, but this design should allow to decouple the system to microservices if and when this becomes a good idea.
Quick Feedback Cycle
As a core team composed of veteran software engineers, we all understood the importance of getting relevant and quick feedback on our code changes. This means that, ideally, the IDE automatically re-runs all relevant tests when a piece of code changes, and that these tests finish running before we even noticed that they ran. If I have to wait many seconds or minutes between test runs, I will run them less often and the probability increases that when I eventually do run them (or when CI runs them for me), I’ll find out that I’ve introduced multiple breakages, forcing me to go back to my code and analyze exactly what broke. If I break something and a test immediately fails, I don’t have to think at all — I just hit Ctrl+Z.
This is even more important when practicing TDD — I’d like to see my test fail, as expected, before I start writing code to make it pass, and I’d like to see it pass as soon as I’m done implementing it. And I’d like it to run often while I’m in the refactor phase, to increase the chances that I actually do any refactoring.
Finally, quick feedback cycles reduce wear on office furniture.
To Emerge or Not To Emerge
Now let’s go back to the conflicting guidelines discussed earlier. The first conflict lies between the wish to facilitate emergent design in order to increase adaptability to changes and reduce unused code, and the wish to impose upfront design of the bounded contexts of the system via Protobuf. Obviously neither approach on its own would be good enough; if we went only with emergent design, we wouldn’t necessarily end up implementing the specified protocol, which is, as mentioned above, the actual novelty of any blockchain project. But if we went only with upfront-design, we would be making the same mistakes that we all know from projects with upfront-design — late integration, untestable code, fear of change and so on.
After a lot of discussion and conflict, we came to an agreement: We should start from the outside-in, creating a walking skeleton of the system with all the necessary harnesses, to facilitate TDD from the first line of code. A walking skeleton forces the developer to pay much of the integration cost upfront, instead of paying it later with interest, after much of the code has already been written. We will then drive the design to meet the spec in spirit, and eventually we will replace the emerged interfaces with the code-generated interfaces and hope for the best.
This, however, immediately gave birth to the second conflict:
E2E tests are SLOW and FLAKY
Writing a walking skeleton can be a frustrating exercise, as it deals with a lot of mechanics and very little of the actual logic of the system. It is the phase where you integrate with a lot of the outside world, only to come up with a glorified Hello, World server.
We knew that if we started with a full-blown E2E test, with HTTP, file system, and TCP for inter-node communications, we would have a lot of moving parts, a ton of fragility and mounds of frustration. In addition, as people are very good at getting used to things, we will be introducing the core team to slow tests, thus reducing the probability that we will ever have fast tests. We wanted the team to be a vigilant proponent of fast tests.
Armed with these concepts, we issued a decree: Instead of a testing pyramid, we should talk about a testing matrix, which separates between two dimensions of testing, 1) speed (slow/fast) and 2) scope (small/large):
In the top-left corner, we have the small and fast tests. These are unit tests that only deal with business logic, without any IO. These tests are computational operations and are CPU-bound. Slightly below are unit tests that have some temporal behavior, for instance the unit tests of a timer, ticker or trigger. These tests require some waiting — be it busy-wait or callbacks — and are flaky if using very small wait times (due to scheduling concerns in a managed runtime). My experience has been that anything below 1 millisecond will experience the occasional false negative.
In the opposite, bottom-right corner, we have our E2E tests — a small amount of tests that exercise a system as close to production as possible, starting up Docker containers, using real file system, networking, and so on. These tests take many seconds to start up and run, and are both IO-bound and IO-intensive.
This leaves three other types of tests to discuss, and we will come back to them later.
Adapters! Adapters Everywhere!
So, to recap — we wanted to have a fast suite of tests, and to make sure the core team is vigilant about keeping the suite fast, but in the same breath, we wanted to outside-in TDD the codebase, which necessitates starting from an E2E test. We wanted to facilitate emergent design so that our system is adaptive to changes, but in the same breath we also wanted to upfront-design the bounded contexts inside system.
We agreed to start with outside-in and then apply the upfront-design, hoping that TDD would validate our design, but we didn’t want to pay the premium of an E2E walking skeleton.
The solution is simple. Each facet of the system that interacts with the world outside the system is represented by an adapter. This hides the specifics of the external dependency (be it network, database, file system, etc), and exposes a domain-specific API that speaks the semantic language the system. For instance, in the following drawing, a BlockPersistence adapter exposes WriteBlock and ReadBlock operations, while hiding away the details of talking to the file system, file names, paths, or even sharding the blockchain across multiple files or volumes. This is, again, very basic software engineering.
Hexagonal Architecture To The Rescue
And indeed, the concept has been floating around for years. Coined by Alistair Cockburn, Hexagonal Architecture is the idea of looking at the system not as layers that represent different levels of responsibility (persistence, business logic, view, etc, known as n-tier architecture), but rather as a collection of modules — or components — with high affinity that talk to each other and to other systems outside of the bounds of its domain. This allows us to focus on the semantics of the relevant area of responsibility in the context of the system’s domain, instead of the mechanics of said area of responsibility.
The idea is simple, but the fact that we think of all adapters as separate abstractions that might have similar implementations (such as the same database vendor or the file system), yields much better domain affinity. Meanwhile, thinking about a collection of DatabaseAccessObjects in a single Data Access Layer might often results in generic, template-style indirections.
And it gets better. If we assume that all mechanical concerns are mere implementation details, we can put them off until after the initial phases of development and just implement them naively in-memory. We develop outside-in from the first test to the user-facing facet of the system, then to business logic and finally to some other system for persistence or networking or whatnot (for instance, writing a block to persistent storage). Whenever we reach such a place, during outside-in TDD, we first implement it inline (holding an array of blocks inside the business logic code), then refactor it out as an in-memory implementation and introduce an adapter interface. This also helps reduce cognitive load for the developers — we can defer the specific of the “real” implementation (i.e. the woes of file system operations on ext4) to a later point and achieve that as a separate task, with our minds dedicated to that specific problem.
As time goes on, we collect more evidence or requirements that we have from our adapter — BlockPersistence in our case. Once we decide to implement a filesystem persistence module, we will have an emergent API that is covered by tests. At this point, we can create an Integration Test that runs against the in-memory implementation, then run it against the filesystem implementation as a kind of Contract Test. These tests are Slow and Small scoped, putting them in the bottom-left quadrant, but the good news is that we’re only going to have a few of them. They don’t need to run often. We can decide to skip them locally running only in CI, depending on the scale of our build and test suite.
But How Does That Help Me?
If we go back to our original list of guidelines, we wanted to make sure that:
1. The code must be written outside-in using TDD 2. The code must implement hard interfaces, generated from protobuf 3. The feedback cycle for a core developer needs to be sub-second
Using dependency injection, we provide the set of all adapter interfaces to a factory method that constructs the logic part of the node.
This factory method is called twice: Once from the main function passing in the real implementations of the adapters and creating a production-ready node, and secondly by passing in-memory adapters, from the test harness of a suite of tests we dubbed as Acceptance Tests, which belong in the upper-right quadrant. They are scoped at the node level, allowing us to test the integration of all components of the system, but they have no IO impact at all, so they are relatively quick. The acceptance suite initially ran in just under two seconds, and later grew in size and scope so it took ~10 seconds to run (code can be found here).
The Acceptance Suite has a critical role — it was the starting point of our outside-in TDD process, it allowed us to drive new behavior and features into the system, it helped identify and deal with flakiness, as it exercises the whole system, helping deadlocks and race conditions surface, and it serves as a last safety net before we head out to the evil, uncertain world of slow tests.
Using Hexagonal Architecture, we used upfront design to forge the inner bounded contexts of the system, while allowing design emerge inside each bounded context (a component of the system, such as the Consensus Algorithm), as well as outside of them, in mechanical areas where we wanted to tease out a mature API via emergent design before committing to implementation details (an adapter, such as the Block Persistence). Each bounded context is tested using a suite of Component Tests (top-right quadrant).
Read more about Component Tests in the subsequent blog post.