When it comes to design, challenges can be overwhelming. Yet. by taking into account some basic principles design challenges can be turned into opportunities. Here are some rules to bear in mind when it comes to design in software engineering:
Principle 1: The production environment and codebase are full of Chesterton’s Gates.
There is a reason that someone built something. In the words of Chesterton, “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”
Chesterton then goes on to say: “There are reformers who get over this difficulty by assuming that all their fathers were fools; but if that be so, we can only say that folly appears to be a hereditary disease.”
Until a software engineer can explain the use of the thing, there is really no way to judge whether the reason was reasonable. Partially destroying a thing (a.k.a. having a rollback plan and a failure budget) is a great way to learn what the use of the thing is.
Principle 2: If two systems must agree for them to work, someday they will inevitably disagree. – Hyrum’s Law
If your system relies on independent information in two different sources of truth that are both changed independently and by different sets of engineers, you have to make sure that downstream systems are aware of the assumptions made by upstream systems. The two systems will inevitably disagree some day.
Principle 3: All software development is maintenance software development.
It’s often tempting to build a brand new system because it will be cleaner. It’s frequently more fun, too. This, however, leads to an endless stream of reimplementations that add new bugs, drop some old features without discussion, and create outages as inevitably, the transition does not go as smoothly as planned.
If you build a culture of fixing, refactoring, and targeted rebuilding of systems, you are more agile (and Agile), and can reduce the needless churn introduced by rewriting. In practice, nobody ever really gets to do more than 6 months of “greenfield” development, anyway. After that point, they are iterating on an existing system, regretting some of the choices they made a few months ago. Why not just build a team that is really good at gradual change, instead?
Principle 4: “Any sufficiently complicated system has at its root an unmaintainable flat text file.” – Huber’s Law of Systems
Generally, systems end up needing configuration, and that configuration tends to scale with the complexity of the system, so as the system becomes more complicated, the configuration tends to grow until the configuration becomes unmanageable.
This file usually ends up being text, since the features of revision control are usually considered very useful: having history or possibility of peer review.
Principle 5: Decrease variance, Increase mean.
Generally, users are happier if their experience is consistent, even if the performance is, overall worse. Thus, trading off the mean to decrease variance generally provides a much better user experience.
Principle 6: Find problems before you start serving.
Your server should do as much internal consistency and sanity checking as it can on startup, before it reports as healthy. The best time to find a problem is before you start receiving queries.
Presubmit tests are good at checking your config does what you think it did. Production tests are good at checking the config is evaluated in the way it should be.
Engineers should not fall into the trap of trying to write tests to make sure that the config is evaluated in the way they think it should be evaluated. Instead, they should write tests to make sure that the config actually does what you think it should do. Then, they should separately test that the config is processed the way they expect it to be.
Principle 7: The principles of ‘Pets vs. cattle’
This principle refers to not caring about single device/machine failures.
A desktop computer, laptop, and smartphone, all can be considered as “pets”: they probably have a unique name, they perform unique functions or tasks, and you taught each of them to operate in the way it does. When any of them is sick, we need to nurse it back to health and feel sad if it dies.
Machines in prod are cattle: They don’t have names; they have numbers. They all go through the same process, and they’re largely indistinguishable. If one is sick, we can simply get a new one, because doing so is less work than nursing the sick one back to health. This is the only way to manage tens of thousands of cattle without overloading ourselves.
Ideally, any component of which we have more than a couple of instances is automated such that increasing the number of components by an order of magnitude is no additional work. This principle can apply to machines, deployments, copies of the software, user accounts, clusters, data centers, and so on.
Principle 8: Design systems for continuous failure.
By designing systems so that failure is a common part of normal operation, software engineers ensure that unexpected failures are handled correctly.
For example, consider a system that’s designed such that if a task shuts down, other instances take over the load automatically. If that task unexpectedly crashes, engineers can observe that other instances automatically take over the load without a problem, and therefore, can expect them to do so in the future. Contrast this to a system that requires an engineer to perform a drain before taking a task out for maintenance. If a component fails, you know the engineer will need to manually intercede to perform drains.
Principle 9: Don’t detect failure—detect the absence of success.
If software engineers are going to implement a failure detection mechanism, it has to share the same failure characteristics as the component in which they are trying to detect failure.
A given service or component can fail in an infinite number of ways, so trying to write rules that can catch them all is nearly impossible (but is an easy trap to fall into: “Oh, this failed… I should write an alert to catch that particular failure.”). Instead, try to catch when things don’t succeed. Doing so is advantageous for a couple of reasons:
- Absence of success is easier to detect than a failure
- This strategy is more robust against future changes. For example, someone might add in a new way a component can fail, or remove a way in which a component can fail.
Principle 10: Software Resource Engineering (SRE) scales with the number of differences, not the total size of a system.
The more commonalities systems and services share, the more systems and services SRE can support.
It is vitally important to maximize commonalities among systems and services. Some examples include to keep files in standard locations, use consistent naming conventions, and use standardized monitoring strategies (ideally, use the same monitoring for multiple different systems). An SRE team can usually easily handle twice as many instances of the same job without any significant changes, but on-boarding another service with half as many instances requires significant work.
Although these design principles might sound easy and simple to be implemented they may be challenging to be put into use during everyday design. After all, oversimplification can also become another design issue to be overcome on its own. So, by exercising a balanced and cautious approach, software engineering can become a successful journey.