Engineering Preventive Mechanisms into Software Systems

A software engineer's approach to designing for failure, resiliency, and efficiency, exploring how traditional engineering concepts and ideas fit into tech.

Engineering Preventive Mechanisms into Software Systems
Photo by Yanguang Lan / Unsplash

I'm not a professional engineer in the traditional sense, and if you're reading this blog, you're probably not one either. I'll let you decide if the title "software engineer" and "architect" are appropriate for use to describe tech knowledge workers, but regardless of your conclusion, while many of the mathematic and traditional engineering knowledge that's required to become a licensed engineer isn't relevant to software engineers, many of the theoretical ideas and design theories are. We don't design the angles used in a car's frame to optimize for manufacturing cost and safety, but we do write the code that controls the car's steering, braking, acceleration, etc. Our code and how we write it can be mission, safety, and every other type of critical. The quality of what we, as software engineers, do matters.

I can't think of any specific examples of safety-critical code that I've personally written (though some of you may have), but practically everything is mission-critical. In this post, I want to highlight the idea of Preventative Mechanisms and how we, as software engineers, can integrate them into our code.

I've been working on improving my skills in systems engineering. Systems Engineer as a job title can have many meanings, ranging from traditional engineering to software. Software Systems Engineers are typically those who work on low-level problems integrating, designing, and building systems using languages like Rust, C++, etc., to squeeze out all of the performance possible. Don't let leetcode fool you, at most levels, n-complexity and which type of for loop you use doesn't matter all that much for 95% of use cases. Systems engineering is that 5% because that's where saving a few milliseconds can result in significant time and cash savings. I don't enjoy leetcode (and definitely don't think it's an effective hiring tool), but at the level where engineering ideas and theories matter, you start to encounter truly interesting problems (at least in my opinion). But that's enough about that.

At the systems engineering level, because of the potential for small improvements to result in huge impacts (see: The Cumulative Advantage), building in quality becomes very important. Systems engineering in software encompasses more than just performance optimization. While improving execution speed is important, true systems engineering focuses on design to scale and fail. There are three concepts that come from traditional engineering that I think are valuable for use as systems engineers or just regular software engineers:

Fault Tolerance

Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. - Wikipedia

Resilient distributed systems are systems that are designed to still function even in the event of failures. Fault-tolerant systems still function when faults or problems occur. Traditionally, if a website got too much traffic the service might not be able to handle it all and go down. We decided to introduce what's called a load balancer, which balances the load of a process across several servers. Beyond that, we created ways to scale up and down our servers based on demand. And more recently, we've created architectures that do not depend on any one server and can be run on servers all around the world with minimal persistent overhead, resulting in extremely resilient websites. This evolution in distributing work across machines has increased the fault tolerance of our information systems in a way that makes it truly possible for a website to stay online around the world even if all of North America lost its internet connection.

Fail-Safe and Fail-Secure:

A fail-safe is a design feature or practice that, in the event of a failure of the design feature, inherently responds in a way that will cause minimal or no harm to other equipment, to the environment or to people. - Wikipedia

This concept is that systems should be designed to fail. Not that we should try to get them to fail, but that we should design their failure. This involves pushing things past their limits and then handling those failures in a way that does not cause harm to the system or other units.

The difference between fail-safe and fail-secure is best illustrated through the idea of magnetic door locks. When the power goes out, what should happen? Yes, we should employ battery backups and a generator, but think beyond that. If a system is fail-safe, the door will be open, allowing anyone trapped inside to enter. If it is fail-secure, the door will remain locked until/unless manually overridden or power is restored.

Fail-SafeFail-Secure
GoalMinimize harm or dangerMaintain system security
PrioritySafetySecurity
Response to FailureEnter a safe state, often reducing functionalityRestrict access or functionality to maintain security
Examples of UsageStopping processes, reducing access rightsLocking systems, blocking unauthorized access

Sometimes it makes sense to stop further changes, but sometimes it doesn't. Resiliency deals with the idea of continuing to operate even when things happen, and things will always happen. In some situations, it makes sense to make the best effort to continue operating what can be done even if parts of a system have failed, while other times, it makes sense to prevent operation.

Preventative Mechanisms

A poka-yoke is any mechanism in a process that helps an equipment operator avoid (yokeru) mistakes (poka) and defects by preventing, correcting, or drawing attention to human errors as they occur. - Wikipedia

Or more commonly called ポカヨケ (poka-yoke). This Japanese term comes from the idea that mistakes should be prevented by design. Often called "mistake-proofing," "error prevention," or less formally, "idiot-proofing." The idea is that the system's mechanisms should take preventative actions that prohibit errors by the operator or that are caused by the system's environment and can be detected.

This is really what this post is about and why I decided to write it. I've been working on creating low-level SDKs in Rust for a PTZ camera. This is part of a multi-year project I've been working on to rebuild a prototype broadcast control interface/dashboard I created several years ago. This project is one where the dependencies simply have never been built, so not only am I building the interface and logic, but also creating the SDKs to send packets to hardware systems, process video frames, etc. There are lots of opportunities to prevent errors both in the data packets and higher-level inputs. I'm thinking about it from the perspective of building a user dashboard/control interface and also open-sourcing the SDKs so that others can use them to create their own automations and tools. It is a very fun project for someone like me who enjoys working on video/media-related technology.

In that sense, one thing I've consistently come back to both while designing the SDKs and the user-facing interfaces is a way to prevent failures or mistakes. I make mistakes regularly. If you don't, you don't do anything challenging. What sets good problem solvers apart is our ability to prevent future mistakes. A prevalent philosophy is that you should learn from your mistakes, but while I definitely do, I don't think that's always the right approach. Instead, I think you should analyze and prevent your mistakes. If a mistake is preventable, you shouldn't try to teach yourself to not do it, you should make it so that you and everyone else won't be able to make that mistake. I'm not sure if that approach can or should be applied to every area of life (we sometimes need to learn the hard way), but in system design, it certainly makes sense.

What Is Over Engineering?

I'll often hear, "You don't want to over-engineer something" or "That solution is over-engineered," but rarely is there any real universal meaning to the term. Instead, it falls under the "I'll know it when I see it" category of mistakes. However, I think we can tie a more meaningful definition to the term, at least as far as software engineering is concerned. The answer, for me, is that over-engineering is dependent on the level of abstraction you're working at.

  • At the lowest level, everything should be as fault-tolerant as possible. Avoid errors causing large out-of-proportion problems when the issue can be caught and handled in a predictable way.
  • At a step above, when a user is interacting with a primitive system, it makes sense to fail safely or securely. While you might not know how your system will be used further up the chain, you do know that you can handle problems in a way that will make recovery easier or more predictable.
  • Finally, it is okay to put on some training wheels once you're a few layers up. At this stage, you should be able to know what the user or programmer is attempting to do and be able to provide a set of constraints that ensure that normal operation occurs. Preventative mechanisms at this stage know enough about the desired outcome to provide meaningful and helpful guardrails against misuse.

Notice that I didn't specify where these layers are. It's all relevant to the level at which you are working, but I do think the idea of at least three layers (a base, primitives, and an API) is a fairly common pattern found throughout most systems. Over-engineering occurs when this pattern is broken. You shouldn't put too many guardrails on an extremely low-level mechanism because you may find it later to be too rigid. Likewise, you shouldn't have to be worrying about handling lower-level faults within a high-level API. This approach allows you to focus on the correct problems at the correct stages, preventing engineering solutions to problems that should have already been solved within an abstracted dependency system.

In summary, I hope you've enjoyed this engineering approach to software design. I like thinking about problems this way sometimes because it gives greater depth into what is actually being solved and what needs to be solved. I saw a Tweet the other day that said software engineers and physicists think they can first-principles this way to any solution, which is sometimes true. However, a lot of well-thought-out ideas can be borrowed from other adjacent fields. We don't have to learn everything the hard way. Instead, sometimes, it helps to slow down and learn.