Design for Failure
“Nothing is certain but death and taxes.” Like death and taxes, we know our devices will fail. The question is when and what will be the impact. Starting from a high level we need to ask, what are the potential components for system failure?
- Communications – networks
- Processing – systems
- Devices
There are different levels of potential failure. For example, in an electrical grid failure, there are national, regional, county, city, section, and city block failures. Large scale failures impact many systems with potential cascading effects. Small failures impact fewer systems but there are significant similarities. Examining macro and micro failures provides some perspective.
The following tables consider how failures impact key system categories of power, processing capacities and communication systems in relation to the cause and possible mitigating solutions.
Macro View: System Wide
Source of Failure | System Impact by Category | ||
Power | Processing | Communications | |
Environment | Recovery strategy | Distributed and Redundancy | Local Process and covery strategy |
Human Error | Recovery Strategy with process sequencing | User limits Behavior monitoring | User limits Behavior monitoring |
Malicious Behavior | Local power | Minimize failure rules Disaster Recovery |
Security, monitoring, alternative pathways and local processing |
Device Failures | Device Independence | Duplicate data collection | Alternative pathways or local communication |
Micro View: One Device
Source of Failure | System Impact by Category | ||
Power | Processing | Communications | |
Environment | Redundancy | Minimize Impact, monitoring | Alternative Pathways |
Human Error | Test backup power | Data sanitization | Testing pathways |
Malicious Behavior | Redundancy | Behavior Monitoring | Alternative pathways |
Device Failures | N/A | Design/Monitoring | Communication Loss |
An examination of the solutions presented fall into the following categories:
- Redundancy – To prevent failure or impact
- Recovery – Reduce loss and time
- Restrictions – Minimize user/system impact
- Monitoring – Identify issues quickly
- Local or edge computing – Solution for communication loss and minimize exposure
Base Design Principles for Systems
The four key principles of design are, security, privacy, control and accountability. Privacy requires security, privacy requires control and control requires accountability.
These 4 principles break down into the following system design requirements:
- Validation – Allowed, verified and correct
- Redundancy – Multiple solution sets
- Monitoring – Information acquisition and Identifying of issues
- Restriction – Limiting use and access
- Mitigation – Minimizing impact or consequences
The following chart considers how the 4 design principles translate into systems services as related to potential failure drivers.
Security | Privacy | Control/Rules | Accountability | |
Servers | Validation | Restriction | Monitoring | Restriction |
Devices | Validation | Restriction | Monitoring | Restriction |
Users Behavior | Restriction | Monitoring | Monitoring | Restriction |
Hostile Actions | Mitigation | Restriction | Monitoring | Restriction |
Communication | Validation | Restriction | Monitoring | Restriction |
Command and Control | Redundancy Validation | Restriction Validation | Restriction | Monitoring |
Architectural Design Components by Category
An advanced IoT architecture designed for failure includes the following components:
(This is a partial list. Many processes are used in multiple categories.)
Principle | Processes | Description |
Security | Distributed Processing Edge Processing Communications Verification Server Device User Relationship Management Operations Validation processes Monitoring processes Redundancy Restrictions Verification Mitigation processes |
Edge and distributed processing minimizes and mitigates system failures related to power, communication, intrusions and Security means all operations are managed, restricted, controlled, managed and monitored. Monitoring with reporting, event triggering, issue tracking and resolution is required to validate resolution processes. |
Privacy | Roles definition Access Management User Validation User Verification Device registration |
Privacy includes:
|
Control | User management Device registration Device access Policies and permissions Limited user access Restricted access Device state control Life cycle management User monitoring Device monitoring Server Monitoring Clearly defined ownership |
Control in this section is related to user/device management, access and operation. Policies and permissions define access, management, relationships and device operations. All information, operations and processes must be managed and monitored. Device states help define access. |
Accountability | User Device Functions Communications Intrusions |
Accountability implies an action to stop or mitigate unwanted behavior. Unwanted behavior is detected through monitoring or analysis of activities. Accountability is required to ensure behavior if stopped or mitigated. |
Command and control | Backup processes Disaster Recovery Migration processes Server Failure recovery Server Maintenance Upgrade Processes Communication failure Information relay Telemetry management |
Command and control here relates to core systems and process planning. What are normal operations and what is the plan for minimizing loss. How do you return to operations |