Design for Failure 

Nothing is certain but death and taxes.” Like death and taxes, we know our devices will fail. The question is when and what will be the impact. Starting from a high level we need to ask, what are the potential components for system failure? 

  • Communications – networks
  • Processing – systems
  • Devices 

There are different levels of potential failure. For example, in an electrical grid failure, there are national, regional, county, city, section, and city block failures. Large scale failures impact many systems with potential cascading effects. Small failures impact fewer systems but there are significant similarities. Examining  macro and micro failures provides some perspective.

The following tables consider how failures impact key system categories of power, processing capacities and communication systems in relation to the cause and possible mitigating solutions. 

Macro View: System Wide

Source of Failure System Impact by Category
Power Processing Communications
Environment Recovery strategy Distributed and Redundancy Local Process and covery strategy
Human Error Recovery Strategy with process sequencing User limits Behavior monitoring User limits Behavior monitoring
Malicious Behavior Local power Minimize failure rules
Disaster Recovery
Security, monitoring, alternative pathways and local processing
Device Failures Device Independence Duplicate data collection Alternative pathways or local communication


Micro View: One Device

Source of Failure System Impact by Category
Power Processing Communications
Environment Redundancy Minimize Impact, monitoring Alternative Pathways
Human Error Test backup power Data sanitization Testing pathways
Malicious Behavior Redundancy Behavior Monitoring  Alternative pathways
Device Failures N/A Design/Monitoring Communication Loss 

An examination of the solutions presented fall into the following categories:

  1. Redundancy – To prevent failure or impact
  2. Recovery – Reduce loss and time
  3. Restrictions – Minimize user/system impact
  4. Monitoring – Identify issues quickly
  5. Local or edge computing – Solution for communication loss and minimize exposure

Base Design Principles for Systems

The four key principles of design are, security, privacy, control and accountability. Privacy requires security,  privacy requires control and control requires accountability.

These 4 principles break down into the following system design requirements:

  • Validation – Allowed, verified and correct
  • Redundancy – Multiple solution sets
  • Monitoring – Information acquisition and Identifying  of issues
  • Restriction – Limiting use and access
  • Mitigation – Minimizing impact or consequences

The following chart considers how the 4 design principles translate into systems services as related to potential failure drivers.

  Security Privacy Control/Rules Accountability
Servers Validation Restriction Monitoring Restriction
Devices Validation Restriction Monitoring Restriction
Users Behavior Restriction Monitoring Monitoring Restriction
Hostile Actions Mitigation Restriction Monitoring Restriction
Communication Validation Restriction Monitoring Restriction
Command and Control Redundancy Validation Restriction Validation Restriction Monitoring

Architectural Design Components by Category

An advanced IoT architecture designed for failure includes the following components:
(This is a partial list. Many processes are used in multiple categories.)

Principle Processes Description
Security Distributed Processing
Edge Processing
Communications Verification
Server
Device
User
Relationship Management
Operations
Validation processes
Monitoring processes
Redundancy
Restrictions
Verification
Mitigation processes

Edge and distributed processing minimizes and mitigates system failures related to power, communication, intrusions and Security means all operations are managed, restricted, controlled, managed and monitored. 

Monitoring with reporting, event triggering, issue tracking and resolution is required to validate resolution processes.

Privacy Roles definition
Access Management
User Validation
User Verification
Device registration

Privacy includes:

  • Information access by roles
  • Access to devices by roles driven by states
  • Device relationship and interaction
  • Access to operations and functions based on roles
Control User management
Device registration
Device access
Policies and permissions
Limited user access
Restricted access
Device state control
Life cycle management
User monitoring
Device monitoring
Server Monitoring
Clearly defined ownership 
Control in this section is related to user/device management, access and operation. Policies and permissions define access, management, relationships and device operations. All information, operations and processes must be managed and monitored. Device states help define access.
Accountability User
Device
Functions
Communications
Intrusions
Accountability implies an action to stop or mitigate unwanted behavior. Unwanted behavior is detected through monitoring or analysis of activities.
Accountability is required to ensure behavior if stopped or mitigated. 
Command and control Backup processes
Disaster Recovery
Migration processes
Server Failure recovery
Server Maintenance
Upgrade Processes
Communication failure
Information relay
Telemetry management

Command and control here relates to core systems  and process planning.

What are normal operations and what is the plan for minimizing loss. How do you return to operations