Design for Failure

“Nothing is certain but death and taxes.” Like death and taxes, we know our devices will fail. The question is when and what will be the impact. Starting from a high level we need to ask, what are the potential components for system failure?

Communications – networks
Processing – systems
Devices

There are different levels of potential failure. For example, in an electrical grid failure, there are national, regional, county, city, section, and city block failures. Large scale failures impact many systems with potential cascading effects. Small failures impact fewer systems but there are significant similarities. Examining macro and micro failures provides some perspective.

The following tables consider how failures impact key system categories of power, processing capacities and communication systems in relation to the cause and possible mitigating solutions.

Macro View: System Wide

Source of Failure	System Impact by Category
Source of Failure	Power	Processing	Communications
Environment	Recovery strategy	Distributed and Redundancy	Local Process and covery strategy
Human Error	Recovery Strategy with process sequencing	User limits Behavior monitoring	User limits Behavior monitoring
Malicious Behavior	Local power	Minimize failure rules Disaster Recovery	Security, monitoring, alternative pathways and local processing
Device Failures	Device Independence	Duplicate data collection	Alternative pathways or local communication

Micro View: One Device

Source of Failure	System Impact by Category
Source of Failure	Power	Processing	Communications
Environment	Redundancy	Minimize Impact, monitoring	Alternative Pathways
Human Error	Test backup power	Data sanitization	Testing pathways
Malicious Behavior	Redundancy	Behavior Monitoring	Alternative pathways
Device Failures	N/A	Design/Monitoring	Communication Loss

An examination of the solutions presented fall into the following categories:

Redundancy – To prevent failure or impact
Recovery – Reduce loss and time
Restrictions – Minimize user/system impact
Monitoring – Identify issues quickly
Local or edge computing – Solution for communication loss and minimize exposure

Base Design Principles for Systems

The four key principles of design are, security, privacy, control and accountability. Privacy requires security, privacy requires control and control requires accountability.

These 4 principles break down into the following system design requirements:

Validation – Allowed, verified and correct
Redundancy – Multiple solution sets
Monitoring – Information acquisition and Identifying of issues
Restriction – Limiting use and access
Mitigation – Minimizing impact or consequences

The following chart considers how the 4 design principles translate into systems services as related to potential failure drivers.

	Security	Privacy	Control/Rules	Accountability
Servers	Validation	Restriction	Monitoring	Restriction
Devices	Validation	Restriction	Monitoring	Restriction
Users Behavior	Restriction	Monitoring	Monitoring	Restriction
Hostile Actions	Mitigation	Restriction	Monitoring	Restriction
Communication	Validation	Restriction	Monitoring	Restriction
Command and Control	Redundancy Validation	Restriction Validation	Restriction	Monitoring

Architectural Design Components by Category

An advanced IoT architecture designed for failure includes the following components:
(This is a partial list. Many processes are used in multiple categories.)

Principle	Processes	Description
Security	Distributed Processing Edge Processing Communications Verification Server Device User Relationship Management Operations Validation processes Monitoring processes Redundancy Restrictions Verification Mitigation processes	Edge and distributed processing minimizes and mitigates system failures related to power, communication, intrusions and Security means all operations are managed, restricted, controlled, managed and monitored. Monitoring with reporting, event triggering, issue tracking and resolution is required to validate resolution processes.
Privacy	Roles definition Access Management User Validation User Verification Device registration	Privacy includes: Information access by roles Access to devices by roles driven by states Device relationship and interaction Access to operations and functions based on roles
Control	User management Device registration Device access Policies and permissions Limited user access Restricted access Device state control Life cycle management User monitoring Device monitoring Server Monitoring Clearly defined ownership	Control in this section is related to user/device management, access and operation. Policies and permissions define access, management, relationships and device operations. All information, operations and processes must be managed and monitored. Device states help define access.
Accountability	User Device Functions Communications Intrusions	Accountability implies an action to stop or mitigate unwanted behavior. Unwanted behavior is detected through monitoring or analysis of activities. Accountability is required to ensure behavior if stopped or mitigated.
Command and control	Backup processes Disaster Recovery Migration processes Server Failure recovery Server Maintenance Upgrade Processes Communication failure Information relay Telemetry management	Command and control here relates to core systems and process planning. What are normal operations and what is the plan for minimizing loss. How do you return to operations