What are key features of reliability?
MIT’s John Carroll explains that “reliability represents an intersection of effectiveness, safety, and resilience…”. He cites HRO organizational/cultural attributes codified by Karl Wick and Kathleen Sutcliffe. The five attributes are:
• Preoccupation with Failure
• Reluctance to Simplify
• Sensitivity to Operations
• Commitment to Resilience
• Deference to Expertise
For more information on these concepts see their book Managing the Unexpected: Sustained Performance in a Complex World, 2015 edition.
Carroll also points out that in complex operations, no one person can be aware of all potential failure modes that reside in a big picture that is beyond their knowledge. When it comes to low frequency high consequence events, these rare events may never have occurred before.
“Expecting a person to interpret and respond to a unique event in the moment is like blaming the goalie in soccer or ice hockey for every goal—reducing shots on goal is everyone’s job in a team sport, whereas the goalie is the last line of defense”, says Carroll.
Dr. Carroll examines lessons from process safety and the 2005 BP Texas City refinery disaster that killed 15 and injured 180. “Process safety hazards are often invisible and can involve combinations of multiple pieces of equipment, materials in process, human actions, and computer software that cannot be understood just by looking at the screen. Nor will everyone doing what is in the procedure manual necessarily avoid accidents, since procedures are frequently missing, incomplete, confusing or wrong,” he explains.
Kathleen Sutcliffe (Johns Hopkins University) writes that HRO organizations make much use of detailed procedures. She notes that “the Diablo Canyon nuclear power plant had 4,303 separate, multistep written procedures, each one revised up to 27 times…”. While acknowledging that procedures are very important, she explains “it is impossible to write procedures to anticipate all the situations and conditions… blind adherence [to procedures] is unwise if it reduces the ability to adapt or react swiftly to unexpected surprises.”
The tension between production goals and reliability
Peter Madsen (BYU) and Vinit Desai (University of Colorado, Denver) explain that incentives for efficiency and short-term profitability are ever-present while major incidents are rare. This can result in drift away from valuing reliability and safety until a major mishap occurs. The Deepwater Horizon disaster is an example of this dynamic. In some organizations, the importance of learning from events is judged by their impact on profit after a major incident takes place. HROs are on guard to manage conflicting production and reliability goals by continuously verifying that the “real goals of the organization are the same as public goals”.
Madsen explains that, whereas HROs rarely experience disasters firsthand, learning only from your own incidents provides relatively little learning. Organizations that seek high reliability must learn from incidents and near misses from other companies as well as from other sectors around the world.
After the 1979 Three Mile Island accident, the nuclear power industry formed the Institute of Nuclear Power Operations (INPO), which facilitates broad-based learning that promotes reliability. INPO functions as a center for safety excellence that goes much further than compliance with regulations. After the Deepwater Horizon disaster, the President’s Oil Spill Commission recommended that the oil industry create its own INPO-type organization.
A best practice from another sector is the FAA’s Aviation Safety Reporting System (ASRS). The NASA-administered program receives more than 50,000 anonymous near-miss reports each year. Former National Transportation Safety Board chairman Chris Hart reports that the program has significantly improved the safety and reliability of operations in aviation.
Do regulators and networks need to worry about their reliability?
Paul Schulman (Mills College) and Emery Roe (University of California, Berkeley) discuss the role of safety regulators for reliability-seeking organizations and sectors. They observe that effective regulations and regulators can raise performance standards across a sector. This helps to prevent some rivals from undercutting safety in order to gain a competitive advantage. An effective regulator also impacts operational practices and culture regarding reliability.
“Expecting a person to interpret and respond to a unique event in the moment is like blaming the goalie in soccer or ice hockey for every goal—reducing shots on goal is everyone’s job in a team sport, whereas the goalie is the last line of defense”
The authors go on to explain that safety regulators should continuously examine and improve their own performance. Regulators need to evaluate the effectiveness of their regulations and inspection activities to see if they are promoting primarily minimum compliance rather than actual improvements in safety and reliability. The self-evaluation by regulators should also ask “to what degree might adversarial relations between regulators and organizations lead to formalization and rigidity of safety management?” Regulators should also ask if their approach results in reliance on a small set of compliance metrics that are retrospective rather than prospective.
Schulman and Roe argue that the challenge of reliability effectiveness “has been neglected by both regulatory agencies and their individual regulatees: the challenge is increasingly one of promoting the management of reliability across organizations, particularly interconnected organizations in networks [supply chain, etc.]. Networked reliability and risk are now in combination one of the most important challenges facing the understanding and pursuit of system reliability in the modern era.”
Reliability and sustainability
Nuclear power plants in the U.S. are most often viewed as examples of high-reliability organizations. In terms of near-term avoidance of major incidents and unplanned outages their operations are reliable. However, if the time frame is expanded and we include environmental sustainability, the chronic unsolved problem of nuclear waste generation and storage can force us to rethink what we mean by reliability. Schulman and Roe explain that “reliability analysis, cast on an extended time frame, must inevitably address risks and consequences that current ‘reliable’ operations are likely to impose on future generations.