Category Archives: Failure Analysis
In DACI’s 1st Quarter 2012 newsletter I predicted that a catastrophic safety event would eventually occur due to lithium batteries (please see “Li-Ion Battery Pack Hazards and our Psychic Prediction“). The recent fires in the initial flights of the new Boeing Dreamliner have come close to fulfilling that prophecy.
From “Detecting Lithium-Ion Cell Internal Faults In Real Time” (Celina Mikolajczak, John Harmon, Kevin White, Quinn Horn, and Ming Wu, in the Mar 1, 2010 issue of Power Electronics Technology) it is known that internal cell faults in lithium batteries can lead to thermal runaway, subsequently resulting in fires and/or explosions. Therefore the question arises: do the Boeing lithium batteries have an advanced internal construction that prevents cell faults, or mitigates thermal runaway in the event of a fault? If not, the Boeing team or vendor responsible for the battery system design is in big, big, trouble.
Although deficiencies in basic battery chemistry and/or construction appear to offer the best root cause hypothesis for the fires, there are also other possible factors. For example, it has been reported that perhaps the charging system malfunctioned, causing the batteries to overheat. However, a properly designed charger for an aircraft application would have fail-safe protection, preventing an overcharge. Plus, it was also reported that charging sensors did not detect an overvoltage. Although these factors sound reassuring, they are not sufficient to eliminate the charger from consideration. For example, one can hypothesize a charging waveform that contains spurious high frequency oscillations that create high rms charging currents. This would not necessarily result in overvoltage, but could result in overheating.
It is also possible that battery “cell defects” are nothing more than cell imbalances that vary according to production tolerances. In other words, the lithium battery, by its very nature, tends towards thermal runaway unless the internal cells are very tightly matched. This sensitivity would become more pronounced with a higher number of cells and higher mass, which would explain why no explosions have occurred in small button-style batteries, but do occur in the larger batteries.
There are other scenarios, including the thorny possibility that some combination of conditions conspired to create the failure. And, of course, the root cause may be highly intermittent, making detection extremely difficult. Such hypotheses are undoubtedly being examined by the Boing engineers. I wish them well, and hope that they are allowed to perform their work calmly, methodically, and thoroughly.
Note: Because it may take quite a long time to conclusively establish a root cause, I would suggest that Boeing immediately begin planning to retrofit the lithium system with one containing battery types that have not shown the proclivity to explode; e.g. nickel metal-hydride, or sealed lead acid gel. Heavier, yes, but in this case safety and the economic timeline indicate that it would be wise to be prepared with a retrofit design.
(For some brief guidelines on design failure crisis management, please see Scenario #6: “Coping with Design Panic,” in The Design Analysis Handbook, Appendix A, “How to Survive an Engineering Project.”
Many years ago I was contacted by someone who worked for a very large defense contractor. The gentleman (Mr. X) had the responsibility for helping ensure that the electronics modules used by his company met stringent reliability requirements, one of which was a minimum allowable Mean Time Between Failures (MTBF). He had read one of our DACI newsletters that mentioned such reliability predictions, and gave me a call.
“My problem,” he said (I paraphrase), “is that MTBF predictions per Military Handbook 217 don’t make any sense.” He subsequently provided detailed backup studies, including a data collection — using real fielded hardware — that showed the predicted times to failure for the hardware did not match the field experience. The predicted numbers were not just too low (as some folks claim for MIL-HDBK-217), they were also too high, or sometimes about right. In other words, they were pretty random, indicating that MIL-HDBK-217 had no more predictive value than you would get by using a Magic 8 Ball.
But that’s not all. “These reliability predictions,” he continued, “are worse than useless, because engineering managers are cramming in heavy heat sinks, or using other cooling techniques, to drive down the MTBF numbers. The result is a potential decrease of overall system reliability, as well as increased weight and cost, based on this MTBF nonsense.”
Until I heard from Mr. X, I had prepared numerous MTBF reports using MIL-HDBK-217, assuming (what a horrible word, I’ve learned) that the methodology was science-based. After reviewing the data, however, I agreed with Mr. X that MTBFs were indeed nonsense, and said so in the DACI newsletter. This sparked a minor controversy, including being threatened by a representative of a reliability firm (one that did a lot of business with the government) that DACI would be “out of business” because of our stance on the issue.
Well, DACI survived. Today, though, and sadly, my impression is that lots of folks still use MIL-HDBK-217-type cookbook calculations for MTBFs, which are essentially a waste of money, other than the important side benefit (that has nothing to do with MTBF predictions) of examining components for potential overstress. But that task can be done as part of a good WCA, skipping all of the costly and misleading MTBF pseudoscience.
Instead of trying to predict reliability, it’s better to ensure reliability by employing “physics of failure,” the scientific process of studying the chemistry, mechanics, and physics of specific materials and assemblies.
Bottom line: Skip the handbook-style MTBF nonsense, and use those dollars instead to keep abreast of materials science, as applicable to your specific products. (If for some reason you absolutely must prepare an MTBF report, use a Magic 8 Ball: it will be much quicker and just as accurate.)
p.s. Prior to my education by Mr. X, I had been deeply involved with the electronics design for a very ambitious spacecraft project. Thinking MTBF to be an important metric, I asked the project manager what the preliminary MTBF was for the system. He smiled and asked me to meet him privately.
Later, alone in his office, I was furtively told that the MTBF calculations indicated that the system was doomed to failure, so it had been decreed that the project was not going to use MTBFs. The rationale was that each system component would be examined on a case-by-case basis to ensure that its materials and assembly were suitable for its intended task. In essence, this can be viewed as an early example of the physics of failure approach. And yes, the mission was a complete success.
P.S. IF YOU’RE DRIVING A TOYOTA, TURN OFF THE CRUISE CONTROL
(see “Toyota Sudden Acceleration Document Withheld From Feds: CNN” by Sharon Silke Carty in the 03/02/12 Huffington Post)
Valuable evidence may be contained in interconnected subassemblies that are initially assumed to be “innocent.” Disconnecting cables, unplugging circuits cards, and other disturbances can cause irretrievable loss of data.
Example: After its investigation of Toyota “sudden acceleration” incidents, the National Highway Traffic Safety Administration (as well as Toyota) claimed that the electronics were not at fault. Yet the NHTSA based its conclusions, at least in part, on a postmortem that was performed on a dismantled vehicle, thereby disturbing the accelerator system’s linkages with the cruise control system.
(For an earlier comment on this issue from our Engineering Thinking blog, please see “Toyota Unintended Acceleration: ‘No Electronics-Based Cause’: Not True & Misleading“)