Failures-in-time (FITs), DPPM and Probability Calculation Questions

J

johnb1

#1
Hello forum members, I am glad I found this very informative web-site. My name is John and I am a new member and this is my first post so I hope I have chosen the correct forum.

I have performed an MTBF on my electronics during the design phase of the project but I have come to the realization that it did not accurately predict the actual reliability of the hardware, i.e the measured failures exceeded predicted failures in time.

In the system, I have 22 components powered/running in parallel, with a published DPPM of 200 (40 DPPM for higher rated parts), and a published 9 FITs; any failure causes a catastrophic failure in the system. Each component has 4 die (88 per system) that could fail. I plan on having 5000 devices in the field (440000 die) and these devices will be powered/running 24/7 in the field. My dilema is figuring out what formulas can I use to calculate the correlation between DPPM and FIT data I have received and how can I determine my actual reliability with this system.

Also, to prevent single event catastrophic failures we are placing a redundant component to act as a back-up in the event of a die/component failure. Can someone point me in the right direction to calculating the probability of two failures occurring in the one system?

Thanks in advance, any help is much appreciated.
 
Elsmar Forum Sponsor

harry

Super Moderator
#2
Re: Failure in time (FITs), DPPM and Probability Calculation Questions

Some definitions / Terminology

DPPM: Defects per parts per million
MTBF: Mean Time Between Failures
FIT: Failures-in-Time. The number of failures per 1E9 (billion) device-hours
Confidence Level %: Statistical confidence level
 
Last edited:
J

johnb1

#3
Harry. Thanks for the reminder of the definitions. My system has an MTBF of 650000 hours but clearly it is not accurate. I have experienced component failures way in excess of that prediction/model. What I am now looking to do is to correlate the DPPM on the component level (200) and introduce FITs (9) to that formula to get the worst case scenario in terms of reliability. Also, we've added a redundant component to protect against chip failure and I need to calculate probability of a failure occurring on two chips at the same time. Can you direct me to right place?
 

harry

Super Moderator
#4
.................. My system has an MTBF of 650000 hours but clearly it is not accurate. I have experienced component failures way in excess of that prediction/model.......................
Did you use the Mil-Std to work out the MTBF? It is recognized that the Mil-Std is overly conservative. Commercial approaches favored the Bellcore/Telcordia (SR 332) standard because it's an improvement of the Mil-Std and closer to reality.

With regards to your other questions, I'll have to leave it to Covers who have more knowledge in that area.
 
J

johnb1

#5
Harry. Thanks for you continued interest. Yes I use MIL-H-217 as a reference but mostly use software tools that automatically derive the failure rates, FITs, DPPM etc. The Handbook allows for conservative prediction models but in reality it does not make good predictions for components that have e.g. floating gate logic or NAND etc. I have 50 components that have 200 DPPM and 9.93 FITs. We have added two additional component to act as a redundant back-up in the event of a failure. The question I have is what is the probability that the redundant parts will also fail. Thanks John.
 
W

w_grunfeld

#6
Harry. Thanks for you continued interest. Yes I use MIL-H-217 as a reference but mostly use software tools that automatically derive the failure rates, FITs, DPPM etc. The Handbook allows for conservative prediction models but in reality it does not make good predictions for components that have e.g. floating gate logic or NAND etc. I have 50 components that have 200 DPPM and 9.93 FITs. We have added two additional component to act as a redundant back-up in the event of a failure. The question I have is what is the probability that the redundant parts will also fail. Thanks John.
John,
It seems you are mixing up some metrics that are related but not identical.
There is no direct correlation between DPPM and relaibility (expressed as FIT). DPPM is a production metric that counts every defect some related to relaibility some not. The simplest example a smeared marking is a defect and counted in the DPPM but has no reliability meaning.
Most manufacturing defects are, at least theoretically, screened out by the various inspection and final test. In reliability it is assumed that all parts are defectless at time t=0. FIT is a reliability metric that you can get from the manufacturer, it is calculated from the number of parts failed during manufacturer's Life Test , the industry standard is to calculate the FIT at a 60% confidence level. Still it is a lot more accurate than "predicting" it based on any generic type prediction method such as MIL-HDBK-217, Prism, 217+ or Telcordia.
As for two parts added for redundacy,, your question is overly general to provide an answer as to the correct method. Are these 2 components active standby, static standby (kick in upon failure), in short you need to draw up a reliability block diagram showing the redundant components with the series and parallel branches. Once you have that, there are numerous sources that will direct you to the correct way of calculating the system reliability. It is not the same as the "probability that both components will fail at the same time" as you imply.
Re for instance :http://www.ece.cmu.edu/~koopman/des_s99/traditional_reliability/index.html
 
J

johnb1

#7
Willy-
Thanks for your detailed response. I should be more specific. The vendor I speak of published a DPPM of their NAND flash to be 200. I don't know if they actually tested million parts or it was just a prediction/model, nevertheless I accept it. So I can expect any die failure causes me to have a catatrophic event, and therefore a system failure. In my system I potentially have 880000 die (potential failures) Can I express DPPM as: 200/880000 = 0.00022 or 0.02%? Now I also know that the vendor has pulished an FIT of 9.93. Therefore in a billion hours of operation I can expect 9.93 failures, is this expression correct? Now, I may have 5000 devices in the field and I may run them 24/7 I can expect hours of operation to be 4.38e7. I accept the 9.93 FIT data therefore I can expect to have 9.93 failure per billion hours of operation which is approximately 42.6 failures over this time (1 year). Is this correct? Can I factor in the DPPM from the vendor as I have expressed or do I only use the FITs?
I have experienced 14 failures in a few weeks of operation in a sample set of 100 devices. Naturally there is concern, so we have moved towards our secondary supplier, nevertheless I am faced with predicting real world data and not those vague prediction used by MIL-H 217. Looking forward to your reply.
 
W

w_grunfeld

#8
Willy-
Thanks for your detailed response. I should be more specific. The vendor I speak of published a DPPM of their NAND flash to be 200. I don't know if they actually tested million parts or it was just a prediction/model, nevertheless I accept it. So I can expect any die failure causes me to have a catatrophic event, and therefore a system failure. In my system I potentially have 880000 die (potential failures) Can I express DPPM as: 200/880000 = 0.00022 or 0.02%? Now I also know that the vendor has pulished an FIT of 9.93. Therefore in a billion hours of operation I can expect 9.93 failures, is this expression correct? Now, I may have 5000 devices in the field and I may run them 24/7 I can expect hours of operation to be 4.38e7. I accept the 9.93 FIT data therefore I can expect to have 9.93 failure per billion hours of operation which is approximately 42.6 failures over this time (1 year). Is this correct? Can I factor in the DPPM from the vendor as I have expressed or do I only use the FITs?
I have experienced 14 failures in a few weeks of operation in a sample set of 100 devices. Naturally there is concern, so we have moved towards our secondary supplier, nevertheless I am faced with predicting real world data and not those vague prediction used by MIL-H 217. Looking forward to your reply.
Leave the DPPM out! Use only the FIT, I did not repeat your calculation but it sounds right, number of expected failures=FITXOperating hours x number of IC's (not dies!)
The real life failure will likely be higher in any case, first because the FIT as given by the manufacturer is at 60% CL. Also manufacturer's FIT relates to components tested on a test board with IC's plugged in and nothing else. In your system there are added failure modes such as caused by substandard solder joints,or other manufacturing imperfections, handling (ESD damage) and others.
So I wouldn't rush to change suppliers, before factoring in all the above. Finally a few weeks of operation is not enough time to decide either way. To see whether or not the actual field results are compatible with the FIT data , and at what confidence, is a lot more complicated than just multiplying a few factors.You would need to use a weibull analysis, consult www.weibull.com (Life data analysis)to understand the underlying theory.
Willy
 

Tim Folkerts

Super Moderator
#9
Let me try my hand at expanding on the good advice Willy has given.

DPPM would be used to predict the success that the item would work at t=0. If each part has a 200 ppm defect rate, then the odds that a set of 22 would all be good would be simply (1-200/1,000,000)^22 = 99.6%. So when you first build the items, you might expect 0.5% to be bad before you even start.

For your failure rate calculation, there are a couple things to look at.

  • I agree that there are 4.38e7 hours of operation for the 5000 devices in 1 yr, but you said you have 22 components/device, so there would be 4.38e7 hr * 22 = 9.6E8 component hr
  • Which ever number you use, the next calculation seems wrong. 9.93e-9 failures/hr * 4.38e7 hr = 0.43 failures (not 43 failures). if I multiply by 22, I get 9.6 failures per year for 5000 devices. In either case, this is way more than 14 failures in a few week for 100 devices.
You ought to check to see where and how the failures occur. As Willy pointed out, there are many other places for failures to occur besides in these components. If it really IS the chips, then they appear to be much worse than 9.93 FIT.

Tim F
 
J

johnb1

#10
Willy/Tim-
Salient points indeed. OK I'll leave DPPM out and use FITs. Then I'll re-calculate my numbers. I apologize for being cryptic at times. The vendor has corroborated the failure modes of NAND and that their published FITs are not consistent with our findings. This is a painful subject on both sides I can assure you. I will follow-up with probablity questions of secondary back up NAND in a later email. Thanks for your insight and feedback I find this very valuable.
 
Thread starter Similar threads Forum Replies Date
C FMEA - Multiple function failures considerations FMEA and Control Plans 3
R Assessing Pipette Calibration Failures General Measurement Device and Calibration Topics 5
M mtbf (mean time between failures) calculations of an assembly Reliability Analysis - Predictions, Testing and Standards 10
G ISO/TS 5.6.2.1 - Potential Field Failures on OEMs IATF 16949 - Automotive Quality Systems Standard 3
P TS 16949 Clause 5.6.2.1 - Review of Potential Field Failures IATF 16949 - Automotive Quality Systems Standard 16
Chennaiite Cause Analysis for Verification/Validation Failures - Seeking Opinions Problem Solving, Root Cause Fault and Failure Analysis 3
S PFMEA Assumption to Exclude Latent Failures FMEA and Control Plans 6
R Modelling Yearly Cyclical Failures Reliability Analysis - Predictions, Testing and Standards 3
T ISO 17025 Liabilty of Reporting Test Failures ISO 17025 related Discussions 5
D Explaining Failures of Quality Control to the General Public Quality Manager and Management Related Issues 12
Q Success or failures using Quality inspection services in China Quality Manager and Management Related Issues 2
D Demonstration Reliability Test Formula with allowance for 'x' Failures Reliability Analysis - Predictions, Testing and Standards 6
R Does FDA require monitoring competitor device failures for preventive action? 21 CFR Part 820 - US FDA Quality System Regulations (QSR) 4
V Common Root Cause for Failures in Scaleup & Commercial (Validation) Batches 21 CFR Part 820 - US FDA Quality System Regulations (QSR) 2
Mikey324 Assessing Potential Field Failures (TS 16949 Requirements) Quality Manager and Management Related Issues 5
S Calculating MTBF (Mean Time Between Failures) for the whole Production Unit Reliability Analysis - Predictions, Testing and Standards 2
L Failures of Imported Electrical Products Design and Development of Products and Processes 10
M Analysis of Potential Field Failures - Impact on Quality, Safety or the Environment Management Review Meetings and related Processes 10
S MTBMF (Mean Time Between Mission Failures) vs. MTBOMF - Differences Reliability Analysis - Predictions, Testing and Standards 6
optomist1 Binary & Attribute Data Measures - Electrical Field Failures or Opens Capability, Accuracy and Stability - Processes, Machines, etc. 2
V Coverage of Process Failures in dFMEA (Scope/Boundaries for Inclusion) & vice versa Pharmaceuticals (21 CFR Part 210, 21 CFR Part 211 and related Regulations) 1
B Line Clearance Failures and Issues ISO 13485:2016 - Medical Device Quality Management Systems 2
M Dbase or Excel file to keep track on our internal failures (Non conform products) Document Control Systems, Procedures, Forms and Templates 8
automoto ISO TS 16949 Management Review Input - Potential Field Failures IATF 16949 - Automotive Quality Systems Standard 10
J Applying Previous Failures to new PFMEA APQP and PPAP 2
A Procedure for Handling Samples needed - Failures, Retention Time, etc. Document Control Systems, Procedures, Forms and Templates 4
Batguy CQI (Certified Quality Inspector) Training failures Training - Internal, External, Online and Distance Learning 1
P Prevention of Failures - General Materials Control Process Flow Question Process Maps, Process Mapping and Turtle Diagrams 7
D Appropriate FMECA handling of cascading failures? FMEA and Control Plans 6
M Building a Database for Equipment Failures Document Control Systems, Procedures, Forms and Templates 9
J Intermittent failures either due to component or machine variations Manufacturing and Related Processes 6
J Remote site which has two locations - Internal Communcations Failures - 5.5.3 IATF 16949 - Automotive Quality Systems Standard 6
J QC Visual Test Failures - Repeating Test Sample Numbers ISO 13485:2016 - Medical Device Quality Management Systems 3
S Mean Time Between Failures (MTBF) - SN 29500 Standards? Various Other Specifications, Standards, and related Requirements 5
N No Actual Field Failures (5 years) - Still need Analysis of Potential Field Failures? IATF 16949 - Automotive Quality Systems Standard 6
S Calculating MTBF (Mean Time Between Failures) of different Electronic Components. Reliability Analysis - Predictions, Testing and Standards 16
K MTBF (Mean Time Between Failures) Calculation Reliability Analysis - Predictions, Testing and Standards 4
C Graph Choice For Monitoring DOA's (Dead On Arrival) and ELF's (Early Life Failures) Statistical Analysis Tools, Techniques and SPC 5
M Workmanship Related Failures - Responding to a customer CAR Nonconformance and Corrective Action 12
A How to organize DFMEA for Failures that have many Causes and Effects FMEA and Control Plans 2
A Defining Complaints - Product malfunctions or failures ISO 13485:2016 - Medical Device Quality Management Systems 5
W Mechanical vs. Electrical Failures - Which do you see more of? Manufacturing and Related Processes 7
J Graphing MTBF (Mean Time Between Failures) or Reliability Data of a System Reliability Analysis - Predictions, Testing and Standards 1
P Analysis of actual and potential field failures - Clarification on clause 5.6.2.1 IATF 16949 - Automotive Quality Systems Standard 4
Q Final Inspection lot failures - AQL based final acceptance AQL - Acceptable Quality Level 3
C How to analyse and identify potential field failures? Information Resources? IATF 16949 - Automotive Quality Systems Standard 4
M Costs of APQP not being applied or Failures APQP and PPAP 3
D FMEA as a tool for identifying possible failures in Acquisition / New Start Up FMEA and Control Plans 4
L Tool to Identify Systemic Failures - 'Drill Deep & Wide' - GM (General Motors) Customer and Company Specific Requirements 35
S Major NC or Not - Procedure Not Implemented and Repetitive mistakes/failures General Auditing Discussions 7

Similar threads

Top Bottom