Failures-in-time (FITs), DPPM and Probability Calculation Questions

J

johnb1

Hello forum members, I am glad I found this very informative web-site. My name is John and I am a new member and this is my first post so I hope I have chosen the correct forum.

I have performed an MTBF on my electronics during the design phase of the project but I have come to the realization that it did not accurately predict the actual reliability of the hardware, i.e the measured failures exceeded predicted failures in time.

In the system, I have 22 components powered/running in parallel, with a published DPPM of 200 (40 DPPM for higher rated parts), and a published 9 FITs; any failure causes a catastrophic failure in the system. Each component has 4 die (88 per system) that could fail. I plan on having 5000 devices in the field (440000 die) and these devices will be powered/running 24/7 in the field. My dilema is figuring out what formulas can I use to calculate the correlation between DPPM and FIT data I have received and how can I determine my actual reliability with this system.

Also, to prevent single event catastrophic failures we are placing a redundant component to act as a back-up in the event of a die/component failure. Can someone point me in the right direction to calculating the probability of two failures occurring in the one system?

Thanks in advance, any help is much appreciated.
 

harry

Trusted Information Resource
Re: Failure in time (FITs), DPPM and Probability Calculation Questions

Some definitions / Terminology

DPPM: Defects per parts per million
MTBF: Mean Time Between Failures
FIT: Failures-in-Time. The number of failures per 1E9 (billion) device-hours
Confidence Level %: Statistical confidence level
 
Last edited:
J

johnb1

Harry. Thanks for the reminder of the definitions. My system has an MTBF of 650000 hours but clearly it is not accurate. I have experienced component failures way in excess of that prediction/model. What I am now looking to do is to correlate the DPPM on the component level (200) and introduce FITs (9) to that formula to get the worst case scenario in terms of reliability. Also, we've added a redundant component to protect against chip failure and I need to calculate probability of a failure occurring on two chips at the same time. Can you direct me to right place?
 

harry

Trusted Information Resource
.................. My system has an MTBF of 650000 hours but clearly it is not accurate. I have experienced component failures way in excess of that prediction/model.......................

Did you use the Mil-Std to work out the MTBF? It is recognized that the Mil-Std is overly conservative. Commercial approaches favored the Bellcore/Telcordia (SR 332) standard because it's an improvement of the Mil-Std and closer to reality.

With regards to your other questions, I'll have to leave it to Covers who have more knowledge in that area.
 
J

johnb1

Harry. Thanks for you continued interest. Yes I use MIL-H-217 as a reference but mostly use software tools that automatically derive the failure rates, FITs, DPPM etc. The Handbook allows for conservative prediction models but in reality it does not make good predictions for components that have e.g. floating gate logic or NAND etc. I have 50 components that have 200 DPPM and 9.93 FITs. We have added two additional component to act as a redundant back-up in the event of a failure. The question I have is what is the probability that the redundant parts will also fail. Thanks John.
 
W

w_grunfeld

Harry. Thanks for you continued interest. Yes I use MIL-H-217 as a reference but mostly use software tools that automatically derive the failure rates, FITs, DPPM etc. The Handbook allows for conservative prediction models but in reality it does not make good predictions for components that have e.g. floating gate logic or NAND etc. I have 50 components that have 200 DPPM and 9.93 FITs. We have added two additional component to act as a redundant back-up in the event of a failure. The question I have is what is the probability that the redundant parts will also fail. Thanks John.
John,
It seems you are mixing up some metrics that are related but not identical.
There is no direct correlation between DPPM and relaibility (expressed as FIT). DPPM is a production metric that counts every defect some related to relaibility some not. The simplest example a smeared marking is a defect and counted in the DPPM but has no reliability meaning.
Most manufacturing defects are, at least theoretically, screened out by the various inspection and final test. In reliability it is assumed that all parts are defectless at time t=0. FIT is a reliability metric that you can get from the manufacturer, it is calculated from the number of parts failed during manufacturer's Life Test , the industry standard is to calculate the FIT at a 60% confidence level. Still it is a lot more accurate than "predicting" it based on any generic type prediction method such as MIL-HDBK-217, Prism, 217+ or Telcordia.
As for two parts added for redundacy,, your question is overly general to provide an answer as to the correct method. Are these 2 components active standby, static standby (kick in upon failure), in short you need to draw up a reliability block diagram showing the redundant components with the series and parallel branches. Once you have that, there are numerous sources that will direct you to the correct way of calculating the system reliability. It is not the same as the "probability that both components will fail at the same time" as you imply.
Re for instance :http://www.ece.cmu.edu/~koopman/des_s99/traditional_reliability/index.html
 
J

johnb1

Willy-
Thanks for your detailed response. I should be more specific. The vendor I speak of published a DPPM of their NAND flash to be 200. I don't know if they actually tested million parts or it was just a prediction/model, nevertheless I accept it. So I can expect any die failure causes me to have a catatrophic event, and therefore a system failure. In my system I potentially have 880000 die (potential failures) Can I express DPPM as: 200/880000 = 0.00022 or 0.02%? Now I also know that the vendor has pulished an FIT of 9.93. Therefore in a billion hours of operation I can expect 9.93 failures, is this expression correct? Now, I may have 5000 devices in the field and I may run them 24/7 I can expect hours of operation to be 4.38e7. I accept the 9.93 FIT data therefore I can expect to have 9.93 failure per billion hours of operation which is approximately 42.6 failures over this time (1 year). Is this correct? Can I factor in the DPPM from the vendor as I have expressed or do I only use the FITs?
I have experienced 14 failures in a few weeks of operation in a sample set of 100 devices. Naturally there is concern, so we have moved towards our secondary supplier, nevertheless I am faced with predicting real world data and not those vague prediction used by MIL-H 217. Looking forward to your reply.
 
W

w_grunfeld

Willy-
Thanks for your detailed response. I should be more specific. The vendor I speak of published a DPPM of their NAND flash to be 200. I don't know if they actually tested million parts or it was just a prediction/model, nevertheless I accept it. So I can expect any die failure causes me to have a catatrophic event, and therefore a system failure. In my system I potentially have 880000 die (potential failures) Can I express DPPM as: 200/880000 = 0.00022 or 0.02%? Now I also know that the vendor has pulished an FIT of 9.93. Therefore in a billion hours of operation I can expect 9.93 failures, is this expression correct? Now, I may have 5000 devices in the field and I may run them 24/7 I can expect hours of operation to be 4.38e7. I accept the 9.93 FIT data therefore I can expect to have 9.93 failure per billion hours of operation which is approximately 42.6 failures over this time (1 year). Is this correct? Can I factor in the DPPM from the vendor as I have expressed or do I only use the FITs?
I have experienced 14 failures in a few weeks of operation in a sample set of 100 devices. Naturally there is concern, so we have moved towards our secondary supplier, nevertheless I am faced with predicting real world data and not those vague prediction used by MIL-H 217. Looking forward to your reply.

Leave the DPPM out! Use only the FIT, I did not repeat your calculation but it sounds right, number of expected failures=FITXOperating hours x number of IC's (not dies!)
The real life failure will likely be higher in any case, first because the FIT as given by the manufacturer is at 60% CL. Also manufacturer's FIT relates to components tested on a test board with IC's plugged in and nothing else. In your system there are added failure modes such as caused by substandard solder joints,or other manufacturing imperfections, handling (ESD damage) and others.
So I wouldn't rush to change suppliers, before factoring in all the above. Finally a few weeks of operation is not enough time to decide either way. To see whether or not the actual field results are compatible with the FIT data , and at what confidence, is a lot more complicated than just multiplying a few factors.You would need to use a weibull analysis, consult www.weibull.com (Life data analysis)to understand the underlying theory.
Willy
 

Tim Folkerts

Trusted Information Resource
Let me try my hand at expanding on the good advice Willy has given.

DPPM would be used to predict the success that the item would work at t=0. If each part has a 200 ppm defect rate, then the odds that a set of 22 would all be good would be simply (1-200/1,000,000)^22 = 99.6%. So when you first build the items, you might expect 0.5% to be bad before you even start.

For your failure rate calculation, there are a couple things to look at.

  • I agree that there are 4.38e7 hours of operation for the 5000 devices in 1 yr, but you said you have 22 components/device, so there would be 4.38e7 hr * 22 = 9.6E8 component hr
  • Which ever number you use, the next calculation seems wrong. 9.93e-9 failures/hr * 4.38e7 hr = 0.43 failures (not 43 failures). if I multiply by 22, I get 9.6 failures per year for 5000 devices. In either case, this is way more than 14 failures in a few week for 100 devices.
You ought to check to see where and how the failures occur. As Willy pointed out, there are many other places for failures to occur besides in these components. If it really IS the chips, then they appear to be much worse than 9.93 FIT.

Tim F
 
J

johnb1

Willy/Tim-
Salient points indeed. OK I'll leave DPPM out and use FITs. Then I'll re-calculate my numbers. I apologize for being cryptic at times. The vendor has corroborated the failure modes of NAND and that their published FITs are not consistent with our findings. This is a painful subject on both sides I can assure you. I will follow-up with probablity questions of secondary back up NAND in a later email. Thanks for your insight and feedback I find this very valuable.
 
Top Bottom