What you need ..
is an attribute gage R&R to be performed.
I'm assuming your test is not destructive. If it is, you can still do R&R but it gets messier.
Take a look at Jurans Handbook, 5th edition, section 23.51 on "Measure of Inspector and Test Accuracy". This reading outlines the plan to compare the results of two inspectors (human in the example, but can just as easily be your machines).
You will need a quantity of parts spanning the range of defects.
You will also need a "check" inspector that is most likely a human or the outside third party inspector that D. Scott mentioned. This check inspector will examine each part on test and decide it's "good" versus "bad" status.
MINITAB software also has an attribute R&R routine as of release 13. It's about $700 off the shelf I think, but you can probably negotiate. There may be other software packages that do this.
If you really get hung up on the analysis I could do it for you. I would want results in some electronic format , Excell spreadsheet or even a text file with good delimiters. My days of manual transcriptions from checksheets and charts are over I hope!
