Any statistical method of data analysis must filter out the noise from the excessive variation therefore the first step is to start right at the beginning and think about how you are going to design the study to reveal what you want.
For example, if you have anymore than one operator carrying out the studies on four arms you will not effectively be able to tell with any degree of certainty whether the differences you may see can be attributed to the kit or the operator or anything else.
If you want to flush out bias you will need an artefact or part of some sort with known values. This artefact will need to be stable during the studies or once again you will encounter differences that you may not be able to explain.
As you have not mentioned carrying out the studies with the intent on understanding the magnitude and type of variation you will encounter when measuring your parts in reality we needn't worry about making them representative, however, the environment may need to be monitored during the study in case some of the results reveal influences which you will need to explain.
You will need to securely clamp the artefact and the candidate arm and devise a program to measure it. Ideally, the key steps in this program would be documented such as the arm manufacturer , the software and revision, the clamping method, type of probe, calibration results, number of probings per feature, datuming method etc etc. The operator will need to be aware of the method to avoid the rehearsal effect during the study.
The feature of interest will need to be measured multiple times by each arm for at least 20 individual measurements. The measurements of the feature should be plotted in time series on an individual moving range (ImR) chart. If any measurements breach the upper or lower control limits calculated on this chart then the measurement process is not consistent and the measurement procedure should be examined and if necessary corrected (this is the part where the noise is being filtered)
Finally, you should arrive at four ImR charts which should contain consistent measurements. The average range on the moving range charts essentially is a major component of the standard deviation which describes the repeatability so you can compare this across the four arms. The X bar average on the Individual Chart defines the location of the data and if there is a known size on the artefact you can compare this for bias.
From this point, if you want to properly compare the characteristics taking into consideration the noise I suggests you look at the quality digest, (
www.qualitydigest.com) articles "When are instruments equivalent?" Part 1 May, 2019 and Part 2 June, 2019 from Dr Donald J Wheeler and James Beagle III. These articles contain simple formulas and example calculations to compute all you need.
Notice how none of what I have said compares the outcomes against tolerance, you can of course do this but only when you know that the differences you see are not noise.
Hope this helps.