Hi Adam
All I ended up doing was calculating agreement by counts.
3 staff view an item (or image in your case) a total of 3 times each.
They can record a maximum of 9 faults in any named category. So the headline figure for agreement is (number of agree)/(number of view).
(9 faults of type A) / (9 views) = 100%
(8 faults of type A) / (9 views) = 88.8% (one person found fault A on 2 out of 3 viewings).
Of course, the problem with this is that you can never have less than 50% agreement in any category.
So I defined other statistics based upon the same basic idea.
Within staff agreement = (number of items with total agreement) / (number of items viewed)
Staff #1 viewed 100 items. 95 items were judged as having either no faults, OR the same fault on all three viewings. Self agreement = 95/100 = 95%
You get the basic idea. Measure agreement by number of items, or number of faults in each category, then divide by the maximum counts you could get, either faults, or items. With a bit of thinking, you can make sense of what the various statistics mean and decide which ones are most appropriate for your purpose.
If you really want to go to town on this, generate a matrix of staff agreement on every round of viewing for 3 staff and 3 views (A, B, C = staff, 1, 2, 3 = view number)
A1 1
A2 .96_1
A3 .94_.98_1
B1 .78_.78_......1
B2 .76_.81_...........1
B3 .59_.59_................1
C1 .88_.89.......................1
C2 .91_.96...........................1
C3 .82_.84.................................1
###A1_ A2 A3 B1 B2 B3 C1 C2 C3
This will allow you to see exactly who agrees with whom and by how much.
It's a lot of work to do manually. In my case, I can't get all of the data onto an Excel worksheet. I only have 256 columns and I have over 300 columns of data. (By the time I have finished, 12 staff, 3 viewings, 10 fault categories, 50 items, all data entered by me, all items with removable numbers, randomised for each round of viewing, all calculations created manually in Excel. At the moment I'm having bad dreams about being attacked by fault categories and COUNTIF functions!)
The above method is crude and not based on any statistical theory, but it does summarise concisely. You will find that you need to make a lot of use of the COUNTIF function.
I would post you part of my spreadsheet, but I'm not at work at the moment and in any case, it's a bit of a mess (making it up as you go along, you know the sort of thing).
Hope this helps.
NC