Reply to Thread

Message:: [QUOTE="dd27, post: 2624261, member: 80416"]No answer yet from PCGS, but let me explain why a formal inter-rater reliability study is important. First one needs to determine what one means by "accurate". Is accuracy some sort of pre-determined '<a href="http://idioms.thefreedictionary.com/gold+standard" target="_blank" class="externalLink ProxyLink" data-proxy-href="http://idioms.thefreedictionary.com/gold+standard" rel="nofollow">gold standard</a>'?[1] Or is the "percent agreement" between independent raters? Of course, there is no definitive gold standard when it comes to grading coins. As others have pointed out, it is subjective to some extent, even with guidelines or standards; it is more art than science. Along those lines, one should specify which guidelines one uses, e.g., <a href="http://www.us-coin-values-advisor.com/grading-coins.html#70Point" target="_blank" class="externalLink ProxyLink" data-proxy-href="http://www.us-coin-values-advisor.com/grading-coins.html#70Point" rel="nofollow">ANA Grading Standards</a> [2], or the <a href="http://www.pcgs.com/grades/" target="_blank" class="externalLink ProxyLink" data-proxy-href="http://www.pcgs.com/grades/" rel="nofollow">PCGS grading standards</a>. One could develop an acceptable gold standard, by, for example, soliciting nominations and votes from numismatic organizations, coin clubs, coin dealers, the TPGs, and coin collectors, regarding the best coin graders (who are not affiliated with any of the major TPGs). A group of 15 such experts could independently grade coins in groups of 3, i.e., three graders per coin, and then either average the grades, or the three graders could discuss their grades and agree on a consensus grade. This grade would then become the gold standard for that particular coin, which would then be submitted to each of the TPGs under study. Such a research project would then require a statistical analysis of accuracy by comparing TPGs' grades vis-a-vis the gold standard grade. Or one could evaluate the degree of congruence (agreement) between two (or more) independent raters. Either way determining how to measure accuracy is not easy. For example, one would first have to determine the appropriate type of statistical analysis, interrater reliability (IRR) or interrater agreement (IRA). As LeBreton & Senter (2008, pp. 816-817) explain: <blockquote>"IRR [interrater reliability] refers to the relative consistency in ratings provided by multiple judges of multiple targets. Estimates of IRR are used to address whether judges rank order targets in a manner that is relatively consistent with other judges. The concern here is not with the equivalence of scores but rather with the equivalence of relative rankings. In contrast, IRA [interrater agreement] refers to the absolute consensus in scores furnished by multiple judges for one or more targets. Estimates of IRA are used to address whether scores furnished by judges are interchangeable or equivalent in terms of their absolute value.The concepts of IRR and IRA both address questions concerning whether or not ratings furnished by one judge are ‘‘similar’’ to ratings furnished by one or more other judges. These concepts simply differ in how they go about defining inter-rater similarity. Agreement emphasizes the interchangeability or the absolute consensus between judges and is typically indexed via some estimate of within-group rating dispersion. Reliability emphasizes the relative consistency or the rank order similarity between judges and is typically indexed via some form of a correlation coefficient. Both IRR and IRA are perfectly reasonable approaches to estimating rater similarity; however, they are designed to answer different research questions. Consequently, researchers need to make sure their estimates match their research questions." [3] </blockquote>Or, as Gisev, Bell, & Chen (2013, p. 330) note: <blockquote>"Interrater agreement indices assess the extent to which the responses of 2 or more independent raters are concordant. Interrater reliability indices assess the extent to which raters consistently distinguish between different responses. A number of indices exist, and some common examples include Kappa, the Kendall coefficient of concordance, Bland-Altman plots, and the intraclass correlation coefficient. Guidance on the selection of an appropriate index is provided. In conclusion, selection of an appropriate index to evaluate interrater agreement or interrater reliability is dependent on a number of factors including the context in which the study is being undertaken, the type of variable under consideration, and the number of raters making assessments." [4]</blockquote> To complicate matters, in some cases even if one simply wants to calculate the extent of agreement between independent raters, one might nonetheless use an IRR analysis because the 70-point grading scale would be considered analogous to a continuous variable, even though it is technically an ordinal variable. (See Categorical and Continuous Variables, near the bottom of the page, at <a href="https://statistics.laerd.com/statistical-guides/types-of-variable.php" target="_blank" class="externalLink ProxyLink" data-proxy-href="https://statistics.laerd.com/statistical-guides/types-of-variable.php" rel="nofollow">Types of Variables</a>.) My point in bringing in this academic stuff is to highlight the complexities involved in establishing a reliable accuracy statistic. Even if a company wants to conduct an internal evaluation of grader consistency and accuracy, it requires careful planning, knowledge of research methodology (or program evaluation methodology, which is similar), and statistical analysis. The best guide to the statistical analysis is the <a href="https://www.amazon.com/Handbook-Inter-Rater-Reliability-Definitive-Measuring/dp/0970806280/" target="_blank" class="externalLink ProxyLink" data-proxy-href="https://www.amazon.com/Handbook-Inter-Rater-Reliability-Definitive-Measuring/dp/0970806280/" rel="nofollow">Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters (4th Ed.)</a>. [5] Footnotes1. gold standard - "...a well-established and widely accepted model or paradigm of excellence by which similar things are judged or measured." <a href="http://idioms.thefreedictionary.com/gold+standard" target="_blank" class="externalLink ProxyLink" data-proxy-href="http://idioms.thefreedictionary.com/gold+standard" rel="nofollow">Farlex Dictionary of Idioms</a> 2. Bressett, K. E. & Bowers, Q. D. (2006). The official American Numismatic Association grading standards for United States coins (6th Ed.). Atlanta, GA: Whitman Publishing. [ISBN-13: 978-0794819934] 3. LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11(4), 815-852. 4. Gisev, N., Bell, J. S., & Chen, T. F. (2013). Interrater agreement and interrater reliability: key concepts, approaches, and applications. Research in Social and Administrative Pharmacy, 9(3), 330-338. 5. Gwet, K. L. (2014). Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters (4th Ed.). Gaithersburg, MD: Advanced Analytics [ISBN: 978-0970806284][/QUOTE]

Create my FREE Account!

Log in or Sign up

Reply to Thread

Useful Searches