How Good is CoinTalk at Grading?

Discussion in 'US Coins Forum' started by physics-fan3.14, Jul 27, 2019.

  1. physics-fan3.14

    physics-fan3.14 You got any more of them.... prooflikes? Supporter

    I'm a huge believer in the idea of "the Wisdom of the Crowds." If you look at the statistics, crowdsourced projects are often more efficient, more accurate, more profitable, and better products. I'm a strong believer in the idea of Wikipedia - if you compare the average number of errors, and depth of detail, in Wikipedia articles compared to a standard published encyclopedia, Wikipedia usually has less errors and more details than a standard peer-reviewed "authoritative reference." For a fascinating read and strong argument in favor of crowdsourcing in favor of traditional authority, I highly recommend the book "Wisdom of Crowds", available here:

    All this got me to thinking... does this apply to coin grading as well? The recent NGC initiative where they'd offer a grade opinion based on a photo on Ebay really stuck with me. After all, this is literally what we do here on CoinTalk. How good are we compared to the "authority"?

    I looked at Guess the Grade threads on the US coins forum since April. I ruled out anything with less than 5 posts, and any details graded coin since those don't have standard grades. Turns out, there's been 31 GTG posts since April. So here's what I found.

    There were 6 GTGs with circulated coins, with 117 total guesses. This sample size is quite small, and will require revisiting in a few months to get more statistically reliable results. However, preliminary results are thus: The average TPG grade was 42.67, the average CoinTalk grade was 47.26. On average, CT overgraded by 4.6 points (in circulated grades, this corresponds to generally one increment higher - example, 40 vs. 45). Again, this sample size is generally too small to have reliable data. The widest spread was over 10 points off.

    I was actually rather surprised at the amount of error in the circulated grades, since these grades are largely determined by amount of wear. There are standardized references (such as Photograde, and the ANA grading guide) which show pictures of each coin in each grade. I am curious to see if this difference between CT and the TPGs changes with a larger sample size.

    For uncirculated coins, there are 25 GTGs with 369 total guesses. The average TPG grade was 64.7, the average CoinTalk grade was 64.26. This gives a difference of -0.45, meaning on average, CoinTalk slightly undergraded coins posted. This ranged from a 0.03 difference (meaning, CoinTalk nailed the TPG grade exactly) to a -3.15 difference, meaning CoinTalk significantly undergraded the coin compared to the TPG.

    I also broke this down into the most common types of coins for GTGs - Morgans, Buffalos, and Walkers. The results were fairly consistent among the types, although the Buffalo did have the biggest difference. I need more data to refine this area, and will post more details in my next update.

    The standard deviation of the error in the uncirculated coins between CT and the TPGs is 1.29. That means that the majority of CT guesses are within about a point of the TPGs. On average, CoinTalk was about a one point difference from the TPG - an acceptable margin in my opinion.

    What we find from this (admittedly small sample size) is that CoinTalk is generally fairly accurate in its grading. In many threads, there is a wide variance between the highest and the lowest grades, but on average we're pretty close. Some variables which will affect the CoinTalk grades include: photo quality, hidden surface issues, differences of opinion in eye appeal, grader experience, and difficulty in grading certain types of coins.

    There are a couple of lessons we can learn from this, especially for newer graders. I highly encourage all of you to participate in the guess the grade threads. You will learn to calibrate your grade to the accepted standard authority. However, if you find yourself more than about a point different from the TPG, you need to figure out why. Is there something you missed? Is there something that you graded too harshly? Is there something that you have a difference of opinion? Do you find a coin attractive or unattractive, and you weigh that too heavily compared to the objective qualities of the coin? These guess the grade threads are some of the most valuable threads on CoinTalk for these reasons - learning to grade coins is an essential skill for the collector. That is why I always try to explain and justify my grade rather than just throwing out a number - and I encourage you all to do the same in the interest of education.

    For the people who post GTG threads, there are also a few takeaways. Always strive to have the best pictures possible. Often, after several posts complaining of poor pictures, the poster will retake pics and the grades will shift significantly. This can be avoided, and yield more accurate results, if the pictures you post are high quality to begin with. I know you're excited to share your coin. I know you want to post the thread, and you're impatient. I get it, I do the same. But you also know that those pictures could be better... take the best pics you can before you post the thread and you'll get much better guesses.

    Also, you'll get a lot more guesses if you post a poll. The average number of guesses with a poll versus without is sometimes an order of magnitude higher - many, many people will click on a poll and never post a thing, rather than have to actually post a guess. The average grade guessed on these threads is usually more accurate. Give a wide range above and below your coin to offer an adequate variance of responses - but offering a poll generally gives much more accurate data. @Lehigh96 is particularly good at this, but few other posters regularly offer a poll on their GTGs. I realize now that we should.

    From a simply numerical perspective, CoinTalk does pretty good. I intend to continue this analysis for the next few months, and will update the boards once I have a larger sample size (I did this same study a couple of years ago, but didn't post it - the results were similar).

    If you want to learn how to grade more accurately, or understand the grading process, CoinTalk members can recommend some books and resources to assist your learning. You only have to ask.

    So, thoughts?

    CT vs TPG.jpg
  2. Avatar

    Guest User Guest

    to hide this ad.
  3. messydesk

    messydesk Well-Known Member

    I think I'd like to see this presented with the grading scale normalized to 1 point per grading step. That is to say, guessing XF45 on an XF40 means you're one point high, just like guessing MS67 on a 66 would. A box plot for each coin would also be informative.

    I'll go back and actually read your post now. ;)
  4. ddddd

    ddddd Member

    You make some great points and I appreciate your analysis!

    Having polls on GTG threads is something that we should start doing more (I haven't done it but will try to remember going forward).

    One thing to note is that a fair share of the GTG threads are the outliers (and that will skew your results). We tend to post coins that either look under-graded or over-graded (again something that I do too). Toned coins are another tricky aspect as it's hard to judge how much of a bump the TPG gave for color. If we posted more blast white examples, the grades should be even closer on average.
  5. MeowtheKitty

    MeowtheKitty Well-Known Member

  6. RonSanderson

    RonSanderson Supporter! Supporter

    I agree that some GTG threads seem to be about coins that the poster feels are misgraded, and might like a second opinion. In this case the results are even more intriguing, showing that CT agrees with the TPGs more than the posters might expect.
    Lembeck13, Sunflower_Coins and ddddd like this.
  7. messydesk

    messydesk Well-Known Member

    Indeed, a good analysis. I agree with ddddd that there is some bias away from the ordinary coins when GTG threads are posted. A good methodology to use would be to grab a Heritage photo, crop out the grade (duh) and post one each week as The Official CoinTalk Crowdsourced Grading Experiment. Post the same instructions each week. You have now controlled for the photo quality, presentation, and instructions to the participants. Another thing you could do is a second, parallel study using high quality photos, such as those on CoinFacts to see if the accuracy is increased (mean closer to actual grade, lower standard deviation).
    Lembeck13, Paul M., Mainebill and 5 others like this.
  8. physics-fan3.14

    physics-fan3.14 You got any more of them.... prooflikes? Supporter

    That is something I do address in the commentary, although I don't normalize for it in the numbers (I'll do that next time).

    I'm fully aware that some of the posts are making a point as to how overgraded or undergraded the coins are. However, what I found is that this actually averages out - a similar number of so-called "overgraded" and "undergraded" coins are posted, and the CT guesses generally balanced. Less of the non-ddddd posts are outliers than you might expect ;)

    Given that eye appeal (and the color bump) are important aspects of the grading process, I don't think this actually distorts the data. CT graders should be able to discern quality toning and adjust their grades for eye appeal according to market demand. How else can we achieve accurate grades as compared to the TPGs?
    Paul M. and ddddd like this.
  9. physics-fan3.14

    physics-fan3.14 You got any more of them.... prooflikes? Supporter

    Standardized photographs would absolutely increase accuracy. However, given that my initial inspiration was the NGC grade review of Ebay pictures, I'm not sure that would actually improve the intended results of this study. While CoinTalk photos are, on average, better than Ebay, there is still a sufficiently wide variance that the skill of the grader in interpreting the photographs has to be considered.

    I do agree, though, that running a standardized test would result in useful and interesting data. I will begin this test tomorrow.
  10. ddddd

    ddddd Member

    What I meant to say was that the color bump can be hard to account for if one doesn't mention the generation of holder. In my experiences the color bumps have been treated differently over the years (much more so than non-toned coins).

    Edit...see here for even more wackiness when it comes to color bumps:
    Paul M. likes this.
  11. eddiespin

    eddiespin Fast Eddie

  12. physics-fan3.14

    physics-fan3.14 You got any more of them.... prooflikes? Supporter

    You are absolutely correct. There is whackiness in grading. There is variability from the TPGs. There is inconsistency in the color bump. I really can't argue that. All I can do is take an average over a sufficiently large sample size, and find what the average result is (hopefully, over a sufficiently large sample size, this sort of thing diminishes in significance). I'll also say that, for the most part, the wildly toned coins make up a small percentage of the overall GTGs, so it doesn't distort the average grade discrepancy as much as you might think (except for your intentional posts, of course ;) )

    There is more grade discrepancy between CT and the TPGs from AU vs MS posts than from color bump posts.
    Paul M., RonSanderson and ddddd like this.
  13. Michael Scarn

    Michael Scarn Member

    I have listed a few ideas for a potential extensions of your research that might not be too difficult, since you have already captured all the required data. Hopefully they are useful/interesting!

    From what I remember (it has been a while since I read this book), the Wisdom of Crowds phenomenon works best when you have disparate, independent estimates. Because previous CT GTGs are visible (and I assume usually read by individuals submitting later posts), I wonder if you might find a statistical relationship between the direction and/or magnitude of the “error” (i.e. CT user grade - TPG Grade) of the first poster and the average (i.e. I would hypothesize that the final CT GTG average might be bigger and/or biased in situations where the first posted estimate is a large positive or negative deviation from the TPG grade). This could be graphically displayed as a scatter plot (initial error, average error) with each dot being a separate coin. Differing sample sizes may cause too much noise and weaken any relationship that may exist, however. This may also explain why polls are more accurate (more responses where people may be less affected by prior guesses).

    It would also be interesting to compare the size of the error of the CT GTG average versus number of estimates for a particular GTG. This could help provide support for how many responses are needed to get an acceptable estimate (without having to assume some sort of distribution in the guesses, a potential problem since these are not IID as noted above).

    Finally, I think it could be interesting to summarize how much “better” the average CT grade for a coin is versus the median CT grader {i.e. Median(abs(CT-TPG)) - abs(Average(CT) - TPG)}.

    Apologies for the enthusiasm; I’m a recovering math/quant guy....
    Lembeck13 and Paul M. like this.
  14. micbraun

    micbraun coindiccted

    I thought grading from pictures is for the birds? Now there’s a study :)

    Let me help you with this. I’ll post a couple of GTG threads in the next days...
  15. physics-fan3.14

    physics-fan3.14 You got any more of them.... prooflikes? Supporter

    @Michael Scarn , those are some great suggestions. "Poster Bias" is something which I have definitely noticed over the years (if a well respected poster states a grade, more posters are going to post that grade....) This, in part, can be handled by the blind poll which I would like to see incorporated in more threads - you can't see the results of the poll until you select your own answer. I think this leads to more accurate results of the level of CT graders. Your other comments are also intriguing, and I will follow up on them.
    Last edited: Jul 27, 2019
    Paul M. likes this.
  16. physics-fan3.14

    physics-fan3.14 You got any more of them.... prooflikes? Supporter

    Fantastic! If you would, please include a poll so we can get good results.
  17. micbraun

    micbraun coindiccted

    Sure, I always include a poll. And I believe I only posted high res pictures in the past, it’s pointless with bad pictures.
  18. desertgem

    desertgem MODERATOR Senior Errer Collecktor Moderator

    Great Work !! Now how about one on accuracy of error coin identification?

    Just kidding !!! I wouldn't put anyone on that task :) Jim
  19. GDJMSP

    GDJMSP Numismatist Moderator

    Question for ya. While you were doing your analysis did try to account for the changes in TPG grading standards ? And, when you were searching for threads did you limit your search to only threads labeled as GTG ?

    I'm asking the 2nd question because that alone would greatly limit your sample size. Of course I also recognize that not limiting that way also adds a much greater degree of difficultly when searching out threads. However, if you were spend the time on that search, I believe it would increase your sample size exponentially. Possibly giving you much better results in regard to accuracy.

    On last thing to consider. Break your sampling down by time frame, meaning the years the posts were made. I think you'll find some interesting differences in results in that aspect alone.
    Paul M. likes this.
  20. calcol

    calcol Supporter! Supporter

    Interesting post and great analysis. Thanks for doing the work.

    Personally, I rarely participate in GTG. Yeah, I look at the pics of the coins ... can't resist that and am glad they get posted ... but don't feel comfortable grading from pics. Especially for AU and UNC coins, variations in lighting and imaging techniques can cause a huge change in perceived grade. For example, it's possible to make adjustment marks on early gold coins virtually disappear with imaging techniques. And I'm not that great at grading AU or UNC coins with the coin in hand ... influenced too much by eye appeal.

    I do look closely at ROF (real or fake) posts. And have made one or two posts on that myself.

    But I encourage fellow CTers to keep up the GTG posts ... wanna see them coins with or w/o a GTG tag! :)

    Paul M. likes this.
  21. physics-fan3.14

    physics-fan3.14 You got any more of them.... prooflikes? Supporter

    I made no attempt to account for TPG standards changes.

    I found threads by browsing through recent threads - I went as far back as April of this year (all these threads were from the past 3 months). Not everything was labelled as GTG, but most of them were. I only used threads with more than 5 guesses, and only coins which had been graded by NGC or PCGS. I also filtered out a couple which had been details graded, as those are hard to account for.
    Paul M. and Randy Abercrombie like this.
Draft saved Draft deleted

Share This Page