So some of you probably know by now that I've been trying to explain, or understand, the hammered prices of coins sold at auctions. As a data-scientist (buzz-word; I'm a clinical epidemiologist), I've been coding and tweaking data on various coin type (aurei, several denari and several medieval coins). This involves a bit of automated data-mining. In this post, I'm focusing on the data-mining, I've coded an algorithm that 'looks' in the data for the following variables: The estimated price [variable name: estimate] The hammered price [hammer] The assigned grade [grade] The auction house [auction_house] Is there a toning (e.g. 'cabinet toning) [toning] Any super-enthusiastic notes (e.g. "superb") [positive_remarks] Is the coin rare (e.g. "extremely rare") [rarity] Is the coin damaged (e.g. broken) [damaged] Is the coin slabbed or graded (e.g. NGC) [slabbed] Are there problems related to minting (e.g. rusty die) [minting_problems] Are there remarkable minting 'benefits' ("well struck") [minting_pros] Is the coin die-matched [die_match] Is there a provenance (e.g. "from the x-collection") [provenance] Is this coin a plate coin [plate_coin] So, so far, 14 variables. Now I'm looking for your input. Except for the estimate, hammered price and auction house, these variables are based on different strings of text. For example, the variable "toning" is based on the following strings: bronzepatina cabinet patina tone toned toning tönung ...the variable "positive_remarks" on the following: attractive wonderful interesting important type fascinating attractively eye appeal splendid Strong portrait desirable pleasing superb premium quality ... and the variable "rarity" is based on these: rare rarest scarce unrecorded selten unpublished unique key coin mule uncommon finest known Rarity seltene recorded Unrecorded So, a description containing the sentence "Good VF, toned. Attractive style. Rare." will result in: Toning = true Positive_remarks = true Rarity = true You might agree (or not) that these variables say something about the value of a coin (you can actually test this). Now what I'm looking for is the following: I need some input on these variables. So, here they are: Damage: (mild damage) prüfhieb test cuts test cutss edge cut edge cuts banker banker-mark banker mark banker\'s mark bankers' marks countermark damage damaged metal flaw graffiti cleaning marks weakness scrapes kratzer corrosion korrosion porous grainy rough flan flaw hairlines damaged damaged scraped flan-crack bent korrosionsspuren schrötlingsrisse roughness porosity deposits flaw scratches scratch chop-mark circulated smoothed cleaned hairline Schrötlingsriss chipped creased korrodiert corroded Schrötlingsriß chip crimped overcleaned Damage: (severe damage) broken broke crack repaired repair crystallized hole holed bronzepest Altered surfaces tooled Artificial toning Retoned pierced Durchbrüche fragment fragments GEBROCHENES GEKLEBTES Abbruch Minting_problems unregelmäßiger schrötling worn reverse die worn obverse die stempelausbruch doppelschl irregular flan prägeschwächen worn die stempelfehler wavy flan schrötlingsfehler knapper schrötling off-center off center flat strike weak strike dezentriert belagreste die crack die rust die clash low relief die wear flow lines gewellt rusted blundered Minting_pros large flan wide flan broad flan breiter schrötling luster lustrous iridescent stempelglanz hohes relief well struck good strike well-centered well centered perfect strike as struck full strike Lustre Provenance provenance collection sammlung erworben from the inventory from a private provenienzen Pedigree purchased Die_match same obv. die same rev. die same dies stempelgleich die match same die Plate_coin plate coin this coin published this coin illustrated this coin cited this coin, illustrated this coin Slabbed ngc slabbed certified encapsulated Slabbed PNG PCGS PNG holder
I'm not at all qualified to comment on these. But I'm curious about one minor point: if merely being cleaned results in a classification of an ancient coin as being "damaged,' than isn't just about every ancient coin in the world that's for sale or has ever been sold "damaged" unless it just came out of the ground? You're not talking about U.S. coins here, after all. Or are you assuming that if a dealer even bothers to mention that a coin has been cleaned, that implies that the coin was damaged in the process?
It is interesting to find you here talking about your automated tools to scrape auction descriptions the same week that @Suarez is posting about the difficulties he is having with his artisanal coin scraping software. I went to my collection database and the second coin had the condition notes "minor chipping to the edges - typical". "Chipping" isn't one of your words. Would could all pitch in and find words but perhaps something like WordNet is better to find synonyms. Consider publishing your tool as open source on github so that people working on tools like CoinProject and Coryssa can use your tools to get more machine-readable data for analysis. I see you have a few foreign words but not many. Look at http://www.muenzen-hardelt.de/dic/diction1.html for more ideas.
Since half of your words are synonyms and many will have several that apply at the same time and to varying degrees, how will you handle searches? Are German and English the only relevant languages? I feel you are trying to digitize an analog subject. That is a hard task.
Thanks all for the comments. This is of course correct. However, it's my understanding (and limited experience) that the comment "cleaned" or "overcleaned" in ancient coinage, and especially in the higher-end, decreases the value. I use publicly available data from sixbid. At this moment, I'm not focusing on auction houses, although this certainly is a possibility. Certainly interesting, and I've looked at coryssa before. Right now, I'm using data from sixbid, because it presents the data in the same, standardized manner which makes text-mining easier. I am not sure I understand your question. But to give an example. Suppose I'm constructing a dataset on Hadrian denari. This one will be included as well. The variables will be as following: estimate 1711 hammer 11976 grade extremely fine auction_house Leu Winterthur toning TRUE ("toned") positive_remarks TRUE ("beautifully"; "wonderful"; "exquisite") rarity TRUE ("very rare") damaged FALSE slabbed FALSE minting_problems FALSE minting_pros FALSE die_match FALSE provenance TRUE ("collections") plate_coin FALSE So, synonyms are not an issue, as long as the synonyms are unique to the variable. Also, I should probably mention that I have no real intentions to use the algorithm apart for personal use (e.g. identifying an auction house for a specific coin)
You are basically missing contextual information. Provenance is FALSE for the example you gave or why should anyone care for a 2013 provenance. I am not commenting on the former collector but there were some negative posts about his auction house and tactics earlier in CT.
I wasn't suggesting you look at Coryssa for input. I understand what you are trying to do. (I wrote a processor for numisbids.com, which is similar, to generate output for CoinProject.com.) I am pointing out that there are a lot of input formats, and a lot of data to gather. I am sure the folks here will help you find synonyms. It would be great if your output can be encoded in the nomisma.org format promoted by the ANS. It would be good for numismatics as a field of study if the data science folks were on the same page. http://nomisma.org/ontology has a proposal for the fields used to describe a coin including wear, corrosion, countermark, secondaryTreatment, etc. They have been working on it a long time. It would be a useful result if you find there system great or full of holes for what you are trying to do.
One additional in the damage category - at least for trade dollars would be "chop marked" ? Of course if you're only interested in ancients, this would have no bearing I would guess.