A couple of days ago Melanie commented that she'd gotten a weird image in the anti-spam verification box here:
Seriously, my first captcha had Greek letters! I don't know how to do that on my keyboard. Was I supposed to transliterate to English?
I've never seen Greek letters, but the question led me to idly click around to find out exactly what the captcha program, which is called reCAPTCHA, is doing. I had a vague memory that it was supposed to be using the human entries to digitize scanned content, but I didn't know how it worked and had wondered: if the content hasn't been digitized yet, how does it know if your entry is correct?
The explanation is on the reCAPTCHA website and also in a 2008 article that appeared in Science (pdf), and it explains why you get two words.
- One of the two words is already known to the software, and it serves as the shibboleth that proves you are human (which lets you post your comment) and verifies that you can probably decipher a distorted word in print.
- The other word is unknown to the software and represents a part of a scanned document that optical character recognition (OCR) has failed to decode satisfactorily.
Or, as the reCAPTCHA people put it,
Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
There, isn't that interesting? And this explains how Greek letters made it in, probably via one of the unknown words: the reason OCR couldn't recognize it wasn't because it was distorted, but because it was in the Greek alphabet.
That is fascinating. I am so glad you looked it up because I was too lazy to look it up myself. I was sort of wondering how it knows your entry is correct if you are digitizing unknown text-- don't know why it didn't occur to me that there are two words, and therefore one is known and one unknown. It seems obvious in hindsight.
I think the Greek word was logos, by the way.
Posted by: MelanieB | 21 July 2012 at 02:18 PM
And now I'm going to be wondering which word is known and which is unknown every time I fill in a reCAPTCHA.
Posted by: MelanieB | 21 July 2012 at 02:19 PM
If only we could harness the human mental power wasted while idly wondering in response to Internet trivia.
Posted by: bearing | 21 July 2012 at 08:06 PM