A Brief History of reCAPTCHA
reCAPTCHA is an excellent example of dual-purpose technology, a term I made up just now to describe software with two major functions. In the first version of reCAPTCHA, users were asked to ID a distorted bit of text to prove they were not spam bots — this was reCAPTCHA’s main purpose. Simultaneously, users were actually helping to transcribe unknown words. According to Tom Scott, if maybe a dozen people agreed on the unknown word, then this became the accepted transcription and helped to digitize old books and newspapers.
The original design is described pretty well here, in an episode of “How I Built This.” At this point, Google bought the reCAPTCHA technology and many of the original creators moved on to Duolingo.
An Overview
reCAPTCHA came out a few years after CAPTCHA, or Completely Automated Public Turing Test to tell Computers and Humans Apart. Why they chose this contrived acronym is beyond me…why not call it Completely Automated Public Turing Assignment Identifying Non-humans? That would have spelled CAPTAIN. Oh well…
The first version of reCAPTCHA was text-based, and doubled as a transcription tool. Even before AI and machine learning, spammers could still create bots that succeeded some of the time, and also employed people to solve reCAPTCHA challenges. reCAPTCHA version 2 was something of a black box, but Google developers designed it to assign trust based on cookies — if the user seemed suspect, or if there was not sufficient information, he/she was then asked to identify pictures. For example, maybe he/she would be asked to determine which pictures were of fire hydrants.
AI and machine learning were able to successfully solve these picture challenges, so reCAPTCHA version 3 came out at the end of 2019.
The Cybersecurity Connection
This is a scientific paper that appeared on BlackHat; the writers were able to solve 70% of reCAPTCHA challenges (version 2, presumably). They outline their tools and methods in detail, including the use of neural networks, deep learning, and, poetically, Google Reverse Image Search.
We are able to create over 63,000 cookies in a single day without triggering any mechanisms or getting blocked, and are only limited by the physical capabilities of the machine. This indicates that there is no mechanism to prohibit the creation of cookies from a single IP address. The only restriction we detected was triggered by a massive number of concurrent requests (i.e., for detecting DoS attacks). The lack of a safeguard can be justified by the fact that creating cookies at a large scale has not been required by attacks before. Indeed, we present a novel misuse of tracking cookies, which makes them a valuable commodity for fraudsters.
— Suphannee Sivakorn, Jason Polakis, and Angelos D. Keromytis
Privacy Concerns
This is an article by FastCompany outlining potential privacy concerns. From the article:
…But there’s the trade-off. “It makes sense and makes it more user-friendly, but it also gives Google more data,” he says. Google would not clarify what it does with the data it captures about user behavior via reCaptcha, only that it is used for improving reCaptcha and general security purposes.
The FastCompany article is skeptical in tone — it rejects Google’s claims that the data will be used responsibly, and seems to share concerns that the company, in general, simply has too much power over user data.
Closing Thoughts
I added a honeypot to our Kiwanis website, and anyone who knows cybersecurity would probably tell me that this is not a complete solution to the problem. This seemed to mitigate our spam problem quite a bit, but occasionally we will get something like this:
Yes, I literally added a question asking them whether or not they are a bot. No one else seems to do this, and I cannot imagine a detail like this going very well in a job interview…but why not? I have yet to meet a bot intelligent enough to fill in “no” for this field…I think.
Terms such as “honeypot” and “reCAPTCHA” may be outside of the common vernacular, but unfortunately everyone in the world is probably familiar with spam. At its core is the “arms race” Tom Scott described, which explains how distorted text turned to pictures, and then to an invisible program running in the background that may be superior, and may or may not present problematic implications.
Criminals will continue to enhance their technology, Google will respond in kind, and occasionally cybersecurity researchers will publish impressive papers on how they bypassed security systems such as this.