A Brief History of reCAPTCHA

Source: https://github.com/google/recaptcha/issues/286. I did not want to violate the Medium.com terms of service, but it is pretty obvious what most of these terms are and I have provided a link to the uncensored source.

reCAPTCHA is an excellent example of dual-purpose technology, a term I made up just now to describe software with two major functions. In the first version of reCAPTCHA, users were asked to ID a distorted bit of text to prove they were not spam bots — this was reCAPTCHA’s main purpose. Simultaneously, users were actually helping to transcribe unknown words. According to Tom Scott, if maybe a dozen people agreed on the unknown word, then this became the accepted transcription and helped to digitize old books and newspapers.

The original design is described pretty well here, in an episode of “How I Built This.” At this point, Google bought the reCAPTCHA technology and many of the original creators moved on to Duolingo.

The entire history and overview provided in about six minutes

An Overview

reCAPTCHA came out a few years after CAPTCHA, or Completely Automated Public Turing Test to tell Computers and Humans Apart. Why they chose this contrived acronym is beyond me…why not call it Completely Automated Public Turing Assignment Identifying Non-humans? That would have spelled CAPTAIN. Oh well…

The first version of reCAPTCHA was text-based, and doubled as a transcription tool. Even before AI and machine learning, spammers could still create bots that succeeded some of the time, and also employed people to solve reCAPTCHA challenges. reCAPTCHA version 2 was something of a black box, but Google developers designed it to assign trust based on cookies — if the user seemed suspect, or if there was not sufficient information, he/she was then asked to identify pictures. For example, maybe he/she would be asked to determine which pictures were of fire hydrants.

AI and machine learning were able to successfully solve these picture challenges, so reCAPTCHA version 3 came out at the end of 2019.

Not everyone loves it.

The Cybersecurity Connection

This is a scientific paper that appeared on BlackHat; the writers were able to solve 70% of reCAPTCHA challenges (version 2, presumably). They outline their tools and methods in detail, including the use of neural networks, deep learning, and, poetically, Google Reverse Image Search.

We are able to create over 63,000 cookies in a single day without triggering any mechanisms or getting blocked, and are only limited by the physical capabilities of the machine. This indicates that there is no mechanism to prohibit the creation of cookies from a single IP address. The only restriction we detected was triggered by a massive number of concurrent requests (i.e., for detecting DoS attacks). The lack of a safeguard can be justified by the fact that creating cookies at a large scale has not been required by attacks before. Indeed, we present a novel misuse of tracking cookies, which makes them a valuable commodity for fraudsters.
— Suphannee Sivakorn, Jason Polakis, and Angelos D. Keromytis

Privacy Concerns

This is an article by FastCompany outlining potential privacy concerns. From the article:

…But there’s the trade-off. “It makes sense and makes it more user-friendly, but it also gives Google more data,” he says. Google would not clarify what it does with the data it captures about user behavior via reCaptcha, only that it is used for improving reCaptcha and general security purposes.

The FastCompany article is skeptical in tone — it rejects Google’s claims that the data will be used responsibly, and seems to share concerns that the company, in general, simply has too much power over user data.

Closing Thoughts

I added a honeypot to our Kiwanis website, and anyone who knows cybersecurity would probably tell me that this is not a complete solution to the problem. This seemed to mitigate our spam problem quite a bit, but occasionally we will get something like this:

“Are you a bot?” “Yes”

Yes, I literally added a question asking them whether or not they are a bot. No one else seems to do this, and I cannot imagine a detail like this going very well in a job interview…but why not? I have yet to meet a bot intelligent enough to fill in “no” for this field…I think.

Terms such as “honeypot” and “reCAPTCHA” may be outside of the common vernacular, but unfortunately everyone in the world is probably familiar with spam. At its core is the “arms race” Tom Scott described, which explains how distorted text turned to pictures, and then to an invisible program running in the background that may be superior, and may or may not present problematic implications.

Criminals will continue to enhance their technology, Google will respond in kind, and occasionally cybersecurity researchers will publish impressive papers on how they bypassed security systems such as this.

--

--

--

A software engineer who writes about software engineering. Shocking, I know.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Italy Launches Contact Tracing App; Proves We Shouldn’t Compromise On Privacy?

The Most Vulnerable Asset

The Epistemology of Information Technology

{UPDATE} Joker Tiler HD Hack Free Resources Generator

I’ve minted the Buddy Beater avatars… But what’s this bag? Where are my avatars? Where is it?????

A Digital Society

The Digital Society

Delete Your Instagram Account Permanently On Phone

Indianapolis How Not To Get Conned by a Roofing Contractor

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Evan SooHoo

Evan SooHoo

A software engineer who writes about software engineering. Shocking, I know.

More from Medium

Trunk-Based Development: How We Learned to Stop Worrying and Trust the Team

Why MVC is not a pattern

TypeSchema code generation explained

Anti-Patterns that assumed as a Pattern(Part2: Facade)