CMU innovation

CAPTCHA and reCAPTCHA


Computer science professor, Manuel Blum, had been thinking about human computation and authentication schemes for quite some time. So in 2000 when Udi Manber, then chief scientist at Yahoo!, mentioned that bots were logging on as humans, hogging Yahoo! email accounts and invading chat rooms, Manuel said “Of course, you need a computer sentry, a Completely Automatic Public Turing test to tell Computers and Humans Apart.” And so CAPTCHAs, those annoying squiggly words that you have to type before you can enter certain sites or perform certain transactions, were born.

This post is part of “The impact of CMU” series, sponsored by Larry Jennings, trustee.

Captcha example

Image of a CAPTCHA

Yes, those are the ones. They were invented at Carnegie Mellon by Manuel and his new PhD student at the time, Luis von Ahn, himself now an acclaimed computer science professor and entrepreneur. The company that ended up being formed to commercialize CAPTCHAs, with a twist, was reCAPTCHA. The company was sold to Google in 2009.

Let’s back up a bit.

The problem.  Yahoo!  had a chat room problem. People in chat rooms were plagued by bots, those automatic web robots that run automated tasks over the Internet. Bots were not a life or death problem, but they were annoying to people in the chat rooms. Bots tricked people into thinking that they were human, only to serve up an advertisement (or some other function) instead of a valid person-to-person response. Yahoo! was king of chat rooms, and they needed the equivalent of a security guard or bouncer in the chat room that let humans in but not bots.

Manuel saw this as a cryptographic problem. The great thing about this problem for Manuel, who loves challenges, is that it sounded impossible to solve. As Manuel puts it,

“You have to create a test that: 1) humans can pass, but 2) no program can pass. And here’s the killer, 3) while the program can’t pass the test, it must nevertheless be able to grade it! The program that grades the test can’t pass the test!It smacks of cryptography, about which I had thought a lot, so it was natural for me to want to solve it.”

Manuel and Luis tried lots of different ideas: IQ tests, psychological tests, tests of understanding, tests of the ability to distinguish what’s serious from what’s humorous. “However,” Manuel laments, “test after test, it turned out that computer programs had already been written that were better at passing those tests than humans. Which meant that none of them worked for our purpose.”

As a collaborative researcher, Manuel reached out to some colleagues, including Professor Richard Fateman, who was teaching a course in optical character recognition (OCR) at UC Berkeley with Henry Baird, then head of OCR at Xerox Parc. OCR gives computers the ability to read printed text and Fateman pointed out that current technology could not do nearly as well as humans.

The solution.  Thinking about it, Manuel and Luis realized that not only do children learn to read at a young age, but they learn to read under difficult conditions, what Manuel ended up calling “strange and non-standard circumstances,” meaning varied lighting, font size, media (books vs. computer screens), etc:

We can read stop signs even with bird doo, graffiti, snow and shadows. We can read the muddled sign but OCR cannot. That was a revelation – OCR can’t read unless the conditions are really good! But people can; people of all ages can read, in all kinds of circumstances.”

Manuel went on a hunt to confirm his assumptions. And he realized that people could of course read a crumpled newspaper, advertisements that skewed words as part of the design, and he found example after example of how people could read where OCR could not.

At the same time, Manuel learned through an undergraduate student at CMU, Scott Crosby, about the GIMP, a free program for altering images. Manuel describes the program:

GIMP allowed you to take stuff and modify it sort of like Photoshop. You could write and then modify it. I wanted to put words on a rubber sheet and then stretch it. I knew that kids would still be able to read the words. And a computer that stretched a word would still know what the word was, even though it couldn’t read it. It could grade the test even though it couldn’t pass it. That led to Luis creating our first moderately successful test called Gimpy. I wanted a gotcha like name for it, and after many attempts came up with CAPTCHA, which stood for Completely Automatic Public Turing test to tell Computers Humans Apart.”

Manuel emphasizes that an important part of this idea, and a basis for the subsequent reCAPTCHA, is its additional application to OCR and to artificial intelligence (AI). “As long as people haven’t written programs to break it, we have good CAPTCHAs. Once their programs  break it, we have good OCR.”

Manuel and Luis started thinking about commercializing the innovation. As it turned out, Andre Broder had patented the idea of a CAPTCHA  (though he didn’t call it that) three years before Manuel and Luis had even heard of the problem – but without a good implementation. Co-licensing was an option but was not pursued and the squiggly CAPTCHAs became ubiquitous.

reCAPTCHA.  Luis got the idea of using crowd sourcing, transforming the concept of a CAPTCHA into reCAPTCHA and killing two birds with one stone: authenticating humans while, at the same time, outperforming OCR. This turned out to be a huge win-win. Luis had worked on a CMU project to digitize books. Based on this work, Luis talked to people at The New York Times who were engaged in trying to digitize the Times’ archives. The newspaper folks were resigned to using the imperfect OCR which couldn’t recognize all of the words on a page, particularly if the page was in anything less than pristine condition. Luis’ idea was to present to the public two words, one which OCR could read and one which it couldn’t.

Recaptcha example 

Image of a reCAPTHA

He theorized that whoever could read the stretched word would do a pretty good job deciphering the OCR-unreadable one. Employing the concept of crowd sourcing by using multiple people to decipher the same difficult-to-read word, Luis realized that they could get the best possible guess at the difficult word. He was right. The New York Times jumped at the chance to quickly and cheaply read the difficult words, advancing their archival digitization project significantly.

reCAPTCHA launched out of CMU with a sizable contract. The startup was profitable from day one. The newspaper benefited from having a thorny problem solved; reCAPTCHA benefited by making money on each word. Manuel jokes that “the ‘P’ for public” in the acronym for CAPTCHA became “a ‘P’ for private.”

Manuel credits Luis with also taking the research from the lab to the marketplace. Luis had always been interested in entrepreneurship, although he never thought of himself as an entrepreneur. The genius and ultimate success of reCAPTCHA lay in combining the CAPTCHAs with crowd sourcing. Luis recounts,

“We got inquiries from people and companies, one of which was the [New York] Times. They were willing to pay. I saw no reason not to do this. Since we couldn’t do this within the confines of a university, we had to form a company.”

reCAPTCHA was founded in 2007 and was acquired by Google two years later. In one of life’s little ironies, Udi Manber, whose problem started it all, is now at Google. As Luis describes, they were not shooting for an exit:

“It happened because we got approached by Amazon to be acquired as a result of the Kindle. Facebook also approached us. I knew Google because I had history with them. I had sold a game to Google a few years beforehand. Google ended up being the highest bidder. But Google actually made the most sense to me for other reasons too. They really cared about the project. They would continue it. That was important to me, and to my co-founders and employees.”

Another startup.  Today, Luis is on his second startup, Duolingo, which is leveraging crowd sourcing to translate the Internet. Because of his past success with reCAPTCHA, Luis has been able to fund his new startup to the tune of more than $18 million from name-brand VCs from both coasts. Duolingo offers courses in six languages and has apps for both iPhones and Android with a user base approaching six million.

Luis found CMU supportive of his entrepreneurial ambitions:

“They [CMU] left me alone to do what I wanted to do. I have learned that if you hire really good people and let them do what they want, what they’re good at, that’s a recipe for success. They did that for me. I also have the advantage of being around great tech people at CMU. Without them I wouldn’t be able to do it. My colleagues and students have participated in the success of reCAPTCHA; now other colleagues and students have the chance to do the same with Duolingo.”

reCAPTCHA today.  The reCAPTCHA project still lives within Google. Luis tells me that Google is using it to digitize two million books a year. reCAPTCHAs are touched by 100 million people a day. And if you are annoyed by those two words, just remember that the alternative is more spam!

CMU innovation
Innovation in speech recognition
Commentary
The EB-5 program demystified
Commentary
Teaching entrepreneurship in engineering