CAPTCHA

CAPTCHA is an acronym for Completely Automated Public Turing Test To Tell Computers and Humans Apart. CAPTCHA may be described as a cyber security tool often used on websites. Websites that offer a service or collect data require the user to give inputs. This process can be thwarted with the use of bots. Bots can present themselves as humans with fake identities.

CAPTCHAs are designed in many ways but all of them have the same goal: to let humans into the system and deny admission to bots. They attempt to do this by presenting problems that are hard for bots to solve but relatively easy for humans to solve. With the coming of Artificial Intelligence (AI), computers have become smarter at solving CAPTCHAs. Therefore CAPTCHAs themselves have evolved to counteract intelligent bots.

Discussion

Where do we need CAPTCHAs?
Automated programs or bots are fast compared to humans. This gives bots an unfair advantage. For example, bots can book tickets in bulk and then sell them later at a higher price in the black market. Likewise, bots can generate bulk votes and thereby skew the results of an online voting system. Bots can also be used to spam, troll or trigger DDoS attacks. Here are some use cases where CAPTCHAs are useful:
- Booking of tickets.
- Online voting system.
- Creating a new online/email account.
- Preventing dictionary attack in password systems.
- Promoting products as a comment to a blog post.
- Liking or sharing a web page via social networking sites.
- Protecting site contents from scrapers or search engine bots.
- Stealing personal information at chat rooms.
Why is a CAPTCHA sometimes called a reverse Turing test?
Turing test was conceived in 1950 by Alan M. Turing as a way to check if machines were as intelligent as humans. If a human interrogator interacting electronically with a machine and another human cannot tell which of them is the machine, then the machine passes the Turing test.
With CAPTCHA, the interrogator is not a human but a computer. Hence the term reverse Turing test is sometimes used.
What are the different types of CAPTCHAs?
reCAPTCHA V2 shows a checkbox and optionally a set of images. Source: Shet 2014.
The following is a broad classification:
- Text-based: Users have to type the characters from a distorted text.
- Image-based: Users are asked to select images belonging to a certain group, such as those containing cats.
- Video-based: Users are shown short video clips and asked to answer a question based on it. NuCAPTCHA is an example.
- Audio-based: Users listen to an audio clip and type what they hear.
- Action-based: Users are asked to perform some action. reCAPTCHA V2 asks users to click a checkbox. MotionCAPTCHA asks users to trace a shape on the canvas. Dynamic Cognitive Game (DCG) is another example.
- Question-based: Users are presented a question that they have to answer correctly. Examples include "What is 1 + six?" or "Flower, resting, lawyer, campsite: the word starting with 'c' is?"
- Fun-based: Users are asked to solve a puzzle or play a game. Bongo is an example. PlayThru from Are You A Human is another example.
- Invisible: User verification is done in the background without requiring specific user inputs. Google introduced this with its Invisible reCAPTCHA.
- Hybrid: Combinations of the above.
How are CAPTCHAs related to AI?
Computers are getting faster and having access to more memory than before. The algorithms that power them are getting better. So tasks that were once difficult for computers are becoming easier. This means CAPTCHAs that were previously unsolvable are becoming solvable by bots.
Advances in optical character recognition (OCR) have enabled bots to solve text-based CAPTCHAs. Image-based CAPTCHAs can be cracked due to advances in image processing, pattern recognition and object recognition. Question-based CAPTCHAs rely on advances in natural language processing (NLP). Audio-based CAPTCHAs too can be solved due to advances in speech-to-text processing. With reCAPTCHA, humans assist machines in solving difficult problems. All these contribute to machines getting smarter everyday.
Ultimately, we have to come up with better CAPTCHAs to defeat smarter bots. A CAPTCHA that's secure today may not be so tomorrow. reCAPTCHA's creator Luis von Ahn commented in 2012 that CAPTCHAs could become useless in another ten years.
What techniques are typically used to solve CAPTCHAs?
Let's note that text-based are presented as images. To solve text-based CAPTCHAs, segmenting the text into individual characters is the first step. One way to enhance OCR is to remove noise. Image transformation such as rotating, shifting, mirroring, warping can lead to better OCR. With audio, waveform analysis can help in solving the CAPTCHA. Machine learning that solves segmentation and the recognition problems simultaneously have been shown to give better results.
Some sites don't implement CAPTCHA in a secure manner. They can be vulnerable due to reuse of session ID of a known CAPTCHA image.
Web services have come up to solve CAPTCHAs: Death by CAPTCHA, 2Captcha. Where the economics make sense, some of these use humans to solve the CAPTCHAs. Some web services forward the CAPTCHA to pornographic sites. Visitors need to solve the CAPTCHA before they can view pornographic content. It's been argued that this is not really economical.
Are there guidelines to making good CAPTCHAs?
Some guidelines are worth noting:
- Accessibility: Give users options. Audio CAPTCHAs are suited for the visually impaired. Allow users to request another CAPTCHA if a particularly hard one comes up. With touchscreen interfaces, text-based CAPTCHAs are less suited compared to CAPTCHAs that require clicks or drag-and-drop actions.
- Dynamic generation: Avoid serving CAPTCHAs from a fixed database. CAPTCHAs should be generated dynamically. Avoid using trivial distortions. Avoid reuse of CAPTCHAs.
- Security: Avoid sending the solutions to the client along with the challenge.
What are reCAPTCHAs?
reCAPTCHA helps with digitization. Source: Von Ahn et al. 2008, fig. 1.
reCAPTCHA was invented by researchers at Carnegie Mellon University in 2007. It used distorted characters of text. Rather than using just one word, a pair of words was presented to the user. One of these words is known to the CAPTCHA system but the other one is unknown. Such unknown words, rather than being randomly generated, were picked up from old books or articles that were being digitized. Since state-of-the-art OCR systems had difficulty deciphering these words, why not crowdsource this task to humans via CAPTCHA? It was with this idea that reCAPTCHA was born.
Thus reCAPTCHA in its original form helped in digitizing scanned text. In 2008, it was claimed that the system had transcribed 440 million words via 40,000 sites.
reCAPTCHA was acquired by Google in 2009. It evolved into No CAPTCHA reCAPTCHA (aka reCAPTCHA V2) in 2014. In 2017, Google introduced Invisible CAPTCHA. With these newer versions, the CAPTCHA system tracks user behaviour to determine if it's coming from a bot. This includes mouse movement, scrolling of the page, time taken to submit a form, and many more.
How effective are CAPTCHAs in stopping bots?
One of the early CAPTCHAs, Gimpy, was used by Yahoo. In October 2002, a program was able to solve Gimpy CAPTCHAs. Back in 2005, W3C stated that many CAPTCHA systems could be solved with 88-100% accuracy. At PARC, one researcher was able to crack Assira CAPTCHA with 7.5% probability. Blogger Jeff Atwood reports that CAPTCHAs of Yahoo, Hotmail and Google were broken in early 2008.
A Stanford team showed in 2010 that their Decaptcha tool could bypass many text-based CAPTCHAs. At a conference in 2012, a program was able to solve Google's audio CAPTCHA 99.1% of the time. In 2013, Vicarious AI claimed to be able to solve CAPTCHAs 90% of the time. Google itself reported that distorted text can be solved by AI with 99.8% accuracy. Using machine learning, researchers in 2014 were able to crack reCAPTCHA with 33% success rate and Baidu at 39%.
This does not mean that CAPTCHAs are useless. It just means that current CAPTCHAs have to be better than what AI is capable of solving.
Are there alternatives to using CAPTCHAs?
WCAG Working Group of W3C has a list of alternatives to using CAPTCHAs. Another list is by Karl Groves. It's important to note that alternatives may not work all the time since smart bots may find a way, now or in the future, to bypass them.
Honeypots and timestamp analysis are alternatives that have proven effective. Another option is to use an anti-spam service such as Akismet, Mollom and SBlam. It's possible to enforce user verification via emails or text messages sent to their mobiles. While this may defeat bots, it reduces usability for humans. Game-based CAPTCHAs are still CAPTCHAs but they are less annoying and more fun to users. It has been claimed that PlayThru takes on average 10-12 seconds compared to 12 seconds of text-based CAPTCHA.
What are the accessibility issues surrounding CAPTCHAs?
A selection of CAPTCHAs hard for humans to solve. Source: Munsell 2012.
In response to smarter bots, CAPTCHAs have gotten more sophisticated. Unfortunately, this makes it harder for humans as well. A study from 2009 found the while CAPTCHAs reduced spam, they also reduced conversion rates. Another study with Animoto web app showed that conversion rates were better by 33% without using CAPTCHAs. A survey from 2010 showed that audio CAPTCHAs for non-native speakers of English are hard. Even with text-based CAPTCHAs, some text can be hard to solve. In 2000, success rate was 97% but this dropped to 92% in 2012.
It has been claimed that CAPTCHAs ignore issues that senior citizens and visually impaired face. A study from 2009 with visually impaired showed that with audio CAPTCHAs success rates was only 45% and it took users 65 seconds to solve one.
Harry Brignull has even questioned the approach, "Using a CAPTCHA is a way of announcing to the world that you’ve got a spam problem, that you don’t know how to deal with it, and that you’ve decided to offload the frustration of the problem onto your user-base."
Can you name some providers of CAPTCHAs?
Google's reCAPTCHA is the well known. As of March 2017, more than million websites are using it. Also, 11.2% of the top 10k sites are using it. NoMoreCaptchas is a possible alternative to Invisible reCAPTCHA since it does not require explicit input from users.
PICATCHA is an image-based CAPTCHA that also uses it as an advertising medium. Microsoft had a research project named Asirra that used image-based CAPTCHA. Confident CAPTCHA claims 96% success rate and usage of 50 million verifications per month. Other examples are Ironclad CAPTCHA, PlayThru, NuCAPTCHA and Solve Media. Site captchas.net is free service. BotDetect CAPTCHA uses image and sound. JCAPTCHA, implemented in Java, can be downloaded and integrated into websites. Dice Captcha shows pictures of dice.
In my automated tests, how can I test form submissions that have CAPTCHAs?
One approach is to disable CAPTCHAs during testing. Depending on the system, this could be in code or a flag in the database. Attempting to automatically solve CAPTCHAs is a fundamentally flawed approach. If test automation can solve, it really suggests that the CAPTCHA is not strong enough to prevent bots from passing themselves off as humans.

Milestones

1996

DEC uses a pixelated image of the US flag as CAPTCHA for gathering opinion polls before the US Presidential Election 1996. This method doesn't work very well. Simple programs could click the flag correctly.

1997

Block diagram showing text-based CAPTCHA generation process. Source: Lillibridge et al. 1998, fig. 3.

Scientists at DEC led by Andrei Broder design a noise-induced text-based CAPTCHA. This is used by AltaVista to prevent bots from adding URLs to the search engine's platform.

2000

Luis von Ahn, Manuel Blum, Nicholas Hopper and John Langford of Carnegie Mellon University come up with a better text-based CAPTCHA. They coin the term CAPTCHA.

2007

reCAPTCHA is invented at Carnegie Mellon University. Later, as part of Google, Google deprecates this version in May 2016, and shuts it down in March 2018.

Dec
2008

Inventors of reCAPTCHA introduce the audio version of the same.

Sep
2009

Google acquires reCAPTCHA since this can also assist with digitization as part of Google Books and Google News Archive Search. Also known as reCAPTCHA v1, it's shutdown in March 2018.

Dec
2014

Google releases reCAPTCHA v2, also known as No CAPTCHA reCAPTCHA. Users merely have to click a checkbox that says "I'm not a robot".

Mar
2017

Google launches a variant of reCAPTCHA v2 and calls it Invisible reCAPTCHA. This does not require explicit user inputs.

Oct
2018

Different versions of reCAPTCHA for website integration. Source: Google Developers 2019.

Google launches reCAPTCHA v3. This doesn't require users to solve any CAPTCHA and gives greater control to website owners. reCAPTCHA v3 returns a score (1.0 for good interaction, 0.0 for a likely bot). Based on this score, website owners can take appropriate action. In March 2019, it's reported that reCAPTCHA v3 has been partially tricked by an AI program using Reinforcement Learning.

References

Article Stats

2143

Words

Authors

Edits

Chats

Likes

14K

Hits

Cite As

Devopedia. 2021. "CAPTCHA." Version 8, June 28. Accessed 2023-11-12. https://devopedia.org/captcha

Contributed by
3 authors

Last updated on
2021-06-28 16:27:27

security bot artificial intelligence

Reverse Turing Test
Turing Test
Optical Character Recognition
Honeypot
Artificial Intelligence
Computer Vision

CAPTCHA

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

CAPTCHA

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login