Are CAPTCHAs Still Bot-hard? Generalized Visual CAPTCHA Solving with Agentic Vision Language Model

1Shanghai Jiao Tong University 2National University of Singapore 3Tel Aviv University

Introduction

Visual CAPTCHAs, such as reCAPTCHA v2, hCaptcha, and GeeTest, are mainstream security mechanisms to deter bots online, based on the assumption that their puzzles are AI-hard but human-friendly. While many deep-learning based solvers have been designed and trained to solve a specific type of visual CAPTCHA, vendors can easily switch to out-of-distribution CAPTCHA variants of the same type or even new types of CAPTCHA, with very low cost. However, the emergence of general AI models (e.g., ChatGPT) challenges the AI-hard assumption of existing CAPTCHA practice, potentially compromising the reliability of visual CAPTCHAs.

"... we are now officially in the age beyond CAPTCHAs."

"... the emergence of large language models such as GPT-4 has further complicated the problem of chatbot detection."

"Future research should invest in alternative methods of human verification or machine verification that adapt with the trajectory of LLMs."

In this work, we report the first general CAPTCHA solving solution, Halligan, built upon the state-of-the-art vision language model (VLM), which can effectively solve unseen visual CAPTCHAs without making any adaption. Our rationale lies in that almost any CAPTCHA can be reduced to a search problem where (i) the CAPTCHA question is transformed into an optimization objective and (ii) the CAPTCHA body is transformed into a search space for the objective. With well designed prompts built upon known VLMs, the transformation can be generalized to almost any existing CAPTCHA.

Approach

Challenges

To solve any visual CAPTCHA, a VLM agent must abstract the puzzle into a generalized representation, retaining only relevant information. This enables the agent to apply a consistent problem-solving approach. Additionally, the natural language VLM output must be reformulated into a formal language that enables precise, actionable steps.


challenges

Metamodel

We designed a CAPTCHA metamodel that standardizes the structure of frames, elements, and keypoints, enabling the agent to represent any visual CAPTCHA in a consistent format. This allows for efficient storage of solution-related data and precise interactions using formal expressions (e.g., drag frame A, slide element B to C, click on keypoint D).


metamodel

Halligan Overview

halligan

Figure: Overview of Halligan, consisting of three steps. (1) Objective Identification step infers the task objective. The (2) CAPTCHA Abstraction step generates an abstract CAPTCHA model from the target CAPTCHA, where salient CAPTCHA entities such as frame and interactable UI elements are present. The (3) CAPTCHA Solving step formalizes a search problem to generate a CAPTCHA solving solution as a piece of Python code to interact with and solve the target CAPTCHA.

Experiments

Comparison Study

We developed an interactive benchmark of 26 visual CAPTCHAs, totaling 2600 challenges to compare Halligan against state-of-the-art CAPTCHA solvers. Halligan achieves a solve rate of 60.7% on the benchmark, with performance on par with baselines.


Learn More

Field Study

To evaluate Halligan's ability to generalize to unforeseen visual CAPTCHAs, we drew inspiration from CAPTCHA measurement studies and infiltrated human-driven CAPTCHA farms. Halligan has a solve rate of 70.6% on previously unseen visual CAPTCHAs in the wild over a 30-day period.


Learn More

Limitations

Temporal

Since the search space is static once constructed, Halligan and VLM agents struggle with CAPTCHAs that change their visual state over time, independent of user interactions. A temporal metamodel + video-based language model could offer a potential solution.

temporal

Figure: A CAPTCHA developed by Lam et al. at the Antrophic x Menlo Ventures Hackathon 2024 (Source).

Domain-Specific

Some visual CAPTCHAs require specialized domain knowledge beyond visual reasoning and pattern matching, making them harder to solve. Domain experts must equip VLM agents with the right tools to effectively parse and understand these challenges.

chess_captcha

Figure: Chess CAPTCHA by lichess.org that requires a chess engine (Source).

Related Work

CAPTCHA Measurement Studies

  1. C-FRAME: Characterizing and measuring in-the-wild CAPTCHA attacks.
  2. An Empirical Study & Evaluation of Modern CAPTCHAs.
  3. Gotta CAPTCHA'Em all: a survey of 20 Years of the human-or-computer Dilemma.
  4. Towards understanding the security of modern image captchas and underground captcha-solving services.
  5. Re:CAPTCHAs—Understanding CAPTCHA-Solving Services in an Economic Context.

LLM Agents: Visual Programming

  1. ViperGPT: Visual Inference via Python Execution for Reasoning.
  2. Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models.
  3. Executable Code Actions Elicit Better LLM Agents.
  4. Visual Sketchpad: Sketching as a visual chain of thought for multimodal language models.