Are CAPTCHAs Still Bot-hard? Generalized Visual CAPTCHA Solving with Agentic Vision Language Model

Comparison Study

We developed an interactive benchmark of 26 visual CAPTCHAs, totaling 2600 challenges to compare Halligan against state-of-the-art CAPTCHA solvers. We selected CAPTCHAs based on availability (as paid commercial services), popularity (identified from top 1 million websites), and AI-completeness (has visual puzzles requiring AGI for human-level performance).

Results

Halligan successfully solved 1,577 out of 2,600 challenges, achieving a solve rate of 60.7%. Halligan, being a generalist agent, is capable of solving all types of CAPTCHA in the benchmark without few-shot examples, pre-training or fine-tuning. As long as visual CAPTCHAs are converted into abstract representations that VLM agents can both understand and manipulate, they can be solved with reasonable success. Hover for more details, see the next section for explanation on failure cases.

Failure Cases

(1) Close: 280 out of 1023 (27.4%) solutions were nearly correct but fell short. The solution is just outside the tolerance range or very similar to the ground truth. Examples include a single incorrect character in textbased CAPTCHAs, sliding or dragging solutions a few pixels off, and quantification tasks (e.g., object counting) that are off by one.

(2) External Tools: 218 out of 1023 (21.3%) failed due to limitations in large vision models. They are caused by bad outputs from mark() and focus() (i.e., using object detection model), leading to faulty reasoning downstream.

(3) Solution Comparison: 488 out of 1023 (47.7%) were due to suboptimal solution during search that arises from ask(), rank(), and compare(), which use gpt-4o for visual question answering.

(4) Search Objective Construction: 7 out of 1023 (3.6%) were the result of misunderstanding the task instructions.

Solve Times

After abstracting a visual CAPTCHA and framing it as a search problem, the median time to compose a solution is 4.5 seconds, while the median execution time is 6.3 seconds. The figure below shows a box plot of execution times grouped by CAPTCHA type. Notably, different CAPTCHA types from the same service exhibit high variability (e.g., arkose, geeTest). Additionally, solve times vary within individual CAPTCHA types, with some challenges showing greater variance. Halligan's performance is comparable to human solving times (ranging from 3.1 to 42 seconds) and ad hoc bots (0.016 to 17.5 seconds). Hover for more details.