Are CAPTCHAs Still Bot-hard? Generalized Visual CAPTCHA Solving with Agentic Vision Language Model


Input:
For each frame:
  Frame image
  Frame caption in format Frame f

Prompt:
## Objective
The task image contains n frame(s).

First, provide one-sentence visual description of each frame.
Second, identify the relationships between frames.
Third, using the information, identify the sequential events and the final visual criteria that leads to solving the task.

Use these Python tools:
describe(frame_id: int, description: str)
relate(frame_id1: int, frame_id2: int = None, relationship: str = '')
objective(description: str)

## Guidelines
1. The output should be a Python block.
2. The third step should be generic enough, it should not reveal the answer.
3. In the third step, all event should begin with atomic verbs like `click`, `slide`, `drag`, `draw` and refer to choices/states rather than objects.
4. Please describe rectangle boxes with hints (e.g., case sensitivity), instructions, or placeholders as input fields.

Output:
An executable Python block using the tools to add descriptions, relations to frames, and also identify the objective.

CAPTCHA Abstraction


Input:
For each frame:
  Frame image
  Frame caption in format Frame f

Prompt:
## Objective
You are given a list of frames, their descriptions, relationships between frames, and the main task objective.

Descriptions:
List of frame descriptions in format Frame f: description.

Relations: 
List of frame relations in format Frame f1 to f2: relation.

Objective: 
Task Objective.

Write a Python script `def structure_abstraction(frames: list[Frame])` to identify 
all element/frame-level interactables using these tools:

  Frame.get_element(position: Literal[up, down, left, right], details: str) -> Element:
    Get one specific element by its relative position in the frame and its detailed visual description.
    position: where is the element
    details: color, shape, and visual features of the element

  Frame.split(rows: int, columns: int) -> list[Frame]:
    Split the entire frame into SELECTABLE choices.

  Frame.grid(tiles: int) -> list[list[Element]]:
    Convert the frame into multiple SWAPPABLE tile elements.
    tiles: total column * row cells.

  Frame.set_frame_as(interactable: str) -> None:
    POINTABLE: You can point/click on specific thing(s) in an area as an answer
    INPUTTABLE: You can type a text answer in this input box as an answer
    SELECTABLE: You can click to toggle this frame as an answer choice
    NEXT: You can click this frame to submit/skip the task

  Element.set_element_as(interactable: str) -> None:
    DRAGGABLE: Can be freely moved anywhere
    SWAPPABLE: Can manually exchange places with another SWAPPABLE
    SLIDEABLE_X: Can be dragged along a horizontal track
    SLIDEABLE_Y: Can be dragged along a vertical track
    INPUTTABLE: You can type a text answer in this input box as an answer
    CLICKABLE: UI button that can be clicked to transition a frame or cycle through choices
    NEXT: UI button that can be clicked to submit/skip the task

## Guidelines
1. If a submit or skip button exists, which can be text or icon button, mark as NEXT; otherwise, don't use NEXT.
2. There can be zero or one NEXT, but there must be exactly one of any other interactable type.
3. The dependent frame in a relationship is not interactable.
4. Focus on extracting all interactables rather than providing an answer to the task.
5. Frame and Element are already defined.

Output:
An executable Python block that uses the tools to annotate interactables.
```python
def structure_abstraction(frames: list[Frame]):
  ...
```

CAPTCHA Solving


Input:
For each frame:
  Frame image
  Frame caption in format Frame f (interactable type)

  For each interactable element in frame:
    Element image
    Element caption in format Element f-e (interactable type)

Prompt:
Solve tasks using your vision, coding, and language skills.
When using code, the user cannot provide any other feedback or perform any other action beyond executing the code you suggest. 
The user can't modify your code. So do not suggest incomplete code which requires users to modify. 
Don't use a code block if it's not intended to be executed by the user.

## Objective
Task Objective.

## Frame Relationships
List of frame relations in format Frame f1 to f2: relation.

Compose a Python script solution. Examples:
In-context learning examples of using tools.

You can perform these actions:

  click(target: Union[Frame, Element]) -> None:
    Click a UI button.

  get_all_choices(prev_arrow: Element, next_arrow: Element, observe: Frame) -> list[SelectChoice]:
    Cycle through all choices by clicking arrow buttons.
    Returns all cycled choices from frame.

  drag(start: Element, end: Point) -> list[DragChoice]:
    Drag element from start to end point.
    Returns drag_choices (list[DragChoice]) to make minor adjustments at the endpoint.

  enter(field: Union[Frame, Element], text: str) -> None:
    Click on an input field and enter text.

  point(to: Point) -> None:
    Click on a point on a frame.
  
  select(choice: Union[Frame, Element]) -> None:
    Select a choice.

  slide_x(handle: Element, direction: Literal['left', 'right'], observe_frame: Frame) -> list[SlideChoice]:
    Drag and move slider handle left/right while observing changes in a frame.
    Returns observation (list[Choice]): observation over frame while sliding.

  slide_y(handle: Element, direction: Literal['up', 'down'], observe_frame: Frame) -> list[SlideChoice]:
    Drag and move slider handle up/down while observing changes in a frame.
    Returns observation (list[Choice]): observation over frame while sliding.

  explore(grid: Frame) -> list[SwapChoice]:
    Get all possible ways to swap elements in the grid.
    Returns choices (list[Choice]): all possible swaps.

  SelectChoice.image -> PIL.Image.Image:

  SelectChoice.select -> None:
    (For get_all_choices) Select this choice.

  DragChoice.preview -> PIL.Image.Image:
    (For drag) A preview of the drag-and-drop results.

  DragChoice.drop() -> None:
    (For drag) Confirm this as the final choice and drop here.

  SwapChoice.preview -> PIL.Image.Image:
    (For explore) A preview image of the grid after swap.

  SwapChoice.grid -> list[list[Element]]:
    (For explore) A 2D list previewing the grid of elements after the swap.
    For solutions that need to compare for elements (i.e., visual identity, position) in the grid.
    For example: compare() identical elements in a row or column.

  SwapChoice.swap() -> None:
    (For explore) Executes the swap previewed in this choice.

  SlideChoice.image -> PIL.Image.Image:

  SlideChoice.refine() -> list[SlideChoice]:
    (For slide_x, slide_y) Reduce the search space for further analysis by narrowing down to 
    a subset of choices around this option.

  SlideChoice.release() -> None:
    (For slide_x, slide_y) Confirm this as the final choice and stop sliding.

You can use these tools in the Python script to help you:

  mark(images: list[PIL.Image.Image], object: str) -> list[PIL.Image.Image]:
    Annotate object bounding boxes in each image.
    Helps answer questions that require counting and finding objects.

  focus(image: PIL.Image.Image, description: str) -> list[PIL.Image.Image]:
    Zooms in on specific regions of the image that matches description.
    Helps answer questions that require detailed visual analysis.
    Returns a list of focused regions.

  ask(images: list[PIL.Image.Image], question: str, answer_type: str) -> list[Any]:
    Ask a question about the visual state of a batch of images.
    `answer_type` can be `bool`, `int`, `str`.
    Returns answers (list[Any]), a list of `answer_type` outputs for each image.

  rank(images: list[PIL.Image.Image], task_objective: str) -> list[str]:
    Ranks each image in the `images` list based on the specified criteria in `task_objective`.
    Returns image_ids (list[int]), a list of image IDs ordered by descending rank.

  compare(images: list[PIL.Image.Image], task_objective: str, reference: PIL.Image.Image = None) -> list[bool]:
    Compare each image with the `reference` image and check if it satisfies `task_objective`.
    Returns comparison (list[bool]), a list of True/False for each image in `images`.

  match(e1: Element, e2: Element) -> bool: 
    Check if two elements are visually similar or identical.
    Works best for grid items.

  Frame.show_keypoints(self, region: Literal['all', 'top', 'bottom', 'left', 'right']) -> PIL.Image.Image:
    Annotate keypoints on the Frame. Each keypoint has a number ID.
    Keypoints allows you to work in 2D spaces (drag-and-drop / drawing / pointing).
    Returns
        image (PIL.Image.Image): The frame image with all keypoints annotated on it.

  Frame.get_keypoint(self, id: int) -> Point:
    Get a keypoint in the frame by its id.

  Frame.get_interactable(self, id: int) -> Element:
    Get an interactable element in the frame by its id.

  Frame.image -> PIL.Image.Image:
    Get the current visual state of the frame.

  Element.image -> PIL.Image.Image:
    Get the current visual state of the element.

  Point.show_neighbours(self) -> PIL.Image.Image:
    Reduce the search space for further analysis by narrowing down to keypoints surrounding this point.

  Point.get_neighbour(self, id: int) -> Point:
    Get a neighbouring keypoint by its id.

## Guidelines
1. You can provide the answer for math and text challenges without using tools.
2. Ensure that all values (e.g., pixels, color) are derived through image processing rather than hardcoded magic values.
3. Compose the solution without referencing the possible answer in the objective (e.g., flowers -> icon).
4. You should implement all placeholders. 

Output:
An executable Python block that uses the tools to compose solution exploration, evaluation, and execution.
```python
def solve(frames: list[Frame]):
  ...
```

Are CAPTCHAs Still Bot-hard? Generalized Visual CAPTCHA Solving with Agentic Vision Language Model

Prompt Templates

Objective Identification

CAPTCHA Abstraction

CAPTCHA Solving