Watson agent-based check evals¶

Run an agent-based watson check (e.g. contact lenses coherence) on the golden dataset stored on Turing and persist per-subcheck results to Turing.

The eval is parallelized: the flask command enqueues one RQ job per investigation.

Prerequisites¶

**Make sure to have a ~fresh Kay on document_parsing profile:

APP=fr_api flask kay refresh --profile document_parsing

Local doctorai-api running — the agent-based check calls it for LLM inference.

The eval fan-outs many concurrent investigation jobs; the default 2 uvicorn workers will be the bottleneck. Boot doctorai with more workers via DOCTORAI_WORKERS:

# in doctorai/
DOCTORAI_WORKERS=8 task dev-server

Rule of thumb: match (or slightly exceed) the number of RQ workers you plan to run. Each worker = full process = full RAM footprint, so cap based on your machine.

One or more local RQ workers — one job per investigation runs in parallel, so the throughput of the eval is bounded by the number of workers consuming the queue.

Single worker:

APP=fr_api bin/flask rq worker -k

The -k flag enables Kay — required because the agent-based check needs realistic investigation data.

Multiple workers: Either open N terminals and run the same command in each, e.g.:

# terminal 1
APP=fr_api bin/flask rq worker -k
# terminal 2
APP=fr_api bin/flask rq worker -k
# ... etc

Or use process-compose / mprocs — see backend/README.md § "How to run several workers" for the supported setups.

Golden dataset row(s) on Turing — at least one fraud.watson_agent_golden_dataset row whose fraud_investigation_id exists in your local Kay DB.

Command reference¶

Enqueue an agent-based watson check eval, one RQ job per investigation.

Reads investigation IDs from fraud.watson_agent_golden_dataset for --check-name and enqueues one job per investigation. Each job re-runs the check and writes its rows to fraud.watson_agent_local_eval_runs independently.

Default is dry-run (jobs log rows but skip Turing INSERTs); pass --execute to persist.

Example usage:

flask fraud run_eval_watson_check -k --check-name are_contact_lenses_documents_coherent --eval-name alaner_test_20260507 --limit 10 -f --execute

Restrict to specific investigations:

flask fraud run_eval_watson_check -k --check-name are_contact_lenses_documents_coherent --investigation-id --investigation-id -f --execute

Source code in components/fr/internal/fraud_detection/commands/watson_eval.py

@fraud.command()
@command_with_dry_run
@click.option(
    "--check-name",
    required=True,
    type=click.Choice(AutomatedInvestigationCheckFunctionCall.get_values()),
    help="Function-call check to evaluate (must be agent-based).",
)
@click.option(
    "--eval-name",
    required=False,
    default=None,
    type=str,
    help="Override the default eval_run_id ({user}_{check_name}_{ts}).",
)
@click.option(
    "--limit",
    required=False,
    default=None,
    type=int,
    help="Cap the number of investigations to enqueue (smoke testing).",
)
@click.option(
    "--investigation-id",
    "investigation_ids_arg",
    required=False,
    multiple=True,
    type=click.UUID,
    help=(
        "Restrict the run to the given investigation UUID(s). Repeatable. "
        "IDs not present in the golden dataset for --check-name are dropped "
        "with a warning."
    ),
)
@click.option("-f", "--force", help="Skip confirmation", is_flag=True, default=False)
def run_eval_watson_check(
    dry_run: bool,
    check_name: AutomatedInvestigationCheckFunctionCall,
    eval_name: str | None,
    limit: int | None,
    investigation_ids_arg: tuple[UUID, ...],
    force: bool,
) -> None:
    """Enqueue an agent-based watson check eval, one RQ job per investigation.

    Reads investigation IDs from ``fraud.watson_agent_golden_dataset`` for ``--check-name`` and enqueues one job per investigation.
    Each job re-runs the check and writes its rows to ``fraud.watson_agent_local_eval_runs`` independently.

    Default is dry-run (jobs log rows but skip Turing INSERTs); pass
    ``--execute`` to persist.

    Example usage:

    >>> flask fraud run_eval_watson_check -k --check-name are_contact_lenses_documents_coherent --eval-name alaner_test_20260507 --limit 10 -f --execute

    Restrict to specific investigations:

    >>> flask fraud run_eval_watson_check -k --check-name are_contact_lenses_documents_coherent --investigation-id <uuid1> --investigation-id <uuid2> -f --execute

    """
    if get_agent_based_check(check_name) is None:
        raise click.UsageError(
            f"No agent-based check registered for {check_name.value!r}. "
            "This command only handles agent-based checks."
        )

    eval_run_id = build_eval_run_id(check_name, eval_name=eval_name)
    requested_ids = list(investigation_ids_arg) if investigation_ids_arg else None
    investigation_ids = fetch_golden_investigation_ids(
        check_name, limit=limit, investigation_ids=requested_ids
    )

    if requested_ids is not None:
        dropped = sorted(set(requested_ids) - set(investigation_ids))
        if dropped:
            current_logger.warning(
                "Some --investigation-id values are not in the golden dataset",
                check_name=check_name,
                dropped_count=len(dropped),
                dropped_investigation_ids=[str(i) for i in dropped],
            )

    current_logger.info(
        "Preparing watson agent-based check eval",
        eval_run_id=eval_run_id,
        check_name=check_name,
        total_investigations=len(investigation_ids),
        queue=EVAL_QUEUE,
        dry_run=dry_run,
    )

    if not investigation_ids:
        current_logger.info(
            f"No investigations in golden dataset for {check_name}; nothing to do."
        )
        return

    if not force and not click.confirm(
        f"Enqueue {check_name} on {len(investigation_ids)} investigations "
        f"to {EVAL_QUEUE} (eval_run_id={eval_run_id}, dry_run={dry_run})?"
    ):
        return

    queue = current_rq.get_queue(EVAL_QUEUE)
    for investigation_id in investigation_ids:
        queue.enqueue(
            run_watson_agent_based_check_eval_for_investigation,
            eval_run_id=eval_run_id,
            fraud_investigation_id=investigation_id,
            check_name=check_name,
            dry_run=dry_run,
            job_timeout=_JOB_TIMEOUT_SECONDS,
        )

    current_logger.info(
        "Enqueued watson agent-based check eval jobs",
        eval_run_id=eval_run_id,
        queue=EVAL_QUEUE,
        jobs_enqueued=len(investigation_ids),
        dry_run=dry_run,
    )

Running the eval¶

Default is dry-run (jobs log rows but skip Turing INSERTs); pass --execute to persist.

# Dry-run, 4 investigations, no confirmation prompt
APP=fr_api flask fraud run_eval_watson_check \
    --check-name are_contact_lenses_documents_coherent \
    --limit 4 \
    -f

# Real run, custom eval name
APP=fr_api flask fraud run_eval_watson_check \
    --check-name are_contact_lenses_documents_coherent \
    --eval-name oma_contact_lenses_2026_05_08_pre_prompt_change \
    --execute

--eval-name defaults to {$USER}_{check}_{utc_timestamp} if omitted.

Inspecting results¶

Workers stream rows to Turing as they complete. Peek mid-run:

SELECT *
FROM fraud.watson_agent_local_eval_runs
WHERE eval_run_id = '<the eval_run_id printed by the command>'
ORDER BY fraud_investigation_id, subcheck_name;

`summary` prefixes¶

The summary column carries a prefix that identifies the row's nature:

Prefix	Meaning
`[EarlyExit] ...`	Preprocessor short-circuited (missing docs, parsing rejected, etc.).
`[AgentCallError] ...`	DoctorAI network failure (timeout / HTTP error). Outcome is `NonRelevant`.
`[Error] ...`	Runner-side exception during `check.run` (e.g. DB lookup failed).
(no prefix)	Agent ran; this is the per-comparison `explanation` from doctorai.