Watson agent-based check evals¶
Run an agent-based watson check (e.g. contact lenses coherence) on the golden dataset stored on Turing and persist per-subcheck results to Turing.
The eval is parallelized: the flask command enqueues one RQ job per investigation.
Prerequisites¶
-
**Make sure to have a ~fresh Kay on document_parsing profile:
-
Local doctorai-api running โ the agent-based check calls it for LLM inference.
The eval fan-outs many concurrent investigation jobs; the default 2 uvicorn workers will be the bottleneck. Boot doctorai with more workers via DOCTORAI_WORKERS:
Rule of thumb: match (or slightly exceed) the number of RQ workers you plan to run. Each worker = full process = full RAM footprint, so cap based on your machine.
- One or more local RQ workers โ one job per investigation runs in parallel, so the throughput of the eval is bounded by the number of workers consuming the queue.
Single worker:
The -k flag enables Kay โ required because the agent-based check needs realistic investigation data.
Multiple workers: Either open N terminals and run the same command in each, e.g.:
# terminal 1
APP=fr_api bin/flask rq worker -k
# terminal 2
APP=fr_api bin/flask rq worker -k
# ... etc
Or use process-compose / mprocs โ see backend/README.md ยง "How to run several workers" for the supported setups.
- Golden dataset row(s) on Turing โ at least one
fraud.watson_agent_golden_datasetrow whosefraud_investigation_idexists in your local Kay DB.
Command reference¶
Enqueue an agent-based watson check eval, one RQ job per investigation.
Reads investigation IDs from fraud.watson_agent_golden_dataset for --check-name and enqueues one job per investigation.
Each job re-runs the check and writes its rows to fraud.watson_agent_local_eval_runs independently.
Default is dry-run (jobs log rows but skip Turing INSERTs); pass
--execute to persist.
Example usage:
flask fraud run_eval_watson_check -k --check-name are_contact_lenses_documents_coherent --eval-name alaner_test_20260507 --limit 10 -f --execute
Restrict to specific investigations:
flask fraud run_eval_watson_check -k --check-name are_contact_lenses_documents_coherent --investigation-id
--investigation-id -f --execute
Source code in components/fr/internal/fraud_detection/commands/watson_eval.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
Running the eval¶
Default is dry-run (jobs log rows but skip Turing INSERTs); pass
--execute to persist.
# Dry-run, 4 investigations, no confirmation prompt
APP=fr_api flask fraud run_eval_watson_check \
--check-name are_contact_lenses_documents_coherent \
--limit 4 \
-f
# Real run, custom eval name
APP=fr_api flask fraud run_eval_watson_check \
--check-name are_contact_lenses_documents_coherent \
--eval-name oma_contact_lenses_2026_05_08_pre_prompt_change \
--execute
--eval-name defaults to {$USER}_{check}_{utc_timestamp} if omitted.
Inspecting results¶
Workers stream rows to Turing as they complete. Peek mid-run:
SELECT *
FROM fraud.watson_agent_local_eval_runs
WHERE eval_run_id = '<the eval_run_id printed by the command>'
ORDER BY fraud_investigation_id, subcheck_name;
summary prefixes¶
The summary column carries a prefix that identifies the row's nature:
| Prefix | Meaning |
|---|---|
[EarlyExit] ... |
Preprocessor short-circuited (missing docs, parsing rejected, etc.). |
[AgentCallError] ... |
DoctorAI network failure (timeout / HTTP error). Outcome is NonRelevant. |
[Error] ... |
Runner-side exception during check.run (e.g. DB lookup failed). |
| (no prefix) | Agent ran; this is the per-comparison explanation from doctorai. |