| # | Agent | Type | IC | EC | RC |
|---|---|---|---|---|---|
| 1 | Human | Baseline | 0.90 | 0.66 | 0.94 |
| 2 | Human Simulacra | RAG | 0.79 | 0.63 | 0.87 |
| 3 | Li et al. (2025) | Prompting | 0.73 | 0.59 | 0.98 |
| 4 | DeepPersona | Prompting | 0.72 | 0.54 | 0.92 |
| 5 | Character.ai | Commercial | 0.71 | 0.71 | 0.46 |
| 6 | Twin 2K 500 | Prompting | 0.53 | 0.26 | 0.95 |
| 7 | Consistent LLM | Fine-tuned | 0.31 | 0.30 | 0.14 |
| 8 | OpenCharacter | Fine-tuned | 0.16 | 0.15 | 0.14 |
A multi-turn interrogation framework for evaluating persona agent consistency. Applies interrogation methodology to systematically probe LLM-based persona agents through logically chained questions, exposing contradictions in internal, external, and retest consistency.
An evidence-grounded diagnostic reasoning agent for chest X-rays. Integrates an LLM with clinically grounded diagnostic tools to produce responses based on explicit image-derived evidence such as quantitative measurements, spatial observations, and visual overlays.