Building PEACE-Compliant Interview AI with LangGraph
Architecture walkthrough of a 4-node LangGraph pipeline for forensic interviewing — from fine-tuning Phi-4 to achieving 10/10 PEACE compliance at ~750ms latency.
Forensic interviewing is one of those domains where getting the language wrong has real consequences. Traditional script-based systems force interviewers through rigid question trees — ignoring the context of what a witness just said. I set out to build something better: an LLM-powered agent that dynamically selects interview strategies while maintaining strict PEACE framework compliance.
Why PEACE, and Why It's Hard for LLMs
The PEACE framework (Planning & Preparation, Engage & Explain, Account, Closure, Evaluate) is the gold standard for non-coercive interviewing used by law enforcement worldwide. It requires questions to be open-ended, single WH-form (who/what/when/where/why/how), non-leading, and contextually grounded in what the witness has already said. Out-of-the-box LLMs fail spectacularly at this — they generate compound questions, yes/no prompts, and leading suggestions. In my benchmarks, base Phi-4 scored 0-1 out of 10 for PEACE compliance.
The 4-Node LangGraph Pipeline
I chose LangGraph for its explicit state management and support for cyclic graphs — both critical for interview flows that loop between question types. The pipeline processes each witness response through four nodes in sequence, with the full round-trip completing in about 900ms.
Node 1: Neo4j Context Extraction (~100ms)
A Neo4j knowledge graph tracks 7 Critical Information Requirements (CIRs) and 43 Essential Elements of Information (EEIs). Each witness response is parsed, entities extracted, and the graph updated with quality scores from 1 (missing) to 5 (verifiable). This gives the system a living map of what has been covered and where gaps remain.
# Quality-driven EEI tracking in Neo4j
MERGE (e:EEI {name: $eei_name, cir: $cir})
SET e.quality = CASE
WHEN e.quality IS NULL THEN $quality
WHEN $quality > e.quality THEN $quality
ELSE e.quality
END
SET e.last_anchor = $anchor_text
RETURN e.quality as current_qualityNode 2: Fast Classification (~1ms)
A lightweight classifier categorizes the witness response — cooperative, evasive, confused, or contradictory. This drives strategy selection downstream. Evasive answers trigger probing techniques; confused answers trigger simplification and rephrasing.
Node 3: Strategy Selection (~0ms)
Seven interview strategies map directly from PEACE principles. The selector considers current CIR coverage, response classification, and topic progression rules to choose the optimal next move.
The 7 strategies:
- probe_any — Low-quality response, probe for any detail
- probe_anchor — Has info but no verifiable detail, push for specifics
- probe_specific — Has anchor, request deeper specifics
- next_eei — Current EEI satisfied, move to next element
- next_cir — CIR complete, transition to next category
- refocus — Topic drift detected, redirect to interview scope
- closure — All CIRs adequately covered, wrap up
Node 4: Question Generation (~750ms)
The fine-tuned Phi-4-mini (3.8B) generates the actual question. The prompt includes the selected strategy, knowledge graph context, conversation history, and the specific CIR gap being targeted. A post-processing step validates WH-prefix format, rejects yes/no constructions, and filters repetition.
Fine-Tuning Phi-4 with LoRA
Why Phi-4-mini? At 3.8B parameters, it's small enough for production deployment on a single GPU with sub-second latency, yet large enough for nuanced language generation. The fine-tune uses LoRA with rank 128, rsLoRA scaling, and alpha 256.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=128,
lora_alpha=256,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
use_rslora=True,
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)
# Adapter size: 353MB — no base model redistribution neededThe 353MB adapter is the entire fine-tune. You load the base Phi-4-mini from Hugging Face and merge the adapter at inference time. This makes deployment and version control straightforward — swap adapters without touching the base model.
Results
The results exceeded expectations. In a standardized 10-turn interview benchmark across multiple scenarios, the fine-tuned model achieved perfect PEACE compliance while the base model consistently failed.
Key metrics:
- PEACE compliance: 10/10 (vs 0-1/10 baseline)
- Average latency: ~750ms per turn
- Full pipeline round-trip: ~900ms
- Strategy accuracy: 94%
- Adapter size: 353MB
Lessons Learned
Three takeaways from this project. First, graph-based context tracking vastly outperforms sliding-window approaches for structured interviews — you need persistent state across the full conversation. Second, fine-tuning small models beats prompting large models when the task domain is well-defined and you have quality training data. Third, LangGraph's explicit state management prevented the "agent chaos" common in fully autonomous systems — every decision point is traceable.
The architecture is designed to be adaptable to other structured conversation domains beyond forensic interviewing — any scenario requiring compliance-driven, context-aware questioning.