Building PEACE-Compliant Interview AI with LangGraph

Architecture walkthrough of a 4-node LangGraph pipeline for forensic interviewing — from fine-tuning Phi-4 to achieving 10/10 PEACE compliance at ~750ms latency.

2025-12-15 12 min read

LangGraphFine-TuningNLPAI Agents

Forensic interviewing is one of those domains where getting the language wrong has real consequences. Traditional script-based systems force interviewers through rigid question trees — ignoring the context of what a witness just said. I set out to build something better: an LLM-powered agent that dynamically selects interview strategies while maintaining strict PEACE framework compliance.

Why PEACE, and Why It's Hard for LLMs

The PEACE framework (Planning & Preparation, Engage & Explain, Account, Closure, Evaluate) is the gold standard for non-coercive interviewing used by law enforcement worldwide. It requires questions to be open-ended, single WH-form (who/what/when/where/why/how), non-leading, and contextually grounded in what the witness has already said. Out-of-the-box LLMs fail spectacularly at this — they generate compound questions, yes/no prompts, and leading suggestions. In my benchmarks, base Phi-4 scored 0-1 out of 10 for PEACE compliance.

The 4-Node LangGraph Pipeline

I chose LangGraph for its explicit state management and support for cyclic graphs — both critical for interview flows that loop between question types. The pipeline processes each witness response through four nodes in sequence, with the full round-trip completing in about 900ms.

Node 1: Neo4j Context Extraction (~100ms)

A Neo4j knowledge graph tracks 7 Critical Information Requirements (CIRs) and 43 Essential Elements of Information (EEIs). Each witness response is parsed, entities extracted, and the graph updated with quality scores from 1 (missing) to 5 (verifiable). This gives the system a living map of what has been covered and where gaps remain.

cypher

# Quality-driven EEI tracking in Neo4j
MERGE (e:EEI {name: $eei_name, cir: $cir})
SET e.quality = CASE
  WHEN e.quality IS NULL THEN $quality
  WHEN $quality > e.quality THEN $quality
  ELSE e.quality
END
SET e.last_anchor = $anchor_text
RETURN e.quality as current_quality

Node 2: Fast Classification (~1ms)

A lightweight classifier categorizes the witness response — cooperative, evasive, confused, or contradictory. This drives strategy selection downstream. Evasive answers trigger probing techniques; confused answers trigger simplification and rephrasing.

Node 3: Strategy Selection (~0ms)

Seven interview strategies map directly from PEACE principles. The selector considers current CIR coverage, response classification, and topic progression rules to choose the optimal next move.

The 7 strategies:

probe_any — Low-quality response, probe for any detail
probe_anchor — Has info but no verifiable detail, push for specifics
probe_specific — Has anchor, request deeper specifics
next_eei — Current EEI satisfied, move to next element
next_cir — CIR complete, transition to next category
refocus — Topic drift detected, redirect to interview scope
closure — All CIRs adequately covered, wrap up

Node 4: Question Generation (~750ms)

The fine-tuned Phi-4-mini (3.8B) generates the actual question. The prompt includes the selected strategy, knowledge graph context, conversation history, and the specific CIR gap being targeted. A post-processing step validates WH-prefix format, rejects yes/no constructions, and filters repetition.

Fine-Tuning Phi-4 with LoRA

Why Phi-4-mini? At 3.8B parameters, it's small enough for production deployment on a single GPU with sub-second latency, yet large enough for nuanced language generation. The fine-tune uses LoRA with rank 128, rsLoRA scaling, and alpha 256.

python

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=128,
    lora_alpha=256,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    use_rslora=True,
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, lora_config)
# Adapter size: 353MB — no base model redistribution needed

The 353MB adapter is the entire fine-tune. You load the base Phi-4-mini from Hugging Face and merge the adapter at inference time. This makes deployment and version control straightforward — swap adapters without touching the base model.

Results

The results exceeded expectations. In a standardized 10-turn interview benchmark across multiple scenarios, the fine-tuned model achieved perfect PEACE compliance while the base model consistently failed.

Key metrics:

PEACE compliance: 10/10 (vs 0-1/10 baseline)
Average latency: ~750ms per turn
Full pipeline round-trip: ~900ms
Strategy accuracy: 94%
Adapter size: 353MB

Lessons Learned

Three takeaways from this project. First, graph-based context tracking vastly outperforms sliding-window approaches for structured interviews — you need persistent state across the full conversation. Second, fine-tuning small models beats prompting large models when the task domain is well-defined and you have quality training data. Third, LangGraph's explicit state management prevented the "agent chaos" common in fully autonomous systems — every decision point is traceable.

The architecture is designed to be adaptable to other structured conversation domains beyond forensic interviewing — any scenario requiring compliance-driven, context-aware questioning.

Fine-Tuning LLMs with LoRA: A Practical Guide

2025-08-20