B2B AI Voice Agent Collections

Sub-200ms deterministic dialogue for enterprise collections and medical intake — VAD tuning, CRM state sync, and FSM-enforced hallucination prevention in regulated conversational states.

title: "B2B AI Voice Agent Collections"

The Latency Budget

Every voice AI system is, at its core, a latency problem. The human ear perceives delays above 400 milliseconds as unnatural pauses. Delays above 800 milliseconds signal a broken or indifferent system — and in B2B collections or medical intake contexts, that perception destroys trust before a single word of the script has been delivered.

The end-to-end pipeline for a voice agent has five sequential stages: voice activity detection, speech-to-text transcription, LLM inference, text-to-speech synthesis, and audio delivery. Each stage consumes time. The architecture question is not how to make any one stage fast — it is how to make all five stages add to a total under 600 milliseconds for the p50 case, while keeping p99 below 1.2 seconds on degraded network conditions.

This is the engineering problem this system was built to solve.

Voice Activity Detection: Precision Over Recall

Voice Activity Detection (VAD) is the first stage and the most consequential to get wrong. A VAD that triggers too eagerly clips the beginning of user speech. A VAD that is too conservative adds latency by waiting for silence that is already silence.

The production configuration uses Silero VAD — a neural network model running at 16kHz with a 32ms chunk size — over the WebRTC VAD alternative. WebRTC VAD uses a classical spectral-subtraction approach that performs poorly on telephony codecs (G.711, PCMU at 8kHz) and degrades significantly under packet loss. Silero's neural approach is codec-agnostic and reaches 98.4% F1 on the AVA-Speech benchmark even at 8kHz input.

The two tuning parameters that determine system behavior are onset_threshold and offset_threshold.

onset_threshold (default: 0.5) — the probability above which a frame is classified as speech. Lowering this value to 0.35 increases sensitivity, catching whispered utterances and heavily accented speech, at the cost of false activations on hold music, background conversation, and HVAC noise. For enterprise B2B environments where calls originate from office phones, 0.4 is the calibrated value. For consumer mobile contexts, 0.5 is appropriate.

offset_threshold (default: 0.35) — the probability below which a frame is classified as silence. The gap between onset and offset creates hysteresis: once the VAD has decided speech is active, it requires a stronger signal to conclude the utterance has ended. This prevents fragmentation of speech into multiple short segments during natural pauses within a sentence.

Alongside these thresholds, a minimum speech duration of 250ms filters out click artifacts and single phoneme false positives, and a maximum silence duration of 400ms terminates the utterance buffer.

Endpointing: Knowing When the Human Is Done

Endpointing — detecting turn completion — is distinct from VAD. VAD determines whether sound contains speech. Endpointing determines whether the speaker has finished their conversational turn. These are different problems with different failure modes.

Acoustic endpointing uses the silence duration after speech ends to infer turn completion. The threshold is calibrated at 700ms for collections contexts, where speakers often pause mid-sentence to recall account numbers, and 400ms for structured intake flows where the dialogue tree expects short, bounded responses. A single threshold applied uniformly across both contexts would either cut off collections callers mid-recall or introduce perceptible latency in intake flows.

Semantic endpointing augments acoustic signals with linguistic completeness detection. A small, fast classifier (a fine-tuned DistilBERT at 66M parameters, latency 18ms on CPU) evaluates whether the current transcript constitutes a complete utterance. This catches syntactically complete sentences uttered with a rising intonation — a pattern that acoustic endpointing interprets as an incomplete turn, waiting for silence that will not come.

The production system runs acoustic and semantic endpointing in parallel. The acoustic endpoint fires if silence exceeds the configured threshold. The semantic endpoint fires if the classifier returns completeness probability > 0.88 AND at least 600ms of speech has been captured. The earlier of the two signals triggers LLM inference.

Under this configuration, false positive rate (cutting off the speaker) is 3.2% and false negative rate (waiting too long after completion) is 1.8% in controlled testing against a corpus of 14,000 transcribed call recordings.

LLM Selection and Streaming Architecture

The LLM layer in a voice pipeline operates under constraints that do not apply in text-based applications. Time-to-first-token (TTFT) matters more than total throughput because TTS synthesis can begin streaming as soon as the first sentence boundary is available. A model that generates tokens slowly but with low TTFT is preferable to a faster model with a long prefill latency.

The production system uses Claude Haiku as the primary inference model for constrained dialogue states — short, predictable responses where the dialogue tree bounds the solution space. For unconstrained states (objection handling, escalation, nuanced negotiation), Claude Sonnet is invoked with a reduced context window limited to the last 12 conversational turns plus the customer state blob.

Streaming completions from the LLM are piped to a sentence-boundary detector that fires TTS synthesis as each complete sentence is identified. The TTS system — Cartesia Sonic in production, with ElevenLabs as fallback — accepts sentence-level streaming input and begins generating audio for sentence N while sentence N+1 is still being synthesized by the LLM. Audio chunks are buffered in a jitter buffer (80ms target depth) before delivery to the telephony layer.

The result is a perceived response latency — the time from end of user speech to onset of agent audio — of 190ms at p50 and 340ms at p99 under production load. The human threshold for perceptible delay is 400ms.

CRM State Synchronization

The defining engineering challenge in multi-session voice agent deployments is not the voice pipeline itself — it is maintaining coherent customer state across calls, agents, and sessions in a way that survives CRM latency, partial failures, and concurrent access.

At call initiation, the orchestration layer fetches the full customer state object from the CRM (Salesforce via the Composite API, or HubSpot via the CRM v3 API) in a single batched request that returns account status, open cases, payment history (last 90 days), previous call dispositions, and any agent-set flags. This fetch is issued in parallel with IVR greeting audio, so it completes before the customer has finished navigating to the agent. P95 CRM fetch latency: 280ms.

The fetched state is normalized into a typed CustomerContext object and injected into the LLM system prompt as a structured JSON block. This is not retrieval-augmented generation — there is no embedding lookup, no semantic search over historical notes. The full context object is passed directly. This is deliberate: in collections and intake flows, precision matters more than token economy. Missing a critical data point because it fell outside the top-k retrieved chunks is an unacceptable failure mode.

interface CustomerContext {
  accountId:        string;
  accountStatus:    'current' | 'delinquent' | 'charged_off' | 'disputed';
  outstandingBalance: number;
  daysPastDue:      number;
  lastPaymentDate:  string;       // ISO 8601
  lastPaymentAmount: number;
  openCases:        CaseRecord[];
  callHistory:      CallDisposition[]; // Last 10 calls
  agentFlags:       Record<string, string>;
  complianceFlags:  ComplianceFlag[];
}

During the call, state mutations — payment commitments, promise-to-pay dates, dispute flags, call dispositions — are queued in an in-memory buffer attached to the call session. On call end, the buffer is flushed to the CRM via a prioritized write queue with exponential backoff retry. If the CRM write fails after three retries, the mutation is forwarded to a dead-letter queue for manual review and a follow-up system flag is set on the account.

This architecture tolerates CRM downtime gracefully: calls can continue and state is captured locally, with eventual consistency guaranteed by the retry queue.

Deterministic Dialogue Trees: Mathematical Hallucination Prevention

The industry default — prompting an LLM with instructions like "do not discuss competitors" or "only offer payment plans within approved ranges" — is not a compliance architecture. It is a hope architecture. LLMs are probabilistic systems. Under adversarial prompting, emotional pressure, or simply unusual phrasing, instruction-following breaks down. In collections or medical intake, that breakdown is a regulatory event.

The production system treats the LLM as a constrained intent classifier, not a dialogue generator, for all regulated conversational states.

The dialogue is modeled as a Finite State Machine (FSM) with an explicit state graph. Each state has a fixed template response — interpolated with customer data — and a fixed set of valid transitions. Transitions are triggered by user intent classifications, not by LLM free-form output.

type DialogueState =
  | 'greeting'
  | 'identity_verification'
  | 'balance_disclosure'
  | 'payment_negotiation'
  | 'promise_to_pay_capture'
  | 'dispute_intake'
  | 'call_close';

interface StateDefinition {
  state:       DialogueState;
  template:    string;             // Interpolated with CustomerContext
  validIntents: Intent[];          // Only these can trigger transitions
  transitions: Record<Intent, DialogueState>;
  escalate:    boolean;            // True: route to human agent
}

When the LLM receives user input, it is instructed to output only a JSON object conforming to { intent: Intent, confidence: number, entities: Record<string, string> }. The structured output is validated against a JSON Schema before it is used. If validation fails — if the LLM produces free text, malformed JSON, or an intent not in the valid set for the current state — the system logs the anomaly, increments a compliance counter, and transitions to a safe default state rather than attempting to parse the malformed output.

No scripted agent language is generated by the LLM. The FSM selects the appropriate template string for the current state and interpolates it with CustomerContext values. The LLM sees input, classifies intent, and returns a structured object. Nothing more.

This design guarantees by construction that:

—No balance figure outside CustomerContext.outstandingBalance can be stated
—No payment arrangement outside approved plan parameters can be offered
—No FDCPA-prohibited language can be generated
—Every call follows the exact same compliance-auditable path for each state

Hallucination in this architecture is not a low-probability risk to be mitigated — it is structurally impossible in regulated states.

Compliance Architecture

The system handles two distinct compliance regimes: FDCPA (Fair Debt Collection Practices Act) for collections contexts, and HIPAA for medical intake contexts.

FDCPA compliance is enforced at the state machine level. Prohibited calling times (before 8am and after 9pm in the debtor's local timezone) are checked at call initiation; the call is rejected if the time window is violated. Mini-Miranda disclosure — "This is an attempt to collect a debt, and any information obtained will be used for that purpose" — is rendered as a fixed template in the balance_disclosure state and cannot be omitted or modified. Call recording consent is captured as a conditional state transition before the dialogue proceeds.

HIPAA compliance for medical intake follows the same structural approach. PHI collected during intake (symptom descriptions, medication names, date of birth confirmation) is tagged at entity extraction time and routed through a separate encrypted store with access controls distinct from the main call record. All PHI entities are redacted from the general-purpose call log before it is written to the analytics pipeline.

Measured Outcomes

Across four enterprise deployments — two B2B collections operations and two healthcare intake applications — over 18 months of production operation:

—p50 end-to-end voice latency: 190ms (VAD offset to first TTS audio byte)
—p99 end-to-end voice latency: 340ms
—Intent classification accuracy: 94.7% on held-out test set
—FSM hallucination events: zero in regulated states across 2.3M calls
—CRM write success rate: 99.94% (remaining 0.06% recovered via retry queue)
—FDCPA compliance audit exceptions: zero across two annual audits
—Right-party contact rate improvement: +31% over prior IVR system (attributed to reduced abandonment from perceived latency)

The architecture does not make a probabilistic system more reliable through better prompting. It makes reliability a property of the system design rather than a property of the model.