6-Layer Pipeline

Every claim passes through a deterministic pipeline before becoming a memory. No raw data is stored without validation.

Overview

The pipeline consists of 7 stages (L0, L0.5, L1–L5), each responsible for a specific validation or transformation step.

L0
Policy Firewall
Classify claim type, block PII and garbage
L0.5
Sentence Classifier
Filter greetings, questions, commands
L1
Claim Normalizer
Extract (subject, predicate, object) triple
L2
Confidence Scorer
Sigmoid scoring with multi-factor inputs
L3
Conflict Detector
Vector + keyword contradiction search
L4
Embed & Store
Generate 1536d vector, write to LanceDB
L5
Tier Router
Assign tier based on confidence thresholds

L0: Policy Firewall

The first gate classifies the incoming claim into one of several types and applies policy rules:

Claim TypeExamplesPolicy
identity"My name is Huy"Always allow, high priority
occupational"I work at VNG"Allow, stable decay
preference"I prefer dark mode"Allow, stable decay
behavioral"I run 5km daily"Allow, moderate decay
temporal"Meeting tomorrow at 2pm"Allow, ephemeral decay
relational"My wife is named Lan"Allow, stable decay
piiCredit card, SSNBlock

L0.5: Sentence Classifier

A fast regex pre-filter removes non-storable content before expensive LLM processing:

  • Greetings: "xin chào", "hello", "hi there" → DROP
  • Questions: "what is X?", "bao nhiêu?" → DROP (questions are not claims)
  • Commands: "please do X", "hãy làm" → DROP
  • Code blocks: Fenced code → DROP (code is not a claim)
  • Assertions: Anything containing factual content → PASS to L1

L1: Claim Normalizer

The normalizer transforms raw text into a structured triple. It uses a two-stage approach:

  1. Regex fast-path — handles ~70% of common patterns (e.g., "I am X", "My X is Y", "I like X"). Supports English and Vietnamese.
  2. LLM fallback — for complex or ambiguous sentences, the LLM extracts the triple with kind and decay class assignment.
// Output of L1 normalizer
{
  subject: "user",        // who the claim is about
  predicate: "works at",   // the relationship
  object: "Google",        // the value
  kind: "occupational",    // claim classification
  decayClass: "STABLE"     // how fast it should decay
}

L2: Confidence Scorer

Confidence is computed using a sigmoid function with 4 factors:

C = σ(α·S + β·K − γ·F + δ·T)

Where:

  • S (Source weight) — user_explicit: 1.0, agent_inferred: 0.3, group_chat: 0.5
  • K (Corroboration) — number of supporting memories (0–1 normalized)
  • F (Conflict penalty) — number of contradicting memories
  • T (Kind bonus) — identity: 1.0, occupational: 0.9, preference: 0.8, behavioral: 0.7, temporal: 0.4

L3: Conflict Detector

Searches for contradictions using hybrid matching:

  1. Vector similarity search (cosine, top-5 results with threshold 0.85)
  2. Keyword overlap on subject + predicate
  3. Semantic contradiction detection via LLM comparison

When a conflict is found:

  • The older claim is moved to CHALLENGED tier
  • Both claims get a conflictGroup link
  • System entropy increases
  • If conflictAutoResolve is enabled, the higher-confidence claim wins automatically

L4: Embedding & Storage

The claim is embedded using OpenAI's text-embedding-3-small (1536 dimensions) and stored in LanceDB with 36 columns including metadata, confidence history, and linking information.

L5: Tier Router

The final stage assigns a tier based on confidence thresholds:

TierRangeAuto-Inject?Behavior
QUARANTINE0.00–0.29NoHidden, awaiting evidence
CANDIDATE0.30–0.49NoSearchable, not injected
WORKING0.50–0.89YesActive working memory
FACT0.90–1.00Yes (priority)Verified, high confidence
CHALLENGEDAnyNoIn conflict, needs resolution