6-Layer Pipeline

Every claim passes through a deterministic pipeline before becoming a memory. No raw data is stored without validation.

Overview

The pipeline consists of 7 stages (L0, L0.5, L1–L5), each responsible for a specific validation or transformation step.

Policy Firewall

Classify claim type, block PII and garbage

→

L0.5

Sentence Classifier

Filter greetings, questions, commands

→

Claim Normalizer

Extract (subject, predicate, object) triple

→

Confidence Scorer

Sigmoid scoring with multi-factor inputs

→

Conflict Detector

Vector + keyword contradiction search

→

Embed & Store

Generate 1536d vector, write to LanceDB

→

Tier Router

Assign tier based on confidence thresholds

L0: Policy Firewall

The first gate classifies the incoming claim into one of several types and applies policy rules:

Claim Type	Examples	Policy
`identity`	"My name is Huy"	Always allow, high priority
`occupational`	"I work at VNG"	Allow, stable decay
`preference`	"I prefer dark mode"	Allow, stable decay
`behavioral`	"I run 5km daily"	Allow, moderate decay
`temporal`	"Meeting tomorrow at 2pm"	Allow, ephemeral decay
`relational`	"My wife is named Lan"	Allow, stable decay
`pii`	Credit card, SSN	Block

L0.5: Sentence Classifier

A fast regex pre-filter removes non-storable content before expensive LLM processing:

Greetings: "xin chào", "hello", "hi there" → DROP
Questions: "what is X?", "bao nhiêu?" → DROP (questions are not claims)
Commands: "please do X", "hãy làm" → DROP
Code blocks: Fenced code → DROP (code is not a claim)
Assertions: Anything containing factual content → PASS to L1

L1: Claim Normalizer

The normalizer transforms raw text into a structured triple. It uses a two-stage approach:

Regex fast-path — handles ~70% of common patterns (e.g., "I am X", "My X is Y", "I like X"). Supports English and Vietnamese.
LLM fallback — for complex or ambiguous sentences, the LLM extracts the triple with kind and decay class assignment.

// Output of L1 normalizer
{
  subject: "user",        // who the claim is about
  predicate: "works at",   // the relationship
  object: "Google",        // the value
  kind: "occupational",    // claim classification
  decayClass: "STABLE"     // how fast it should decay
}

L2: Confidence Scorer

Confidence is computed using a sigmoid function with 4 factors:

C = σ(α·S + β·K − γ·F + δ·T)

Where:

S (Source weight) — user_explicit: 1.0, agent_inferred: 0.3, group_chat: 0.5
K (Corroboration) — number of supporting memories (0–1 normalized)
F (Conflict penalty) — number of contradicting memories
T (Kind bonus) — identity: 1.0, occupational: 0.9, preference: 0.8, behavioral: 0.7, temporal: 0.4

L3: Conflict Detector

Searches for contradictions using hybrid matching:

Vector similarity search (cosine, top-5 results with threshold 0.85)
Keyword overlap on subject + predicate
Semantic contradiction detection via LLM comparison

When a conflict is found:

The older claim is moved to CHALLENGED tier
Both claims get a conflictGroup link
System entropy increases
If conflictAutoResolve is enabled, the higher-confidence claim wins automatically

L4: Embedding & Storage

The claim is embedded using OpenAI's text-embedding-3-small (1536 dimensions) and stored in LanceDB with 36 columns including metadata, confidence history, and linking information.

L5: Tier Router

The final stage assigns a tier based on confidence thresholds:

Tier	Range	Auto-Inject?	Behavior
`QUARANTINE`	0.00–0.29	No	Hidden, awaiting evidence
`CANDIDATE`	0.30–0.49	No	Searchable, not injected
`WORKING`	0.50–0.89	Yes	Active working memory
`FACT`	0.90–1.00	Yes (priority)	Verified, high confidence
`CHALLENGED`	Any	No	In conflict, needs resolution