Agentic Memory
Executive Summary
This document describes a Proof of Concept for Agentic Memory in the Semantic Router. Agentic Memory enables AI agents to remember information across sessions, providing continuity and personalization.
⚠️ POC Scope: This is a proof of concept, not a production design. The goal is to validate the core memory flow (retrieve → inject → extract → store) with acceptable accuracy. Production hardening (error handling, scaling, monitoring) is out of scope.
Core Capabilities
| Capability | Description |
|---|---|
| Memory Retrieval | Embedding-based search with simple pre-filtering |
| Memory Saving | LLM-based extraction of facts and procedures |
| Cross-Session Persistence | Memories stored in Milvus (survives restarts; production backup/HA not tested) |
| User Isolation | Memories scoped per user_id (see note below) |
⚠️ User Isolation - Milvus Performance Note:
Approach POC Production (10K+ users) Simple filter ✅ Filter by user_idafter search❌ Degrades: searches all users, then filters Partition Key ❌ Overkill ✅ Physical separation, O(log N) per user Scalar Index ❌ Overkill ✅ Index on user_idfor fast filteringPOC: Uses simple metadata filtering (sufficient for testing).
Production: Configureuser_idas Partition Key or Scalar Indexed Field in Milvus schema.
Key Design Principles
- Simple pre-filter decides if query should search memory
- Context window from history for query disambiguation
- LLM extracts facts and classifies type when saving
- Threshold-based filtering on search results
Explicit Assumptions (POC)
| Assumption | Implication | Risk if Wrong |
|---|---|---|
| LLM extraction is reasonably accurate | Some incorrect facts may be stored | Memory contamination (fixable via Forget API) |
| 0.6 similarity threshold is a starting point | May need tuning (miss relevant or include irrelevant) | Adjustable based on retrieval quality logs |
| Milvus is available and configured | Feature disabled if down | Graceful degradation (no crash) |
| Embedding model produces 384-dim vectors | Must match Milvus schema | Startup failure (detectable) |
| History available via Response API chain | Required for context | Skip memory if unavailable |
Table of Contents
- Problem Statement
- Architecture Overview
- Memory Types
- Pipeline Integration
- Memory Retrieval
- Memory Saving
- Memory Operations
- Data Structures
- API Extension
- Configuration
- Failure Modes and Fallbacks
- Success Criteria
- Implementation Plan
- Future Enhancements
1. Problem Statement
Current State
The Response API provides conversation chaining via previous_response_id, but knowledge is lost across sessions:
Session A (March 15):
User: "My budget for the Hawaii trip is $10,000"
→ Saved in session chain
Session B (March 20) - NEW SESSION:
User: "What's my budget for the trip?"
→ No previous_response_id → Knowledge LOST ❌
Desired State
With Agentic Memory:
Session A (March 15):
User: "My budget for the Hawaii trip is $10,000"
→ Extracted and saved to Milvus
Session B (March 20) - NEW SESSION:
User: "What's my budget for the trip?"
→ Pre-filter: memory-relevant ✓
→ Search Milvus → Found: "budget for Hawaii is $10K"
→ Inject into LLM context
→ Assistant: "Your budget for the Hawaii trip is $10,000!" ✅
2. Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ AGENTIC MEMORY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ExtProc Pipeline │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Request → Fact? → Tool? → Security → Cache → MEMORY → LLM │ │
│ │ │ │ ↑↓ │ │
│ │ └───────┴──── signals used ────────┘ │ │
│ │ │ │
│ │ Response ← [extract & store] ←─────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────┴─────────────────────┐ │
│ │ │ │
│ ┌─────────▼─────────┐ ┌────────────▼───┐ │
│ │ Memory Retrieval │ │ Memory Saving │ │
│ │ (request phase) │ │(response phase)│ │
│ ├───────────────────┤ ├────────────────┤ │
│ │ 1. Check signals │ │ 1. LLM extract │ │
│ │ (Fact? Tool?) │ │ 2. Classify │ │
│ │ 2. Build context │ │ 3. Deduplicate │ │
│ │ 3. Milvus search │ │ 4. Store │ │
│ │ 4. Inject to LLM │ │ │ │
│ └─────────┬─────────┘ └────────┬───────┘ │
│ │ │ │
│ │ ┌──────────────┐ │ │
│ └────────►│ Milvus │◄─────────────┘ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Component Responsibilities
| Component | Responsibility | Location |
|---|---|---|
| Memory Filter | Decision + search + inject | pkg/extproc/req_filter_memory.go |
| Memory Extractor | LLM-based fact extraction | pkg/memory/extractor.go (new) |
| Memory Store | Storage interface | pkg/memory/store.go |
| Milvus Store | Vector database backend | pkg/memory/milvus_store.go |
| Existing Classifiers | Fact/Tool signals (reused) | pkg/extproc/processor_req_body.go |
Storage Architecture
Issue #808 suggests a multi-layer storage architecture. We implement this incrementally:
┌─────────────────────────────────────────────────────────────────────────┐
│ STORAGE ARCHITECTURE (Phased) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 1 (MVP) │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Milvus (Vector Index) │ │ │
│ │ │ • Semantic search over memories │ │ │
│ │ │ • Embedding storage │ │ │
│ │ │ • Content + metadata │ │ │
│ │ └──── ─────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 2 (Performance) │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Redis (Hot Cache) │ │ │
│ │ │ • Fast metadata lookup │ │ │
│ │ │ • Recently accessed memories │ │ │
│ │ │ • TTL/expiration support │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 3+ (If Needed) │ │
│ │ ┌───────────────────────┐ ┌───────────────────────┐ │ │
│ │ │ Graph Store (Neo4j) │ │ Time-Series Index │ │ │
│ │ │ • Memory links │ │ • Temporal queries │ │ │
│ │ │ • Relationships │ │ • Decay scoring │ │ │
│ │ └───────────────────────┘ └───────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
| Layer | Purpose | When Needed | Status |
|---|---|---|---|
| Milvus | Semantic vector search | Core functionality | ✅ MVP |
| Redis | Hot cache, fast access, TTL | Performance optimization | 🔶 Phase 2 |
| Graph (Neo4j) | Memory relationships | Multi-hop reasoning queries | ⚪ If needed |
| Time-Series | Temporal queries, decay | Importance scoring by time | ⚪ If needed |
Design Decision: We start with Milvus only. Additional layers are added based on demonstrated need, not speculation. The
Storeinterface abstracts storage, allowing backends to be added without changing retrieval/saving logic.
3. Memory Types
| Type | Purpose | Example | Status |
|---|---|---|---|
| Semantic | Facts, preferences, knowledge | "User's budget for Hawaii is $10,000" | ✅ MVP |
| Procedural | How-to, steps, processes | "To deploy payment-service: run npm build, then docker push" | ✅ MVP |
| Episodic | Session summaries, past events | "On Dec 29 2024, user planned Hawaii vacation with $10K budget" | ⚠️ MVP (limited) |
| Reflective | Self-analysis, lessons learned | "Previous budget response was incomplete - user prefers detailed breakdowns" | 🔮 Future |
⚠️ Episodic Memory (MVP Limitation): Session-end detection is not implemented. Episodic memories are only created when the LLM extraction explicitly produces a summary-style output. Reliable session-end triggers are deferred to Phase 2.
🔮 Reflective Memory: Self-analysis and lessons learned. Not in scope for this POC. See Appendix A.
Memory Vector Space
Memories cluster by content/topic, not by type. Type is metadata:
┌────────────────────────────────────────────────────────────────────────┐
│ MEMORY VECTOR SPACE │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ BUDGET/MONEY │ │ DEPLOYMENT │ │
│ │ CLUSTER │ │ CLUSTER │ │
│ │ │ │ │ │
│ │ ● budget=$10K │ │ ● npm build │ │
│ │ (semantic) │ │ (procedural) │ │
│ │ ● cost=$5K │ │ ● docker push │ │
│ │ (semantic) │ │ (procedural) │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ ● = memory with type as metadata │
│ Query matches content → type comes from matched memory │
│ │
└────────────────────────────────────────────────────────────────────────┘
Response API vs. Agentic Memory: When Does Memory Add Value?
Critical Distinction: Response API already sends full conversation history to the LLM when previous_response_id is present. Agentic Memory's value is for cross-session context.
┌─────────────────────────────────────────────────────────────────────────┐
│ RESPONSE API vs. AGENTIC MEMORY: CONTEXT SOURCES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SAME SESSION (has previous_response_id): │
│ ───────────────────────────────────────── │
│ Response API provides: │
│ └── Full conversation chain (all turns) → sent to LLM │
│ │
│ Agentic Memory: │
│ └── STILL VALUABLE - current session may not have the answer │
│ └── Example: 100 turns planning vacation, but budget never said │
│ └── Days ago: "I have 10K spare, is that enough for a week in │
│ Thailand?" → LLM extracts: "User has $10K budget for trip" │
│ └── Now: "What's my budget?" → answer in memory, not this chain │
│ │
│ NEW SESSION (no previous_response_id): │
│ ────────────────────────────────────── │
│ Response API provides: │
│ └── Nothing (no chain to follow) │
│ │
│ Agentic Memory: │
│ └── ADDS VALUE - retrieves cross-session context │
│ └── "What was my Hawaii budget?" → finds fact from March session │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Design Decision: Memory retrieval adds value in both scenarios — new sessions (no chain) and existing sessions (query may reference other sessions). We always search when pre-filter passes.
Known Redundancy: When the answer IS in the current chain, we still search memory (~10-30ms wasted). We can't cheaply detect "is the answer already in history?" without understanding the query semantically. For POC, we accept this overhead.
Phase 2 Solution: Context Compression solves this properly — instead of Response API sending full history, we send compressed summaries + recent turns + relevant memories. Facts are extracted during summarization, eliminating redundancy entirely.
4. Pipeline Integration
Current Pipeline (main branch)
1. Response API Translation
2. Parse Request
3. Fact-Check Classification
4. Tool Detection
5. Decision & Model Selection
6. Security Checks
7. PII Detection
8. Semantic Cache Check
9. Model Routing → LLM
Enhanced Pipeline with Agentic Memory
REQUEST PHASE:
─────────────
1. Response API Translation
2. Parse Request
3. Fact-Check Classification ──┐
4. Tool Detection ├── Existing signals
5. Decision & Model Selection ─ ─┘
6. Security Checks
7. PII Detection
8. Semantic Cache Check ───► if HIT → return cached
9. 🆕 Memory Decision:
└── if (NOT Fact) AND (NOT Tool) AND (NOT Greeting) → continue
└── else → skip to step 12
10. 🆕 Build context + rewrite query [~1-5ms]
11. 🆕 Search Milvus, inject memories [~10-30ms]
12. Model Routing → LLM
RESPONSE PHASE:
──────────────
13. Parse LLM Response
14. Cache Update
15. 🆕 Memory Extraction (async goroutine, if auto_store enabled)
└── Runs in background, does NOT add latency to response
16. Response API Translation
17. Return to Client
Step 10 details: Query rewriting strategies (context prepend, LLM rewrite, HyDE) are explained in Appendix C.
5. Memory Retrieval
Flow
┌─────────────────────────────────────────────────────────────────────────┐
│ MEMORY RETRIEVAL FLOW │
├───────────────────────────────────────────── ────────────────────────────┤
│ │
│ 1. MEMORY DECISION (reuse existing pipeline signals) │
│ ────────────────────────────────────────────────── │
│ │
│ Pipeline already classified: │
│ ├── ctx.IsFact (Fact-Check classifier) │
│ ├── ctx.RequiresTool (Tool Detection) │
│ └── isGreeting(query) (simple pattern) │
│ │
│ Decision: │
│ ├── Fact query? → SKIP (general knowledge) │
│ ├── Tool query? → SKIP (tool provides answer) │
│ ├── Greeting? → SKIP (no context needed) │
│ └── Otherwise → SEARCH MEMORY │
│ │
│ 2. BUILD CONTEXT + REWRITE QUERY │
│ ───────────────────────────── │
│ History: ["Planning vacation", "Hawaii sounds nice"] │
│ Query: "How much?" │
│ │
│ Option A (MVP): Context prepend │
│ → "How much? Hawaii vacation planning" │
│ │
│ Option B (v1): LLM rewrite │
│ → "What is the budget for the Hawaii vacation?" │
│ │
│ 3. MILVUS SEARCH │
│ ───────────── │
│ Embed context → Search with user_id filter → Top-k results │
│ │
│ 4. THRESHOLD FILTER │
│ ──────────────── │
│ Keep only results with similarity > 0.6 │
│ ⚠️ Threshold is configurable; 0.6 is starting value, tune via logs │
│ │
│ 5. INJECT INTO LLM CONTEXT │
│ ──────────────────────── │
│ Add as system message: "User's relevant context: ..." │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Implementation
MemoryFilter Struct
// pkg/extproc/req_filter_memory.go
type MemoryFilter struct {
store memory.Store // Interface - can be MilvusStore or InMemoryStore
}
func NewMemoryFilter(store memory.Store) *MemoryFilter {
return &MemoryFilter{store: store}
}
Note:
storeis theStoreinterface (Section 8), not a specific implementation. At runtime, this is typicallyMilvusStorefor production orInMemoryStorefor testing.
Memory Decision (Reuses Existing Pipeline)
⚠️ Known Limitation: The
IsFactclassifier was designed for general-knowledge fact-checking (e.g., "What is the capital of France?"). It may incorrectly classify personal-fact questions ("What is my budget?") as fact queries, causing memory to be skipped.POC Mitigation: We add a personal-indicator check. If query contains personal pronouns ("my", "I", "me"), we override
IsFactand search memory anyway.Future: Retrain or augment the fact-check classifier to distinguish general vs. personal facts.
// pkg/extproc/req_filter_memory.go
// shouldSearchMemory decides if query should trigger memory search
// Reuses existing pipeline classification signals with personal-fact override
func shouldSearchMemory(ctx *RequestContext, query string) bool {
// Check for personal indicators (overrides IsFact for personal questions)
hasPersonalIndicator := containsPersonalPronoun(query)
// 1. Fact query → skip UNLESS it contains personal pronouns
if ctx.IsFact && !hasPersonalIndicator {
logging.Debug("Memory: Skipping - general fact query")
return false
}
// 2. Tool required → skip (tool provides answer)
if ctx.RequiresTool {
logging.Debug("Memory: Skipping - tool query")
return false
}
// 3. Greeting/social → skip (no context needed)
if isGreeting(query) {
logging.Debug("Memory: Skipping - greeting")
return false
}
// 4. Default: search memory (conservative - don't miss context)
return true
}
func containsPersonalPronoun(query string) bool {
// Simple check for personal context indicators
personalPatterns := regexp.MustCompile(`(?i)\b(my|i|me|mine|i'm|i've|i'll)\b`)
return personalPatterns.MatchString(query)
}
func isGreeting(query string) bool {
// Match greetings that are ONLY greetings, not "Hi, what's my budget?"
lower := strings.ToLower(strings.TrimSpace(query))
// Short greetings only (< 20 chars and matches pattern)
if len(lower) > 20 {
return false
}
greetings := []string{
`^(hi|hello|hey|howdy)[\s\!\.\,]*$`,
`^(hi|hello|hey)[\s\,]*(there)?[\s\!\.\,]*$`,
`^(thanks|thank you|thx)[\s\!\.\,]*$`,
`^(bye|goodbye|see you)[\s\!\.\,]*$`,
`^(ok|okay|sure|yes|no)[\s\!\.\,]*$`,
}
for _, p := range greetings {
if regexp.MustCompile(p).MatchString(lower) {
return true
}
}
return false
}
Context Building
// buildSearchQuery builds an effective search query from history + current query
// MVP: context prepend, v1: LLM rewrite for vague queries
func buildSearchQuery(history []Message, query string) string {
// If query is self-contained, use as-is
if isSelfContained(query) {
return query
}
// MVP: Simple context prepend
context := summarizeHistory(history)
return query + " " + context
// v1 (future): LLM rewrite for vague queries
// if isVague(query) {
// return rewriteWithLLM(history, query)
// }
}
func isSelfContained(query string) bool {
// Self-contained: "What's my budget for the Hawaii trip?"
// NOT self-contained: "How much?", "And that one?", "What about it?"
vaguePatterns := []string{`^how much\??$`, `^what about`, `^and that`, `^this one`}
for _, p := range vaguePatterns {
if regexp.MustCompile(`(?i)`+p).MatchString(query) {
return false
}
}
return len(query) > 20 // Short queries are often vague
}
func summarizeHistory(history []Message) string {
// Extract key terms from last 3 user messages
var terms []string
count := 0
for i := len(history) - 1; i >= 0 && count < 3; i-- {
if history[i].Role == "user" {
terms = append(terms, extractKeyTerms(history[i].Content))
count++
}
}
return strings.Join(terms, " ")
}
// v1: LLM-based query rewriting (future enhancement)
func rewriteWithLLM(history []Message, query string) string {
prompt := fmt.Sprintf(`Conversation context: %s
Rewrite this vague query to be self-contained: "%s"
Return ONLY the rewritten query.`, summarizeHistory(history), query)
// Call LLM endpoint
resp, _ := http.Post(llmEndpoint+"/v1/chat/completions", ...)
return parseResponse(resp)
// "how much?" → "What is the budget for the Hawaii vacation?"
}
Full Retrieval
// pkg/extproc/req_filter_memory.go
func (f *MemoryFilter) RetrieveMemories(
ctx context.Context,
query string,
userID string,
history []Message,
) ([]*memory.RetrieveResult, error) {
// 1. Memory decision (skip if fact/tool/greeting)
if !shouldSearchMemory(ctx, query) {
logging.Debug("Memory: Skipping - not memory-relevant")
return nil, nil
}
// 2. Build search query (context prepend or LLM rewrite)
searchQuery := buildSearchQuery(history, query)
// 3. Search Milvus
results, err := f.store.Retrieve(ctx, memory.RetrieveOptions{
Query: searchQuery,
UserID: userID,
Limit: 5,
Threshold: 0.6,
})
if err != nil {
return nil, err
}
logging.Infof("Memory: Retrieved %d memories", len(results))
return results, nil
}
// InjectMemories adds memories to the LLM request
func (f *MemoryFilter) InjectMemories(
requestBody []byte,
memories []*memory.RetrieveResult,
) ([]byte, error) {
if len(memories) == 0 {
return requestBody, nil
}
// Format memories as context
var sb strings.Builder
sb.WriteString("## User's Relevant Context\n\n")
for _, mem := range memories {
sb.WriteString(fmt.Sprintf("- %s\n", mem.Memory.Content))
}
// Add as system message
return injectSystemMessage(requestBody, sb.String())
}
6. Memory Saving
Triggers
Memory extraction is triggered by three events:
| Trigger | Description | Status |
|---|---|---|
| Every N turns | Extract after every 10 turns | ✅ MVP |
| End of session | Create episodic summary when session ends | 🔮 Future |
| Context drift | Extract when topic changes significantly | 🔮 Future |
Note: Session end detection and context drift detection require additional implementation. For MVP, we rely on the "every N turns" trigger only.
Flow
┌─────────────────────────────────────────────────────────────────────────┐
│ MEMORY SAVING FLOW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRIGGERS: │
│ ───────── │
│ ├── Every N turns (e.g., 10) ← MVP │
│ ├── End of session ← Future (needs detection) │
│ └── Context drift detected ← Future (needs detection) │
│ │
│ Runs: Async (background) - no user latency │
│ │
│ 1. GET BATCH │
│ ───────── │
│ Get last 10-15 turns from session │
│ │
│ 2. LLM EXTRACTION │
│ ────────────── │
│ Prompt: "Extract important facts. Include context. │
│ Return JSON: [{type, content}, ...]" │
│ │
│ LLM returns: │
│ [{"type": "semantic", "content": "budget for Hawaii is $10K"}] │
│ │
│ 3. DEDUPLICATION │
│ ───────────── │
│ For each extracted fact: │
│ - Embed content │
│ - Search existing memories (same user, same type) │
│ - If similarity > 0.9: UPDATE existing (merge/replace) │
│ - If similarity 0.7-0.9: CREATE new (gray zone, conservative) │
│ - If similarity < 0.7: CREATE new │
│ │
│ Example: │
│ Existing: "User's budget for Hawaii is $10,000" │
│ New: "User's budget is now $15,000" │
│ → Similarity ~0.92 → UPDATE existing with new value │
│ │
│ 4. STORE IN MILVUS │
│ ─────────────── │
│ Memory { id, type, content, embedding, user_id, created_at } │
│ │
│ 5. SESSION END (future): Create episodic summary │
│ ───────────────────────────────────────────── │
│ "On Dec 29, user planned Hawaii vacation with $10K budget" │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Note on
user_id: When we refer touser_idfor memory usage, we mean the logged-in user (the authenticated user identity), not the session user we currently have. This is something that will need to be configured in the semantic router agent itself.
Implementation
// pkg/memory/extractor.go
type MemoryExtractor struct {
store memory.Store // Interface - can be MilvusStore or InMemoryStore
llmEndpoint string // LLM endpoint for fact extraction
batchSize int // Extract every N turns (default: 10)
turnCounts map[string]int
mu sync.Mutex
}
// ProcessResponse extracts and stores memories (runs async)
//
// Triggers (MVP: only first one implemented):
// - Every N turns (e.g., 10) ← MVP
// - End of session ← Future: needs session end detection
// - Context drift detected ← Future: needs drift detection
//
func (e *MemoryExtractor) ProcessResponse(
ctx context.Context,
sessionID string,
userID string,
history []Message,
) error {
e.mu.Lock()
e.turnCounts[sessionID]++
turnCount := e.turnCounts[sessionID]
e.mu.Unlock()
// MVP: Only extract every N turns
// Future: Also trigger on session end or context drift
if turnCount % e.batchSize != 0 {
return nil
}
// Get recent batch
batchStart := max(0, len(history) - e.batchSize - 5)
batch := history[batchStart:]
// LLM extraction
extracted, err := e.extractWithLLM(batch)
if err != nil {
return err
}
// Store with deduplication
for _, fact := range extracted {
existing, similarity := e.findSimilar(ctx, userID, fact.Content, fact.Type)
if similarity > 0.9 && existing != nil {
// Very similar → UPDATE existing memory
existing.Content = fact.Content // Use newer content
existing.UpdatedAt = time.Now()
if err := e.store.Update(ctx, existing.ID, existing); err != nil {
logging.Warnf("Failed to update memory: %v", err)
}
continue
}
// similarity < 0.9 → CREATE new memory
mem := &Memory{
ID: generateID("mem"),
Type: fact.Type,
Content: fact.Content,
UserID: userID,
Source: "conversation",
CreatedAt: time.Now(),
}
if err := e.store.Store(ctx, mem); err != nil {
logging.Warnf("Failed to store memory: %v", err)
}
}
return nil
}
// findSimilar searches for existing similar memories
func (e *MemoryExtractor) findSimilar(
ctx context.Context,
userID string,
content string,
memType MemoryType,
) (*Memory, float32) {
results, err := e.store.Retrieve(ctx, memory.RetrieveOptions{
Query: content,
UserID: userID,
Types: []MemoryType{memType},
Limit: 1,
Threshold: 0.7, // Only consider reasonably similar
})
if err != nil || len(results) == 0 {
return nil, 0
}
return results[0].Memory, results[0].Score
}
// extractWithLLM uses LLM to extract facts
//
// ⚠️ POC Limitation: LLM extraction is best-effort. Failures are logged but do not
// block the response. Incorrect extractions may occur.
//
// Future: Self-correcting memory (see Section 14 - Future Enhancements):
// - Track memory usage (access_count, last_accessed)
// - Score memories based on usage + age + retrieval feedback
// - Periodically prune low-score, unused memories
// - Detect contradictions → auto-merge or flag for resolution
//
func (e *MemoryExtractor) extractWithLLM(messages []Message) ([]ExtractedFact, error) {
prompt := `Extract important information from these messages.
IMPORTANT: Include CONTEXT for each fact.
For each piece of information:
- Type: "semantic" (facts, preferences) or "procedural" (instructions, how-to)
- Content: The fact WITH its context
BAD: {"type": "semantic", "content": "budget is $10,000"}
GOOD: {"type": "semantic", "content": "budget for Hawaii vacation is $10,000"}
Messages:
` + formatMessages(messages) + `
Return JSON array (empty if nothing to remember):
[{"type": "semantic|procedural", "content": "fact with context"}]`
// Call LLM with timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
reqBody := map[string]interface{}{
"model": "qwen3",
"messages": []map[string]string{
{"role": "user", "content": prompt},
},
}
jsonBody, _ := json.Marshal(reqBody)
req, _ := http.NewRequestWithContext(ctx, "POST",
e.llmEndpoint+"/v1/chat/completions",
bytes.NewReader(jsonBody))
req.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(req)
if err != nil {
logging.Warnf("Memory extraction LLM call failed: %v", err)
return nil, err // Caller handles gracefully
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
logging.Warnf("Memory extraction LLM returned %d", resp.StatusCode)
return nil, fmt.Errorf("LLM returned %d", resp.StatusCode)
}
facts, err := parseExtractedFacts(resp.Body)
if err != nil {
// JSON parse error - LLM returned malformed output
logging.Warnf("Memory extraction parse failed: %v", err)
return nil, err // Skip this batch, don't store garbage
}
return facts, nil
}
7. Memory Operations
All operations that can be performed on memories. Implemented in the Store interface (see Section 8).
| Operation | Description | Trigger | Interface Method | Status |
|---|---|---|---|---|
| Store | Save new memory to Milvus | Auto (LLM extraction) or explicit API | Store() | ✅ MVP |
| Retrieve | Semantic search for relevant memories | Auto (on query) | Retrieve() | ✅ MVP |
| Update | Modify existing memory content | Deduplication or explicit API | Update() | ✅ MVP |
| Forget | Delete specific memory by ID | Explicit API call | Forget() | ✅ MVP |
| ForgetByScope | Delete all memories for user/project | Explicit API call | ForgetByScope() | ✅ MVP |
| Consolidate | Merge related memories into summary | Scheduled / on threshold | Consolidate() |