Methodology¶
This page explains the technical approaches and design patterns used in the Scientific Literature Explorer.
1. Retrieval-Augmented Generation (RAG)¶
%%{init: {'theme':'base','themeVariables':{'fontFamily':'sans-serif'}}}%%
flowchart LR
Paper([Paper Text]) --> Chunk[Chunking<br/>1000 chars<br/>200 overlap]
Chunk --> Vec[TF-IDF<br/>Vectorization<br/>sklearn]
Vec --> Index[(Chunk Index<br/>Sparse Matrix)]
Query([User Query]) --> QVec[Vectorize Query<br/>Same TF-IDF]
QVec --> Sim[Cosine<br/>Similarity]
Index --> Sim
Sim --> TopK[Top-K Chunks<br/>default: 5]
TopK --> Compress[ScaleDown<br/>Compression<br/>40-60% reduction]
Compress --> LLM[Gemini<br/>Generation]
LLM --> Answer([Answer with<br/>Citations])
style Paper fill:#424242,stroke:#212121,stroke-width:3px,color:#fff,rx:10,ry:10
style Query fill:#424242,stroke:#212121,stroke-width:3px,color:#fff,rx:10,ry:10
style Index fill:#1565c0,stroke:#0d47a1,stroke-width:3px,color:#fff,rx:5,ry:5
style Compress fill:#f57c00,stroke:#e65100,stroke-width:3px,color:#fff,rx:5,ry:5
style Answer fill:#2e7d32,stroke:#1b5e20,stroke-width:3px,color:#fff,rx:10,ry:10
style Chunk fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff,rx:5,ry:5
style Vec fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff,rx:5,ry:5
style QVec fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff,rx:5,ry:5
style Sim fill:#6a1b9a,stroke:#4a148c,stroke-width:2px,color:#fff,rx:5,ry:5
style TopK fill:#6a1b9a,stroke:#4a148c,stroke-width:2px,color:#fff,rx:5,ry:5
style LLM fill:#1976d2,stroke:#0d47a1,stroke-width:2px,color:#fff,rx:5,ry:5
linkStyle default stroke:#e0e0e0,stroke-width:2px
The RAG pattern ensures answers are grounded in actual paper content rather than relying solely on the LLM's training data:
- Chunking: Papers are split into overlapping segments (1000 chars, 200 overlap) to ensure no information is lost at boundaries
- TF-IDF Vectorization: scikit-learn's
TfidfVectorizercreates sparse vector representations with English stop-word removal - Cosine Similarity: Queries are matched against the chunk index; top-k (default 5) most similar chunks are retrieved
- Source Tracking: Every chunk retains its source label (e.g.,
arxiv:2511.14362) for citation tracing
2. Context Compression (ScaleDown)¶
%%{init: {'theme':'base','themeVariables':{'fontFamily':'sans-serif'}}}%%
flowchart TD
Raw[Raw Retrieved Chunks<br/>1500 tokens<br/>Redundant verbose] --> SD{ScaleDown API}
Query[User Question] --> SD
SD --> Analyze[Query-Aware<br/>Analysis]
Analyze --> Remove[Remove Redundancy]
Remove --> Preserve[Preserve Semantics]
Preserve --> Optimize[Optimize for<br/>gemini-2.5-flash<br/>tokenizer]
Optimize --> Compressed[Compressed Context<br/>600 tokens<br/>40-60% reduction]
Compressed --> Benefits
subgraph Benefits [Benefits]
B1[Lower API Cost]
B2[Faster Response]
B3[More Context Fits<br/>in Window]
B4[Better Focus]
end
style Raw fill:#c62828,stroke:#b71c1c,stroke-width:3px,color:#fff,rx:5,ry:5
style Query fill:#424242,stroke:#212121,stroke-width:2px,color:#fff,rx:5,ry:5
style Compressed fill:#2e7d32,stroke:#1b5e20,stroke-width:3px,color:#fff,rx:5,ry:5
style SD fill:#f57c00,stroke:#e65100,stroke-width:3px,color:#fff,rx:5,ry:5
style Analyze fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff,rx:5,ry:5
style Remove fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff,rx:5,ry:5
style Preserve fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff,rx:5,ry:5
style Optimize fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff,rx:5,ry:5
style Benefits fill:#6a1b9a,stroke:#4a148c,stroke-width:2px,color:#fff,rx:5,ry:5
style B1 fill:#1976d2,stroke:#0d47a1,stroke-width:2px,color:#fff,rx:5,ry:5
style B2 fill:#1976d2,stroke:#0d47a1,stroke-width:2px,color:#fff,rx:5,ry:5
style B3 fill:#1976d2,stroke:#0d47a1,stroke-width:2px,color:#fff,rx:5,ry:5
style B4 fill:#1976d2,stroke:#0d47a1,stroke-width:2px,color:#fff,rx:5,ry:5
linkStyle default stroke:#e0e0e0,stroke-width:2px
Raw retrieved chunks are often redundant. ScaleDown's compression:
- Reduces token count by 40-60% while preserving semantics
- Uses the user's question as a guide (prompt parameter) to prioritize relevant information
- Optimizes for the target model's tokenizer (gemini-2.5-flash)
- The "rate": "auto" setting lets ScaleDown determine optimal compression
3. Multi-Stage Reasoning Workflow¶
Inspired by research on self-verification and chain-of-verification (CoVe):
- Chain-of-Thought: Forces step-by-step reasoning, reducing reasoning errors
- Self-Verification: A separate LLM call cross-references every claim against source documents
- Self-Critique: An independent evaluator checks for completeness and accuracy
- Stages are configurable — enable, disable, or reorder via CLI
4. Question Triage¶
%%{init: {'theme':'base','themeVariables':{'fontFamily':'sans-serif'}}}%%
flowchart LR
Q([Question]) --> Classify{Gemini Triage<br/>+ Keyword Extract}
Classify -->|GENERAL<br/>What is CNN?| Direct[Direct Answer<br/>No Papers<br/>~5s]
Classify -->|CONCEPTUAL<br/>Explain attention| Minimal[Basic Search<br/>Skip Critique<br/>~15s]
Classify -->|RESEARCH<br/>Latest NAS methods?| Full[Full Discovery<br/>COT+Verify+Critique<br/>~45s]
Direct --> Answer1([Answer])
Minimal --> Papers1[Light Paper Fetch]
Papers1 --> Workflow1[Workflow<br/>critique=OFF]
Workflow1 --> Answer2([Answer])
Full --> Papers2[Deep Paper Discovery]
Papers2 --> Workflow2[Full Workflow<br/>All Stages ON]
Workflow2 --> Answer3([Answer])
style Q fill:#424242,stroke:#212121,stroke-width:3px,color:#fff,rx:10,ry:10
style Classify fill:#d32f2f,stroke:#b71c1c,stroke-width:3px,color:#fff,rx:5,ry:5
style Direct fill:#2e7d32,stroke:#1b5e20,stroke-width:3px,color:#fff,rx:5,ry:5
style Minimal fill:#f57c00,stroke:#e65100,stroke-width:3px,color:#fff,rx:5,ry:5
style Full fill:#c62828,stroke:#b71c1c,stroke-width:3px,color:#fff,rx:5,ry:5
style Answer1 fill:#1565c0,stroke:#0d47a1,stroke-width:2px,color:#fff,rx:10,ry:10
style Answer2 fill:#1565c0,stroke:#0d47a1,stroke-width:2px,color:#fff,rx:10,ry:10
style Answer3 fill:#1565c0,stroke:#0d47a1,stroke-width:2px,color:#fff,rx:10,ry:10
style Papers1 fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff,rx:5,ry:5
style Papers2 fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff,rx:5,ry:5
style Workflow1 fill:#6a1b9a,stroke:#4a148c,stroke-width:2px,color:#fff,rx:5,ry:5
style Workflow2 fill:#6a1b9a,stroke:#4a148c,stroke-width:2px,color:#fff,rx:5,ry:5
linkStyle default stroke:#e0e0e0,stroke-width:2px
A single Gemini call classifies questions into three tiers: - General: Simple factual questions → answered directly (no paper fetch, ~5s) - Conceptual: Needs depth but not specific papers → uses workflow but may skip critique - Research: Needs actual papers → full discovery + workflow pipeline
This saves 60-90 seconds for simple questions by skipping paper discovery entirely.
5. Resilient LLM Strategy¶
%%{init: {'theme':'base','themeVariables':{'fontFamily':'sans-serif'}}}%%
flowchart TD
Start([API Call]) --> Gemini{Gemini API}
Gemini -->|Success 200| Success([Return Response])
Gemini -->|Rate Limit 429| Retry{Retry Count<br/>< 5?}
Retry -->|Yes| Wait[Exponential Backoff<br/>5s, 10s, 20s, 40s, 60s]
Wait --> Gemini
Retry -->|No, All Failed| Fallback[ScaleDown<br/>Compression-as-Generation]
Fallback --> FallbackSuccess([Return Compressed<br/>Extraction])
Gemini -->|Other Error| Fail([Raise Exception])
style Start fill:#424242,stroke:#212121,stroke-width:3px,color:#fff,rx:10,ry:10
style Gemini fill:#1976d2,stroke:#0d47a1,stroke-width:3px,color:#fff,rx:5,ry:5
style Success fill:#2e7d32,stroke:#1b5e20,stroke-width:3px,color:#fff,rx:10,ry:10
style FallbackSuccess fill:#f57c00,stroke:#e65100,stroke-width:3px,color:#fff,rx:10,ry:10
style Fail fill:#c62828,stroke:#b71c1c,stroke-width:3px,color:#fff,rx:10,ry:10
style Retry fill:#6a1b9a,stroke:#4a148c,stroke-width:2px,color:#fff,rx:5,ry:5
style Wait fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff,rx:5,ry:5
style Fallback fill:#ef6c00,stroke:#e65100,stroke-width:2px,color:#fff,rx:5,ry:5
linkStyle default stroke:#e0e0e0,stroke-width:2px
Implementation:
Primary: Gemini 2.5 Flash (full generation)
│
├── Rate limited (429)?
│ └── Retry with exponential backoff (5× up to 60s)
│ └── Still limited?
│ └── Fallback: ScaleDown compression-as-generation
│
└── Research Agent rate-limited?
└── Heuristic keyword extraction (regex-based, no API call)
Next: Anti-Hallucination Details¶
See Anti-Hallucination Pipeline for the full verification workflow.