How It Works¶
This page explains the end-to-end flow of the Scientific Literature Explorer from user question to final answer.
End-to-End Flow¶
The system follows a 9-step pipeline that intelligently routes questions, discovers papers, retrieves relevant content, compresses context, and runs multi-stage verification.
Step-by-Step Breakdown¶
-
Question Triage — The Research Agent classifies the question as
general,conceptual, orresearchusing Gemini. General questions (e.g., "What is CNN?") are answered directly without paper retrieval, saving significant latency. -
Keyword Extraction — In the same Gemini call as triage, search keywords and an ArXiv query are extracted. This merged call halves the initial latency.
-
Paper Discovery — The system searches ArXiv's Atom API for relevant papers. PDFs are downloaded in parallel using
ThreadPoolExecutorand text is extracted via PyPDF2. -
Chunking & Indexing — Paper text is split into overlapping chunks (default: 1000 chars, 200 overlap) and indexed using TF-IDF vectorization (scikit-learn).
-
Retrieval — The top-k most relevant chunks are retrieved using cosine similarity against the TF-IDF matrix.
-
Context Compression — Retrieved chunks are compressed through the ScaleDown API, reducing token count while preserving semantic meaning. This is critical for fitting more information into the LLM's context window at lower cost.
-
Reasoning Workflow — The compressed context passes through a configurable multi-stage pipeline:
- Chain-of-Thought (COT): Gemini generates a detailed, step-by-step analysis with mandatory inline citations
- Self-Verification: A faster Gemini call checks every citation against the source documents
-
Self-Critique (optional): Evaluates completeness, accuracy, and clarity
-
Artifact Storage — Each stage's output is saved as compressed markdown with metadata (timestamps, token counts, compression stats).
-
Session Persistence — Conversations are saved as JSON, enabling multi-turn interactions with history context.
Role of ScaleDown API¶
ScaleDown is the context compression backbone of this system. It is NOT an LLM — it's a specialized service that intelligently compresses text while preserving semantic meaning.
What ScaleDown Does¶
| Function | How It's Used |
|---|---|
| Context Compression | Retrieved RAG chunks are compressed before being sent to Gemini, reducing token usage by 40-60% |
| Query-Aware Compression | The user's question is passed as the prompt parameter, so ScaleDown preserves information most relevant to the query |
| Artifact Compression | COT traces, verification tables, and critique outputs are compressed before storage |
| Fallback Generation | When Gemini is rate-limited (429), ScaleDown's compression endpoint (which internally uses GPT-4o) is used as a pseudo-generation fallback |
ScaleDown API Endpoint¶
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#f57c00','primaryTextColor':'#fff','primaryBorderColor':'#e65100','lineColor':'#424242'}}}%%
sequenceDiagram
participant Client
participant SD as ScaleDown API
Client->>SD: POST /compress/raw/
Note over Client,SD: Headers: x-api-key<br/>Content-Type: application/json
Note left of Client: Payload<br/>context: 1500 tokens<br/>prompt: user question<br/>model: gemini-2.5-flash<br/>scaledown rate: auto
SD->>SD: Analyze context + prompt relevance
SD->>SD: Apply query-aware compression
SD->>SD: Optimize for target model tokenizer
SD-->>Client: Response (2.3s)
Note right of SD: compressed_prompt: 600 tokens<br/>original_tokens: 1500<br/>compressed_tokens: 600<br/>successful: true<br/>latency_ms: 2341
Note over Client,SD: 60% reduction in token count
Raw API Example:
curl -X POST https://api.scaledown.xyz/compress/raw/ \
-H "x-api-key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"context": "The text to compress...",
"prompt": "The user question (guides compression)",
"model": "gemini-2.5-flash",
"scaledown": { "rate": "auto" }
}'
ScaleDown as Fallback¶
When Gemini hits its rate limit (429 errors), the system automatically falls back to ScaleDown:
ScaleDown's internal model processes the system prompt + user question as "context" and produces a compressed, relevant extraction. While not a full generative response, it captures the key facts from the provided context.
Role of Google Gemini¶
Gemini 2.5 Flash is the intelligence layer — it generates answers, classifies questions, extracts keywords, and powers the anti-hallucination pipeline.
Gemini Usage in the System¶
| Component | Model Config | Purpose |
|---|---|---|
| Question Triage + Keywords | temperature=0.0 |
Classifies question complexity and extracts search keywords in one call |
| COT Reasoning | max_tokens=8192, thinking_budget=2048 |
Full chain-of-thought analysis with citations |
| Self-Verification | max_tokens=4096, thinking_budget=1024 |
Faster/cheaper — checks citations against source text |
| Self-Critique | max_tokens=4096, thinking_budget=1024 |
Faster/cheaper — evaluates answer quality |
| Direct Answers | max_tokens=8192, thinking_budget=2048 |
For general questions that don't need papers |
Thinking Budget¶
Gemini 2.5 Flash supports a thinkingConfig parameter that caps internal reasoning tokens. We use:
- 2048 tokens for primary reasoning (COT, direct answers)
- 1024 tokens for lighter tasks (verify, critique)
This prevents the model from spending excessive time on internal reasoning, reducing latency by ~40%.
Rate Limit Handling¶
# Exponential backoff: 5s, 10s, 20s, 40s, 60s
# After 5 retries → raises GeminiRateLimitError
# Caller catches it → falls back to ScaleDown
Next: Setup Instructions¶
See Getting Started for installation and configuration.