Possible Improvements¶
Ideas for enhancing the system, organized by implementation complexity.
Short-Term¶
These improvements can be implemented quickly with high impact.
| Improvement | Impact | Effort | Why It Matters |
|---|---|---|---|
| Semantic embeddings (Sentence Transformers / Qwen3-Embedding) | Much better retrieval quality | Medium | TF-IDF misses semantic similarity; embeddings capture meaning |
| Cross-encoder re-ranking after initial TF-IDF retrieval | Higher precision top-k | Low | Second-pass ranking eliminates noise from first-stage retrieval |
| Async HTTP calls (aiohttp) for parallel Gemini + ScaleDown | Lower latency | Medium | Currently sequential; parallel calls could save 50% time |
ScaleDown Python SDK (pip install scaledown) |
Cleaner code, batch support, built-in retry | Low | Raw HTTP calls are verbose; SDK handles boilerplate |
| Response caching — cache Gemini responses by (question, context_hash) | Eliminates repeat latency | Low | Same question on same papers = instant response |
Better PDF extraction — use pymupdf or pdfplumber |
Better text quality, especially tables | Low | PyPDF2 struggles with complex layouts; these libraries are more robust |
Medium-Term¶
These require more design work but have significant benefits.
| Improvement | Impact | Effort | Why It Matters |
|---|---|---|---|
| ScaleDown SemanticOptimizer | Replace TF-IDF entirely | Medium | Their FAISS-based semantic search is faster and more accurate than TF-IDF |
| ScaleDown Pipeline — chain HasteOptimizer → Compressor | Structured compression pipeline | Medium | Eliminates manual orchestration; built-in observability |
| Multi-source support — Semantic Scholar API, PubMed, IEEE Xplore | Much wider paper coverage | High | ArXiv-only limits domains (medicine, older CS papers, etc.) |
| Streaming responses — stream Gemini output token-by-token | Better UX for long answers | Medium | Users see progress instead of waiting 15s for full response |
| Web UI (Streamlit/Gradio) | Broader accessibility | Medium | CLI limits non-technical users; web UI is more approachable |
| Configurable thinking budget per question complexity | Better quality/speed trade-off | Low | General questions waste thinking budget; research questions need more |
| Section-level citations — extract page/section from chunks | More precise citations | Medium | "arxiv:1706.03762 Section 3.2" is better than just "arxiv:1706.03762" |
Long-Term¶
These are ambitious improvements requiring substantial effort.
| Improvement | Impact | Effort | Why It Matters |
|---|---|---|---|
| ScaleDown Pareto Merging — dynamic model merging | Potentially 30% cost reduction | High | Intelligently routes queries to optimal model in cost/quality trade-off space |
| Knowledge graph — build citation graph across papers | Deep cross-paper analysis | High | Multi-hop reasoning: "How does paper A's findings relate to paper B's methods?" |
| Fine-tuned embeddings on scientific text | Domain-specific retrieval | High | Sentence Transformers trained on arXiv abstracts would outperform general models |
| Evaluation framework — automated hallucination detection | Quantified quality metrics | Medium | ScaleDown's evaluation pipeline can score answer quality automatically |
| Multi-agent architecture — separate agents for search, analysis, verification | Better specialization | High | Single LLM does everything; specialist agents could improve quality |
| Real-time monitoring — track latency, costs, cache hits | System observability | Medium | Currently no metrics; hard to optimize without measurement |
| Database backend — replace JSON files with SQLite/Postgres | Multi-user, scalability | Medium | File-based storage doesn't scale; DB enables concurrent access |
Retrieval Improvements¶
Semantic Embeddings¶
Current: TF-IDF (keyword-based)
Proposed: Sentence Transformers or Qwen3-Embedding
Implementation:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)
# Then use FAISS or cosine similarity
Benefits: - Captures semantic meaning (not just keywords) - Handles synonyms, paraphrasing - Better cross-domain retrieval
Cross-Encoder Re-Ranking¶
Current: TF-IDF scores are final
Proposed: Second-pass with cross-encoder
Implementation:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, chunk) for chunk in top_k])
reranked = [chunk for _, chunk in sorted(zip(scores, top_k), reverse=True)]
Benefits: - Higher precision (fewer irrelevant chunks) - Better top-5 than top-20 → top-5
ScaleDown SemanticOptimizer¶
Current: Custom TF-IDF pipeline
Proposed: Use ScaleDown's built-in semantic search
Implementation:
from scaledown import SemanticOptimizer
optimizer = SemanticOptimizer(api_key=SCALEDOWN_API_KEY)
results = optimizer.query(
chunks=chunks,
query=user_question,
top_k=5
)
Benefits: - FAISS-backed semantic search - Integrated with ScaleDown compression - No need to maintain custom retrieval code
Compression Improvements¶
ScaleDown Pipeline¶
Current: Manual compression calls
Proposed: ScaleDown Pipeline class
Implementation:
from scaledown import Pipeline, HasteOptimizer, Compressor
pipeline = Pipeline([
HasteOptimizer(rate="auto"),
Compressor(model="gemini-2.5-flash")
])
result = pipeline.run(context=raw_text, prompt=user_question)
Benefits: - Structured, maintainable - Built-in retries and error handling - Observability (latency, compression stats)
Multi-Source Support¶
Current: ArXiv only
Proposed: Add Semantic Scholar, PubMed, IEEE Xplore
Semantic Scholar API¶
import requests
response = requests.get(
"https://api.semanticscholar.org/graph/v1/paper/search",
params={"query": user_query, "limit": 10}
)
papers = response.json()["data"]
Benefits: - 200M+ papers across all domains - Free API with generous rate limits - Includes citations, references, metadata
PubMed API¶
from Bio import Entrez
Entrez.email = "your_email@example.com"
handle = Entrez.esearch(db="pubmed", term=query, retmax=10)
results = Entrez.read(handle)
Benefits: - Medical and life sciences papers - High-quality, peer-reviewed
User Experience Improvements¶
Streaming Responses¶
Current: Wait for full response
Proposed: Stream tokens as they're generated
Implementation:
Benefits: - Immediate feedback - Users can read while generation continues - Better perceived latency
Web UI¶
Current: CLI only
Proposed: Streamlit or Gradio web app
Streamlit Example:
import streamlit as st
question = st.text_input("Ask a research question")
if st.button("Ask"):
with st.spinner("Researching..."):
answer = ask_question(question)
st.markdown(answer)
Benefits: - No terminal knowledge required - Richer UI (charts, images, interactive tables) - Shareable URLs
Cost & Performance Improvements¶
Response Caching¶
Current: Every question hits API
Proposed: Cache by (question, context_hash)
Implementation:
import hashlib
cache = {}
key = f"{question}:{hashlib.md5(context.encode()).hexdigest()}"
if key in cache:
return cache[key]
response = gemini_client.generate(prompt)
cache[key] = response
return response
Benefits: - Instant responses for repeat questions - Lower API costs - Reduced rate limit pressure
Async Parallelization¶
Current: Sequential API calls
Proposed: Parallel with asyncio
Implementation:
import asyncio
async def parallel_calls():
tasks = [
gemini_async.generate(prompt1),
scaledown_async.compress(context1),
scaledown_async.compress(context2)
]
return await asyncio.gather(*tasks)
Benefits: - 50% latency reduction - Better resource utilization
ScaleDown Batching¶
Current: One API call per compression
Proposed: Batch multiple compressions
Implementation:
from scaledown import batch_compress
results = batch_compress([
{"context": chunk1, "prompt": query},
{"context": chunk2, "prompt": query},
{"context": chunk3, "prompt": query}
])
Benefits: - Lower latency (1 network round-trip vs 3) - Potential cost savings
Evaluation & Monitoring¶
Automated Hallucination Detection¶
Current: Manual verification
Proposed: ScaleDown evaluation pipeline
Implementation:
from scaledown import evaluate_response
score = evaluate_response(
question=user_question,
answer=cot_answer,
sources=retrieved_chunks
)
Benefits: - Quantified quality metrics - Automatic flagging of low-confidence answers - A/B testing of prompts and models
System Metrics¶
Proposed: Track latency, costs, cache hits
Implementation:
import time
start = time.time()
response = gemini_client.generate(prompt)
latency = time.time() - start
metrics.log({"latency": latency, "tokens": len(response), "cost": calculate_cost(response)})
Benefits: - Identify bottlenecks - Optimize slow stages - Cost tracking per session
Next Steps¶
Priority order:
1. ScaleDown Python SDK (easiest, immediate code cleanup)
2. Better PDF extraction (pymupdf instead of PyPDF2)
3. Semantic embeddings (biggest retrieval quality gain)
4. Response caching (instant repeat queries)
5. Async parallelization (50% latency reduction)
See Project Structure for where these changes would fit in the codebase.