Possible Improvements¶

Ideas for enhancing the system, organized by implementation complexity.

Short-Term¶

These improvements can be implemented quickly with high impact.

Improvement	Impact	Effort	Why It Matters
Semantic embeddings (Sentence Transformers / Qwen3-Embedding)	Much better retrieval quality	Medium	TF-IDF misses semantic similarity; embeddings capture meaning
Cross-encoder re-ranking after initial TF-IDF retrieval	Higher precision top-k	Low	Second-pass ranking eliminates noise from first-stage retrieval
Async HTTP calls (aiohttp) for parallel Gemini + ScaleDown	Lower latency	Medium	Currently sequential; parallel calls could save 50% time
ScaleDown Python SDK (`pip install scaledown`)	Cleaner code, batch support, built-in retry	Low	Raw HTTP calls are verbose; SDK handles boilerplate
Response caching — cache Gemini responses by (question, context_hash)	Eliminates repeat latency	Low	Same question on same papers = instant response
Better PDF extraction — use `pymupdf` or `pdfplumber`	Better text quality, especially tables	Low	PyPDF2 struggles with complex layouts; these libraries are more robust

Medium-Term¶

These require more design work but have significant benefits.

Improvement	Impact	Effort	Why It Matters
ScaleDown SemanticOptimizer	Replace TF-IDF entirely	Medium	Their FAISS-based semantic search is faster and more accurate than TF-IDF
ScaleDown Pipeline — chain HasteOptimizer → Compressor	Structured compression pipeline	Medium	Eliminates manual orchestration; built-in observability
Multi-source support — Semantic Scholar API, PubMed, IEEE Xplore	Much wider paper coverage	High	ArXiv-only limits domains (medicine, older CS papers, etc.)
Streaming responses — stream Gemini output token-by-token	Better UX for long answers	Medium	Users see progress instead of waiting 15s for full response
Web UI (Streamlit/Gradio)	Broader accessibility	Medium	CLI limits non-technical users; web UI is more approachable
Configurable thinking budget per question complexity	Better quality/speed trade-off	Low	General questions waste thinking budget; research questions need more
Section-level citations — extract page/section from chunks	More precise citations	Medium	"arxiv:1706.03762 Section 3.2" is better than just "arxiv:1706.03762"

Long-Term¶

These are ambitious improvements requiring substantial effort.

Improvement	Impact	Effort	Why It Matters
ScaleDown Pareto Merging — dynamic model merging	Potentially 30% cost reduction	High	Intelligently routes queries to optimal model in cost/quality trade-off space
Knowledge graph — build citation graph across papers	Deep cross-paper analysis	High	Multi-hop reasoning: "How does paper A's findings relate to paper B's methods?"
Fine-tuned embeddings on scientific text	Domain-specific retrieval	High	Sentence Transformers trained on arXiv abstracts would outperform general models
Evaluation framework — automated hallucination detection	Quantified quality metrics	Medium	ScaleDown's evaluation pipeline can score answer quality automatically
Multi-agent architecture — separate agents for search, analysis, verification	Better specialization	High	Single LLM does everything; specialist agents could improve quality
Real-time monitoring — track latency, costs, cache hits	System observability	Medium	Currently no metrics; hard to optimize without measurement
Database backend — replace JSON files with SQLite/Postgres	Multi-user, scalability	Medium	File-based storage doesn't scale; DB enables concurrent access

Retrieval Improvements¶

Semantic Embeddings¶

Current: TF-IDF (keyword-based)

Proposed: Sentence Transformers or Qwen3-Embedding

Implementation:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)

# Then use FAISS or cosine similarity

Benefits: - Captures semantic meaning (not just keywords) - Handles synonyms, paraphrasing - Better cross-domain retrieval

Cross-Encoder Re-Ranking¶

Current: TF-IDF scores are final

Proposed: Second-pass with cross-encoder

Implementation:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, chunk) for chunk in top_k])
reranked = [chunk for _, chunk in sorted(zip(scores, top_k), reverse=True)]

Benefits: - Higher precision (fewer irrelevant chunks) - Better top-5 than top-20 → top-5

ScaleDown SemanticOptimizer¶

Current: Custom TF-IDF pipeline

Proposed: Use ScaleDown's built-in semantic search

Implementation:

from scaledown import SemanticOptimizer

optimizer = SemanticOptimizer(api_key=SCALEDOWN_API_KEY)
results = optimizer.query(
    chunks=chunks,
    query=user_question,
    top_k=5
)

Benefits: - FAISS-backed semantic search - Integrated with ScaleDown compression - No need to maintain custom retrieval code

Compression Improvements¶

ScaleDown Pipeline¶

Current: Manual compression calls

Proposed: ScaleDown Pipeline class

Implementation:

from scaledown import Pipeline, HasteOptimizer, Compressor

pipeline = Pipeline([
    HasteOptimizer(rate="auto"),
    Compressor(model="gemini-2.5-flash")
])

result = pipeline.run(context=raw_text, prompt=user_question)

Benefits: - Structured, maintainable - Built-in retries and error handling - Observability (latency, compression stats)

Multi-Source Support¶

Current: ArXiv only

Proposed: Add Semantic Scholar, PubMed, IEEE Xplore

Semantic Scholar API¶

import requests

response = requests.get(
    "https://api.semanticscholar.org/graph/v1/paper/search",
    params={"query": user_query, "limit": 10}
)
papers = response.json()["data"]

Benefits: - 200M+ papers across all domains - Free API with generous rate limits - Includes citations, references, metadata

PubMed API¶

from Bio import Entrez

Entrez.email = "your_email@example.com"
handle = Entrez.esearch(db="pubmed", term=query, retmax=10)
results = Entrez.read(handle)

Benefits: - Medical and life sciences papers - High-quality, peer-reviewed

User Experience Improvements¶

Streaming Responses¶

Current: Wait for full response

Proposed: Stream tokens as they're generated

Implementation:

for chunk in gemini_client.generate_stream(prompt):
    print(chunk, end="", flush=True)

Benefits: - Immediate feedback - Users can read while generation continues - Better perceived latency

Web UI¶

Current: CLI only

Proposed: Streamlit or Gradio web app

Streamlit Example:

import streamlit as st

question = st.text_input("Ask a research question")
if st.button("Ask"):
    with st.spinner("Researching..."):
        answer = ask_question(question)
    st.markdown(answer)

Benefits: - No terminal knowledge required - Richer UI (charts, images, interactive tables) - Shareable URLs

Cost & Performance Improvements¶

Response Caching¶

Current: Every question hits API

Proposed: Cache by (question, context_hash)

Implementation:

import hashlib

cache = {}
key = f"{question}:{hashlib.md5(context.encode()).hexdigest()}"

if key in cache:
    return cache[key]

response = gemini_client.generate(prompt)
cache[key] = response
return response

Benefits: - Instant responses for repeat questions - Lower API costs - Reduced rate limit pressure

Async Parallelization¶

Current: Sequential API calls

Proposed: Parallel with asyncio

Implementation:

import asyncio

async def parallel_calls():
    tasks = [
        gemini_async.generate(prompt1),
        scaledown_async.compress(context1),
        scaledown_async.compress(context2)
    ]
    return await asyncio.gather(*tasks)

Benefits: - 50% latency reduction - Better resource utilization

ScaleDown Batching¶

Current: One API call per compression

Proposed: Batch multiple compressions

Implementation:

from scaledown import batch_compress

results = batch_compress([
    {"context": chunk1, "prompt": query},
    {"context": chunk2, "prompt": query},
    {"context": chunk3, "prompt": query}
])

Benefits: - Lower latency (1 network round-trip vs 3) - Potential cost savings

Evaluation & Monitoring¶

Automated Hallucination Detection¶

Current: Manual verification

Proposed: ScaleDown evaluation pipeline

Implementation:

from scaledown import evaluate_response

score = evaluate_response(
    question=user_question,
    answer=cot_answer,
    sources=retrieved_chunks
)

Benefits: - Quantified quality metrics - Automatic flagging of low-confidence answers - A/B testing of prompts and models

System Metrics¶

Proposed: Track latency, costs, cache hits

Implementation:

import time

start = time.time()
response = gemini_client.generate(prompt)
latency = time.time() - start

metrics.log({"latency": latency, "tokens": len(response), "cost": calculate_cost(response)})

Benefits: - Identify bottlenecks - Optimize slow stages - Cost tracking per session

Next Steps¶

Priority order: 1. ScaleDown Python SDK (easiest, immediate code cleanup) 2. Better PDF extraction (pymupdf instead of PyPDF2) 3. Semantic embeddings (biggest retrieval quality gain) 4. Response caching (instant repeat queries) 5. Async parallelization (50% latency reduction)

See Project Structure for where these changes would fit in the codebase.