Challenges and insights in developing and evaluating RAG assistants

Generative AI and Official Statistics Workshop 2025

Insee

12/05/2025

Initial example

Is this a good answer ? Hard to tell

Initial example

Answer from ChatGPT

Answer from ChatGPT

Answer from Google

Answer from Google

Both answers are better: precise, contextual

Why RAG ?

  • Retrieval-Augmented Generation (RAG) combines:
    • Information retrieval from a knowledge base
    • Text generation using aLLM with contextualized information

Objective:

  • Produce accurate information (no hallucination)
  • Produce verifiable information (source citation)
  • Propose up-to-date answer
  • Interpretation of the meaning of the question, unlike traditional text queries based on bags of words

Why doing that ?

  • Asking Google is great but user needs to have good keywords
    • Assume user knows what she wants…
    • … and have some literacy
  • LLMs are more and more used as search engine
    • How can we best structure information in our website for the response to be relevant ?
  • We have 20+ years of experience in understanding how Google works,
    • We also need to understand how LLMs work
    • Experimenting RAG is a good way for that

Typical RAG pipeline

Typical RAG pipeline

Challenge


  • How should we parse the documents ?
  • How to handle tables?
  • How to handle documents metadata that can be useful ?
  • Should we split the pages ?
  • How long should each chunk be ?
  • How should we chunk ?

Challenge

  • Which embedding should I choose ?
  • Is the best performing embedding in MTEB relevant for my use case ?
  • Which backend should I use for embedding ? (VLLM, Ollama…)
  • Which vector database should I use ? (ChromaDB, QDrant…)
  • How to make my vectordatabase always available to my RAG in production ?
  • Should I only use semantic search or hybrid search ?
  • How many documents should I retrieve ?
  • Should I rerank ? How ?

Challenge

  • Which generative model should I use ?
  • How to prompt it to ensure context citation and avoid hallucinations ?
  • How to prompt it if there are different use cases that are covered ?
  • Which backend should I use ?
  • How to expose him to clients ? Should I expose him first to happy fews?

RAG is hard

RAG is hard RAG needs evaluation

  • Evaluation challenges:
    • Is the retrieved context relevant?
    • Is the generation faithful to the context?
    • Is the answer useful to end users (e.g. analysts, statisticians)?
  • Generic metrics are not that useful
    • Better to define use case related objectives
    • Adapt pipelines to that end
  • Existing plug and play frameworks show limitations
    • To build good RAG, need to go on details

Evaluation

Many metrics exist

flowchart TD
    A[RAG Evaluation] --> B[Retrieval Quality]
    A --> C[Generation Quality]
    A --> D[End-to-End Evaluation]
    B --> E[Precision, Recall, F-score,<br> NDCG, MRR...]
    C --> F[Accuracy, Faithfulness,<br> Relevance, ROUGE/BLEU/METEOR,<br> Hallucination...]
    D --> G[Helpfulness, Consistency,<br> Conciseness,<br> Latency, Satisfaction...]

An attempt to classify RAG metrics

They are not that much helpful

  • RAG quality depends on so many dimensions
  • … we understood Kierkegaard’s vertigo of freedom concept

We need to know what we want

  • Best way to go forward : read Hamel Husain blog
    • Notably: Husain (2024) and Husain (2025) posts
    • Pragmatic approach
  • Better to start with limited set of metrics

“The kind of dashboard that foreshadows failure.” Husain (2025)

“The kind of dashboard that foreshadows failure.” Husain (2025)

What do we want ?

  • No hallucination !
    • How many invented references or facts ?
    • Hallucination rate
  • Retrieve relevant content:
    • Does the retriever find the relevant page/document for a given question ?
    • Topk retrieval
  • Have a useful companion to official statistics
    • Given the sources, is the answer satisfactory ?
    • Satisfaction rate

How did we evaluate ?

Methodology

  • Evaluating RAG on different dimensions
  • Main model used:
    • Embedding: BAAI/bge-multilingual-gemma2
    • Generation: mistralai/Mistral-Small-24B-Instruct-2501
  • Collected expert domain Q&A
    • Small sample: 62 questions
    • More questions will come later

Technical details

  • 2 VLLMs instance (OpenAI API compatible endpoints) in backend
    • Running on Nvidia H100 GPU
  • ChromaDB as vector database
  • Langchain for document handling
  • Streamlit for front end user interface

Note

This is a quite demanding pipeline :

  • Embedding and generation instances must be available at each user query

1. Collect expert level annotations

To challenge retrieval before any product launch

  • Once again Husain (2024) is right:
    • Many frameworks can create tricky questions using LLM (RAGAS, Giskard…)
    • But nothing works better than starting from expert questions and answers in a spreadsheet

Astuce

Collect existing questions from insee.fr website (e.g. here) or write original ones

1. Collect expert level annotations

To challenge retrieval before any product launch

1. Collect expert level annotations

To challenge retrieval before any product launch

  • Helped us to iterate over a “satisfying” strategy regarding parsing and chunking
    • Need medium sized chunks (around 1100 tokens)
    • Not more than 1500 tokens to avoid lost in the middle
  • Cast the tables aside
    • Hard to chunk, hard to interpret without
    • Prioretizing text content.

Note

This dataset can be later on used for any parametric change in our RAG pipeline

2. Collect user feedbacks

To ensure we satisfy user needs

  • Having an interface gamifies the evaluation process…
    • … which can help collecting evaluation
  • We want users to give honest feedbacks on different dimensions:
    • Sources used, quality of the answer
    • Free form to understand for manual inspection to understand what does not work
    • Simple feedback (👍️/👎️) to track satisfaction rate

Astuce

We need good satisfaction rates before A/B testing !

Retriever quality

  • Helped monitor retriever quality
  • Helped understand problems in website parsing

RAG behavior

  • Helped finding a satisfying prompt

First user feedbacks

  • First feedbacks are mostly positive :
    • Streaming is fast
  • Main negative feedbacks:
    • Retriever gives outdated papers (how to prioretize recent content?)

Easy to put into production thanks to:

  • Our modular approach (Chroma or VLLM APIs)
  • Our Kubernetes infrastructure SSPCloud

Remaining challenges

  • Need to prioritize recent content
  • Need to prioritize national statistics unless question about specific area

Conclusion

  • RAG quality depends first (and IMO mostly) on how documents are parsed and processed
    • Back to information retrieval problem !
    • See Barnett et al. (2024)
    • Lot of RAG resources focused on short documents…
  • We struggled for a long time because of poor technical choices
    • Other choices (e.g. reranking, generative models…) can be handled after having a good pipeline

Conclusion

  • llms.xt (llmstxt.org/): proposal to normalize website content for LLM ingestion
  • After SEO, we will have GEO (Generative Engine Optimization)
    • We want an easy access to reliable information

References

Barnett, Scott, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, et Mohamed Abdelrazek. 2024. « Seven Failure Points When Engineering a Retrieval Augmented Generation System ». https://arxiv.org/abs/2401.05856.
Husain, Hamel. 2024. « Your AI Product Needs Evals ». https://hamel.dev/blog/posts/evals/.
———. 2025. « A Field Guide to Rapidly Improving AI Products ». https://hamel.dev/blog/posts/field-guide/.