Building an Institutional Chatbot in the Era of Generative AI
The rise of artificial intelligence in education opens the way to new modes of interaction between students and institutions. At the heart of this dynamic, chatbots appear as a promising solution to answer frequently asked questions, guide newcomers, and relieve administrative services. But these tools must also deal with an unstable environment, where information is constantly evolving.
At INSA Toulouse, we undertook the creation of an institutional chatbot capable of facing these challenges. After testing different architectures, from the most basic to the most sophisticated, it was ultimately the RAG (Retrieval-Augmented Generation) model that proved to be the most suitable.
A Modern History of Chatbots
The roots of chatbots go back to Alan Turing, with his famous question about the ability of machines to think. Since ELIZA, the first conversational program born in the 1960s, progress has been spectacular. The introduction of deep learning and transformers has enabled decisive breakthroughs. Today, models like ChatGPT or Siri are capable of understanding natural language and generating remarkably fluid responses.
Among the notable architectures:
- GPT (Generative Pretrained Transformer), the current reference in text generation
- MoE (Mixture of Experts), used in Mixtral-8x7B
- RNN, now outdated, but historically important
Data Collection: A Scraper for INSA
To feed our chatbot, we developed a Java scraper targeting the public sites of INSA Toulouse and its Moodle platform. Results:
- 6,215 pages collected in 20 minutes
- Approximately 4.45 million words extracted
- 55% of documents with identifiable update date

Some limitations arose, notably the inability to read image content, unreadable scanned PDFs, or poor extraction of complex tables. Despite everything, the volume of raw text proved sufficient for our first experiments.
First Attempt: Building a Model from Scratch
We attempted to design a language model of the SLM (Small Language Model) type with PyTorch, testing several tokenization configurations (char-level and subword).
The datasets used:
- Shakespeare (2 MB, old English)
- Wikipedia in French (10 GB)
The results, despite some loss improvements, remained very far from our expectations. The model did not properly learn syntax, nor could it produce logical sentences.
| Name | Training Device | Language | Dataset | Final Loss | Context Size | Batch Size | Duration |
|---|---|---|---|---|---|---|---|
| Scratch-v0.0 | Intel Core i7-8700 | English | Shakespeare (2 MB) | 1.1268 | 192 | 32 | 03:24:00 |
| Scratch-v0.1 | Intel Core i7-8700 | English | Shakespeare (2 MB) | 0.8046 | 192 | 32 | 06:11:00 |
| Scratch-v1.0 | 2×GPU NVIDIA RTX A4500 | English | Shakespeare (2 MB) | 0.9041 | 512 | 32 | 00:28:00 |
| Scratch-v2.0 | 2×GPU NVIDIA RTX A4500 | French | Wikipedia (10 GB) | 1.0603 | 512 | 32 | 00:27:00 |
| Scratch-v2.3 | 2×GPU NVIDIA RTX A4500 | French | Wikipedia (10 GB) | 0.6484 | 512 | 48 | 03:31:00 |
The main limitation was hardware: a single RTX A4500 GPU card is not enough for deep training. For results comparable to GPT-2, weeks of training on distributed infrastructure would have been necessary.
Second Attempt: Fine-tuning GPT-2
We then opted for adapting an existing model: GPT-2. The idea was to use a pre-trained model, then specialize it with an internal dataset (100 documents from our scraping).
Training was done locally, with:
- 3 epochs
- batch size of 2
- Intel Core i7-13700H CPU
- specific HuggingFace tokenizer
Despite these efforts, only 3.82% of responses were fully satisfactory according to two human evaluators. Moreover, any data update would require tedious and energy-intensive retraining.
RAG: The Hybrid Solution That Changes Everything
The Retrieval-Augmented Generation approach combines the best of both worlds: semantic search and response generation.
The process:
- Documents are split into 1000-character segments
- Each segment is transformed into a vector via MiniLM
- Vectors are indexed in FAISS
- When queried, the closest segments are extracted and injected with the question into Mixtral-8x7B

Main advantage: the model can rely on up-to-date documents, without requiring retraining.
Testing the Assistant: IAN
We developed an interface with Streamlit, giving birth to IAN (INSA Artificial Intelligence). The interface requires:
- Responses in French
- Formal and concise tone
- Clear indication when information is unavailable

On a test of 8 questions, 7 responses were deemed relevant, both for questions about regulations and for general interactions.
Evaluating Relevance: Contradictions and Context
We developed several tools to detect system weaknesses:
- Cosine similarity calculation between chunks
- Re-rankers by cross-encoder (MiniLM)
- Binary classifier to estimate whether a question is answerable
An interesting discovery: the FAISS score range can indicate whether extracted documents are useful. A narrow range = low relevance; a wide range = diversity of responses, thus good coverage.
Conclusion: A Promising Model
Our work demonstrates that RAG, combined with a well-structured database, can become a reliable tool to help students in an academic setting. It far surpasses locally trained or fine-tuned models.
Certainly, everything is not perfect. We will need to:
- Improve ambiguity management
- Add structured sources (schedules, SQL databases)
- Integrate multimodality (images, forms)
But the foundations are solid.
Technical Perspectives
We are considering several improvements:
- Divide the FAISS index into specialized sub-indexes
- Use a classifier to trigger query rewriting or not
- Explore structured data to enrich responses
These avenues open the way to a powerful and agile school assistant.
Acknowledgments
Thanks to Philippe Leleux, Eric Alata and Céline Peyraube for their support. This project was conducted seriously, and no generative AI tool was used for research or analysis. Only to improve text clarity.