Building an Institutional Chatbot in the Era of Generative AI

The rise of artificial intelligence in education opens the way to new modes of interaction between students and institutions. At the heart of this dynamic, chatbots appear as a promising solution to answer frequently asked questions, guide newcomers, and relieve administrative services. But these tools must also deal with an unstable environment, where information is constantly evolving.

At INSA Toulouse, we undertook the creation of an institutional chatbot capable of facing these challenges. After testing different architectures, from the most basic to the most sophisticated, it was ultimately the RAG (Retrieval-Augmented Generation) model that proved to be the most suitable.

A Modern History of Chatbots

The roots of chatbots go back to Alan Turing, with his famous question about the ability of machines to think. Since ELIZA, the first conversational program born in the 1960s, progress has been spectacular. The introduction of deep learning and transformers has enabled decisive breakthroughs. Today, models like ChatGPT or Siri are capable of understanding natural language and generating remarkably fluid responses.

Among the notable architectures:

GPT (Generative Pretrained Transformer), the current reference in text generation
MoE (Mixture of Experts), used in Mixtral-8x7B
RNN, now outdated, but historically important

Data Collection: A Scraper for INSA

To feed our chatbot, we developed a Java scraper targeting the public sites of INSA Toulouse and its Moodle platform. Results:

6,215 pages collected in 20 minutes
Approximately 4.45 million words extracted
55% of documents with identifiable update date

Scraping — Fig. 1 - Overview of the scraping process

Some limitations arose, notably the inability to read image content, unreadable scanned PDFs, or poor extraction of complex tables. Despite everything, the volume of raw text proved sufficient for our first experiments.

First Attempt: Building a Model from Scratch

We attempted to design a language model of the SLM (Small Language Model) type with PyTorch, testing several tokenization configurations (char-level and subword).

The datasets used:

Shakespeare (2 MB, old English)
Wikipedia in French (10 GB)

The results, despite some loss improvements, remained very far from our expectations. The model did not properly learn syntax, nor could it produce logical sentences.

Name	Training Device	Language	Dataset	Final Loss	Context Size	Batch Size	Duration
Scratch-v0.0	Intel Core i7-8700	English	Shakespeare (2 MB)	1.1268	192	32	03:24:00
Scratch-v0.1	Intel Core i7-8700	English	Shakespeare (2 MB)	0.8046	192	32	06:11:00
Scratch-v1.0	2×GPU NVIDIA RTX A4500	English	Shakespeare (2 MB)	0.9041	512	32	00:28:00
Scratch-v2.0	2×GPU NVIDIA RTX A4500	French	Wikipedia (10 GB)	1.0603	512	32	00:27:00
Scratch-v2.3	2×GPU NVIDIA RTX A4500	French	Wikipedia (10 GB)	0.6484	512	48	03:31:00

The main limitation was hardware: a single RTX A4500 GPU card is not enough for deep training. For results comparable to GPT-2, weeks of training on distributed infrastructure would have been necessary.

Second Attempt: Fine-tuning GPT-2

We then opted for adapting an existing model: GPT-2. The idea was to use a pre-trained model, then specialize it with an internal dataset (100 documents from our scraping).

Training was done locally, with:

3 epochs
batch size of 2
Intel Core i7-13700H CPU
specific HuggingFace tokenizer

Despite these efforts, only 3.82% of responses were fully satisfactory according to two human evaluators. Moreover, any data update would require tedious and energy-intensive retraining.

RAG: The Hybrid Solution That Changes Everything

The Retrieval-Augmented Generation approach combines the best of both worlds: semantic search and response generation.

The process:

Documents are split into 1000-character segments
Each segment is transformed into a vector via MiniLM
Vectors are indexed in FAISS
When queried, the closest segments are extracted and injected with the question into Mixtral-8x7B

Main advantage: the model can rely on up-to-date documents, without requiring retraining.

Testing the Assistant: IAN

We developed an interface with Streamlit, giving birth to IAN (INSA Artificial Intelligence). The interface requires:

Responses in French
Formal and concise tone
Clear indication when information is unavailable

On a test of 8 questions, 7 responses were deemed relevant, both for questions about regulations and for general interactions.

Evaluating Relevance: Contradictions and Context

We developed several tools to detect system weaknesses:

Cosine similarity calculation between chunks
Re-rankers by cross-encoder (MiniLM)
Binary classifier to estimate whether a question is answerable

An interesting discovery: the FAISS score range can indicate whether extracted documents are useful. A narrow range = low relevance; a wide range = diversity of responses, thus good coverage.

Conclusion: A Promising Model

Our work demonstrates that RAG, combined with a well-structured database, can become a reliable tool to help students in an academic setting. It far surpasses locally trained or fine-tuned models.

Certainly, everything is not perfect. We will need to:

Improve ambiguity management
Add structured sources (schedules, SQL databases)
Integrate multimodality (images, forms)

But the foundations are solid.

Technical Perspectives

We are considering several improvements:

Divide the FAISS index into specialized sub-indexes
Use a classifier to trigger query rewriting or not
Explore structured data to enrich responses

These avenues open the way to a powerful and agile school assistant.

Acknowledgments

Thanks to Philippe Leleux, Eric Alata and Céline Peyraube for their support. This project was conducted seriously, and no generative AI tool was used for research or analysis. Only to improve text clarity.

Building a Chatbot for Student Guidance.