Testing LLM Responses Using Spring AI Evaluators

1. Overview

Modern web applications are increasingly integrating with Large Language Models (LLMs) to build solutions like chatbots and virtual assistants.

However, while LLMs are powerful, they’re prone to generating hallucinations, and their responses may not always be relevant, appropriate, or factually accurate.

One solution for evaluating LLM responses is to use an LLM itself, preferably a separate one.

To achieve this, Spring AI defines the Evaluator interface and provides two implementations to check the relevance and factual accuracy of the LLM response, namely RelevanceEvaluator and FactCheckingEvaluator.

In this tutorial, we’ll explore how to use Spring AI Evaluators to test LLM responses. We’ll use the two basic implementations provided by Spring AI to evaluate the responses from a Retrieval-Augmented Generation (RAG) chatbot.

2. Building a RAG Chatbot

Before we can start testing LLM responses, we’ll need a chatbot to test. For our demonstration, we’ll build a simple RAG chatbot that answers user questions based on a set of documents.

We’ll use Ollama, an open-source tool, to pull and run our chat completion and embedding models locally.

2.1. Dependencies

Let’s start by adding the necessary dependencies to our project’s pom.xml file:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
    <version>1.0.0-M5</version>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-markdown-document-reader</artifactId>
    <version>1.0.0-M5</version>
</dependency>

The Ollama starter dependency helps us to establish a connection with the Ollama service.

Additionally, we import Spring AI’s markdown document reader dependency, which we’ll use to convert .md files into documents that we can store in the vector store.

Since the current version, 1.0.0-M5, is a milestone release, we’ll also need to add the Spring Milestones repository to our pom.xml:

<repositories>
    <repository>
        <id>spring-milestones</id>
        <name>Spring Milestones</name>
        <url>https://repo.spring.io/milestone</url>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
    </repository>
</repositories>

This repository is where milestone versions are published, as opposed to the standard Maven Central repository.

Given that we’re using multiple Spring AI starters in our project, let’s also include the Spring AI Bill of Materials (BOM) in our pom.xml:

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-bom</artifactId>
            <version>1.0.0-M5</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

With this addition, we can now remove the version tag from both of our starter dependencies.

The BOM eliminates the risk of version conflicts and ensures our Spring AI dependencies are compatible with each other.

2.2. Configuring a Chat Completion and an Embedding Model

Next, let’s configure our chat completion and embedding models in the application.yaml file:

spring:
  ai:
    ollama:
      chat:
        options:
          model: llama3.3
      embedding:
        options:
          model: nomic-embed-text
      init:
        pull-model-strategy: when_missing

Here, we specify the llama3.3 model provided by Meta as our chat completion model and the nomic-embed-text model provided by Nomic AI as our embedding model. Feel free to try this implementation with different models.

Additionally, we set the pull-model-strategy to when_missing. This ensures that Spring AI pulls the specified models if they’re not available locally.

On configuring valid models, Spring AI automatically creates beans of type ChatModel and EmbeddingModel, allowing us to interact with the chat completion and embedding models, respectively.

Let’s use them to define the additional beans required for our chatbot:

@Bean
public VectorStore vectorStore(EmbeddingModel embeddingModel) {
    return SimpleVectorStore
      .builder(embeddingModel)
      .build();
}
@Bean
public ChatClient contentGenerator(ChatModel chatModel, VectorStore vectorStore) {
    return ChatClient.builder(chatModel)
      .defaultAdvisors(new QuestionAnswerAdvisor(vectorStore))
      .build();
}

First, we define a VectorStore bean and use the SimpleVectorStore implementation, which is an in-memory implementation that emulates a vector store using the java.util.Map class.

In a production application, we can consider using a real vector store such as ChromaDB.

Next, using the ChatModel and VectorStore beans, we create a bean of type ChatClient, which is our main entry point for interacting with our chat completion model.

We configure it with a QuestionAnswerAdvisor, which uses the vector store to retrieve relevant portions of the stored documents based on the user’s question and provides them as context to the chat model.

2.3. Populating Our In-Memory Vector Store

For our demonstration, we’ve included a leave-policy.md file containing sample information about leave policies in the src/main/resources/documents directory.

Now, to populate the vector store with our document during application startup, we’ll create a VectorStoreInitializer class that implements the ApplicationRunner interface:

@Component
class VectorStoreInitializer implements ApplicationRunner {
    private final VectorStore vectorStore;
    private final ResourcePatternResolver resourcePatternResolver;
    // standard constructor
    @Override
    public void run(ApplicationArguments args) {
        List<Document> documents = new ArrayList<>();
        Resource[] resources = resourcePatternResolver.getResources("classpath:documents/*.md");
        Arrays.stream(resources).forEach(resource -> {
            MarkdownDocumentReader markdownDocumentReader = new MarkdownDocumentReader(resource, MarkdownDocumentReaderConfig.defaultConfig());
            documents.addAll(markdownDocumentReader.read());
        });
        vectorStore.add(new TokenTextSplitter().split(documents));
    }
}

Inside the run() method, we first use the injected ResourcePatternResolver class to fetch all the markdown files from the src/main/resources/documents directory. While we’re only working with a single markdown file, our method is extensible.

Then, we convert the fetched resources into Document objects using the MarkdownDocumentReader class.

Finally, we add the documents to the vector store after splitting them into smaller chunks using the TokenTextSplitter class.

When we invoke the add() method, Spring AI automatically converts our plaintext content into vector representation before storing it in the vector store. We don’t need to explicitly convert it using the EmbeddingModel bean.

3. Setting up Ollama With Testcontainers

To facilitate local development and testing, we’ll use Testcontainers to set up the Ollama service, the prerequisite for which is an active Docker instance.

3.1. Test Dependencies

First, let’s add the necessary test dependencies to our pom.xml:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-spring-boot-testcontainers</artifactId>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>org.testcontainers</groupId>
    <artifactId>ollama</artifactId>
    <scope>test</scope>
</dependency>

We import the Spring AI Testcontainers dependency for Spring Boot and the Ollama module of Testcontainers.

These dependencies provide the necessary classes to spin up an ephemeral Docker instance for the Ollama service.

3.2. Defining Testcontainers Beans

Next, let’s create a @TestConfiguration class that defines our Testcontainers beans:

@TestConfiguration(proxyBeanMethods = false)
class TestcontainersConfiguration {
    @Bean
    public OllamaContainer ollamaContainer() {
        return new OllamaContainer("ollama/ollama:0.5.7");
    }
    @Bean
    public DynamicPropertyRegistrar dynamicPropertyRegistrar(OllamaContainer ollamaContainer) {
        return registry -> {
            registry.add("spring.ai.ollama.base-url", ollamaContainer::getEndpoint);
        };
    }
}

We specify the latest stable version of the Ollama image when creating the OllamaContainer bean.

Then, we define a DynamicPropertyRegistrar bean to configure the base-url of the Ollama service. This allows our application to connect to the started container.

Now, we can use this configuration in our integration tests by annotating our test classes with the @Import(TestcontainersConfiguration.class) annotation.

4. Using Spring AI Evaluators

Now that we’ve built our RAG chatbot and set up a local test environment, let’s see how we can use the two available implementations of Spring AI’s Evaluator interface to test the responses it generates.

4.1. Configuring the Evaluation Model

The quality of our testing ultimately depends on the quality of the evaluation model we use. We’ll choose the current industry standard, the bespoke-minicheck model, which is an open-source model specifically trained for evaluation testing by Bespoke Labs. It ranks at the top of the LLM-AggreFact leaderboard and only produces a yes/no response.

Let’s configure it in our application.yaml file:

com:
  baeldung:
    evaluation:
      model: bespoke-minicheck

Next, we’ll create a separate ChatClient bean to interact with our evaluation model:

@Bean
public ChatClient contentEvaluator(
  OllamaApi olamaApi,
  @Value("${com.baeldung.evaluation.model}") String evaluationModel
) {
    ChatModel chatModel = OllamaChatModel.builder()
      .ollamaApi(olamaApi)
      .defaultOptions(OllamaOptions.builder()
        .model(evaluationModel)
        .build())
      .modelManagementOptions(ModelManagementOptions.builder()
        .pullModelStrategy(PullModelStrategy.WHEN_MISSING)
        .build())
      .build();
    return ChatClient.builder(chatModel)
      .build();
}

Here, we define a new ChatClient bean using the OllamaApi bean that Spring AI creates for us and our custom evaluation model property, which we inject using the @Value annotation.

It’s important to note that we use a custom property for our evaluation model and manually create its corresponding ChatModel class, since the OllamaAutoConfiguration class only allows us to configure a single model via the spring.ai.ollama.chat.options.model property, which we’ve already used for our content generation model.

4.2. Evaluating Relevance of LLM Response With RelevancyEvaluator

Spring AI provides the RelevancyEvaluator implementation to check whether an LLM response is relevant to the user’s query and the retrieved context from the vector store.

First, let’s create a bean for it:

@Bean
public RelevancyEvaluator relevancyEvaluator(
    @Qualifier("contentEvaluator") ChatClient chatClient) {
    return new RelevancyEvaluator(chatClient.mutate());
}

We use the @Qualifier annotation to inject the relevancyEvaluator ChatClient bean we defined earlier and create an instance of the RelevancyEvaluator class.

Now, let’s test our chatbot’s response for relevancy:

String question = "How many days sick leave can I take?";
ChatResponse chatResponse = contentGenerator.prompt()
  .user(question)
  .call()
  .chatResponse();
String answer = chatResponse.getResult().getOutput().getContent();
List<Document> documents = chatResponse.getMetadata().get(QuestionAnswerAdvisor.RETRIEVED_DOCUMENTS);
EvaluationRequest evaluationRequest = new EvaluationRequest(question, documents, answer);
EvaluationResponse evaluationResponse = relevancyEvaluator.evaluate(evaluationRequest);
assertThat(evaluationResponse.isPass()).isTrue();
String nonRelevantAnswer = "A lion is the king of the jungle";
evaluationRequest = new EvaluationRequest(nonRelevantAnswer, documents, answer);
evaluationResponse = relevancyEvaluator.evaluate(evaluationRequest);
assertThat(evaluationResponse.isPass()).isFalse();

We start by invoking our contentGenerator ChatClient with a question and extract the generated answer and the documents used to generate it from the returned ChatResponse.

Then, we create an EvaluationRequest containing the question, the retrieved documents, and the chatbot’s answer. We pass it to the relevancyEvaluator bean and assert that the answer is relevant using the isPass() method.

However, when we pass a completely unrelated answer about lions, the evaluator correctly identifies it as non-relevant.

4.3. Evaluating Factual Accuracy of LLM Response With FactCheckingEvaluator

Similarly, Spring AI provides a FactCheckingEvaluator implementation to validate the factual accuracy of the LLM response against the retrieved context.

Let’s create a FactCheckingEvaluator bean as well using our contentEvaluator ChatClient:

@Bean
public FactCheckingEvaluator factCheckingEvaluator(
    @Qualifier("contentEvaluator") ChatClient chatClient) {
    return new FactCheckingEvaluator(chatClient.mutate());
}

Finally, let’s test the factual accuracy of our chatbot’s response:

String question = "How many days sick leave can I take?";
ChatResponse chatResponse = contentGenerator.prompt()
  .user(question)
  .call()
  .chatResponse();
String answer = chatResponse.getResult().getOutput().getContent();
List<Document> documents = chatResponse.getMetadata().get(QuestionAnswerAdvisor.RETRIEVED_DOCUMENTS);
EvaluationRequest evaluationRequest = new EvaluationRequest(question, documents, answer);
EvaluationResponse evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);
assertThat(evaluationResponse.isPass()).isTrue();
String wrongAnswer = "You can take no leaves. Get back to work!";
evaluationRequest = new EvaluationRequest(wrongAnswer, documents, answer);
evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);
assertThat(evaluationResponse.isPass()).isFalse();

Similar to the previous approach, we create an EvaluationRequest with the question, retrieved documents, and chatbot’s answer, and pass it to our factCheckingEvaluator bean.

We assert that the chatbot’s response is factually accurate based on the retrieved context. Additionally, we retest the evaluation with a hardcoded factually wrong answer and assert that the isPass() method returns false for it.

It’s worth noting that if we passed our hardcoded wrongAnswer to the RelevancyEvaluator, then the evaluation would pass, as even though the response is factually incorrect, it’s still relevant to the topic of sick leaves that the user asked about.

5. Conclusion

In this article, we’ve explored testing LLM responses using Spring AI’s Evaluator interface.

We built a simple RAG chatbot that answers user questions based on a set of documents and used Testcontainers to set up the Ollama service, creating a local test environment.

Then, we used the RelevancyEvaluator and FactCheckingEvaluator implementations provided by Spring AI to evaluate the relevance and factual accuracy of our chatbot’s responses.

As always, all the code examples used in this article are available over on GitHub.

The post Testing LLM Responses Using Spring AI Evaluators first appeared on Baeldung.

Testing LLM Responses Using Spring AI Evaluators

1. Overview

2. Building a RAG Chatbot

2.1. Dependencies

2.2. Configuring a Chat Completion and an Embedding Model

2.3. Populating Our In-Memory Vector Store

3. Setting up Ollama With Testcontainers

3.1. Test Dependencies

3.2. Defining Testcontainers Beans

4. Using Spring AI Evaluators

4.1. Configuring the Evaluation Model

4.2. Evaluating Relevance of LLM Response With RelevancyEvaluator

4.3. Evaluating Factual Accuracy of LLM Response With FactCheckingEvaluator

5. Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112