Evaluate and Improve your Chatbots with TruLens

Tuesday, November 14, 2023 by alfredo_lhuissier73

Improve your LLM applications with TruLens

In this tutorial, I will show you how to build and evaluate a contextual chatbot (a.k.a conversational LLM with memory) using Langchain and TruLens. For this purpose, we will track our bot's responses to check for moderation key metrics, like hate and maliciousness, while also focusing on performance and costs.

🔎 TruLens: Evaluate and Track LLM Applications

TruLens offers a powerful suite of tools to monitor and revise the performance of LLM-based applications. Through its evaluation capabilities, TruLens enables users to assess the quality of inputs, outputs, and internal processes, incorporating built-in feedback mechanisms like groundedness, relevance, and moderation assessment, while also accommodating custom evaluation needs. It also provides essential instrumentation for a wide range of LLM applications, including question answering, retrieval-augmented generation, and agent-based solutions. This instrumentation allows users to closely monitor diverse usage metrics and metadata, offering valuable insights into the model's performance across various applications.

Requisites

⚙️ Setting Up

Let's begin by creating a virtual environment inside a new folder.

mkdir trulens_chatbot
cd trulens_chatbot
conda create -n "trulens_chatbot" python=3.10
conda activate trulens_chatbot

And install the necessary libraries:

pip install streamlit langchain trulens-eval openai

Streamlit simplifies the process of storing sensitive data by offering built-in file-based secrets management, ensuring secure storage and easy access to your application's API keys.

To incorporate our OpenAI API key and HuggingFace Access Token into Streamlit secrets, we'll proceed by creating a .streamlit/secrets.toml file within our project directory:

touch .streamlit/secrets.toml

Next, you'll need to insert and substitute the following lines with your respective keys:

OPENAI_API_KEY = "YOUR_API_KEY"
HUGGINGFACE_API_KEY = "YOUR_ACCESS_TOKEN"

We're now ready to start building!

🤖 Building the Chatbot

Create a chatbot.py file and open it.

touch chatbot.py

You can start by importing and loading the necessary libraries, along with the environment variables.

import streamlit as st
import os
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from trulens_eval import TruChain, Feedback, OpenAI, Huggingface, Tru

hugs = Huggingface()
openai = OpenAI()
tru = Tru()

# Load environment variables from .streamlit/secrets.toml
os.environ["OPENAI_API_KEY"] = st.secrets["OPENAI_API_KEY"]
os.environ["HUGGINGFACE_API_KEY"] = st.secrets["HUGGINGFACE_API_KEY"]

Chain building

After that, let's build our LLM chain with a simple prompt, that we can later improve based on our evaluation's results.

# Build LLM chain
template = """You are a chatbot having a conversation with a human.
        {chat_history}
        Human: {human_input}
        Chatbot:"""
prompt = PromptTemplate(
    input_variables=["chat_history", "human_input"], template=template
)
memory = ConversationBufferMemory(memory_key="chat_history")
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
chain = LLMChain(llm=llm, prompt=prompt, memory=memory, verbose=True)

TruLens integration

Once you've created your LLM chain, you can use TruLens for evaluation and tracking. TruLens has a number of out-of-the-box Feedback Functions, and is also an extensible framework for LLM evaluation.

A Feedback Function scores the output of an LLM application by analyzing generated text from an LLM (or a downstream model or application built on it) and metadata.

In this case, we will track the relevance of the bot's answers, and also evaluate for hate, violent, selfharm or malicous responses.

# Question/answer relevance between overall question and answer.
f_relevance = Feedback(openai.relevance).on_input_output()

# Moderation metrics on output
f_hate = Feedback(openai.moderation_hate).on_output()
f_violent = Feedback(openai.moderation_violence, higher_is_better=False).on_output()
f_selfharm = Feedback(openai.moderation_selfharm, higher_is_better=False).on_output()
f_maliciousness = Feedback(openai.maliciousness_with_cot_reasons, higher_is_better=False).on_output()

We can now instantiate TruChain, a wrapper that will track the outputs of the LLM chain.

# TruLens Eval chain recorder
chain_recorder = TruChain(
    chain, app_id="contextual-chatbot", feedbacks=[f_relevance, f_hate, f_violent, f_selfharm, f_maliciousness]
)

Chatbot UI with Streamlit

We will leverage Streamlit's chat elements, st.chat_message and st.chat_input, along with st.session_state, to quickly build a ChatGPT-like UI.

# Streamlit frontend
st.title("Contextual Chatbot")

if "messages" not in st.session_state:
    st.session_state.messages = []

for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

if prompt := st.chat_input("What is up?"):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    with st.chat_message("assistant"):
        # Record with TruLens
        with chain_recorder as recording:
            full_response = chain.run(prompt)
        message_placeholder = st.empty()
        message_placeholder.markdown(full_response + "▌")
        message_placeholder.markdown(full_response)
    st.session_state.messages.append(
        {"role": "assistant", "content": full_response})

Finally, we can initialize TruLens' dashboard at the end of the file.

tru.run_dashboard()

Here's the complete chatbot.py file.

Let's run the chatbot by entering the following command on the terminal:

streamlit run chatbot.py

A new tab should open automatically on your browser with URL http://localhost:8501/

And you can check TruLens' dashboard on http://192.168.0.7:8502/

🧐 Evaluation and Improvement

Now that the Chatbot is running on http://localhost:8501/, let's assess its output using TruLens Eval.

Not so bad! Let's see if we can obtain better performance by modifying the initial prompt on the LLM chain.

Change your prompt template to something like this:

...
# Build LLM chain
template = """You are a very understanding chatbot that loves to make people feel good and display empathy towards humans
        that seem to be going through a difficult time. You always try to cheer people up and make them smile. You are very
        considerate and your mission in this universe is to elevate others.
        {chat_history}
        Human: {human_input}
        Chatbot:"""
prompt = PromptTemplate(
    input_variables=["chat_history", "human_input"], template=template
)
...

And now let's try talking again with our new friend 😂.

Remember that the moderation scores are based on outputs, not inputs.

Seems reasonable. The responses are considerably more toughtful and the scores are coherent with that. And the best of all, this therapy session cost me just about $0.001.

Jokes aside, let's change the model "chatgpt-3.5-turbo" for another one more powerful, like "gpt-4", and see what happens.

Inside your LLM definition, modify the model name like this:

# Build LLM chain
...
memory = ConversationBufferMemory(memory_key="chat_history")
llm = ChatOpenAI(model_name="gpt-4") # <--- Change this
chain = LLMChain(llm=llm, prompt=prompt, memory=memory, verbose=True)
...

You can also change your recorder's app_id so you can differentiate it in TruLen's results displays.

...
# TruLens Eval chain recorder
chain_recorder = TruChain(
    chain, app_id="gpt-4", feedbacks=[f_relevance, f_hate, f_violent, f_selfharm, f_maliciousness]
)
...

It's noticeably faster (and more expensive). Let's make one last test with an older model like "gpt-3.5-turbo-0301". As we did before, change the model name and app_id. Let's see what happens.

# Build LLM chain
...
memory = ConversationBufferMemory(memory_key="chat_history")
llm = ChatOpenAI(model_name="gpt-3.5-turbo-0301") # <--- Change this
chain = LLMChain(llm=llm, prompt=prompt, memory=memory, verbose=True)
...
# TruLens Eval chain recorder
chain_recorder = TruChain(
    chain, app_id="gpt-3.5-turbo-0301", feedbacks=[f_relevance, f_hate, f_violent, f_selfharm, f_maliciousness]
)

GPT-4 is the definitive winner 🥳

🎁 Wrapping Up

We just managed to build a chatbot with contextual memory and integrated with a complete evaluation suite from TruLens. Thanks to TruLens Eval, we can continuously monitorize and improve the performance of any LLM application and compare different configurations and models.

In summary, it is crucial to assess the downstream effects of specific chain configuration choices on response quality, cost, and latency during the development of LLM applications. This ensures optimal performance. Furthermore, TruLens and Langchain together form an effective combination for building reliable chatbots. Langchain efficiently manages the storage of context vital for LLM applications, while TruLens provides a comprehensive framework for monitoring and evaluating each iteration of the application.

Last but not least, you could easily deploy this app by uploading it to Github and connecting the repository to the Streamlit platform.

Thanks for following along!