Chroma Tutorial: How to give GPT-3.5 chatbot memory-like capability

Wednesday, May 24, 2023 by septian_adi_nugraha408

Why should my chatbot have memory-like capability?

In this tutorial, we will walk through the steps to integrate a Chroma database with OpenAI's GPT-3.5 model, aiming to give a chatbot a memory-like capability. This unique feature enables the chatbot to reference past exchanges while formulating its responses, essentially acting as the bot's "memory". This memory mechanism not only enhances the chatbot's ability to maintain context over an extended interaction, but it also provides an advantage to overcome limitations inherent in the context window size of certain OpenAI models, all the while conserving tokens for more efficient usage. Really unique and amazing feature which can improve the quality of user experience with your AI app.

What are Embeddings?

Embeddings are a type of vector representation where similar items are represented by close vectors and dissimilar items by distant vectors. In the context of natural language processing (NLP), embeddings are used to represent words, sentences, or even entire documents. Now, if you've dabbled in Computer Vision or have used Image Object Detection libraries like OpenCV, the concept of vector representation should ring a bell, as it's a go-to technique for pinpointing similarities in images. These embeddings capture the semantic meaning of the text, allowing us to perform mathematical operations on them, like finding the cosine similarity between two vectors to measure how similar two pieces of text are.

In this particular case, embeddings are vital because they allow us to store and retrieve past conversations based on semantic similarity rather than exact text matching. This means our chatbot can understand the meaning of the previous conversations, rather than just remembering the exact words. So, rather than searching for phrases that contain certain words, we can search for certain items from the database using vector representations.

What is ChromaDB?

To quote the official documentation, Chroma is the open-source embedding database. Chroma stores embeddings along with their metadata, and, by using its built-in functionality, help embed documents (convert documents into vectors), and query the stored embeddings based on the embedded documents.

Prerequisites

Basic knowledge of Python
Access to OpenAI's GPT-3,5
A Chroma database set up

Outline

Initializing the Project
Setting Up the Required Libraries
Write the main File
Testing the Basic Chatbot
Setting Up Chroma Database
Testing the Enhanced Chatbot

Discussion

Initializing the Project

Let's get coding! First of all, we need to initialize the project, let's call it chroma-openai. Make the project directory, and as a best practice, we need to create a new virtual environment specifically for this project. This helps keeping the project's dependencies isolated from our global environment.

# Let's make the directory
mkdir chroma-openai

# create a new virtual environment for this project. You can give it whatever name you like, for this tutorial, I choose chromaenv.
python3 -m venv chromaemv
# this will create a new virtual environment called `env` in our project.

Afterwards, we activate the virtual environment. Depending on the OS we're working on, here's how we can do it.

On Windows

.\chromaenv\Scripts\activate

On Linux/MacOS

source chromaenv/bin/activate

Our terminal should look like this after the env activation, with the virtual environment's name inside the parentheses.

Setting Up the Required Libraries

Next, we need to set up all the required libraries. To keep it simple, we only install openai for making calls to the GPT-3.5 model as well as providing the embedding function, and chromadb to store the embeddings, as well as some libraries such as halo for sweet loading indicators for each requests..

# In this tutorial, I use pip3 to install python libraries, but if you use `pip` instead, feel free to switch it.

# OpenAI library
pip3 install openai

# ChromaDB for storing the embeddings
pip3 install chromadb

# Halo for loading indicator
pip3 install halo

Write the Project Files

Now, we return into the coding part. Make a new file with your favorite IDE/code editors, let's call it main.py, as it's the only Python file that we create in this project.

main.py 🐍

First thing first, let's import all the required dependencies. For now, we only import openai and halo libraries, as well as other standard libraries.

from dotenv import load_dotenv
import os
import openai
import pprint
from halo import Halo

After we imported all the required dependencies, we load the constant variables from the .env file, which will we visit after this.

load_dotenv()

Now, we will write the most important function which will handle sending the requests to the OpenAI model and return the response.

def generate_response(messages):
    # Create a loading spinner
    spinner = Halo(text='Loading...', spinner='dots')
    spinner.start()

    # Load the OpenAI key and model name from environment variables
    openai.api_key = os.getenv("OPENAI_KEY")
    model_name = os.getenv("MODEL_NAME")

    # Create a chat completion with the provided messages
    response = openai.ChatCompletion.create(
            model=model_name,
            messages=messages,
            temperature=0.5,
            max_tokens=250)

    # Stop the spinner once the response is received
    spinner.stop()

    # Pretty-print the messages sent to the model
    pp = pprint.PrettyPrinter(indent=4)
    print("Request:")
    pp.pprint(messages)

    # Print the usage statistics for the completion
    print(f"Completion tokens: {response['usage']['completion_tokens']}, Prompt tokens: {response['usage']['prompt_tokens']}, Total tokens: {response['usage']['total_tokens']}")

    # Return the message part of the response
    return response['choices'][0]['message']

Now, where will this function be called? in the good old main() function of course! so, let's write it!

def main():
    # Initialize the messages with a system message, let's say we're talking to a wise wizard bot.
    messages=[
        {"role": "system", "content": "You are a kind and wise wizard"}
        ]

    # Continue chatting until the user types "quit"
    while True:
        input_text = input("You: ")
        if input_text.lower() == "quit":
            break

        # Add the user's message to the messages
        messages.append({"role": "user", "content": input_text})

        # Get a response from the model and add it to the messages
        response = generate_response(messages)
        messages.append(response)

        # Print the assistant's response
        print(f"Wizard: {response['content']}")

Finally, we add the entry point of this script, which as we Pythonistas would probably already know, it's the if clause:

if __name__ == "__main__":
    main()

Let's go over it one more time! the main.py file should contains this script:

# main.py

from dotenv import load_dotenv
import os
import openai
import pprint
from halo import Halo

# Load environment variables from a .env file
load_dotenv()

# Function to generate a response from the model
def generate_response(messages):
    # Create a loading spinner
    spinner = Halo(text='Loading...', spinner='dots')
    spinner.start()

    # Load the OpenAI key and model name from environment variables
    openai.api_key = os.getenv("OPENAI_KEY")
    model_name = os.getenv("MODEL_NAME")

    # Create a chat completion with the provided messages
    response = openai.ChatCompletion.create(
            model=model_name,
            messages=messages,
            temperature=0.5,
            max_tokens=250)

    # Stop the spinner once the response is received
    spinner.stop()

    # Pretty-print the messages sent to the model
    pp = pprint.PrettyPrinter(indent=4)
    print("Request:")
    pp.pprint(messages)

    # Print the usage statistics for the completion
    print(f"Completion tokens: {response['usage']['completion_tokens']}, Prompt tokens: {response['usage']['prompt_tokens']}, Total tokens: {response['usage']['total_tokens']}")

    # Return the message part of the response
    return response['choices'][0]['message']

# Main function to run the chatbot
def main():
    # Initialize the messages with a system message
    messages=[
        {"role": "system", "content": "You are a kind and wise wizard"}
        ]

    # Continue chatting until the user types "quit"
    while True:
        input_text = input("You: ")
        if input_text.lower() == "quit":
            break

        # Add the user's message to the messages
        messages.append({"role": "user", "content": input_text})

        # Get a response from the model and add it to the messages
        response = generate_response(messages)
        messages.append(response)

        # Print the assistant's response
        print(f"Wizard: {response['content']}")

# Run the main function when the script is run
if __name__ == "__main__":
    main()

.env 🌏

Next, we define the constant variable in the .env file. This is considered a best practice, as we will eventually push this project on a public repository such as github. We don't want our API keys to be exposed to the public and risking misusage.

# your own API key
OPENAI_KEY=sk-xxxxxxx
# for this tutorial, we will use GPT-3.5 (Turbo) which is considered faster and cheaper while on par with GPT-3 Davinci in term of performance.
MODEL_NAME=gpt-3.5-turbo-0301

requirements.txt 📄

Finally, we'll record our dependencies list into a file called requirements.txt. This is also considered a best practice, as other developers will be able to install the required libraries with one simple command.

# "Freeze" the dependencies, storing the list into a file
pip3 freeze /> requirements.txt

# If you want to install the dependencies of this project, run this
pip3 install -r requirements.txt

Testing the Basic Chatbot

Now the moment of truth! Or the first half of it, at least. Let's run the file on your favorite terminal with this command.

python3 main.py

Our terminal, should things go well, will return the input prompt. Let's start talking to the wizard! notice how the sweet loading indicator show up while waiting for the response.

After a short while, the response should show up. In this project, we also show the actual requests sent and the total tokens used for the conversation.

Let's talk to the bot more..

At this point, we notice how the request will add up each subsequent conversation. With more requests, the more tokens will be used. However, up until this point, this is one of the only way for the bot to retain some semblance of "memory", how the bot will remember aspects from the conversations, at least, up until it hit the context limit. At that point, we need to reset the bot and lose the previous chat history.

That's where the Chroma database comes in! With the power of embeddings storage and queries, storing conversations and recalling it will be a piece of cake.

Setting Up Chroma Database

A keen reader might observe that we've already installed chromadb with our pip3 install commands earlier. So, we need to make changes to the main.py file.

First, we import the necessary libraries and packages

 from dotenv import load_dotenv
 import os
 import openai
 import pprint
 from halo import Halo
+import chromadb
+from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

Next, we need to initialize ChromaDB. In the beginning of the main function, we add these variables. As ChromaDB needs IDs for identifiers to its embeddings, we initialize a variable as counter to increment the ids for each records. Also notice that we delete the initial system message, which we will move down to stop the while loop to append each chat history.

 def main():
-    messages=[
-        {"role": "system", "content": "You are a kind and wise wizard"}
-        ]
+    chroma_client = chromadb.Client()
+    embedding_function = OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_KEY"), model_name=os.getenv("EMBEDDING_MODEL"))
+    collection = chroma_client.create_collection(name="conversations", embedding_function=embedding_function)
+    current_id = 0

Inside the while loop, we initialize lists to store the chat history, metadata and the IDs of each chat history.

     while True:
+        chat_history = []
+        chat_metadata = []
+        history_ids = []

         messages=[
             {"role": "system", "content": "You are a kind and wise wizard"}
             ]
         input_text = input("You: ")
         if input_text.lower() == "quit":
             break

Below that code, we query the stored embeddings at the beginning of each loop. In this project, we only query chat history from the assistant role (the bot). What really happens in this code is our input text if first get converted into vectors then used as parameters to query two nearest embeddings in the Chroma database.

+        results = collection.query(
+            query_texts=[input_text],
+            where={"role": "assistant"},
+            n_results=2
+        )

Finally, we add the results as the input for our next conversation. That way, only two most relevant chat history are added to the next chat input, avoiding irrelevant entries and thus further saving the token cost.

+        for res in results['documents'][0]:
+            messages.append({"role": "user", "content": f"previous chat: {res}"})

         messages.append({"role": "user", "content": input_text})
         response = generate_response(messages)

+        chat_metadata.append({"role":"user"})
+        chat_history.append(input_text)
+        chat_metadata.append({"role":"assistant"})
+        chat_history.append(response['content'])
+        current_id += 1
+        history_ids.append(f"id_{current_id}")
+        current_id += 1
+        history_ids.append(f"id_{current_id}")
+        collection.add(
+            documents=chat_history,
+            metadatas=chat_metadata,
+            ids=history_ids
+        )
         print(f"Wizard: {response['content']}")

 if __name__ == "__main__":
     main()

To wrap it up, the overall code should look like this:

from dotenv import load_dotenv
import os
import openai
import pprint
from halo import Halo
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction


# Load environment variables
load_dotenv()
pp = pprint.PrettyPrinter(indent=4)

def generate_response(messages):
    spinner = Halo(text='Loading...', spinner='dots')
    spinner.start()
    openai.api_key = os.getenv("OPENAI_KEY")
    model_name = os.getenv("MODEL_NAME")
    response = openai.ChatCompletion.create(
            model=model_name,
            messages=messages,
            temperature=0.5,
            max_tokens=250)

    spinner.stop()
    print("Request:")
    pp.pprint(messages)

    print(f"Completion tokens: {response['usage']['completion_tokens']}, Prompt tokens: {response['usage']['prompt_tokens']}, Total tokens: {response['usage']['total_tokens']}")
    return response['choices'][0]['message']

def main():
    chroma_client = chromadb.Client()
    embedding_function = OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_KEY"), model_name=os.getenv("EMBEDDING_MODEL"))
    collection = chroma_client.create_collection(name="conversations", embedding_function=embedding_function)
    current_id = 0
    while True:
        chat_history = []
        chat_metadata = []
        history_ids = []

        messages=[
            {"role": "system", "content": "You are a kind and wise wizard"}
            ]
        input_text = input("You: ")
        if input_text.lower() == "quit":
            break

        results = collection.query(
            query_texts=[input_text],
            where={"role": "assistant"},
            n_results=2
        )

        # append the query result into the messages
        for res in results['documents'][0]:
            messages.append({"role": "user", "content": f"previous chat: {res}"})

        # append user input at the end of conversation chain
        messages.append({"role": "user", "content": input_text})
        response = generate_response(messages)

        chat_metadata.append({"role":"user"})
        chat_history.append(input_text)
        chat_metadata.append({"role":"assistant"})
        chat_history.append(response['content'])
        current_id += 1
        history_ids.append(f"id_{current_id}")
        current_id += 1
        history_ids.append(f"id_{current_id}")
        collection.add(
            documents=chat_history,
            metadatas=chat_metadata,
            ids=history_ids
        )
        print(f"Wizard: {response['content']}")

if __name__ == "__main__":
    main()

Testing the Enhanced Chatbot

Finally, the real moment of truth! Run the script again. This time, take note of the request printed on the stdout. On subsequent chats, we'd notice that the script no longer send the whole conversation history as input to the model, but only pick the system chat and two most relevant query results, as well as our most recent inpput. The query results, as they are queried based on their similarities, can serve as "memory" of the chatbot, which only show up when the user talk about it in their chat.

Note that I have the chatbot take a little quiz to test its memory, which worked just fine!

Wrap it up!

As we have demonstrated with our little chatbot, we can do wonder by using ChromaDB as memory for our chatbot. The usage of memory, besides really beneficial to retain important informations from our chat histories, can also help saving our token usage, as it can help weed out irrelevant chat histories and entries. With a few lines of code, we can implement a kind of long-term memory for our bots - doesn't matter if we're using GPT-3.5 model or any other! And if you want to test your AI skills, join our Artificial Intelligence Hackathons and create an amazing AI based apps in just 1,2 or 7 days!