Cohere tutorial: Text embedder with Cohere

by Adrian Banachowicz on SEP 15, 2022

Text embedder is a machine learning task that is used to create a vector representation of a piece of text. This vector can then be used as input to a machine learning algorithm. The goal of text embedding is to capture the meaning of the text in a way that is suitable for machine learning.

There are many different ways to create text embeddings, but the most common is to use a neural network. A neural network is a machine learning algorithm that is very good at learning complex relationships. The input to a neural network is a vector, and the output is a vector of the same size. The neural network learns to map the input vectors to the output vectors in a way that captures the relationships between the inputs and outputs.

To create text embedding, the neural network is first trained on a large corpus of text. The training data is a set of sentences, and each sentence is represented as a vector. The vectors are created by taking the word vectors of the words in the sentence and summing them together. The neural network is then trained to map the sentence vectors to a fixed vector size.

Once the neural network has been trained, it can then be used to create text embeddings for new pieces of text. The new text is first represented as a vector, and then the neural network is used to map the vector to the fixed vector size. The result is a text embedding that captures the meaning of the text.

Text embeddings can be used for a variety of machine learning tasks. For example, they can be used to improve the performance of a machine learning algorithm that is used to classify texts. Text embeddings can also be used to find similar pieces of text, or to cluster texts together.

There are many different ways to create text embeddings, and the choice of method will depend on the application. However, neural networks are a powerful and widely used method for creating text embeddings.

πŸ’¬ Co:here

Co:here is a powerful neural network, which can generate, embed, and classify text. In this tutorial, we will use Co:here to embed descriptions. To use Co:here you need to create account on Co:here and get API key.

We will be programming in Python, so we need to install cohere library by pip

pip install cohere

Firstly, we have to implement cohere.Client. In arguments of Client should be API key, which you have generated before, and version 2021-11-08. I will create the class CoHere, it will be useful in the next steps.

class CoHere:
    def __init__(self, api_key):
        self.co = cohere.Client(f'{api_key}', '2021-11-08')
        self.examples = []

πŸ’Ύ Dataset

The main part of each neural network is a dataset. In this tutorial, I will use a dataset that includes 1000 descriptions of 10 classes. If you want to use the same, you can download it here.

The downloaded dataset has 10 folders in each folder is 100 files.txt with descriptions. The name of files is a label of description, e.g.sport_3.txt.

We will compare Random Forest with Co:here Classifier, so we have to prepare data in two ways. For Random Forest we will use Co:here Embedder, we will focus on this in this tutorial. Cohere classifier requires samples, in which each sample should be designed as a list [description, label] and I did it in my previous tutorial (here)

Loading paths of examples

In the beginning, we need to load all data, to do that. We create the function load_examples. In this function we will use three external libraries:

os.path to go into the folder with data. The code is executed in a path where is a python's file.py. This is an internal library, so we do not need to install it.

numpy this library is useful to work with arrays. In this tutorial, we will use it to generate random numbers. You have to install this library by pip pip install numpy.

glob helps us to read all files and folder names. This is an external library, so the installation is needed - pip install glob.

The downloaded dataset should be extracted in the folder data. By os.path.join we can get universal paths of folders.

folders_path = os.path.join('data', '*')

In windows, a return is equal to data\*.

Then we can use glob method to get all names of folders.

folders_name = glob(folders_path)

folders_name is a list, which contains window paths of folders. In this tutorial, these are the names of labels.

['data\\business', 'data\\entertainment', 'data\\food', 'data\\graphics', 'data\\historical', 'data\\medical', 'data\\politics', 'data\\space', 'data\\sport', 'data\\technologie']

Size of Co:here training dataset can not be bigger than 50 examples and each class has to have at least 5 examples, but for Random Forest we can use 1000 examples. With loop for we can get the names of each file. The entire function looks like that:

import os.path
from glob import glob
import numpy as np

def load_examples(no_of_ex):
    examples_path = []

    folders_path = os.path.join('data', '*')
    folders_name = glob(folders_path)

    for folder in folders_name:
        files_path = os.path.join(folder, '*')
        files_name = glob(files_path)
        for i in range(no_of_ex // len(folders_name)):
            random_example = np.random.randint(0, len(files_name))
            examples_path.append(files_name[random_example])
    return examples_path

The last loop is taking randomly N-paths of each label and appending them into a new list examples_path.

Load descriptions

Now, we have to create a training set. To make it we will load examples with load_examples(). In each path is the name of a class, we will use it to create samples. Descriptions need to be read from files, a length can not be long, so in this tutorial, the length will be equal to 100. To list texts is appended list of [descroption, class_name]. Thus, a return is that list.

def examples(no_of_ex):
    texts = []
    examples_path = load_examples(no_of_ex)
    for path in examples_path:
        class_name = path.split(os.sep)[1]
        with open(path, 'r', encoding="utf8") as file:
            text = file.read()[:100]
            texts.append([text, class_name])
    return texts

πŸ”₯ Co:here classifier

We back to CoHere class. We have to add one method - to embed examples.

The second cohere method is to embed text. The method has serval arguments, such as:

model size of a model.

texts list of texts to embed.

truncate if the text is longer than the available token, which part of the text should be taken LEFT, RIGHT or NONE.

All of them you can find here.

In this tutorial, the cohere method will be implemented as a method of our CoHere class.

def embed(self, no_of_ex):
        # as a good developer we should split the dataset. 
        data = pd.DataFrame(examples(no_of_ex))
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            list(data[0]), list(data[1]), test_size=0.2, random_state=0)
        # in the next two lines we create a numeric form of X_train data
        self.X_train_embeded = self.co.embed(texts=X_train,
                                              model="large",
                                              truncate="LEFT").embeddings
        self.X_test_embeded = self.co.embed(texts=X_test,
                                             model="large",
                                             truncate="LEFT").embeddings

X_train_embeded will be an array of numbers, which looks like that:

[ 386, 0.39653537, -0.409076, 0.5956299, -0.06624506, 2.0539167, 0.7133603,...

πŸ“ˆ Web application - Streamlit

To create an application, which will be comparison of two likelihood displays, we will use Stramlit. This is an easy and very useful library.

Installation

pip install streamlit

We will need text inputs for co:here API key.

In docs of streamlit we can find methods:

st.header() to make a header on our app

st.test_input() to send a text request

st.button() to create button

st.write() to display the results of cohere model.

st.progress() to display a progrss bar

st.column() to split an app

st.header("Co:here Text Classifier vs Random Forest")

api_key = st.text_input("API Key:", type="password")

cohere = CoHere(api_key)

cohere.list_of_examples(50)  # number of examples for Cohere classifier
                             # showed in the previous tutorial 
cohere.embed(1000)           # number of examples for random forest

# initialization of random forest with sklearn library
forest = RandomForestClassifier(max_depth=10, random_state=0) 

col1, col2 = st.columns(2)

if col1.button("Classify"):
    # training process of random forest, to do it we use embedded text.
    forest.fit(cohere.X_train_embeded, cohere.y_train)
    # prediction process of random forest
    predict = forest.predict_proba(np.array(cohere.X_test_embeded[0]).reshape(1, -1))[0] 
    here = cohere.classify([cohere.X_test[0]])[0] # prediction process of cohere classifier
    col2.success(f"Correct prediction: {cohere.y_test[0]}") # display original label

    col1, col2 = st.columns(2)
    col1.header("Co:here classify") # predictions for cohere
    for con in here.confidence:
        col1.write(f"{con.label}: {np.round(con.confidence*100, 2)}%")
        col1.progress(con.confidence)

    col2.header("Random Forest") # predictions for random forest
    for con, pred in zip(here.confidence, predict):
        col2.write(f"{con.label}: {np.round(pred*100, 2)}%")
        col2.progress(pred)

To run the streamlit app use command

streamlit run name_of_your_file.py

The created app looks like this

πŸ’‘ Conclusion

Text embedding is a powerful tool that can be used to improve the performance of machine learning algorithms. Neural networks are a widely used and effective method for creating text embeddings. Text embeddings can be used for tasks such as text classification, text similarity, and text clustering.

In this tutorial, we compare the Random Forest with Co:here Classifier, but the possibilities of Co:here Embedder are huge. You can build a lot of stuff with it.

Stay tuned for future tutorials! The repository of this code can check here.

Thank you! - Adrian Banachowicz, Data Science Intern in New Native

TECHNOLOGIES

profile image

Adrian Banachowicz

Data Lover

Discover more tutorials with similar technologies