Cohere tutorial: Text classifier with Cohere
The Magic of Natural Language Processing
Welcome to the fascinating world of Natural Language Processing (NLP), a unique blend of computer science and linguistics that focuses on the interaction between computers and human languages. At its core, NLP is all about developing advanced algorithms that can understand and produce human language with remarkable accuracy.
The Ultimate Goal of NLP
The long-term objective of NLP is to create computational models of human language that can perform a wide array of tasks. These tasks range from automatic translation and summarization to question answering and information extraction, among others. The research in this field is highly interdisciplinary, involving experts from linguistics, cognitive science, artificial intelligence, and computer science.
The Diverse Methods in NLP
NLP employs a variety of methods, including rule-based methods, statistical methods, and neural computed methods. Rule-based methods rely on hand-crafted rules written by NLP experts. While these methods can be highly effective for specific tasks, they often require a lot of effort to maintain and are limited in their scope. On the other hand, statistical methods use large amounts of data to train computational models, which can then be used to perform various NLP tasks automatically.
The Role of Neural Networks in NLP
Neural networks, a type of machine learning algorithm, are particularly well-suited for NLP tasks. They have been used to create state-of-the-art models for tasks such as machine translation and classification, showcasing the immense potential of this technology.
Cohere
Cohere is a powerful neural network, which can generate, embed, and classify text. In this tutorial we will use Co:here to classify descriptions. To use it you need to create account on Co:here and get API key.
We will be programming in Python, so we need to install cohere
library by pip
pip install cohere
Firstly, we have to implement cohere.Client
. In arguments of Client should be API key, which you have generated before, and version 2021-11-08
. I will create the class CoHere
, it will be useful in the next steps.
class CoHere:
def __init__(self, api_key):
self.co = cohere.Client(f'{api_key}', '2021-11-08')
self.examples = []
💾 Dataset
The main part of each neural network is a dataset. In this tutorial, I will use a dataset that includes 1000 descriptions of 10 classes. If you want to use the same, you can download it here.
The downloaded dataset has 10 folders in each folder is 100 files.txt
with descriptions. The name of files is a label of description, e.g.sport_3.txt
.
In this field, tasks are to read descriptions and labels from files and to create data, which contains description and label as one sample of data. Cohere classifier requires samples, in which each sample should be designed as a list [description, label]
.
Loading paths of examples
In the beginning, we need to load all data, to do that. We create the function load_examples
. In this function we will use three external libraries:
os.path
to go into the folder with data. The code is executed in a path where is a python's file.py
. This is an internal library, so we do not need to install it.
numpy
this library is useful to work with arrays. In this tutorial, we will use it to generate random numbers. You have to install this library by pip pip install numpy
.
glob
helps us to read all files and folder names. This is an external library, so the installation is needed - pip install glob
.
The downloaded dataset should be extracted in the folder data
. By os.path.join
we can get universal paths of folders.
folders_path = os.path.join('data', '*')
In windows, a return is equal to data\*
.
Then we can use glob
method to get all names of folders.
folders_name = glob(folders_path)
folders_name
is a list, which contains window paths of folders. In this tutorial, these are the names of labels.
['data\\business', 'data\\entertainment', 'data\\food', 'data\\graphics', 'data\\historical', 'data\\medical', 'data\\politics', 'data\\space', 'data\\sport', 'data\\technologie']
Size of Co:here
training dataset can not be bigger than 50 examples and each class has to have at least 5 examples. With loop for
we can get the names of each file. The entire function looks like that:
import os.path
from glob import glob
import numpy as np
def load_examples():
examples_path = []
folders_path = os.path.join('data', '*')
folders_name = glob(folders_path)
for folder in folders_name:
files_path = os.path.join(folder, '*')
files_name = glob(files_path)
for i in range(50 // len(folders_name)):
random_example = np.random.randint(0, len(files_name))
examples_path.append(files_name[random_example])
return examples_path
The last loop is taking randomly 5 paths of each label and appending them into a new list examples_path
.
Load descriptions
Now, we have to create a training set. To make it we will load examples with load_examples()
. In each path is the name of a class, we will use it to create samples. Descriptions need to be read from files, a length can not be long, so in this tutorial, the length will be equal to 100. To list texts
is appended list of [descroption, class_name]
. Thus, a return is that list.
def examples():
texts = []
examples_path = load_examples()
for path in examples_path:
class_name = path.split(os.sep)[1]
with open(path, 'r', encoding="utf8") as file:
text = file.read()[:100]
texts.append([text, class_name])
return texts
🔥 Co:here classifier
We back to CoHere
class. We have to add two methods - to load examples and to classify input.
The first one is simple, co:here
list of examples has to be created with the additional cohere
's method - cohere.classify.Example
.
def list_of_examples(self):
for e in examples():
self.examples.append(Example(text=e[0], label=e[1]))
The second method is to classify the method from cohere
. The method has serval arguments, such as:
model
size of a model.
inputs
list of data to classify.
examples
list of a training set with examples
All of them you can find here.
In this tutorial, the cohere
method will be implemented as a method of our CoHere
class. An argument of this method is a list of descriptions to predict.
def classify(self, inputs):
return self.co.classify(
model='medium',
inputs=inputs,
examples=self.examples
).classifications
The return is input
, prediction
of input, and a list of confidence
. Confidence
is a likelihood list of each class.
cohere.Classification {
input:
prediction:
confidence: []
}
CoHere
class
import cohere
from loadExamples import examples
from cohere.classify import Example
class CoHere:
def __init__(self, api_key):
self.co = cohere.Client(f'{api_key}', '2021-11-08')
self.examples = []
def list_of_examples(self):
for e in examples():
self.examples.append(Example(text=e[0], label=e[1]))
def classify(self, inputs):
return self.co.classify(
model='medium',
taskDescription='',
outputIndicator='',
inputs=inputs,
examples=self.examples
).classifications
📈 Web application - Streamlit
To create an application, in which will be a text input box and a likelihood display, we will use Stramlit
. This is an easy and very useful library.
Installation
pip install streamlit
We will need two text inputs for co:here
API key and for text to predict.
In docs of streamlit we can find methods:
st.header()
to make a header on our app
st.test_input()
to send a text request
st.button()
to create button
st.write()
to display the results of cohere model.
st.progress()
to display a progrss bar
st.column()
to split an app
st.header("Your personal text classifier - Co:here application")
api_key = st.text_input("API Key:", type="password") #text box for API key
description = [st.text_input("Description:")] #text box for text to predict
cohere = CoHere(api_key) #initialization CoHere
cohere.list_of_examples() #loading training set
if st.button("Classify"):
here = cohere.classify(description)[0] #prediction
col1, col2 = st.columns(2)
for no, con in enumerate(here.confidence): #display likelihood for each label
if no % 2 == 0: # in two columns
col1.write(f"{con.label}: {np.round(con.confidence*100, 2)}%")
col1.progress(con.confidence)
else:
col2.write(f"{con.label}: {np.round(con.confidence * 100, 2)}%")
col2.progress(con.confidence)
To run the streamlit app use command
streamlit run name_of_your_file.py
The created app looks like this
Harnessing Cohere for Text Classification: A Comprehensive Guide
Unleashing the Power of Cohere Models
Cohere models are not just for text generation, they are equally adept at text classification. This tutorial demonstrated how to classify short texts using a small dataset. With just 50 examples across 10 classes, we achieved a high prediction likelihood, proving that a large dataset isn't always necessary.
Cohere: A Solution for Limited Data
In scenarios where producing a large dataset is challenging, Cohere models can be a fantastic solution. They can handle text classification tasks effectively even with a smaller amount of data.
Empowering Yourself with Cohere
Identify a problem in your surroundings and consider building a Cohere application to address it. The power to change is in your hands.
Looking Ahead
Stay tuned for future tutorials that will further explore the capabilities of Cohere models. The journey of learning never ends. The repository of this code can check here.
Thank you! - Adrian Banachowicz, Data Science Intern in New Native