Project Eval

application badge
Created by team chAI on December 24, 2022

Eval aims to address the problem of subjectively evaluating test answers. Traditionally, this task has been carried out manually by human graders, which can be time-consuming and prone to bias. To address this issue, the project utilizes Cohere powered APIs to automate the evaluation process. The use of Cohere APIs allows for the integration of advanced natural language processing techniques, enabling the system to accurately understand and analyze the content of test answers. The custom model built upon these APIs then scores the answers based on suitable metrics, which can be tailored to the specific requirements of the test or assessment. One potential application of this technology is in the field of education, where it could be used to grade assignments or exams in a more efficient and unbiased manner. It could also be utilized in professional settings for evaluating job applications or performance evaluations. In addition to increasing efficiency and reducing bias, the use of automated evaluation techniques has the potential to provide more consistent and reliable scoring. This can help to ensure that test-takers receive fair and accurate assessments of their knowledge and skills. The model for the same was evaluated based on 4 major metrics: - Semantic Search: this is the primary scoring strategy of Eval. It is used to semantically understand the answer given and evaluate based on content rather than simply scoring based on textual similarities. Cohere Embed was used to generate embeddings for 5 suggested answers for the question and the answer to be checked. Then we find the distance from the nearest neighbor out of the 5 suggestions and the answer. This distance is used to grade the answer. - Duplication Check: partially correct answers with duplication of text tended to get higher similarity scores compared to the ones without duplication. To stop students from using this exploit to gain extra marks, a duplication checker was implemented based on Jaccard-Similarity between sentences within the answer. - Grammar Check: this strategy aims to check the grammar of the answer and assign a score based on the number of grammatical errors. We used Cohere Generate endpoint to generate a grammatically correct version of the answer, then check for cosine similarity of the generated version with original version to check if the original version was grammatically correct. - Toxicity Check: this aims to detect for toxic content in the answer and penalize an answer if it is toxic. We trained a custom classification model on Cohere using the Social Media Toxicity Dataset by SurgeAI which gave a 98% precision on the test split. We also implemented a Custom Checks which allows users to give different weights to each of the three different metrics based on how important they are for the evaluation of the answer. This allows for a more personalized evaluation of the answer. We built our custom model into a Flask-based REST API server deployed on Replit to streamline usage and allow people to access the full-functionality of the model. We also built a highly interactive UI that allows for users to easily interact with the API and evaluate their answers as well as submit questions.

Category tags: