How to Summarize and Find Similar ArXiv Articles

Friday, October 13, 2023 by raghavan848

Introduction

The volume of research articles on platforms like arXiv can be overwhelming for scholars trying to stay updated with the latest findings. This tutorial will guide you through the process of summarizing long-form arXiv articles into key points and identifying similar papers.

These actions can help researchers quickly grasp the essence of a paper and contextualize it within the broader academic discourse, ensuring a comprehensive understanding and avoiding redundant research efforts.

This is article has two parts:

Generating the embeddings and building the Annoy index
Querying index to get related papers and generating summary

Part 1: Building the Annoy Index

Prerequisites

Before you begin, make sure you have Python 3.9 and pip installed on your system.

Steps

Python Packages Installation

Install the following Python packages using pip:

pip install annoy
pip install langchain
pip install openai
pip install sentence-transformers
pip install torch

Alternatively, you can create a requirements.txt file and install the packages using the command pip install -r requirements.txt with the following contents:

numpy==1.25.2
torch==2.0.1
annoy==1.17.3
langchain==0.0.237
sentence-transformers==2.2.2

Kaggle arXiv dataset

To proceed, create a Kaggle account and download the arXiv dataset with limited metadata from this dataset. After downloading, unzip the file to find a JSON file.

Preprocess the Data

Load your dataset and preprocess it into the desired format. Here, we're reading a JSON file containing ArXiv metadata and concatenating titles and abstracts with a '[SEP]' separator:

def preprocess(path):
    data = []
    with open(path,&#39;r&#39;) as f:
        for line in f:
            data.append(json.loads(line))
    sents = [entry[&#39;title&#39;]+&#39;[SEP]&#39;+entry[&#39;abstract&#39;] for entry in data]
    return sents

Generate Embeddings using SBERT

Initialize the SBERT model and generate embeddings for your preprocessed data. We're using the allenai-specter model, specially trained for scientific papers. For approximately ~2 million articles of arxiv upto december 2022, it took 8+ hours on RTX3080, 6 hours on RTX4090 and 1.5 hours on A100 (cloud).

Adjust the batch_size based on your GPU memory:

def generate_embeddings(sents,model):
    embeddings = model.encode(sents, batch_size=400, show_progress_bar=True, device=&#39;cuda&#39;, convert_to_numpy=True)
    np.save(&quot;embeddings.npy&quot;, embeddings)
    return embeddings

You can adjust the batch_size depending on the GPU memory.

GPU	Time to generate embeddings.npy
RTX 3080 (16GB)	8 hours
RTX 4090 (16 GB)	5 hours
A100 (80 GB) (on cloud)	1 hours

Index Embeddings with Annoy

Once you have the embeddings, the next step is to index them for fast similarity search. We're using the Annoy library because of its efficiency:

def generate_annoy(embeddings):
    n_trees = 256
    embedding_size = 768
    annoy_index = AnnoyIndex(embedding_size, &#39;angular&#39;)
    for i in range(len(embeddings)):
        annoy_index.add_item(i, embeddings[i])
    annoy_index.build(n_trees)
    annoy_index.save(&quot;annoy_index.ann&quot;)
    return annoy_index

Alternatively, if you do not have a GPU and are okay with the arXiv snapshot up to December 2022, you can use public S3 URLs to download the necessary datasets.

dataset	description	S3URL
annoy_index.ann	Annoy Index of 2M arxiv articles using the file arxiv-metadata-oai-snapshot.json	S3 URL : https://arxiv-r-1228.s3.us-west-1.amazonaws.com/annoy_index.ann
arxiv-metadata-oai-snapshot.json	Dataset of 2M arxiv articles downloaded from Kaggle	S3 URL: https://arxiv-r-1228.s3.us-west-1.amazonaws.com/arxiv-metadata-oai-snapshot.json
embeddings.npy	Embedding numpy file. Contains serialized embeddings of all 2M articles	s3 URL: https://arxiv-r-1228.s3.us-west-1.amazonaws.com/embeddings.npy

Part 2: Summarize and Search for Similar Articles on Arxiv

Description

This tutorial will guide you through the process of summarizing a long-form arXiv article into key points, generating an idea based on it, and identifying similar papers. We will make use of Sentence Transformers for embeddings, Annoy for indexing, and the OpenAI API for generating the summary.

Prerequisites

Before proceeding, ensure you have the following:

Python 3+
Flask for creating an endpoint
Knowledge of JSON, Annoy, and Sentence Transformers

Steps

Step 1: Setup and Install Dependencies

First, install the required packages:

pip install sentence_transformers arxiv annoy flask

Step 2: Load and Preprocess Arxiv Metadata

To summarize and find similar articles, we need the dataset's metadata. The preprocess function does this by:

Loading the JSON data
Extracting titles and abstracts
Combining them into sentences

import json
import time

path = "arxiv-metadata-oai-snapshot.json"
data = []

def preprocess(path):
    with open(path, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    sents = [entry['title'] + '[SEP]' + entry['abstract'] for entry in data]
    return sents

Step 3: Generate Annoy Index

Annoy (Approximate Nearest Neighbors Oh Yeah) is used to search for similar vectors in large datasets. Here, we load an Annoy index given a filename.

from annoy import AnnoyIndex

def generate_annoy(fn):
    embedding_size = 768
    annoy_index = AnnoyIndex(embedding_size, 'angular')
    annoy_index.load(fn)
    return annoy_index

Step 4: Search Function

The search function takes a query, computes its embedding using Sentence Transformers, and then finds the closest matches in our Annoy index.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/allenai-specter', device='cuda')

def search(query, annoy_index, model):
    query_embedding = model.encode(query, convert_to_numpy=True)
    hits = annoy_index.get_nns_by_vector(query_embedding, 5, include_distances=True)
    return hits

Step 5: Display Results

Once we've found the closest matches, we need to format and display them.

def print_results(hits, sents):
    response = ""
    for i in range(len(hits[0])):
        response += "<b><a href=https://arxiv.org/abs/" + data[hits[0][i]]['id'] + ">" + sents[hits[0][i]].split('[SEP]')[0] + "</a></b><br>"
        response += "Abstract:" + sents[hits[0][i]].split('[SEP]')[1]
        response += "Authors:" +  data[hits[0][i]]['authors']
        response += "<br>"
    return response

Step 6: Using OpenAI for Summarization

We use OpenAI's API to generate a summary of the Arxiv article. The article, its title, abstract, and page content are sent to OpenAI.

from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.document_loaders import ArxivLoader

open_ai_key = "YOUR_OPENAI_API_KEY"
llm = OpenAI(openai_api_key=open_ai_key, model_name= "gpt-3.5-turbo-16k")

Step 7: Flask Endpoint

We create an endpoint in Flask that processes the Arxiv URL, summarizes the article, searches for similar articles, and returns a formatted HTML response.

from flask import Flask, request

app = Flask(__name__)

@app.route('/search', methods=['GET'])
def search_endpoint():
    args = request.args
    query = args.get('q')
    tokens = query.split('/')
    id = tokens[-1] ## https://arxiv.org/abs/2310.03717

    docs = ArxivLoader(query=id, load_max_docs=2).load()
    print(docs[0].metadata['Title'])
    title = docs[0].metadata['Title']
    abstract = docs[0].metadata['Summary']
    page_content = docs[0].page_content[:40000]
    
    article = title +  '[SEP]' + abstract  + '[SEP]' + page_content

    print("Some related papers:")
    related = search(article,an,model)
    html_response = print_results(search(article,an,model), sents)
    template = """ Take a deep breath. You are a researcher. Your task is to read the  RESEARCH ARTICLE and generate 3 KEY POINTS of it in your own words and generate AN IDEA OF FUTURE EXTENSION based on the RELATED ARTICLES. Generate one actionable idea for extending the RESEARCH.
    RESEARCH ARTICLE: {article}
    RELATED ARTICLES: {related}
    INSTRUCTIONS
    1. Read the TITLE, ABSTRACT and the CONTENT of the RESEARCH ARTICLE.
    2. Generate 3 KEY POINTS of the RESEARCH ARTICLE in your own words. Each Key Point should be a bullet point of 10 WORDS are less.
    3. Read the RELATED ARTICLES
    4. Generate an IDEA OF FUTURE EXTENSION of the RESEARCH ARTICLE based on the RELATED ARTICLES.
    5. The IDEA OF FUTURE EXTENSION should be ONE sentence.
    6. Generate one actionable idea for extending the RESEARCH with Light Bulb emoji.
    7. STRICTLY generate the response in json format using the TYPESCRIPT SCHEMA below. Insert a line break after each bullet point.
    SCHEMA
    response:
        KEYPOINT1: String,
        KEYPOINT2: String,
        KEYPOINT3: String,
        FUTURE_EXTENSION_IDEA: String,
        ACTIONABLE_IDEA: String"""
    prompt = PromptTemplate(template=template, input_variables=["article", "related"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    output = ""
    try:
        output = llm_chain.run({'article': article, 'related':related})
        print (output)
    except Exception as e:
        print (e)

    html_content = "<html> <head> <title> Arxiv Summary for  " + title + " </title> </head> <body> <h1> "+ title + " </h1>  <h2> Summary </h2> <p>"
    jsonD = json.loads(output)
    html_content += "<br> 1." + jsonD['KEYPOINT1']
    html_content += "<br> 2." + jsonD['KEYPOINT2']
    html_content += "<br> 3. " + jsonD['KEYPOINT3']
    html_content += "<br> <b>FUTURE EXTENSION IDEA:</b> " + jsonD['FUTURE_EXTENSION_IDEA']
    html_content += "<br> <b>ACTIONABLE IDEA: </b>" + jsonD['ACTIONABLE_IDEA']

    html_content += "</p> Related Articles: <br>"


    html_content += html_response
    html_content += "</html>"
    return html_content

Step 8: Running the Flask Server

Finally, run your Flask application.

if __name__ == "__main__":
    app.run()

Then, open a browser and navigate to: http://127.0.0.1:5000/search?q=ARXIV_URL replacing ARXIV_URL with your desired Arxiv article URL.

Conclusion

You've now created a tool that summarizes Arxiv articles and finds similar articles based on their content. This tool can be extended with other features or integrated into larger applications to aid researchers and academics.

Explore more AI tutorials for all the levels of expertice, and test your skills at AI Hackathons at lablab.ai community!