Make a chatbot for GitHub repositories

Make a chatbot for GitHub repositories

Use LangChain and OpenAI to make your AI assistant for repositories.

·

8 min read

Introduction

Have you ever stumbled upon an intriguing project on GitHub, only to find that navigating through readmes and files to understand it was daunting? Have you wished for a tool that allows you to ask direct questions about the project?

Today, we're going to build just that. Using Python, OpenAI, and the LangChain framework, we will create an innovative tool that simplifies your exploration of GitHub projects by allowing you to make your own ChatGPT instance that can access the content of a repository.

Find the repository with the full code in Scrape and chat with repositories.

Please note that this article is meant to guide you through the logic of this project and give you a high level overview of the LangChain framework.

Stay tuned for an introduction to how to use LangChain from scratch.

What is LangChain?

LangChain is a pretty versatile framework built with the aim of making it easier to develop apps that use language models like OpenAI's GPT3 and GPT4. It's all about being aware of data and making things happen. It's designed to help language models connect with all kinds of data sources and let them interact with their environments. The modules it comes with can handle different stuff - we're talking about different model types, managing prompts, keeping memory states persistent, and more.

What's cool is that LangChain can be used in all sorts of ways - think autonomous agents, personal assistants, question-answering systems, and even chatbots. It's super helpful for tricky tasks like querying structured data, understanding code, working with APIs, pulling out info, summarizing documents, and checking out generative models. Basically, if you're looking to get the most out of language models in your apps, LangChain is your go-to toolbox. It's a great resource, whether you're just starting out or you've been at this for a while.

Now, for our project today, we're gonna be using LangChain for all our 'AI-related' tasks. We're talking about generating embedding vectors, creating and querying a vector database - that kind of stuff.

Find more about LangChain in their official docs.

Explore the LangChain models used

LangChain has many modules available to take care of different tasks; in this project, we'll use the following:

  • File Directory loader: Document loaders are modules that allow your app to easily load different formats of documents. In our case, the scrapers dumps txt files into a directory, so we are using the File Directory loader, which loads every file from a directory. We also specify to only pick up .txt files.

  • Recursive Character Splitter: Text splitters allow the app to split the loaded documents into smaller chunks more suited for the indexing process. There are many types available, but we'll be using the Recursive Character splitter in this project as it is ideal for general text. I decided on this because we are also indexing the README file and comments. If you only index code, you can use the CodeTextSplitter instead.

  • Text Embedding Model: For the LLM to properly understand the context, we need to generate embedding vectors; embedding vectors are a way to represent words, phrases, or even larger pieces of text as numerical vectors. This is extremely useful in the field of natural language processing (NLP), as computers fundamentally deal with numbers, not words. These vectors are then stored in a database and can be used as context for the chat model. In this case, we'll use the OpenAI embeddings model since we will use the GPT models for the chat.

  • Deep Lake Vector Store: We need to store the embedding vectors in a proper database, and in this case, we'll use the Deep Lake module available on LangChain. Deep Lake is like a huge storage unit for embeddings and all the stuff that goes with them — think text, JSON files, pictures, sound and video files, and so on. You can keep this data wherever you like, whether that's on your own computer (local), in your cloud storage, or on Activeloop's storage (Deep Lake managed database). What's cool about Deep Lake is that it doesn't just store these embeddings; it can also search through them and their attributes. So, it's a really flexible tool for managing and using embeddings.

  • Chat Model: Of course, we need a chat model for the interaction part. LangChain easily integrates with OpenAI, so we'll use the OpenAI ChatGPT module to interact with our indexed repository.

I recomend to clone the repository and follow this article along referncing it.

Scraping a repository

The logic dedicated to this part is in the scraper.py file in the repository. This file is run when running the main.py file. The scraper will create a directory with the content of the repository in .txt files, one for each file in the repository.

The scraper file is in src/scraper.py.

This Python script is designed to scrape a GitHub repository and save the content of its files to a local directory.

The main file and the Langchain interaction

This is the file executing most of the heavy lifting. The first part of the main function is dedicated to the scraping process, but let's explore the LangChain modules.

Make sure to follow the instructions in the repository to install all fo the required dependencies.

  1. Imports: First, the script imports necessary modules and functions. It alters the system path to include the 'src' directory and imports various functions and classes from the 'langchain', 'deeplake', and 'scraper' modules.
from dotenv import load_dotenv
from scraper import main as github_scraper_main
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import DeepLake
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
import deeplake
  1. One part worth mentioning is how to load the environment variables; you pass them in the script as follows:
# Load environment variables from .env file
    load_dotenv()
    os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')
    os.environ['ACTIVELOOP_TOKEN'] = os.getenv('ACTIVELOOP_TOKEN')

In this case, we use the dotenv library to load them from a .env file.

  1. The embedding model is also initialized with the following line:
# Config embeddings model
embeddings = OpenAIEmbeddings(disallowed_special=())
  1. Loading documents: After scraping, the script loads the scraped text files from the './repos_content/' directory using the DirectoryLoader class from 'langchain'. It prints the number of loaded documents.
 # Load the documents, in this this case only .txt files.
    loader = DirectoryLoader('./repos_content/', glob="**/*.txt", show_progress=True, use_multithreading=True)
    print("=" * 100)
    print('Loading docs...')
    docs = loader.load()
    print(f"Loaded {len(docs)} documents.")
  1. Splitting documents: The script then splits the loaded documents into chunks of 1000 characters with an overlap of 10 characters using the RecursiveCharacterTextSplitter class. It prints the number of generated chunks.
    # Split the docs
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=10, length_function=len)
    print("=" * 100)
    print('Splitting documents...')
    text = text_splitter.split_documents(docs)
    print(f'Generated {len(text)} chunks.')
  1. Creating vector DB: Next, the script creates a vector database using the DeepLake class from 'langchain'. It uses the previously defined OpenAIEmbeddings object as the embedding function. The path for the Deep Lake dataset is obtained from an environment variable. The script adds the chunks of text to the vector database and prints a confirmation message.
    # Set the deeplake_path to the repository name
    deeplake_path = os.getenv('DATASET_PATH')
    db = DeepLake(dataset_path=deeplake_path, embedding_function=embeddings, overwrite=True)

As you can see, each step only takes a few lines of code, thanks to the LangChain framework.

Chat with the repository

Once the previous step is over, the repository has been scraped and indexed in the vector database. Now we can run the chat.py file and actually ask questions!

  1. Initializing embeddings, vector store, and retriever: The script initializes an OpenAIEmbeddings object, a DeepLake object (which stores the vectors of the documents in the dataset), and a retriever object (which retrieves the most relevant documents from the vector store based on a query).
# Set DeepLake dataset path
DEEPLAKE_PATH = os.getenv('DATASET_PATH')

# Initialize OpenAI embeddings and disallow special tokens
EMBEDDINGS = OpenAIEmbeddings(disallowed_special=())

# Initialize DeepLake vector store with OpenAI embeddings
deep_lake = DeepLake(
    dataset_path=DEEPLAKE_PATH,
    read_only=True,
    embedding_function=EMBEDDINGS,
)

# Initialize retriever and set search parameters
retriever = deep_lake.as_retriever()
retriever.search_kwargs.update({
    'distance_metric': 'cos',
    'fetch_k': 100,
    'maximal_marginal_relevance': True,
    'k': 10,
})
  1. Initializing chat model and conversational retrieval chain: The script initializes a ChatOpenAI object with the specified language model and a ConversationalRetrievalChain object, which uses the chat model and the retriever to answer questions based on the chat history and the dataset.
# Initialize ChatOpenAI model
model = ChatOpenAI(model_name=language_model, temperature=0.2) # gpt-3.5-turbo by default. Use gpt-4 for better and more accurate responses 

# Initialize ConversationalRetrievalChain
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

# Initialize chat history
chat_history = []

def get_user_input():
    """Get user input and handle 'quit' command."""
    question = input("\nPlease enter your question (or 'quit' to stop): ")
    if question.lower() == 'quit':
        return None
    return question

def print_answer(question, answer):
    """Format and print question and answer."""
    print(f"\nQuestion: {question}\nAnswer: {answer}\n")

def main():
    """Main program loop."""
    while True:
        question = get_user_input()
        if question is None:  # User has quit
            break

        # Display token usage and approximate costs
        with get_openai_callback() as tokens_usage:
            result = qa({"question": question, "chat_history": chat_history})
            chat_history.append((question, result['answer']))
            print_answer(question, result['answer'])
            print(tokens_usage)

You will have a real chatGPT-style chat in your terminal!

Conclusion

And here, it was a comprehensive overview of how to build a tool that leverages the power of Python, OpenAI, and the LangChain framework to navigate GitHub repositories more effectively. By using a combination of various LangChain modules, including a File Directory loader, Recursive Character Splitter, OpenAI embedding model, Deep Lake Vector Store, and the ChatGPT model, we created a tool that can index a repository's content and engage in intelligent dialogue about it. This offers a revolutionary approach to exploring and understanding GitHub projects by translating complex file structures and code into natural language answers to user queries.

Did you find this article valuable?

Support Davide by becoming a sponsor. Any amount is appreciated!