LangChain practical projects — build a CLI chatbot

LangChain practical projects — build a CLI chatbot

Build an AI assistant trained on specific docs

·

14 min read

Introduction

In the rapidly evolving field of conversational AI, the ability to interact with and query structured data is paramount. This project showcases a Python-based Command Line Interface (CLI) application that leverages the power of Langchain to index and chat with custom documentation. Comprising two main files, main.py for indexing the data and chat.py for querying, the project demonstrates a practical implementation of conversational retrieval, allowing users to ask questions and receive precise answers based on indexed documents.

This is the first practical project after completing The ultimate LangChain series. Make sure to go over it before starting so you can get the foundations required.

Mastering AI Applications with Langchain & Python — The Ultimate LangChain guide

In this guide, we'll build a custom chatbot trained on the Chainstack documentation.

Technologies Used

The project uses a combination of cutting-edge technologies and libraries to achieve its functionality:

  • Langchain: A core library that facilitates conversational interfaces with data, enabling document indexing, retrieval, and conversational modeling.

  • BeautifulSoup: Used for parsing HTML content and extracting relevant document information.

  • DeepLake: A vector database used for storing and retrieving document embeddings.

  • OpenAIEmbeddings: Utilized for generating embeddings for the text data.

  • dotenv: A library to manage environment variables, ensuring secure and convenient configuration management.

Together, these technologies form the project's backbone, providing a robust and flexible framework for building a conversational interface with custom documentation.

Prerequisites

Required Knowledge

To successfully follow this guide and implement the project, completing the series Mastering AI Applications with Langchain & Python — The Ultimate LangChain guide is recommended. This series will provide you with the foundational knowledge and skills needed to understand and work with Langchain, a core component of this project.

Tools and Libraries

Before getting started, ensure you have the following installed and configured:

Environment Setup

Follow these steps to set up the development environment:

You can either clone the repository or start from scratch; skip step one if starting from zero.

  1. Clone the Repository:

     git clone https://github.com/soos3d/chainstack-docs-chat.git
    
  2. Navigate to the Project's Directory:

     cd chainstack-docs-chat
    
  3. Create a New Python Virtual Environment:

     python3 -m venv docs-chat
    
  4. Activate the virtual environment:

     source docs-chat/bin/activate
    
  5. Install Dependencies:

     pip install -r requirements.txt
    
  6. Configure API Keys and Settings in the .env File: Add your OpenAI API key, Activeloop token, and other configuration settings to the .env file.

     # OpenAI 
     OPENAI_API_KEY=""
     EMBEDDINGS_MODEL="text-embedding-ada-002"
     LANGUAGE_MODEL="gpt-3.5-turbo" # gpt-4 gpt-3.5-turbo
    
     # Deeplake vector DB
     ACTIVELOOP_TOKEN=""
     DATASET_PATH="./chainstack_docs" # "hub://USER_ID/custom_dataset"  # Edit with your user id if you want to use the cloud db.
    
     # Scrape settings
     SITE_MAP="https://docs.chainstack.com/sitemap.xml"
    

You can also fill up the env.sample file and rename it to .env.

Following these steps, you will have a fully configured environment ready to explore the Chainstack docs chatbot project. The subsequent sections of this guide will delve into the details of the code, explaining how each component works.

Project Structure

Understanding the project's structure is essential for navigating the code and making modifications or extensions. Here's an overview of the main files and their roles:

1. .env

  • Role: This file stores secrets and configuration as environment variables. It includes API keys for OpenAI and Activeloop, the embedding model, language model, dataset path, and sitemap URL.

  • Usage: Used to manage sensitive information securely and easily configure various settings for the project.

2. main.py

  • Role: The main file responsible for scraping pages and creating the vector database. It includes functions for loading configuration, loading documents, splitting documents, creating the vector database, initializing the retriever, and setting up the conversational chain.

  • Usage: Run this file to index the documents from the Chainstack sitemap and create a vector database for retrieval.

3. chat.py

  • Role: This file handles user queries and interacts with the indexed data to provide answers. It includes functions for loading environment variables, initializing various components (e.g., embeddings, retriever, chat model), and managing the user interaction loop.

  • Usage: Run this file to start the chat interface, allowing users to enter questions and receive answers based on the indexed documents.

4. requirements.txt

  • Role: Lists the Python packages and libraries required to run the project.

  • Usage: Used in conjunction with pip install -r requirements.txt to install all necessary dependencies.

These files form the core structure of the Chainstack docs chatbot project. The main.py file focuses on indexing and preparing the data, while the chat.py file manages the interaction with users, leveraging the indexed data to provide relevant answers. The .env file ensures that sensitive information is handled securely, simplifying the installation of dependencies.

main.py: Indexing the Data

Once the environment is ready, create a new file named main.py and paste the following code:

import os
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from langchain.document_loaders.sitemap import SitemapLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import DeepLake
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.callbacks import get_openai_callback


def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    # Remove navigation and header elements from HTML content
    nav_elements = content.find_all('nav')
    header_elements = content.find_all('header')
    for element in nav_elements + header_elements:
        element.decompose()
    return str(content.get_text())


def load_configuration():
    # Load environment variables from .env file
    load_dotenv()
    return {
        'OPENAI_API_KEY': os.getenv('OPENAI_API_KEY'),
        'ACTIVELOOP_TOKEN': os.getenv('ACTIVELOOP_TOKEN'),
        'SITE_MAP': os.getenv('SITE_MAP'),
        'DATASET_PATH': os.getenv('DATASET_PATH'),
        'LANGUAGE_MODEL': os.getenv('LANGUAGE_MODEL')
    }


def load_documents(config):
    # Load pages from Chainstack sitemap using SitemapLoader
    print('Load pages from Chainstack sitemap...')
    loader = SitemapLoader(
        config['SITE_MAP'],
        filter_urls=["https://docs.chainstack.com/docs/", "https://docs.chainstack.com/reference/"],
        parsing_function=remove_nav_and_header_elements
    )
    return loader.load()


def split_documents(documents):
    # Split documents into chunks using TokenTextSplitter
    """
    Check https://blog.davideai.dev/the-ultimate-langchain-series-text-splitters?source=more_series_bottom_blogs#heading-chunk-size-and-overlap
    to understand more about the splitter
    """

    # Extract metadata from the documents if you want to use it for something
    metadatas = [doc.metadata for doc in documents]

    print("=" * 100)
    print('Splitting documents...')
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
    )
    text = text_splitter.split_documents(documents)
    print(f'Generated {len(text)} chunks.')
    return text


def create_vector_db(text, config):
    # Create and update a vector database using DeepLake
    print("=" * 100)
    print('Creating vector DB...')
    embeddings = OpenAIEmbeddings(disallowed_special=())
    deeplake_path = config['DATASET_PATH']
    db = DeepLake(dataset_path=deeplake_path, embedding_function=embeddings, overwrite=True)
    db.add_documents(text)
    print('Vector database updated.')
    return db


def initialize_retriever(db):
    # Initialize retriever with specific search parameters
    """
    Check https://blog.davideai.dev/the-ultimate-langchain-series-chat-with-your-data#heading-setting-up-the-retriever 
    to understand more about the retriver
    """
    retriever = db.as_retriever()
    retriever.search_kwargs.update({
        'distance_metric': 'cos',
        'fetch_k': 100,
        'maximal_marginal_relevance': True,
        'k': 10,
    })
    return retriever


def main():
    # Main execution flow
    config = load_configuration()
    os.environ['OPENAI_API_KEY'] = config['OPENAI_API_KEY']
    os.environ['ACTIVELOOP_TOKEN'] = config['ACTIVELOOP_TOKEN']

    documents = load_documents(config)
    text = split_documents(documents)
    db = create_vector_db(text, config)
    retriever = initialize_retriever(db)

    language_model = config['LANGUAGE_MODEL']
    model = ChatOpenAI(model_name=language_model, temperature=0)
    qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever, return_source_documents=True)

    # Iterate through questions and print answers. This is to test the indexing process was successful
    questions = [
        "What are the Chainstack's core pillars?",
        "Does Chainstack have a guide on eth_getBlockReceipts?"
    ]
    chat_history = []
    for question in questions:
        with get_openai_callback() as tokens_usage:
            result = qa({"question": question, "chat_history": chat_history})
            chat_history.append((question, result['answer']))   

            first_document = result['source_documents'][0]
            metadata = first_document.metadata
            source = metadata['source']

            print(f"-> **Question**: {question}\n")
            print(f"**Answer**: {result['answer']}\n")
            print(f"++source++: {source}")
            print(tokens_usage)


if __name__ == "__main__":
    main()

Let's break down the code into its main components and explain how each part works.

Import Statements

The code begins by importing the necessary libraries and modules:

  • os: To interact with the operating system, mainly for environment variables.

  • BeautifulSoup: To parse HTML content.

  • load_dotenv: To load environment variables from a .env file.

  • Various LangChain modules: To handle document loading, text splitting, vector storage, embeddings, chat models, retrieval chains, and callbacks.

Function: remove_nav_and_header_elements(content: BeautifulSoup) -> str

This function takes a BeautifulSoup object containing HTML content and removes all navigation and header elements (nav and header tags). It returns the cleaned text as a string. This is necessary because the sitemap loader, by default, scrapes the entire page, including a lot of unnecessary data that only makes noise. You will probably need to adapt this function to the specific website you plan to scrape, as this is made based on the Chainstack docs.

Function: load_configuration()

This function loads environment variables from the .env file, including API keys, sitemap URL, dataset path, and language model. It returns a dictionary containing these configurations.

The site map is included in the .env file.

Function: load_documents(config)

This function uses the SitemapLoader class from LangChain to load pages from the Chainstack sitemap. It filters URLs and applies the remove_nav_and_header_elements function to clean the content. It returns the loaded documents.

Function: split_documents(documents)

This function splits the loaded documents into chunks using the RecursiveCharacterTextSplitter class from LangChain. It returns the split text chunks.

Function: create_vector_db(text, config)

This function creates and updates a vector database using the DeepLake class from Langchain. It uses OpenAI embeddings and adds the text chunks to the database. It returns the updated database.

The default version will create a local vector database, edit the DATASET_PATH environment variable in the .env file to change the name/location or to create a cloud version.

Function: initialize_retriever(db)

This function initializes a retriever with specific search parameters, such as cosine distance, fetch count, maximal marginal relevance, and k value. It returns the initialized retriever.

Check out The ultimate LangChain series — chat with your data to learn more about the retriever.

Function: main()

This is the main execution flow of the script, and it performs the following steps:

  1. Load Configuration: Calls load_configuration() to load environment variables.

  2. Load Documents: Calls load_documents(config) to load and clean the documents.

  3. Split Documents: Calls split_documents(documents) to split the documents into chunks.

  4. Create Vector DB: Calls create_vector_db(text, config) to create the vector database.

  5. Initialize Retriever: Calls initialize_retriever(db) to set up the retriever.

  6. Set Up Chat Model: Initializes the ChatOpenAI model and sets up the ConversationalRetrievalChain.

  7. Iterate Through Questions: Iterates through predefined questions queries the model, and prints the answers along with the source. This is done as a test to make sure the process is successful.

Execution Entry Point: if __name__ == "__main__": main()

This line ensures the main() function is called when the script is executed directly. It serves as the entry point for the script's execution.

Now you can run the script with the following command:

python3 main.py

This will start the process and will index the data. You will see a similar response in the console:

Load pages from Chainstack sitemap...
Fetching pages: 100%|############################################################################| 488/488 [04:35<00:00,  1.77it/s]
====================================================================================================
Splitting documents...
Generated 3912 chunks.
====================================================================================================
Creating vector DB...
./chainstack_docs loaded successfully.
Evaluating ingest: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00
Dataset(path='./chainstack_docs', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype      shape       dtype  compression
  -------   -------    -------     -------  ------- 
 embedding  generic  (3912, 1536)  float32   None   
    ids      text     (3912, 1)      str     None   
 metadata    json     (3912, 1)      str     None   
   text      text     (3912, 1)      str     None   
Vector database updated.
-> **Question**: What are the Chainstack's core pillars?

**Answer**: Chainstack's core pillars are:

1. Unbeatable pricing - Chainstack offers competitive pricing options for its services. You can check their pricing options and use the calculator on their website or contact them for more information.

2. Unbounded performance - Chainstack does not impose rate limiting or hard caps on its services. This allows for optimal performance and scalability.

3. Unlimited flexibility - Chainstack provides the flexibility to customize your blockchain infrastructure according to your specific needs. This includes options such as customizing transaction pool pricing, adding extra node resources, load balancing, and more.

++source++: https://docs.chainstack.com/docs/pricing-introduction
Tokens Used: 1060
        Prompt Tokens: 945
        Completion Tokens: 115

chat.py: Chat interface

In the same directory, create a new file named chat.py and paste the following code:

import os
from dotenv import load_dotenv
from langchain.vectorstores import DeepLake
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.embeddings import OpenAIEmbeddings
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler


def load_environment_variables():
    """Load environment variables from .env file."""
    load_dotenv()
    os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')
    os.environ['ACTIVELOOP_TOKEN'] = os.getenv('ACTIVELOOP_TOKEN')

def initialize_embeddings():
    """Initialize OpenAI embeddings and disallow special tokens."""
    return OpenAIEmbeddings(disallowed_special=())

def initialize_deeplake(embeddings):
    """Initialize DeepLake vector store with OpenAI embeddings."""
    return DeepLake(
        dataset_path=os.getenv('DATASET_PATH'),
        read_only=True,
        embedding=embeddings,
    )

def initialize_retriever(deep_lake):
    """Initialize retriever and set search parameters."""
    retriever = deep_lake.as_retriever()
    retriever.search_kwargs.update({
        'distance_metric': 'cos',
        'fetch_k': 100,
        'maximal_marginal_relevance': True,
        'k': 10,
    })
    return retriever

def initialize_chat_model():
    """Initialize ChatOpenAI model."""
    return ChatOpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], model_name=os.getenv('LANGUAGE_MODEL'), temperature=0.0)

def initialize_conversational_chain(model, retriever):
    """Initialize ConversationalRetrievalChain."""
    return ConversationalRetrievalChain.from_llm(model, retriever=retriever, return_source_documents=True)

def get_user_input():
    """Get user input and handle 'quit' command."""
    question = input("\nPlease enter your question (or 'quit' to stop): ")
    return None if question.lower() == 'quit' else question

# In case you want to format the result.
def print_answer(question, answer):
    """Format and print question and answer."""
    print(f"\nQuestion: {question}\nAnswer: {answer}\n")

def main():
    """Main program loop."""
    load_environment_variables()
    embeddings = initialize_embeddings()
    deep_lake = initialize_deeplake(embeddings)
    retriever = initialize_retriever(deep_lake)
    model = initialize_chat_model()
    qa = initialize_conversational_chain(model, retriever)

    # In this case the chat history is stored in memory only
    chat_history = []

    while True:
        question = get_user_input()
        if question is None:  # User has quit
            break

        # Get results based on question
        result = qa({"question": question, "chat_history": chat_history})
        chat_history.append((question, result['answer']))   

        # Take the first source to display
        first_document = result['source_documents'][0]
        metadata = first_document.metadata
        source = metadata['source']

        # We are streaming the response so no need to print those
        #print(f"-> **Question**: {question}\n")
        #print(f"**Answer**: {result['answer']}\n")
        print(f"\n\n++source++: {source}")

if __name__ == "__main__":
    main()

This code allows users to ask questions and receive answers based on indexed documents. Let's break down the code into its main components and explain how each part works.

Import Statements

The code imports the necessary libraries and modules:

  • os: To interact with the operating system, mainly for environment variables.

  • load_dotenv: To load environment variables from a .env file.

  • Various LangChain modules: To handle vector storage, chat models, retrieval chains, embeddings, and callbacks.

Function: load_environment_variables()

This function loads environment variables from a .env file, including the OpenAI API key and ActiveLoop token.

Function: initialize_embeddings()

This function initializes OpenAI embeddings and disallows special tokens. It returns the initialized embeddings.

Function: initialize_deeplake(embeddings)

This function initializes the DeepLake vector store with OpenAI embeddings. It returns the initialized DeepLake object.

Function: initialize_retriever(deep_lake)

This function initializes a retriever with specific search parameters, such as cosine distance, fetch count, maximal marginal relevance, and k value. It returns the initialized retriever.

Function: initialize_chat_model()

This function initializes the ChatOpenAI model with streaming enabled and a callback handler for streaming standard output. It returns the initialized chat model.

Function: initialize_conversational_chain(model, retriever)

This function initializes the ConversationalRetrievalChain with the given chat model and retriever; here, it also tells the chain to return the sources. It returns the initialized conversational chain.

Function: get_user_input()

This function gets user input from the command line and handles the 'quit' command. It returns the user's question or None if the user wants to quit. This is done to stop the script graciously.

Function: print_answer(question, answer)

This function formats and prints the question and answer. It's defined but not used in the code since we enabled streaming.

Function: main()

This is the main execution flow of the script, and it performs the following steps:

  1. Load Environment Variables: Calls load_environment_variables().

  2. Initialize Components: Initializes embeddings, DeepLake, retriever, chat model, and conversational chain.

  3. User Interaction Loop: Enters a loop that asks for user input, processes the question, retrieves the answer, and prints the source. The loop continues until the user enters 'quit'.

Note that the chat_history in this case, is stored in memory. This means it will be lost once the script is stopped. This is one of many improvements you can make to this started code.

Execution Entry Point: if __name__ == "__main__": main()

This line ensures the main() function is called when the script is executed directly. It serves as the entry point for the script's execution.

Now you can run the script using the following:

python3 chat.py

Then ask away! Here is an example:

Deep Lake Dataset in ./chainstack_docs already exists, loading from the storage

Please enter your question (or 'quit' to stop): What methods can I use to get Ethereum blocks information?
To get Ethereum blocks information, you can use the following methods:

1. eth_blockNumber: This method returns the number of the most recent block on the Ethereum blockchain.

2. eth_getBlockByHash: This method retrieves a block by its block hash.

3. eth_getBlockByNumber: This method retrieves a block by its block number.

4. eth_getBlockTransactionCountByHash: This method returns the number of transactions in a block given its block hash.

5. eth_getBlockTransactionCountByNumber: This method returns the number of transactions in a block given its block number.

6. eth_newBlockFilter: This method creates a new filter that notifies when a new block is added to the Ethereum blockchain.

These methods allow you to access specific block details such as transactions, timestamp, height, header, and more.

++source++: https://docs.chainstack.com/reference/ethereum-blocks-rpc-methods

Considerations and improvements

This is all you need to make a relatively complex app, thanks to LangChain. Keep in mind this is a starter app, and even though you will see it gives some useful and good responses, some good improvements can be made.

  • You could add a custom prompt to handle the responses and interactions better.

  • You could improve the memory logic.

To practice, try to make these enhancements and come up with more improvements!

Conclusion

The Chainstack docs chatbot project illustrates a powerful and practical application of conversational AI, leveraging cutting-edge technologies to create a Python-based Command Line Interface (CLI) application. Using LangChain, the project enables users to interact with custom documentation, ask questions, and receive precise answers based on indexed documents.

This is a starter project, and many improvements can be made. Practice by making a better app!

We'll see how to add a front end in the next projects.

Did you find this article valuable?

Support Davide by becoming a sponsor. Any amount is appreciated!