LangChain practical projects — build a CLI chatbot
Build an AI assistant trained on specific docs
Table of contents
- Introduction
- This is the first practical project after completing The ultimate LangChain series. Make sure to go over it before starting so you can get the foundations required.
- Technologies Used
- Project Structure
- main.py: Indexing the Data
- Import Statements
- Function: remove_nav_and_header_elements(content: BeautifulSoup) -> str
- Function: load_configuration()
- Function: load_documents(config)
- Function: split_documents(documents)
- Function: create_vector_db(text, config)
- Function: initialize_retriever(db)
- Function: main()
- Execution Entry Point: if __name__ == "__main__": main()
- chat.py: Chat interface
- Import Statements
- Function: load_environment_variables()
- Function: initialize_embeddings()
- Function: initialize_deeplake(embeddings)
- Function: initialize_retriever(deep_lake)
- Function: initialize_chat_model()
- Function: initialize_conversational_chain(model, retriever)
- Function: get_user_input()
- Function: print_answer(question, answer)
- Function: main()
- Execution Entry Point: if __name__ == "__main__": main()
- Considerations and improvements
- Conclusion
Introduction
In the rapidly evolving field of conversational AI, the ability to interact with and query structured data is paramount. This project showcases a Python-based Command Line Interface (CLI) application that leverages the power of Langchain to index and chat with custom documentation. Comprising two main files, main.py
for indexing the data and chat.py
for querying, the project demonstrates a practical implementation of conversational retrieval, allowing users to ask questions and receive precise answers based on indexed documents.
This is the first practical project after completing The ultimate LangChain series. Make sure to go over it before starting so you can get the foundations required.
Mastering AI Applications with Langchain & Python — The Ultimate LangChain guide
In this guide, we'll build a custom chatbot trained on the Chainstack documentation.
Technologies Used
The project uses a combination of cutting-edge technologies and libraries to achieve its functionality:
Langchain: A core library that facilitates conversational interfaces with data, enabling document indexing, retrieval, and conversational modeling.
BeautifulSoup: Used for parsing HTML content and extracting relevant document information.
DeepLake: A vector database used for storing and retrieving document embeddings.
OpenAIEmbeddings: Utilized for generating embeddings for the text data.
dotenv: A library to manage environment variables, ensuring secure and convenient configuration management.
Together, these technologies form the project's backbone, providing a robust and flexible framework for building a conversational interface with custom documentation.
Prerequisites
Required Knowledge
To successfully follow this guide and implement the project, completing the series Mastering AI Applications with Langchain & Python — The Ultimate LangChain guide is recommended. This series will provide you with the foundational knowledge and skills needed to understand and work with Langchain, a core component of this project.
Tools and Libraries
Before getting started, ensure you have the following installed and configured:
Python: Version 3.7 or newer is required. Download Python
OpenAI Account: An active OpenAI account with an OpenAI API key.
Activeloop Account: An Activeloop account, complete with an Activeloop API key.
Environment Setup
Follow these steps to set up the development environment:
You can either clone the repository or start from scratch; skip step one if starting from zero.
Clone the Repository:
git clone https://github.com/soos3d/chainstack-docs-chat.git
Navigate to the Project's Directory:
cd chainstack-docs-chat
Create a New Python Virtual Environment:
python3 -m venv docs-chat
Activate the virtual environment:
source docs-chat/bin/activate
Install Dependencies:
pip install -r requirements.txt
Configure API Keys and Settings in the
.env
File: Add your OpenAI API key, Activeloop token, and other configuration settings to the.env
file.# OpenAI OPENAI_API_KEY="" EMBEDDINGS_MODEL="text-embedding-ada-002" LANGUAGE_MODEL="gpt-3.5-turbo" # gpt-4 gpt-3.5-turbo # Deeplake vector DB ACTIVELOOP_TOKEN="" DATASET_PATH="./chainstack_docs" # "hub://USER_ID/custom_dataset" # Edit with your user id if you want to use the cloud db. # Scrape settings SITE_MAP="https://docs.chainstack.com/sitemap.xml"
You can also fill up the env.sample
file and rename it to .env
.
Following these steps, you will have a fully configured environment ready to explore the Chainstack docs chatbot project. The subsequent sections of this guide will delve into the details of the code, explaining how each component works.
Project Structure
Understanding the project's structure is essential for navigating the code and making modifications or extensions. Here's an overview of the main files and their roles:
1. .env
Role: This file stores secrets and configuration as environment variables. It includes API keys for OpenAI and Activeloop, the embedding model, language model, dataset path, and sitemap URL.
Usage: Used to manage sensitive information securely and easily configure various settings for the project.
2. main.py
Role: The main file responsible for scraping pages and creating the vector database. It includes functions for loading configuration, loading documents, splitting documents, creating the vector database, initializing the retriever, and setting up the conversational chain.
Usage: Run this file to index the documents from the Chainstack sitemap and create a vector database for retrieval.
3. chat.py
Role: This file handles user queries and interacts with the indexed data to provide answers. It includes functions for loading environment variables, initializing various components (e.g., embeddings, retriever, chat model), and managing the user interaction loop.
Usage: Run this file to start the chat interface, allowing users to enter questions and receive answers based on the indexed documents.
4. requirements.txt
Role: Lists the Python packages and libraries required to run the project.
Usage: Used in conjunction with
pip install -r requirements.txt
to install all necessary dependencies.
These files form the core structure of the Chainstack docs chatbot project. The main.py
file focuses on indexing and preparing the data, while the chat.py
file manages the interaction with users, leveraging the indexed data to provide relevant answers. The .env
file ensures that sensitive information is handled securely, simplifying the installation of dependencies.
main.py: Indexing the Data
Once the environment is ready, create a new file named main.py
and paste the following code:
import os
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from langchain.document_loaders.sitemap import SitemapLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import DeepLake
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.callbacks import get_openai_callback
def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
# Remove navigation and header elements from HTML content
nav_elements = content.find_all('nav')
header_elements = content.find_all('header')
for element in nav_elements + header_elements:
element.decompose()
return str(content.get_text())
def load_configuration():
# Load environment variables from .env file
load_dotenv()
return {
'OPENAI_API_KEY': os.getenv('OPENAI_API_KEY'),
'ACTIVELOOP_TOKEN': os.getenv('ACTIVELOOP_TOKEN'),
'SITE_MAP': os.getenv('SITE_MAP'),
'DATASET_PATH': os.getenv('DATASET_PATH'),
'LANGUAGE_MODEL': os.getenv('LANGUAGE_MODEL')
}
def load_documents(config):
# Load pages from Chainstack sitemap using SitemapLoader
print('Load pages from Chainstack sitemap...')
loader = SitemapLoader(
config['SITE_MAP'],
filter_urls=["https://docs.chainstack.com/docs/", "https://docs.chainstack.com/reference/"],
parsing_function=remove_nav_and_header_elements
)
return loader.load()
def split_documents(documents):
# Split documents into chunks using TokenTextSplitter
"""
Check https://blog.davideai.dev/the-ultimate-langchain-series-text-splitters?source=more_series_bottom_blogs#heading-chunk-size-and-overlap
to understand more about the splitter
"""
# Extract metadata from the documents if you want to use it for something
metadatas = [doc.metadata for doc in documents]
print("=" * 100)
print('Splitting documents...')
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100,
)
text = text_splitter.split_documents(documents)
print(f'Generated {len(text)} chunks.')
return text
def create_vector_db(text, config):
# Create and update a vector database using DeepLake
print("=" * 100)
print('Creating vector DB...')
embeddings = OpenAIEmbeddings(disallowed_special=())
deeplake_path = config['DATASET_PATH']
db = DeepLake(dataset_path=deeplake_path, embedding_function=embeddings, overwrite=True)
db.add_documents(text)
print('Vector database updated.')
return db
def initialize_retriever(db):
# Initialize retriever with specific search parameters
"""
Check https://blog.davideai.dev/the-ultimate-langchain-series-chat-with-your-data#heading-setting-up-the-retriever
to understand more about the retriver
"""
retriever = db.as_retriever()
retriever.search_kwargs.update({
'distance_metric': 'cos',
'fetch_k': 100,
'maximal_marginal_relevance': True,
'k': 10,
})
return retriever
def main():
# Main execution flow
config = load_configuration()
os.environ['OPENAI_API_KEY'] = config['OPENAI_API_KEY']
os.environ['ACTIVELOOP_TOKEN'] = config['ACTIVELOOP_TOKEN']
documents = load_documents(config)
text = split_documents(documents)
db = create_vector_db(text, config)
retriever = initialize_retriever(db)
language_model = config['LANGUAGE_MODEL']
model = ChatOpenAI(model_name=language_model, temperature=0)
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever, return_source_documents=True)
# Iterate through questions and print answers. This is to test the indexing process was successful
questions = [
"What are the Chainstack's core pillars?",
"Does Chainstack have a guide on eth_getBlockReceipts?"
]
chat_history = []
for question in questions:
with get_openai_callback() as tokens_usage:
result = qa({"question": question, "chat_history": chat_history})
chat_history.append((question, result['answer']))
first_document = result['source_documents'][0]
metadata = first_document.metadata
source = metadata['source']
print(f"-> **Question**: {question}\n")
print(f"**Answer**: {result['answer']}\n")
print(f"++source++: {source}")
print(tokens_usage)
if __name__ == "__main__":
main()
Let's break down the code into its main components and explain how each part works.
Import Statements
The code begins by importing the necessary libraries and modules:
os
: To interact with the operating system, mainly for environment variables.BeautifulSoup
: To parse HTML content.load_dotenv
: To load environment variables from a.env
file.Various LangChain modules: To handle document loading, text splitting, vector storage, embeddings, chat models, retrieval chains, and callbacks.
Function: remove_nav_and_header_elements(content: BeautifulSoup) -> str
This function takes a BeautifulSoup object containing HTML content and removes all navigation and header elements (nav
and header
tags). It returns the cleaned text as a string. This is necessary because the sitemap loader, by default, scrapes the entire page, including a lot of unnecessary data that only makes noise. You will probably need to adapt this function to the specific website you plan to scrape, as this is made based on the Chainstack docs.
Function: load_configuration()
This function loads environment variables from the .env
file, including API keys, sitemap URL, dataset path, and language model. It returns a dictionary containing these configurations.
The site map is included in the .env
file.
Function: load_documents(config)
This function uses the SitemapLoader
class from LangChain to load pages from the Chainstack sitemap. It filters URLs and applies the remove_nav_and_header_elements
function to clean the content. It returns the loaded documents.
Function: split_documents(documents)
This function splits the loaded documents into chunks using the RecursiveCharacterTextSplitter
class from LangChain. It returns the split text chunks.
Function: create_vector_db(text, config)
This function creates and updates a vector database using the DeepLake class from Langchain. It uses OpenAI embeddings and adds the text chunks to the database. It returns the updated database.
The default version will create a local vector database, edit the DATASET_PATH
environment variable in the .env
file to change the name/location or to create a cloud version.
Function: initialize_retriever(db)
This function initializes a retriever with specific search parameters, such as cosine distance, fetch count, maximal marginal relevance, and k value. It returns the initialized retriever.
Check out The ultimate LangChain series — chat with your data to learn more about the retriever.
Function: main()
This is the main execution flow of the script, and it performs the following steps:
Load Configuration: Calls
load_configuration()
to load environment variables.Load Documents: Calls
load_documents(config)
to load and clean the documents.Split Documents: Calls
split_documents(documents)
to split the documents into chunks.Create Vector DB: Calls
create_vector_db(text, config)
to create the vector database.Initialize Retriever: Calls
initialize_retriever(db)
to set up the retriever.Set Up Chat Model: Initializes the ChatOpenAI model and sets up the ConversationalRetrievalChain.
Iterate Through Questions: Iterates through predefined questions queries the model, and prints the answers along with the source. This is done as a test to make sure the process is successful.
Execution Entry Point: if __name__ == "__main__": main()
This line ensures the main()
function is called when the script is executed directly. It serves as the entry point for the script's execution.
Now you can run the script with the following command:
python3 main.py
This will start the process and will index the data. You will see a similar response in the console:
Load pages from Chainstack sitemap...
Fetching pages: 100%|############################################################################| 488/488 [04:35<00:00, 1.77it/s]
====================================================================================================
Splitting documents...
Generated 3912 chunks.
====================================================================================================
Creating vector DB...
./chainstack_docs loaded successfully.
Evaluating ingest: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00
Dataset(path='./chainstack_docs', tensors=['embedding', 'ids', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding generic (3912, 1536) float32 None
ids text (3912, 1) str None
metadata json (3912, 1) str None
text text (3912, 1) str None
Vector database updated.
-> **Question**: What are the Chainstack's core pillars?
**Answer**: Chainstack's core pillars are:
1. Unbeatable pricing - Chainstack offers competitive pricing options for its services. You can check their pricing options and use the calculator on their website or contact them for more information.
2. Unbounded performance - Chainstack does not impose rate limiting or hard caps on its services. This allows for optimal performance and scalability.
3. Unlimited flexibility - Chainstack provides the flexibility to customize your blockchain infrastructure according to your specific needs. This includes options such as customizing transaction pool pricing, adding extra node resources, load balancing, and more.
++source++: https://docs.chainstack.com/docs/pricing-introduction
Tokens Used: 1060
Prompt Tokens: 945
Completion Tokens: 115
chat.py: Chat interface
In the same directory, create a new file named chat.py
and paste the following code:
import os
from dotenv import load_dotenv
from langchain.vectorstores import DeepLake
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.embeddings import OpenAIEmbeddings
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
def load_environment_variables():
"""Load environment variables from .env file."""
load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')
os.environ['ACTIVELOOP_TOKEN'] = os.getenv('ACTIVELOOP_TOKEN')
def initialize_embeddings():
"""Initialize OpenAI embeddings and disallow special tokens."""
return OpenAIEmbeddings(disallowed_special=())
def initialize_deeplake(embeddings):
"""Initialize DeepLake vector store with OpenAI embeddings."""
return DeepLake(
dataset_path=os.getenv('DATASET_PATH'),
read_only=True,
embedding=embeddings,
)
def initialize_retriever(deep_lake):
"""Initialize retriever and set search parameters."""
retriever = deep_lake.as_retriever()
retriever.search_kwargs.update({
'distance_metric': 'cos',
'fetch_k': 100,
'maximal_marginal_relevance': True,
'k': 10,
})
return retriever
def initialize_chat_model():
"""Initialize ChatOpenAI model."""
return ChatOpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], model_name=os.getenv('LANGUAGE_MODEL'), temperature=0.0)
def initialize_conversational_chain(model, retriever):
"""Initialize ConversationalRetrievalChain."""
return ConversationalRetrievalChain.from_llm(model, retriever=retriever, return_source_documents=True)
def get_user_input():
"""Get user input and handle 'quit' command."""
question = input("\nPlease enter your question (or 'quit' to stop): ")
return None if question.lower() == 'quit' else question
# In case you want to format the result.
def print_answer(question, answer):
"""Format and print question and answer."""
print(f"\nQuestion: {question}\nAnswer: {answer}\n")
def main():
"""Main program loop."""
load_environment_variables()
embeddings = initialize_embeddings()
deep_lake = initialize_deeplake(embeddings)
retriever = initialize_retriever(deep_lake)
model = initialize_chat_model()
qa = initialize_conversational_chain(model, retriever)
# In this case the chat history is stored in memory only
chat_history = []
while True:
question = get_user_input()
if question is None: # User has quit
break
# Get results based on question
result = qa({"question": question, "chat_history": chat_history})
chat_history.append((question, result['answer']))
# Take the first source to display
first_document = result['source_documents'][0]
metadata = first_document.metadata
source = metadata['source']
# We are streaming the response so no need to print those
#print(f"-> **Question**: {question}\n")
#print(f"**Answer**: {result['answer']}\n")
print(f"\n\n++source++: {source}")
if __name__ == "__main__":
main()
This code allows users to ask questions and receive answers based on indexed documents. Let's break down the code into its main components and explain how each part works.
Import Statements
The code imports the necessary libraries and modules:
os
: To interact with the operating system, mainly for environment variables.load_dotenv
: To load environment variables from a.env
file.Various LangChain modules: To handle vector storage, chat models, retrieval chains, embeddings, and callbacks.
Function: load_environment_variables()
This function loads environment variables from a .env
file, including the OpenAI API key and ActiveLoop token.
Function: initialize_embeddings()
This function initializes OpenAI embeddings and disallows special tokens. It returns the initialized embeddings.
Function: initialize_deeplake(embeddings)
This function initializes the DeepLake vector store with OpenAI embeddings. It returns the initialized DeepLake object.
Function: initialize_retriever(deep_lake)
This function initializes a retriever with specific search parameters, such as cosine distance, fetch count, maximal marginal relevance, and k value. It returns the initialized retriever.
Function: initialize_chat_model()
This function initializes the ChatOpenAI model with streaming enabled and a callback handler for streaming standard output. It returns the initialized chat model.
Function: initialize_conversational_chain(model, retriever)
This function initializes the ConversationalRetrievalChain with the given chat model and retriever; here, it also tells the chain to return the sources. It returns the initialized conversational chain.
Function: get_user_input()
This function gets user input from the command line and handles the 'quit' command. It returns the user's question or None
if the user wants to quit. This is done to stop the script graciously.
Function: print_answer(question, answer)
This function formats and prints the question and answer. It's defined but not used in the code since we enabled streaming.
Function: main()
This is the main execution flow of the script, and it performs the following steps:
Load Environment Variables: Calls
load_environment_variables()
.Initialize Components: Initializes embeddings, DeepLake, retriever, chat model, and conversational chain.
User Interaction Loop: Enters a loop that asks for user input, processes the question, retrieves the answer, and prints the source. The loop continues until the user enters 'quit'.
Note that the chat_history
in this case, is stored in memory. This means it will be lost once the script is stopped. This is one of many improvements you can make to this started code.
Execution Entry Point: if __name__ == "__main__": main()
This line ensures the main()
function is called when the script is executed directly. It serves as the entry point for the script's execution.
Now you can run the script using the following:
python3 chat.py
Then ask away! Here is an example:
Deep Lake Dataset in ./chainstack_docs already exists, loading from the storage
Please enter your question (or 'quit' to stop): What methods can I use to get Ethereum blocks information?
To get Ethereum blocks information, you can use the following methods:
1. eth_blockNumber: This method returns the number of the most recent block on the Ethereum blockchain.
2. eth_getBlockByHash: This method retrieves a block by its block hash.
3. eth_getBlockByNumber: This method retrieves a block by its block number.
4. eth_getBlockTransactionCountByHash: This method returns the number of transactions in a block given its block hash.
5. eth_getBlockTransactionCountByNumber: This method returns the number of transactions in a block given its block number.
6. eth_newBlockFilter: This method creates a new filter that notifies when a new block is added to the Ethereum blockchain.
These methods allow you to access specific block details such as transactions, timestamp, height, header, and more.
++source++: https://docs.chainstack.com/reference/ethereum-blocks-rpc-methods
Considerations and improvements
This is all you need to make a relatively complex app, thanks to LangChain. Keep in mind this is a starter app, and even though you will see it gives some useful and good responses, some good improvements can be made.
You could add a custom prompt to handle the responses and interactions better.
You could improve the memory logic.
To practice, try to make these enhancements and come up with more improvements!
Conclusion
The Chainstack docs chatbot project illustrates a powerful and practical application of conversational AI, leveraging cutting-edge technologies to create a Python-based Command Line Interface (CLI) application. Using LangChain, the project enables users to interact with custom documentation, ask questions, and receive precise answers based on indexed documents.
This is a starter project, and many improvements can be made. Practice by making a better app!
We'll see how to add a front end in the next projects.