The ultimate LangChain series — text splitters

The ultimate LangChain series — text splitters

·

10 min read

Learn how to use text splitters in LangChain

Introduction

Welcome to the fourth article in this series; so far, we have explored how to set up a LangChain project and load documents; now it's time to process our sources and introduce text splitter, which is the next step in building an LLM-based application.

Check out the first three parts of the series:

What is a text splitter in LangChain

A text splitter is an algorithm or method that breaks down a large piece of text into smaller chunks or segments. The goal is to create manageable pieces that can be processed individually, which is often necessary when dealing with large documents or datasets.

There are different kinds of splitters in LangChain depending on your use case; the splitter we'll see the most is the RecursiveCharacterTextSplitter, which is ideal for general documents, such as text or a mix of text and code, and so on.

Text splitters in LangChain come with some controls to manage the size and quality of the chunks:

  1. length_function: This parameter determines how the length of a chunk is calculated. By default, it simply counts the number of characters, but you could also pass a token counter function here, which would count the number of words or other tokens in a chunk instead of characters.

  2. chunk_size: This parameter sets the maximum size of the chunks. The size is measured according to the length_function parameter.

  3. chunk_overlap: This parameter sets the maximum overlap between chunks. Overlapping chunks means that some part of the text will be included in more than one chunk. For instance, this can be useful in some situations to maintain context or continuity between chunks.

  4. add_start_index: This parameter is a boolean flag that determines whether to include the starting position of each chunk within the original document in the metadata. Including this information could be useful for tracking the origin of each chunk in the original document.

Complete list of text splitters.

Split some documents

Now, let's move on to the second step. After we've loaded a document, we'll dive into various text splitters, using one of the PDF examples introduced in the previous article."

Find out how to load a PDF.

Recursive Character text splitter

The RecursiveCharacterTextSplitter It is often recommended for handling general text due to its adaptability. This text splitter operates based on a list of characters that serve as delimiters or 'split points' within the text. It attempts to create chunks of text by splitting these characters, one by one, in the order they are listed until the resulting chunks reach a manageable size.

The default character list is ["\n\n", "\n", " ", ""]. The text splitter first attempts to split the text at every double newline ("\n\n"), which typically separates paragraphs in the text. If the resulting chunks are too large, it then tries to split at every newline ("\n"), which often separates sentences. If the chunks are still too large, it finally tries to split at every space (" "), which separates words. If the chunks remain too large, it splits at every character (""), though in most cases, this level of granularity is unnecessary.

The advantage of this approach is that it tries to preserve as much semantic context as possible by keeping paragraphs, sentences, and words intact. These units of text tend to have strong semantic relationships, meaning the words within them are often closely related in meaning, which is beneficial for many natural language processing tasks."

Find more about the RecursiveCharacterTextSplitter on LangChain docs.

Now that we have an idea of how it works let's try it in practice and continue building from the SpaceX CRS-5 Mission Press Kit PDF.

Follow the steps from the previous article to get here.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Use the PyPDFLoader to load and parse the PDF
loader = PyPDFLoader("./pdf_files/SpaceX_NASA_CRS-5_PressKit.pdf")
pages = loader.load_and_split()
print(f'Loaded {len(pages)} pages from the PDF')

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap  = 10,
    length_function = len,
    add_start_index = True,
)

texts = text_splitter.split_documents(pages)
print(f'Split the pages in {len(texts)} chunks')
print(texts[0])
print(texts[1])

Now here is what's happening:

  1. Load and Parse the PDF.

  2. Set Up Text Splitter:

    • Create an instance of the RecursiveCharacterTextSplitter. The parameters passed to the constructor are:

      • chunk_size: This defines the maximum size of the chunks that the text should be split into. In this case, it's set to 400, so each chunk will contain at most 400 characters. This is an example size; we'll discuss this later.

      • chunk_overlap: This is the maximum overlap between chunks. Here, it's set to 30, so there could be up to 30 characters of overlap between consecutive chunks.

      • length_function: This is a function used to calculate the length of chunks. In this case, it's just the built-in len function, so the length of a chunk is simply its number of characters.

      • add_start_index: This parameter determines whether to include the starting position of each chunk within the original document in the metadata. Here, it's set to True, so this information will be included.

  3. Split the Text:

    • The split_documents method is then called on the text_splitter instance, with the pages list as the argument. This method goes through each page in the pages list and splits the text of the page into chunks according to the parameters set when initializing the text_splitter. The result is a list of chunks, and the number of chunks is printed.
  4. Print the First Two Chunks: Finally, we print the first two chunks from the texts list to the console. Each chunk is represented as a tuple, where the first element is the chunk's text (page_content), and the second element is a dictionary containing metadata about the chunk. The metadata includes information such as the chunk's starting position within the original document, as specified by the add_start_index parameter.

Along the way, we print the number of pages the PDF was loaded into and how many chunks the splitter created. In this example, we have 26 pages and 151 chunks.

Chunk size and overlap

When working with text data, setting up the parameters correctly is important. In our example, the numbers used for chunk size and overlap were chosen arbitrarily, but we need to decide on them in a real-world scenario.

Firstly, we have to split the text in a way that won't exceed the token limit of our embedding model. "Embedding" might sound like a complex term, but in reality, it's a way for us to turn words, sentences, or whole documents into numerical vectors or 'embeddings'. These vectors capture the meaning and relationships of words and sentences in a way that computers can understand.

The embedding model we'll be using is OpenAI's text-embedding-ada-002, which is a good fit for many types of applications. This model can handle up to 8191 tokens, so we have to make sure our chunks of text have fewer tokens than this.

You might wonder, what's a 'token'? It's not the same as a character. Roughly speaking, one token is about four characters long. This means our model can handle a lot of characters, but we need to be careful not to make our chunks too big, or we might lose some context.

Based on my experience, keeping chunks between 500 and 1000 characters is best. This size seems to work well without losing important information.

As for the overlap parameter, this refers to how much text we want to repeat between chunks. Setting it to 10-20% of the chunk size is generally recommended. This way, we ensure some connection between the chunks without causing too much repetition. If the overlap is too large, it can slow down the process and increase costs.

So based on that, this is the configuration I use with relatively long texts.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap  = 50,
    length_function = len,
    add_start_index = True,
)

# or

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 100,
    length_function = len,
    add_start_index = True,
)

Custom length function

In this example, we are using the default function to calculate the length, lenwhich calculates the number of characters, but we can make and pass more complex functions. For instance, we can calculate the length based on tokens instead of characters. To do this, we can use HuggingFace's Transformers library; let's install a few more packages:

pip install transformers torch tensorflow

And here is a breakdown of each of them:

  1. transformers: This is a library developed by Hugging Face that provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, etc.) for Natural Language Understanding (NLU) and Natural Language Generation (NLG). It contains thousands of pre-trained models in about 100+ languages and deep learning frameworks (PyTorch, TensorFlow). It's designed to be research-friendly, easy to use, and efficient.

  2. torch (PyTorch): PyTorch is an open-source machine learning library developed by Facebook's AI Research lab. It's based on the Torch library and is used for computer vision and natural language processing applications. It's known for being more intuitive and easier to debug than other libraries and for its strong GPU acceleration support.

  3. tensorflow: TensorFlow is another open-source machine learning library developed by the Google Brain team. It's designed to provide a flexible, efficient, and scalable platform for machine learning and supports a wide range of neural network architectures. TensorFlow also offers robust support for distributed computing, allowing you to train large models across multiple GPUs or even across multiple machines.

Then let's add our custom function (just like we did to scrape custom data from the sitemap loader) and use it as a parameter in the text splitter constructor:

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer


# Use the PyPDFLoader to load and parse the PDF
loader = PyPDFLoader("./pdf_files/SpaceX_NASA_CRS-5_PressKit.pdf")
pages = loader.load_and_split()
print(f'Loaded {len(pages)} pages from the PDF')

tokenizer = AutoTokenizer.from_pretrained('gpt2')

def tokens(text: str) -> int:
    return len(tokenizer.encode(text))

# Note that we reduce to 250 the chunk size since we are working with tokens now
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 250,
    chunk_overlap  = 20,
    length_function = tokens,
    add_start_index = True,
)

texts = text_splitter.split_documents(pages)
print(f'Split the pages in {len(texts)} chunks')
print(texts[0])
print(texts[1])

Note that now we are working with tokens, so I changed the chunk size from 1000 to 250, and the overlap from 100 to 20.

Running both of you'll see that the splitter with the tokens creates more even chunks, which might help the model get the context.

Having said that, the regular splitter works extremely well and might be the best case to handle simple text since it's easier to manage.

Code splitters

As we mentioned earlier, LangChain offers a wide range of splitters depending on your use case; let's now see what we can use if we are only working with code.

Find the code text splitter in the docs.

The CodeTextSplitter allows you to split a piece of code into smaller parts, for example, to analyze or process them separately. It does this based on language-specific syntax rules and conventions. The RecursiveCharacterTextSplitter is a specific implementation of CodeTextSplitter that uses characters or character sequences to split the code.

Let's see how we can use it to split our own code:

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
)

# Print a list of the available languages
for code in Language:
    print(code)

# The code to split
python = """
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer


# Use the PyPDFLoader to load and parse the PDF
loader = PyPDFLoader("./pdf_files/SpaceX_NASA_CRS-5_PressKit.pdf")
pages = loader.load_and_split()
print(f'Loaded {len(pages)} pages from the PDF')

tokenizer = AutoTokenizer.from_pretrained('gpt2')

def tokens(text: str) -> int:
    return len(tokenizer.encode(text))

# Note that we reduce to 250 the chunk size since we are working with tokens now
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 250,
    chunk_overlap  = 20,
    length_function = tokens,
    add_start_index = True,
)

texts = text_splitter.split_documents(pages)
print(f'Split the pages in {len(texts)} chunks')
print(texts[0])
print(texts[1])
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)

python_docs = python_splitter.create_documents([python])
print(python_docs)

By running this, you will first print a list of available languages in this format:

Language.PROGRAMMING_LANGUAGE

And we are running one of the previous examples in it using Language.PYTHON.

So this method is ideal if you are working exclusively with code bases.

Another useful splitter is the natural language tool kit ideal if you are working with speeches and similar.

Conclusion

This was another big article but now you have all you need to start exploring how to effectively generate chunks for your AI model-based applications.

Did you find this article valuable?

Support Davide by becoming a sponsor. Any amount is appreciated!