The ultimate LangChain series — data loaders

The ultimate LangChain series — data loaders

Learn how to use LangChain document loaders

·

10 min read

Introduction

This is the third module of our journey to master LangChain, and we'll cover the first step of application development with LangChain: the realm of Data Loaders. We'll explore their role, examine the variety of loaders available within the LangChain framework, and walk you through the steps of incorporating them into your own code.

Check out the first two parts of the series:

What are data loaders in LangChain

In this context, a "loader" is a utility or function that takes data from a specific format or source and transforms it into a format that a language model can use. In this case, the target format is referred to as a "Document".

Loaders are essential components in the data preprocessing pipeline for machine learning. The great part here is that LangChain does the heavy lifting behind the curtains and will save you a lot of work trying to format your data the vanilla way properly.

Data sources can be incredibly diverse, from files of various formats, such as CSV, SQL, PDF, and image files, to data obtained from public or proprietary online services and datasets like Wikipedia, Google Drive, or Twitter. Imagine having to design custom code to accommodate each unique data type you encounter — yet, this would merely be the tip of the iceberg.

Kind of loaders available in LangChain

Loaders in LangChain are grouped into three categories:

  1. Transform loaders: These loaders convert data from specific formats into Document format, essentially text. They can process a wide range of file types, including text, PowerPoint, images, HTML, PDF, etc. Some of the specific loaders in this category include CSV, SQL, Jupyter Notebook, Pandas DataFrame, and more. A key package used by these loaders is the Unstructured Python package.

  2. Public dataset or service loaders: These loaders retrieve and process data from public datasets and services. No special access permissions are required for these loaders. They include loaders for Hacker News, Wikipedia, YouTube transcripts, and more.

  3. Proprietary dataset or service loaders: These loaders work with datasets and services that are not in the public domain. They typically require access tokens or other parameters to access the data. Examples include loaders for Google Drive, AWS S3, Azure Blob Storage, Google Cloud Storage, Reddit, and Twitter.

Check a comprehensive list of data loaders on LangChain's docs.

Use data loaders in LangChain

As we mentioned, this is the first step of developing your language model-based apps; now, let's see how we can use those loaders. If you didn't follow this series from the beginning, make sure to go back to episode 1 to learn how to set up your environment properly.

Note that we installed the base dependencies in the first episode, but you might need to install loaders-specific dependencies; no worries it will be explained.

In the following examples, we'll play with a few different ones to give you a good idea.

PDF loader

As it's somewhat obvious, this one allows you to load PDF files. There are different options using different specialized packages, depending on your needs. Let's cover a few with code examples:

List of PDF loaders on LangChain docs.

General PFD use

If you are working with a general PDF, you can use the PDF loader powered by the PyPDF library. PyPDF is a comprehensive Python library designed for manipulating PDF files. Its capabilities extend to splitting and merging documents, cropping page layouts, and transforming the structure of PDF files. LangChain gives you the option to use a loader based on it, which will be ideal for loading a specific PDF file you want to work with.

Let's finally code a bit and see how to use this loader to load and parse the SpaceX CRS-5 Mission Press Kit! Make sure to activate the new environment we built in Chapter 1, then save this file in your project. I will create a new directory named pdf_files.

Install the PyPDF library:

pip install pypdf

Then create a new file named main.py ; the first step is to import the loader we need, and you will see this as a common theme; we can import any loader from langchain.document_loaders . You might be surprised, but the following code is all you need to load and parse a PDF document using LangChain!

from langchain.document_loaders import PyPDFLoader

# Use the PyPDFLoader to load and parse the PDF
loader = PyPDFLoader("./pdf_files/SpaceX_NASA_CRS-5_PressKit.pdf")
pages = loader.load_and_split()

Now we have a list named pages containing our parsed document, incredible! The PyPDFLoader creates a list where each element is a page from the PDF; each element includes two fields:

  • page_content which will have the actual content from the page.

  • metadata which is an object with the source, in this case, the NASA file, and the page number.

Add those print statements to your code to display how many pages we got, plus the first page in the console:

print(len(pages)
print(pages[0])

If you run python3 main.py you should get the following result:

$ python3 main.py
26
page_content='1 \n \n \nSpaceX CRS-5 Mission Press Kit  \n \n \nCONTENTS  \n \n3 Mission Overview  \n7 Mission Timeline  \n9 Graphics – Rendezvous, Grapple and Berthing, Departure and Re -Entry  \n11 International Space Station Overview  \n14 CASIS Payloads  \n15 Falcon 9 Overview  \n18 Dragon Overview  \n20 SpaceX Facilities  \n22 SpaceX Overview  \n24 SpaceX Leadership  \n \n \nSPACEX MEDIA CONTACT  \n \nJohn Taylor  \nDirector of Communications  \n310-363-6703  \nmedia@spacex.com   \n \n \nNASA PUBLIC AFFAIRS  CONTACT S  \n \n \n \n \n  Joshua Buck  \nPublic Affairs Officer  \nHuman Exploration and Operations  \nNASA Headquarters  \n202-358-1100  \n \nStephanie Schierholz  \nPublic Affairs Officer  \nHuman Exploration and Operations  \nNASA Headquarters  \n202-358-1100  Michael Curie  \nNews Chief  \nLaunch Operations  \nNASA Kennedy Space Center  \n321-867-2468  \n \nGeorge Diller  \nPublic Affairs Officer  \nLaunch Operations  \nNASA Kennedy Space Center  \n321-867-2468  Dan Huot  \nPublic Affairs Officer  \nInternational Space Station  \nNASA Johnson Space Center  \n281-483-5111' metadata={'source': './pdf_files/SpaceX_NASA_CRS-5_PressKit.pdf', 'page': 1}

At this point the pages list is ready to be passed into a text splitter 🤯 Wasn't this easy?

But what if we have an entire directory full of PDFs?

Load a PDF directory

No worries, in that case, you can use the PyPDF Directory loader, which has the same principle, but it loads every PDF file from the directory. Let's check it out. Download some more cool PDFs to add to the pdf_files directory; I used the following:

Advisory Circulars are documents that pilots use to learn extra important things!

Dump those files in the same directory, and let's try the directory loader.

from langchain.document_loaders import PyPDFDirectoryLoader

# Use the PyPDFDirectoryLoader to load and parse the PDFs from a directory
loader = PyPDFDirectoryLoader("./pdf_files/")
docs = loader.load_and_split()

print(len(docs))
print(docs[0])

This time you'll see we parsed 61 pages, which is all three documents.

More PDF use cases

By now, you get the gist; LangChain has different kinds of PDF loaders available powered by different Python packages. For example, you could use the MathPix loader if you need to work with PDFs with mathematical formulas; MathPix offers an API to recognize mathematical symbols and can be used this way.

Import it using the same syntax and use it like the PyPDF loader:

from langchain.document_loaders import MathpixPDFLoader

loader = MathpixPDFLoader("./pdf_files/my_algebra.pdf")
data = loader.load()

The PyMuPDF loader is another example, it does the same thing as PyPDF but it's faster.

Find a full list of PDF loaders in the LangChain Docs.

YouTube loader

This is one of my favorite use cases; it allows you to retrieve and parse the transcripts of YouTube videos directly from the URL. This loader uses the YouTube API to pull the transcript, thumbnail, and other data. As you can imagine, is really simple to use with LangChain; let's try it out by parsing the transcript for this nice video about early computing.

Even though you installed the langchain library, you might need to install those extra packages; I recommend to install pytube so we can take the video metadata as well.

pip install youtube-transcript-api pytube

Then use this code in your main file:

from langchain.document_loaders import YoutubeLoader

# Use add_video_info=True to get the video metadata; requires pytube
loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=O5nskjZ_GoI", add_video_info=True)
video = loader.load()
print(video)

Done and dusted! Now you have the transcript and video data in the video variable, ready for the next step. This returns a response similar to the PDF loader:

  • A Document object with the following elements:

    • page_content : the text transcript of the video

    • metadata : this holds information about the video like source, title, description, view_count, thumbnail_url, publish_date, length, and author. This is all data you can use in your apps!

Find the more about the YouTube loader on the LangChain docs.

As you can see, they all follow the same (simple) principle, but I want to explore one more loader next, so you can have a complete picture.

Sitemap loader

The sitemap loader is very useful if you want to efficiently scrape and index entire websites; this could be a very good use case for documentation, for example.

A sitemap is a file where you provide information about your site's pages, videos, and other files and the relationships between them. Search engines like Google read this file to crawl your site. We can take advantage of this to load and index entire websites very efficiently. Let's use the Chainstack docs sitemap for this example (the blockchain infra provider where I am a developer advocate). I'm using this example because it has all the use cases I can use to show you the important features!

First thing, you might need to install the following libraries, which are used by this loader:

pip install lxml bs4

Then let's import the loader and crawl the sitemap:

from langchain.document_loaders.sitemap import SitemapLoader

loader = SitemapLoader(
    "https://docs.chainstack.com/sitemap.xml",
)

documents = loader.load()
print(len(documents))
print(documents[0])

Same principle as before, it will crawl every page in the sitemap and store it in the documents list. If you run this program, you will crawl the entire Chainstack documentation, which you might want to do, but what if you don't need every page? Wouldn't it be good to be able to filter for the sections we want? Guess what? Langchain gives us that option. Let's say we only want to filter the docs section, we can add a list of the sections or pages to include using the filter_urls flag in the loader constructor:

from langchain.document_loaders.sitemap import SitemapLoader

loader = SitemapLoader(
    "https://docs.chainstack.com/sitemap.xml",
    filter_urls=["https://docs.chainstack.com/docs/"]
)

documents = loader.load()
print(len(documents))
print(documents[0])

Running this code, you will index a little over 100 pages and notice that the content stored in the documents variable is similar to what we saw earlier. The object in the list will have a page_content field with the text and some metadata.

So by filtering, we were already able only to take the page we wanted or needed, but by printing out the first page, you will notice the text has a lot of noise in it; specifically, the tool scraped all the menus and navigation as well, which will certainly create problems down the road. How can we fix that? The sitemap loader scrapes the pages using BeautifulSoup4, a popular Python scraping library, and luckily we can make a custom scraping function to include in our loader.

I'll skip the process of checking this, but if you inspect one of the Chainstack docs pages, you will see the noise is coming from the <nav> and <header> tags, so let's make a function using BeautifulSoup4 to fix this:

from langchain.document_loaders.sitemap import SitemapLoader
from bs4 import BeautifulSoup

def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    # Find all 'nav' and 'header' elements in the BeautifulSoup object
    nav_elements = content.find_all('nav')
    header_elements = content.find_all('header')

    # Remove each 'nav' and 'header' element from the BeautifulSoup object
    for element in nav_elements + header_elements:
        element.decompose()

    return str(content.get_text())

loader = SitemapLoader(
    "https://docs.chainstack.com/sitemap.xml",
    filter_urls=["https://docs.chainstack.com/docs/"],
    parsing_function=remove_nav_and_header_elements  
)

documents = loader.load()
print(len(documents))
print(documents[0])

The parsing_function flag allows us to pass the name of the function we created and tell the loader to use that one instead of the default. You will notice how much cleaner the response is, and this is a good way only to get the content relevant to us.

And with this, we also saw how to index an entire website from the sitemap. There are many other data loaders available on LangChain, and I suggest you explore the list to find the loader that can be more suitable for your needs.

Find the loader list on the LangChain docs.

You will see that they all follow pretty much the principles we explored in this article.

Conclusion

Wow, this was a long one, but this gives you a solid foundation to use any data loader you might need from the LangChain collection. The next step will be to find out about the text splitters, which is the step coming after we load the data.

Did you find this article valuable?

Support Davide by becoming a sponsor. Any amount is appreciated!