When building LLM-powered applications with LlamaIndex, one of the first steps is getting your data into the system. LlamaIndex provides a robust set of tools for document loading and data connectors, making it easy to ingest and process various data formats. In this blog post, we'll dive into these features and explore how they can streamline your data preparation process.
LlamaIndex supports a wide range of document formats out of the box, making it incredibly versatile for different use cases. Let's look at some common document types and how to load them:
Loading a simple text file is straightforward:
from llama_index import SimpleDirectoryReader documents = SimpleDirectoryReader('path/to/your/directory').load_data()
This code snippet will load all text files from the specified directory.
For PDF files, LlamaIndex uses the pypdf
library under the hood:
from llama_index import download_loader PDFReader = download_loader("PDFReader") loader = PDFReader() documents = loader.load_data(file='path/to/your/file.pdf')
You can easily load content from web pages using the BeautifulSoupWebReader
:
from llama_index import download_loader BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader") loader = BeautifulSoupWebReader() documents = loader.load_data(urls=['https://example.com'])
LlamaIndex doesn't stop at local files. It provides connectors to various external data sources, allowing you to integrate diverse datasets into your LLM applications.
To connect to Google Docs, you'll need to set up authentication first. Once that's done, you can use the GoogleDocsReader
:
from llama_index import download_loader GoogleDocsReader = download_loader('GoogleDocsReader') loader = GoogleDocsReader() documents = loader.load_data(document_ids=['your_document_id'])
LlamaIndex also supports loading data from Notion databases:
from llama_index import download_loader NotionPageReader = download_loader('NotionPageReader') reader = NotionPageReader(integration_token='your_integration_token') documents = reader.load_data(database_id='your_database_id')
For data sources not natively supported, you can create custom document loaders. Here's a simple example:
from llama_index import Document class MyCustomLoader: def load_data(self): # Your custom logic here text = "This is a custom document" return [Document(text)] loader = MyCustomLoader() documents = loader.load_data()
Once you've loaded your documents, LlamaIndex provides various tools to process and prepare them for use with LLMs:
Large documents often need to be split into smaller chunks:
from llama_index import Document, SimpleDirectoryReader from llama_index.node_parser import SimpleNodeParser documents = SimpleDirectoryReader('path/to/your/directory').load_data() parser = SimpleNodeParser.from_defaults(chunk_size=1024, chunk_overlap=20) nodes = parser.get_nodes_from_documents(documents)
You can extract metadata from your documents to enhance the context:
from llama_index import SimpleDirectoryReader reader = SimpleDirectoryReader( 'path/to/your/directory', filename_as_id=True, metadata_extractor=lambda x: {"filename": x.name} ) documents = reader.load_data()
Document loading and data connectors in LlamaIndex provide a powerful foundation for ingesting and preparing data for your LLM applications. By leveraging these tools, you can easily work with various data formats and sources, setting the stage for building sophisticated AI-powered systems.
Remember, the key to effective data ingestion is understanding your data sources and choosing the right tools for the job. Experiment with different loaders and connectors to find the best fit for your project's needs.
15/10/2024 | Python
22/11/2024 | Python
22/11/2024 | Python
08/11/2024 | Python
05/11/2024 | Python
14/11/2024 | Python
06/10/2024 | Python
17/11/2024 | Python
25/09/2024 | Python
21/09/2024 | Python
15/10/2024 | Python
25/09/2024 | Python