This post was generated by an LLM
In this tutorial, we’ll walk through a practical, step-by-step method to build a Retrieval-Augmented Generation (RAG) system using Ollama for the Large Language Model (LLM) and custom prompts to guide the model’s behavior. This approach emphasizes simplicity, privacy, and customization for local deployment.
🧰 Prerequisites
Before starting, ensure you have the following installed:
- Python 3.11+
python3 --version
# Expected output: Python 3.11.7 or higher
Ollama
Download and run Ollama from ollama.com. This tool allows you to run LLMs locally (e.g., LLaMA 3).ChromaDB
A vector database for storing and retrieving document embeddings. Install via pip:
pip install chromadb
- LangChain Community Modules
For integration with Ollama and ChromaDB:
pip install langchain-community
🛠 Step 1: Set Up Your Environment
- Create a Virtual Environment
python3 -m venv venv
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windows
- Install Dependencies
pip install -r requirements.txt
# Example `requirements.txt`:
# langchain-community
# chromadb
# ollama
- Load Environment Variables
Create a.env
file:
OLLAMA_MODEL=llama3
📚 Step 2: Load and Prepare Your Data
Use TextLoader
to load documents (e.g., from PDFs or text files). For this example, we’ll use a simple text file.
from langchain_community.document_loaders import TextLoader
# Load a text file
loader = TextLoader("data/sample.txt")
documents = loader.load()
Split Documents
Use a TextSplitter
to break documents into smaller chunks for embedding:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = splitter.split_documents(documents)
🧠 Step 3: Embed and Store in ChromaDB
Use OllamaEmbeddings to convert text into vector embeddings and store them in ChromaDB.
from langchain_community.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma
# Initialize embeddings
embeddings = OllamaEmbeddings(model_name="llama3")
# Create a ChromaDB instance
db = Chroma.from_documents(split_documents, embeddings, persist_directory="chroma_db")
📌 Note: ChromaDB will persist data in the
chroma_db
directory. Add this to your.gitignore
to avoid version control conflicts.
🧩 Step 4: Define Custom Prompts
Custom prompts guide the LLM’s behavior. Here’s an example of a prompt template for generating alternative questions and answering based on context:
from langchain.prompts import PromptTemplate
def get_prompt():
# Prompt for generating alternative questions
QUERY_PROMPT = PromptTemplate(
input_variables=["question"],
template="""You are an AI assistant. Generate 5 different versions of the given question to retrieve relevant documents:
{question}"""
)
# Prompt for answering based on context
ANSWER_PROMPT = PromptTemplate(
input_variables=["context", "question"],
template="""Answer the question based ONLY on the following context:
{context}
Question: {question}"""
)
return QUERY_PROMPT, ANSWER_PROMPT
🔄 Step 5: Build the RAG Pipeline
Combine retrieval and generation using create_retrieval_chain
and create_stuff_documents_chain
.
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.llms import Ollama
# Initialize the LLM
llm = Ollama(model="llama3")
# Get prompts
query_prompt, answer_prompt = get_prompt()
# Create chains
retriever = db.as_retriever()
combine_chain = create_stuff_documents_chain(llm, answer_prompt)
retrieval_chain = create_retrieval_chain(retriever, combine_chain)
# Query the RAG system
response = retrieval_chain.invoke({"input": "What is the main idea of the document?"})
print(response["answer"])
📌 Step 6: Enhance with Multi-Query Retrieval (Optional)
To improve retrieval accuracy, use MultiQueryRetriever
to generate multiple queries from the input:
from langchain.retrievers import MultiQueryRetriever
# Use the query prompt to generate multiple queries
retriever = MultiQueryRetriever.from_llm(
retriever=db.as_retriever(),
llm=llm,
prompt=query_prompt
)
# Combine with the answer chain
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| answer_prompt
| llm
| StrOutputParser()
)
response = chain.invoke("What is the main idea of the document?")
print(response)
🧪 Step 7: Test and Iterate
Test with Different Prompts
Modify theget_prompt()
function to refine the LLM’s behavior (e.g., for summarization, question-answering, or code generation).Add More Documents
Expand your dataset by adding more text/PDF files and re-running the embedding process.Monitor Performance
Use logging or a simple web interface (e.g., Streamlit) to test the RAG system interactively.
🛡 Key Considerations
- Privacy: All data remains on your local machine, ensuring no sensitive information is exposed.
- Customization: Tailor prompts to suit your use case (e.g., legal, medical, or technical domains).
- Scalability: For larger datasets, consider using a more powerful vector database (e.g., Pinecone or Weaviate).
📦 Conclusion
By following this pragmatic approach, you’ve built a local RAG system that leverages Ollama for LLM inference, ChromaDB for vector storage, and custom prompts to control the model’s output. This setup is ideal for privacy-sensitive applications, experimentation, or integration into larger workflows.
For further exploration, check out the GitHub repository for the full code and additional features.
This post has been uploaded to share ideas an explanations to questions I might have, relating to no specific topics in particular. It may not be factually accurate and I may not endorse or agree with the topic or explanation – please contact me if you would like any content taken down and I will comply to all reasonable requests made in good faith.
– Dan
Leave a Reply