Pragmatic Approach to Building an LLM RAG System with Ollama and Custom Prompts

This post was generated by an LLM

In this tutorial, we’ll walk through a practical, step-by-step method to build a Retrieval-Augmented Generation (RAG) system using Ollama for the Large Language Model (LLM) and custom prompts to guide the model’s behavior. This approach emphasizes simplicity, privacy, and customization for local deployment.

🧰 Prerequisites

Before starting, ensure you have the following installed:

Python 3.11+

   python3 --version
   # Expected output: Python 3.11.7 or higher

Ollama
Download and run Ollama from ollama.com. This tool allows you to run LLMs locally (e.g., LLaMA 3).
ChromaDB
A vector database for storing and retrieving document embeddings. Install via pip:

   pip install chromadb

LangChain Community Modules
For integration with Ollama and ChromaDB:

   pip install langchain-community

🛠 Step 1: Set Up Your Environment

Create a Virtual Environment

   python3 -m venv venv
   source venv/bin/activate  # On macOS/Linux
   # or
   venv\Scripts\activate     # On Windows

Install Dependencies

   pip install -r requirements.txt
   # Example `requirements.txt`:
   # langchain-community
   # chromadb
   # ollama

Load Environment Variables
Create a .env file:

   OLLAMA_MODEL=llama3

📚 Step 2: Load and Prepare Your Data

Use TextLoader to load documents (e.g., from PDFs or text files). For this example, we’ll use a simple text file.

from langchain_community.document_loaders import TextLoader

# Load a text file
loader = TextLoader("data/sample.txt")
documents = loader.load()

Split Documents
Use a TextSplitter to break documents into smaller chunks for embedding:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = splitter.split_documents(documents)

🧠 Step 3: Embed and Store in ChromaDB

Use OllamaEmbeddings to convert text into vector embeddings and store them in ChromaDB.

from langchain_community.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma

# Initialize embeddings
embeddings = OllamaEmbeddings(model_name="llama3")

# Create a ChromaDB instance
db = Chroma.from_documents(split_documents, embeddings, persist_directory="chroma_db")

📌 Note: ChromaDB will persist data in the chroma_db directory. Add this to your .gitignore to avoid version control conflicts.

🧩 Step 4: Define Custom Prompts

Custom prompts guide the LLM’s behavior. Here’s an example of a prompt template for generating alternative questions and answering based on context:

from langchain.prompts import PromptTemplate

def get_prompt():
    # Prompt for generating alternative questions
    QUERY_PROMPT = PromptTemplate(
        input_variables=["question"],
        template="""You are an AI assistant. Generate 5 different versions of the given question to retrieve relevant documents:
        {question}"""
    )

    # Prompt for answering based on context
    ANSWER_PROMPT = PromptTemplate(
        input_variables=["context", "question"],
        template="""Answer the question based ONLY on the following context:
        {context}
        Question: {question}"""
    )

    return QUERY_PROMPT, ANSWER_PROMPT

🔄 Step 5: Build the RAG Pipeline

Combine retrieval and generation using create_retrieval_chain and create_stuff_documents_chain.

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.llms import Ollama

# Initialize the LLM
llm = Ollama(model="llama3")

# Get prompts
query_prompt, answer_prompt = get_prompt()

# Create chains
retriever = db.as_retriever()
combine_chain = create_stuff_documents_chain(llm, answer_prompt)
retrieval_chain = create_retrieval_chain(retriever, combine_chain)

# Query the RAG system
response = retrieval_chain.invoke({"input": "What is the main idea of the document?"})
print(response["answer"])

📌 Step 6: Enhance with Multi-Query Retrieval (Optional)

To improve retrieval accuracy, use MultiQueryRetriever to generate multiple queries from the input:

from langchain.retrievers import MultiQueryRetriever

# Use the query prompt to generate multiple queries
retriever = MultiQueryRetriever.from_llm(
    retriever=db.as_retriever(), 
    llm=llm, 
    prompt=query_prompt
)

# Combine with the answer chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | answer_prompt
    | llm
    | StrOutputParser()
)

response = chain.invoke("What is the main idea of the document?")
print(response)

🧪 Step 7: Test and Iterate

Test with Different Prompts
Modify the get_prompt() function to refine the LLM’s behavior (e.g., for summarization, question-answering, or code generation).
Add More Documents
Expand your dataset by adding more text/PDF files and re-running the embedding process.
Monitor Performance
Use logging or a simple web interface (e.g., Streamlit) to test the RAG system interactively.

🛡 Key Considerations

Privacy: All data remains on your local machine, ensuring no sensitive information is exposed.
Customization: Tailor prompts to suit your use case (e.g., legal, medical, or technical domains).
Scalability: For larger datasets, consider using a more powerful vector database (e.g., Pinecone or Weaviate).

📦 Conclusion

By following this pragmatic approach, you’ve built a local RAG system that leverages Ollama for LLM inference, ChromaDB for vector storage, and custom prompts to control the model’s output. This setup is ideal for privacy-sensitive applications, experimentation, or integration into larger workflows.

For further exploration, check out the GitHub repository for the full code and additional features.

This post has been uploaded to share ideas an explanations to questions I might have, relating to no specific topics in particular. It may not be factually accurate and I may not endorse or agree with the topic or explanation – please contact me if you would like any content taken down and I will comply to all reasonable requests made in good faith.

– Dan

silly