Local RAG Chatbots with Ollama and Phi-3

The biggest barrier to enterprise AI adoption is Data Privacy. Sending sensitive company documents to a public cloud LLM is often a non-starter. This is where Local RAG (Retrieval-Augmented Generation) comes in.

In this guide, we’ll build a fully local RAG pipeline using Ollama, Microsoft’s Phi-3 (a tiny but mighty model), and a local vector database.

The RAG Architecture

RAG works by “stuffing” relevant context into the model’s prompt before it generates an answer. This allows the model to “know” about your private data without being re-trained.

graph TD
    User((User Query)) --> Embed[Embedding Model]
    Embed --> Search[Vector Similarity Search]
    DB[(Local Vector DB)] --> Search
    Search -- "Relevant Context" --> Prompt[Enhanced Prompt]
    Prompt --> LLM[Local LLM - Phi-3]
    LLM --> Response[Final Answer]

The Stack

Ollama: The engine that runs LLMs locally.
Phi-3 Mini: A 3.8B parameter model that outperforms models twice its size.
ChromaDB or FAISS: Local storage for your vector embeddings.
LangChain or LlamaIndex: The orchestration layer.

Step-by-Step Implementation

1. Setting up Ollama

First, ensure Ollama is running and pull the Phi-3 model:

ollama serve
ollama pull phi3

2. Generating Embeddings

To search your documents, you need to turn them into numbers (vectors). We use a local embedding model (like mxbai-embed-large) so that the data never leaves your RAM.

import ollama

def get_embedding(text):
    response = ollama.embeddings(model="mxbai-embed-large", prompt=text)
    return response["embedding"]

3. The Retrieval Loop

When a user asks a question, we find the most relevant chunks in our vector database and pass them to Phi-3.

# Simplified RAG Loop
context = vector_db.query(query_embedding, n_results=3)

prompt = f"""
Use the following context to answer the question.
Context: {context}
Question: {user_query}
Answer:
"""

response = ollama.generate(model="phi3", prompt=prompt)
print(response["response"])

Performance Optimization

Running AI locally requires managing hardware resources.

Quantization: By default, Ollama uses q4_K_M quantization, which reduces memory usage by 70% with minimal loss in accuracy.
GPU Acceleration: If you have an NVIDIA GPU (CUDA) or Apple Silicon (Metal), Ollama will automatically offload the heavy lifting to the GPU.

Real-World Use Cases

Codebase Assistant: Index your entire Git repo and ask “Where is the auth logic located?”
Private Wiki: A chatbot that knows all your internal documentation and Slack history.
Legal/Medical Analysis: Processing sensitive records without cloud exposure.

Why Phi-3?

Microsoft’s Phi-3 is revolutionary because it was trained on “textbook-quality” data. It’s small enough to run on a standard laptop (even a 16GB Mac or a mid-range PC) while being smart enough to handle complex reasoning.

Conclusion

Building local RAG pipelines is no longer a futuristic dream. With toolsets like Ollama and Phi-3, any developer can build a private, intelligent system in a single afternoon.

Love microservices? See how this fits into a larger system in my Microservices Anti-Patterns post.

Tags: ai ollama phi-3 rag embeddings