Practical ApplicationsOctober 24, 202516 min readshipped

DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3

DGX Lab: Building a Complete RAG Infrastructure - Day 3

Lab Session Info

Date: October 24, 2025 DGX System: NVIDIA DGX Workstation (ARM64) Session Duration: 45 minutes Primary Focus: RAG Infrastructure & System Architecture

I just deployed Qdrant and AnythingLLM to my DGX workstation. This wasn't just about adding two more services. It was about completing a vision: a self-hosted ML infrastructure that can handle everything from medical AI fine-tuning to production RAG workflows without sending a single byte to the cloud.

The system now runs 8 services. Each one serves a purpose. Together, they create something powerful.

The Strategic Vision: Why Build This?

When I started fine-tuning medical models (like the work with catastrophic forgetting I wrote about yesterday), I realized I needed more than just inference. I needed:

A way to ground models in authoritative medical literature (RAG)
Vector storage for embeddings (thousands of medical documents)
Multiple interfaces for different use cases (chat vs document Q&A)
Intelligent routing between models based on task
Everything local (medical data can't leave the infrastructure)

This isn't about collecting tools. It's about building a system where fine-tuned models can leverage curated medical knowledge, where document Q&A runs entirely on-premise, and where I can experiment with cutting-edge AI without worrying about API costs or data privacy.

The 8-Service Architecture

Here's what's running on a single DGX workstation:

Inference Layer

Ollama (port 11434) - Primary LLM server, GPU-accelerated
llama.cpp (port 8080) - Alternative inference engine, lower memory

Gateway Layer

LiteLLM (port 4000) - OpenAI-compatible API gateway
Arch-Router (port 8082) - Intelligent request routing

Interface Layer

Open WebUI (port 3000) - ChatGPT-like interface for general chat
AnythingLLM (port 3001) - Document RAG and workspace management

Storage Layer

Qdrant (port 6333/6334) - Vector database for embeddings

Management Layer

DGX Dashboard (port 9000) - Centralized monitoring and control

Architecture Philosophy

Each layer is independent but integrated. Services can be swapped, scaled, or replaced without breaking the system. This modularity is critical for experimentation.

Why These Components Matter

Qdrant: The Memory System

Vector databases store embeddings (numerical representations of text) that enable semantic search. When you ask "What are the side effects of drug X?", Qdrant finds the most relevant documents by vector similarity, not keyword matching.

Why Qdrant over alternatives:

Native ARM64 support (DGX is ARM-based)
Fast (written in Rust)
Simple REST API
Can scale from laptop to cluster
Open source

For medical AI, this means I can embed thousands of research papers, clinical guidelines, and drug information sheets. When the fine-tuned model generates an answer, it's grounded in actual medical literature.

AnythingLLM: The RAG Platform

Think of it as "ChatGPT for your documents" but entirely self-hosted. Upload PDFs, create workspaces, ask questions, get cited answers.

What makes it powerful:

Workspace isolation (separate medical docs from general knowledge)
Multi-user support (teams can collaborate)
Agent capabilities (can call external tools)
Works with local models (no API keys, no cloud)
Built-in embedding generation

Key integration: Pre-configured to use Ollama running on the host via host.docker.internal networking. Docker container talks to local LLMs without exposing services to the network.

Open WebUI: The General Interface

Sometimes you just want to chat with a model. No documents, no RAG, just conversation.

Use cases:

Quick prototyping ("How should I structure this experiment?")
Code generation
General questions
Model comparison (switch between Gemma, Llama, etc.)

Why run both AnythingLLM and Open WebUI:

Different tools for different jobs
Open WebUI: Lightweight, fast, simple chat
AnythingLLM: Heavy lifting, document analysis, workspace management
They complement each other (ports 3000 vs 3001, side by side)

Ollama: The Inference Engine

This is where the GPU earns its keep. Ollama manages model loading, memory allocation, and inference. With ARM64 optimization, it's fast.

Current models:

Gemma 3 2B (fast, efficient)
Gemma 2 9B (balanced)
Medical fine-tuned variants (custom models)

The fine-tuning work ties directly into this. Train a model on medical QA, deploy it via Ollama, access it through AnythingLLM for document-grounded answers.

The Integration: How It All Works Together

Here's a real workflow:

Workflow 1: Medical Document Q&A

Upload 50 medical research papers to AnythingLLM
AnythingLLM generates embeddings using Ollama's embedding model
Embeddings stored in Qdrant vector database
User asks: "What are the latest treatments for condition X?"
AnythingLLM queries Qdrant for relevant document chunks
Sends context + query to Ollama (fine-tuned medical model)
Model generates grounded answer with citations
User sees response in AnythingLLM interface

Everything happens locally. No data leaves the DGX.

Workflow 2: Model Fine-Tuning + Deployment

Fine-tune Gemma 3 on medical QA dataset (5,000+ examples)
Export fine-tuned model to Ollama format
Load into Ollama: ollama create gemma3-medical -f Modelfile
Configure AnythingLLM to use new model
Test against medical literature in RAG workflow
Iterate based on results

The catastrophic forgetting lesson? It directly applies here. With 5,000+ training examples and proper LoRA configuration, the fine-tuned model can answer medical questions while still being grounded in up-to-date literature via RAG.

Workflow 3: Multi-Model Routing

User asks question via Open WebUI
Arch-Router analyzes query complexity
Simple question → Gemma 2B (fast, cheap)
Complex reasoning → Gemma 9B (slower, better)
Medical question → Fine-tuned medical model + RAG
LiteLLM provides OpenAI-compatible API
External tools can integrate seamlessly

The Technical Details

Docker Networking Challenge

AnythingLLM runs in a container. Ollama runs on the host. How do they communicate?

Solution: host.docker.internal

# docker-compose.yml for AnythingLLM
extra_hosts:
  - "host.docker.internal:host-gateway"
environment:
  - OLLAMA_BASE_PATH=http://host.docker.internal:11434

The container can now reach the host's Ollama service without exposing it to the entire network. Security and simplicity.

Resource Management

8 services on one machine. How do they share resources?

CPU/Memory:

Each service has reasonable defaults
Docker containers have memory limits
systemd manages restart policies
Dashboard monitors resource usage

GPU:

Ollama and llama.cpp share GPU via queue
One inference at a time (single GPU)
Context switching handled automatically
No manual GPU management needed

Storage:

Qdrant: ./data/ volume (persistent)
AnythingLLM: ./storage/ volume (persistent)
Docker images: ~10GB total
Models: 5-20GB depending on size

Auto-Start Configuration

Everything survives reboots:

systemd services:

# Check status
systemctl --user status ollama
systemctl status litellm

Docker containers:

restart: unless-stopped  # Auto-restart on boot

Verification:

# After reboot, all 8 services should be healthy
curl localhost:9000/api/services | jq

Why This Architecture Matters

For Medical AI

The fine-tuning work is just the beginning. With RAG infrastructure:

Models can cite sources (critical for medical)
Knowledge stays current (update docs, not model)
Reduces hallucination (grounded in literature)
Enables audit trails (which sources influenced answer)

For Privacy

All inference, all storage, all processing happens on-premise:

HIPAA-compliant deployment possible
No vendor lock-in
No API rate limits
No cloud costs
Complete control over data

For Experimentation

With 8 integrated services:

Test different embedding models (which works best for medical text?)
Compare vector databases (Qdrant vs LanceDB vs Chroma)
Benchmark RAG performance (retrieval accuracy, latency)
Try new models instantly (just download with Ollama)

The Numbers

Deployment time: 45 minutes for both Qdrant and AnythingLLM Services running: 8/8 (100% uptime since deployment) Total resource usage:

CPU: ~15% average (spikes to 80% during inference)
RAM: 8GB base + model size (4-8GB per loaded model)
GPU: On-demand (only during active inference)
Disk: 25GB (services + models + data)

Performance:

Qdrant vector search: <100ms for 10k vectors
AnythingLLM document upload: ~2 seconds per MB
Ollama inference: 30-80 tokens/second (model dependent)
End-to-end RAG query: 2-5 seconds

Challenges Solved

Challenge 1: Service Discovery

Problem: How do services find each other?

Solution: Fixed ports + host networking

Each service has a well-known port
Dashboard knows all endpoints
host.docker.internal for container-to-host

Challenge 2: Health Monitoring

Problem: How to know if everything's working?

Solution: DGX Dashboard with health checks

Polls each service every 30 seconds
Multiple endpoints tried (/health, /healthz, /ping)
Web UI shows real-time status

Challenge 3: Self-Monitoring Confusion

Problem: Dashboard was monitoring itself (showing "down" when down)

Solution: Removed self-check

Cleaner UI (7 services, not 8)
More logical (if dashboard is down, you can't see it anyway)
Users reported confusion resolved

What This Enables

Immediate Use Cases

Medical Literature Q&A - Upload guidelines, query with natural language
Research Paper Analysis - Ask questions across hundreds of papers
Fine-Tuned Model Testing - Deploy and evaluate custom models
Multi-User AI Access - Team can share infrastructure
Experiment Tracking - Document workflows with integrated tools

Future Possibilities

Multi-Model Agents - Chain multiple fine-tuned specialists
Continuous Learning - Retrain models as new data arrives
A/B Testing - Compare model versions on same documents
Production Deployment - Same stack scales to production
Teaching Platform - Show others how to build this

Lessons Learned

1. Start With the End in Mind

I didn't deploy random tools. Each service serves the vision:

Fine-tune models → Need inference (Ollama)
Ground in literature → Need RAG (AnythingLLM)
RAG needs embeddings → Need vectors (Qdrant)
Users need interfaces → Need UIs (Open WebUI + AnythingLLM)
System needs management → Need monitoring (Dashboard)

2. Docker Simplifies Deployment

Both Qdrant and AnythingLLM deployed in minutes with docker-compose. Benefits:

Isolated environments
Easy updates (pull new image, restart)
Portable (same compose file works anywhere)
Persistent storage (volumes survive restarts)

3. Documentation During Deployment

I created README files while deploying, not after. Result:

All configuration decisions captured
Setup reproducible
Integration patterns documented
Future troubleshooting easier

4. Integration > Features

AnythingLLM has dozens of features I don't use yet. What matters:

Works with local Ollama (no cloud dependency)
Stores vectors in Qdrant (or LanceDB)
Multi-user capable (team can collaborate)
Workspace isolation (separate concerns)

Pick tools that integrate well, not tools with the most checkboxes.

5. ARM64 Considerations

The DGX runs ARM64, not x86. Important:

Check architecture support before deploying
Most modern tools support ARM64 (Docker images tagged linux/arm64)
Rust-based tools (like Qdrant) compile for ARM easily
Ollama has native ARM64 builds

Next Steps

Week 1: Validation

Upload test medical documents to AnythingLLM
Verify end-to-end RAG workflow
Benchmark query latency and accuracy
Test multi-user access

Week 2: Integration

Deploy fine-tuned medical model to Ollama
Configure AnythingLLM to use custom model
Compare baseline vs fine-tuned performance
Measure citation accuracy

Week 3: Optimization

Switch AnythingLLM from LanceDB to Qdrant
Benchmark vector search performance
Optimize embedding generation
Test concurrent user load

Week 4: Production

Enable authentication (multi-user)
Set up backup strategy
Create architecture diagram
Document troubleshooting runbook
Plan Phase 2 enhancements

The Bigger Picture

This isn't just about running 8 services. It's about building the foundation for serious medical AI work:

Current capabilities:

Fine-tune models on medical QA datasets
Ground answers in authoritative literature
Test models against real-world documents
Deploy without cloud dependencies

Future capabilities:

Multi-model agents (routing by specialty)
Continuous learning (retrain on new literature)
Federated deployment (multiple DGX boxes)
Production-grade medical AI applications

Three days ago, this DGX was a bare system. Today, it's a complete ML infrastructure capable of supporting medical AI research and deployment. The catastrophic forgetting lessons? They'll directly apply to the next round of fine-tuning, now backed by RAG to prevent knowledge loss.

Try It Yourself

Quick Start (Docker required):

# Deploy Qdrant
mkdir -p (local path) && cd (local path)
docker run -d -p 6333:6333 -v $PWD/data:/qdrant/storage qdrant/qdrant

# Deploy AnythingLLM (with Ollama)
mkdir -p (local path) && cd (local path)
docker run -d -p 3001:3001 \
  --cap-add SYS_ADMIN \
  -v $PWD/storage:/app/server/storage \
  -e OLLAMA_BASE_PATH=http://host.docker.internal:11434 \
  --add-host host.docker.internal:host-gateway \
  mintplexlabs/anythingllm

# Access AnythingLLM
open http://localhost:3001

Verify everything:

curl http://localhost:6333/healthz  # Qdrant
curl http://localhost:3001           # AnythingLLM

Upload documents, create a workspace, ask questions. Welcome to self-hosted RAG.

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

DGX Lab: Building a Complete RAG Infrastructure - Day 3

The Strategic Vision: Why Build This?

The 8-Service Architecture

Inference Layer

Gateway Layer

Interface Layer

Storage Layer

Management Layer

Why These Components Matter

Qdrant: The Memory System

AnythingLLM: The RAG Platform

Open WebUI: The General Interface

Ollama: The Inference Engine

The Integration: How It All Works Together

Workflow 1: Medical Document Q&A

Workflow 2: Model Fine-Tuning + Deployment

Workflow 3: Multi-Model Routing

The Technical Details

Docker Networking Challenge

Resource Management

Auto-Start Configuration

Why This Architecture Matters

For Medical AI

For Privacy

For Experimentation

The Numbers

Challenges Solved

Challenge 1: Service Discovery

Challenge 2: Health Monitoring

Challenge 3: Self-Monitoring Confusion

What This Enables

Immediate Use Cases

Future Possibilities

Lessons Learned

1. Start With the End in Mind

2. Docker Simplifies Deployment

3. Documentation During Deployment

4. Integration > Features

5. ARM64 Considerations

Next Steps

Week 1: Validation

Week 2: Integration

Week 3: Optimization

Week 4: Production

The Bigger Picture

Related Articles

Related Articles

Get the next experiment