AI Systems & ArchitectureMarch 21, 20243 min readshipped

Building a Markdown RAG System: A Practical Guide to Document-Grounded AI

Overview

Learn how to build a lightweight yet powerful RAG system that enables natural language interactions with markdown documentation. This implementation focuses on simplicity and effectiveness, avoiding the complexity of vector databases while maintaining high-quality responses.

System Architecture

The RAG system consists of four main components working together to provide document-grounded AI responses:

Document Processor: Parses and chunks markdown files
Search Engine: Finds relevant document sections
LLM Integration: Generates contextual responses
Chat Interface: Handles user interactions

flowchart TD
    User[User] --> |Question| Chat[Chat Interface]
    Chat --> |Query| Search[Search Engine]
    Search --> |Retrieves| Docs[Markdown Documents]
    Search --> |Results| LLM[LLM Processor]
    LLM --> |Response| Chat

Document Processing

Intelligent Chunking

The system uses a sophisticated chunking strategy that preserves document context:

interface ChunkOptions {
  maxChunkSize: number;
  overlapSize: number;
}

function chunkDocument(content: string, options: ChunkOptions): string[] {
  const { maxChunkSize, overlapSize } = options;
  const chunks: string[] = [];
  
  // Split by paragraphs
  const paragraphs = content.split(/\n\s*\n/);
  let currentChunk = '';
  
  for (const paragraph of paragraphs) {
    if (currentChunk.length + paragraph.length > maxChunkSize && currentChunk.length > 0) {
      chunks.push(currentChunk);
      // Keep overlap for context
      const lastParagraphs = currentChunk
        .split(/\n\s*\n/)
        .slice(-3)
        .join('\n\n');
      
      currentChunk = lastParagraphs.length <= overlapSize 
        ? lastParagraphs + '\n\n' 
        : '';
    }
    
    currentChunk += paragraph + '\n\n';
  }
  
  return chunks;
}

Metadata Extraction

Each document chunk includes rich metadata for better context:

interface DocumentChunk {
  id: string;
  path: string;
  title: string;
  content: string;
  metadata: Record<string, any>;
  headings: string[];
  tokens: string[];
  createdAt: Date;
  updatedAt: Date;
}

Search Implementation

Instead of using complex vector embeddings, we implement a hybrid search approach combining:

Fuzzy Matching: Using Fuse.js for typo tolerance
Term Frequency: For content relevance scoring
Phrase Matching: For exact matches

class SearchEngine {
  private calculateRelevance(query: string[], document: DocumentChunk): number {
    const tfScore = this.calculateTF(query, document);
    const phraseScore = this.phraseMatchBoost(query.join(' '), document);
    const headingScore = this.headingMatchScore(query, document);
    
    return (tfScore * 0.6) + (phraseScore * 0.3) + (headingScore * 0.1);
  }
  
  search(query: string, limit: number = 5): SearchResult[] {
    const queryTokens = tokenize(query);
    const results = this.documents.map(doc => ({
      document: doc,
      score: this.calculateRelevance(queryTokens, doc)
    }));
    
    return results
      .filter(result => result.score > THRESHOLD)
      .sort((a, b) => b.score - a.score)
      .slice(0, limit);
  }
}

LLM Integration

The system uses a carefully crafted prompt structure to ensure high-quality, grounded responses:

function buildSystemPrompt(relevantDocs: DocumentChunk[]): string {
  const contextSections = relevantDocs.map(doc => `
Source: ${doc.path}
Title: ${doc.title}
Content:
${doc.content}
---`).join('\n\n');
  
  return `You are an AI assistant specialized in answering questions about the documentation.
Answer questions based ONLY on the following information:

${contextSections}

Guidelines:
1. Use ONLY the provided information
2. If information is not in context, acknowledge the limitation
3. Cite sources when possible
4. Format responses in Markdown
5. Be concise but thorough
6. Explain technical terms`;
}

Context Management

The system maintains conversation history while keeping context relevant:

function buildPromptWithRecentContext(
  messages: Message[],
  relevantDocs: DocumentChunk[]
): { systemPrompt: string; userMessages: Message[] } {
  const systemPrompt = buildSystemPrompt(relevantDocs);
  const recentMessages = messages.slice(-6); // Keep conversation focused
  
  return { systemPrompt, userMessages: recentMessages };
}

Performance Optimization

The system uses in-memory caching and document reprocessing strategies to maintain fast response times without sacrificing quality:

let documentsCache: DocumentChunk[] = [];
let lastProcessed: Date | null = null;

async function getProcessedDocuments() {
  const shouldReprocess = 
    documentsCache.length === 0 || 
    !lastProcessed ||
    (new Date().getTime() - lastProcessed.getTime() > 3600000);
  
  if (shouldReprocess) {
    documentsCache = await processMarkdownDocuments(MARKDOWN_DIRECTORIES);
    lastProcessed = new Date();
  }
  
  return documentsCache;
}

User Interface

The system provides a clean, responsive chat interface with:

Markdown Rendering: For formatted responses
Source Citations: Linking to original documents
Context Awareness: Understanding conversation flow

function ChatMessage({ message }: ChatMessageProps) {
  return (
    <div className="flex justify-start">
      <div className="max-w-[80%] rounded-lg p-3 bg-gray-100">
        <MarkdownRenderer content={message.content} />
        
        {message.sources && (
          <div className="mt-2 pt-2 border-t">
            <div className="text-xs font-medium">Sources:</div>
            {message.sources.map(source => (
              <SourceCitation key={source.chunkId} source={source} />
            ))}
          </div>
        )}
      </div>
    </div>
  );
}

Key Benefits

Simplicity: No complex vector database setup
Performance: Fast in-memory search and caching
Accuracy: Document-grounded responses
Maintainability: Easy to update and extend
Integration: Works with existing markdown docs

Limitations & Considerations

- Search is keyword-based, may miss semantic relationships - In-memory processing limits document scale - No persistent conversation storage - Requires careful prompt engineering

Future Enhancements

Vector Search: Add optional embedding-based search
Conversation Storage: Implement persistence
Streaming Responses: Improve response UX
Advanced Context: Add semantic understanding
Multi-Modal: Support for images and diagrams

Conclusion

This lightweight RAG implementation demonstrates that you can build a powerful document-grounded AI system without complex infrastructure. By focusing on practical search strategies and careful prompt engineering, we achieve high-quality responses while maintaining simplicity and performance.

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe