Building a Production RAG Chatbot: From Zero to Portfolio AI Assistant

How I built a sophisticated Retrieval-Augmented Generation system that indexes my entire portfolio and responds as me using OpenAI GPT-4o-mini

🚀 Live Demo: Try the chatbot | 📊 View Index: RAG Index | 💰 Cost: Less than $0.001 per conversation

Introduction

Here's a problem every developer faces: You want an AI assistant that knows everything about your work, but generic chatbots hallucinate facts, can't access your specific projects, and cost a fortune to train on your data.

The naive solution? Fine-tune a model on your resume. Result: expensive training, poor performance on new content, and no way to update knowledge without retraining.

This is what I built with my Portfolio RAG Chatbot: a serverless Next.js application that indexes my entire portfolio (blog posts, resume, projects) into searchable chunks, uses sophisticated relevance scoring to find the most relevant context, and leverages OpenAI's GPT-4o-mini to respond as me with 100% accuracy.

We'll explore document chunking strategies, custom relevance scoring algorithms, OpenAI integration patterns, and production optimizations that power a chatbot that costs less than $0.001 per conversation while providing accurate, contextual responses about my experience and skills.

The Problem: Generic AI vs. Personal Knowledge

Real-World Requirements

A production portfolio chatbot must handle:

📚 Knowledge Requirements

✅ Access to all blog posts and technical articles
✅ Complete resume and work experience
✅ Project details and technical implementations
✅ Real-time updates when content changes

🎯 Response Quality

✅ Accurate, factual responses (no hallucination)
✅ First-person responses as the portfolio owner
✅ Technical depth appropriate to the question
✅ Context-aware follow-up conversations

⚡ Performance Requirements

✅ Sub-second response times
✅ Cost-effective operation (< $0.01 per conversation)
✅ Scalable to multiple concurrent users
✅ Easy content updates without retraining

The Traditional AI Approaches

❌ Attempt 1: Fine-tuning

// Expensive, inflexible, requires retraining for updates
const fineTunedModel = await openai.fineTuning.jobs.create({
  training_file: "portfolio_data.jsonl",
  model: "gpt-3.5-turbo",
  // Cost: $100+ for training, $0.02 per 1K tokens
});

⚠️ Attempt 2: Prompt Engineering

// Limited context, prone to hallucination
const response = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [
    {
      role: "system",
      content:
        "You are David Kwon. Here's a summary of his work: [static summary]",
    },
  ],
  // Problem: Static knowledge, no access to detailed content
});

🔮 Attempt 3: Vector Embeddings

// Complex setup, requires vector database
const embedding = await openai.embeddings.create({
  model: "text-embedding-ada-002",
  input: portfolioContent,
});
// Problem: Overkill for structured content, expensive

The RAG Solution

✅ Retrieval-Augmented Generation combines the best of both worlds:

🔍 Retrieval: Find relevant content from your actual portfolio
🔗 Augmentation: Inject that content into the AI prompt
🤖 Generation: Let the AI respond using only your verified content

RAG Pipeline Flow

Here's the complete sequence of how your question flows through the RAG system:

Step-by-Step Process

📝 User Input
"What AWS experience do you have?"

🔍 Document Search
Searches 62 chunks from knowledge base (blog posts + resume)

📊 Relevance Scoring
Scores each chunk: exact matches (+10), title matches (+5), tag matches (+3), content matches (+1)

🎯 Top-K Retrieval
Retrieves top 5 chunks with highest relevance scores (avg: 15.4)

🔗 Context Injection
Builds system prompt with retrieved chunks and sends to OpenAI GPT-4o-mini

🤖 AI Response
GPT-4o-mini generates response using only the provided context (no hallucination)

Real-Time Console Output

📚 Document index built: {
  totalChunks: 62,
  totalWords: 125000,
  sources: { blogs: 45, resume: 17, projects: 0 }
}

🔍 RAG Retrieval: {
  query: "What AWS experience do you have?...",
  chunksRetrieved: 5,
  avgScore: 15.4,
  sources: ["AWS Lambda Experience", "Serverless ETL", "Cloud Architecture"]
}

📊 RAG Retrieval Details: {
  query: "What AWS experience do you have?...",
  chunksRetrieved: 5,
  avgScore: 15.4,
  sources: ["AWS Lambda Experience", "Serverless ETL", "Cloud Architecture"]
}

Architecture Overview

System Components

graph TB
    A[User Question] --> B[Document Indexer]
    B --> C[RAG Service]
    C --> D[Relevance Scoring]
    D --> E[Context Retrieval]
    E --> F[OpenAI GPT-4o-mini]
    F --> G[Response as David]

    H[Blog Posts] --> B
    I[Resume Data] --> B
    J[Project Data] --> B

    B --> K[Document Chunks]
    K --> L[In-Memory Cache]
    L --> C

Detailed Pipeline Flow

sequenceDiagram
    participant U as User
    participant F as Frontend
    participant API as API Route
    participant RAG as RAG Service
    participant IDX as Document Index
    participant AI as OpenAI GPT-4o-mini

    U->>F: "What AWS experience do you have?"
    F->>API: POST /api/chat
    API->>RAG: getRelevantContext(query)
    RAG->>IDX: Search 62 chunks
    IDX-->>RAG: Return all chunks
    RAG->>RAG: Calculate relevance scores
    RAG->>RAG: Sort by score (avg: 15.4)
    RAG->>RAG: Select top 5 chunks
    RAG-->>API: Return context + metadata
    API->>AI: Send prompt with context
    AI-->>API: Generate response
    API-->>F: Return response
    F-->>U: Display answer as David

Core Technologies

⚛️ Next.js 14
App Router with API routes

🤖 OpenAI GPT-4o-mini
Cost-effective language model

🔍 Custom RAG Engine
Document indexing and retrieval

📘 TypeScript
Type-safe implementation

Implementation Deep Dive

1. Document Indexing System

The foundation of our RAG system is intelligent document chunking that preserves context while creating searchable segments.

// lib/services/document-indexer.ts
export interface DocumentChunk {
  id: string;
  content: string;
  metadata: {
    source: string;
    sourceType: "blog" | "resume" | "project";
    title: string;
    category?: string;
    tags?: string[];
    wordCount: number;
    dateAdded?: string;
  };
}

function chunkText(
  text: string,
  maxChunkSize: number = 1000,
  overlap: number = 200
): string[] {
  const words = text.split(/\s+/);
  const chunks: string[] = [];

  // Create overlapping chunks for context preservation
  for (let i = 0; i < words.length; i += maxChunkSize - overlap) {
    const chunk = words.slice(i, i + maxChunkSize).join(" ");
    if (chunk.trim()) {
      chunks.push(chunk);
    }
  }

  return chunks.length > 0 ? chunks : [text];
}

Key Design Decisions:

📏 800-word chunks with 150-word overlap - Balances context preservation with search precision
🏷️ Metadata enrichment - Each chunk includes source, type, and extracted keywords
🗂️ Hierarchical indexing - Separate chunks for different content types (blog, resume sections)

2. Relevance Scoring Algorithm

Our custom scoring system combines multiple signals to find the most relevant content:

export function calculateRelevanceScore(
  query: string,
  chunk: DocumentChunk
): number {
  const queryWords = query.toLowerCase().split(/\s+/);
  const contentLower = chunk.content.toLowerCase();
  const titleLower = chunk.metadata.title.toLowerCase();
  const tags = chunk.metadata.tags || [];

  let score = 0;

  // Exact phrase match (highest weight)
  if (contentLower.includes(query.toLowerCase())) {
    score += 10;
  }

  // Title matches (high weight)
  queryWords.forEach((word) => {
    if (titleLower.includes(word)) {
      score += 5;
    }
  });

  // Tag matches (medium weight)
  queryWords.forEach((word) => {
    if (tags.some((tag) => tag.includes(word) || word.includes(tag))) {
      score += 3;
    }
  });

  // Content word matches (base weight)
  queryWords.forEach((word) => {
    if (word.length > 3 && contentLower.includes(word)) {
      score += 1;
    }
  });

  // Contextual boosting based on query intent
  const queryLower = query.toLowerCase();

  if (queryLower.includes("experience") || queryLower.includes("work")) {
    if (chunk.metadata.category === "experience") {
      score += 5; // Boost experience chunks for work queries
    }
  }

  if (queryLower.includes("skill") || queryLower.includes("technology")) {
    if (chunk.metadata.category === "skills") {
      score += 3; // Boost skills chunks for tech queries
    }
  }

  return score;
}

Scoring Strategy:

🎯 Exact phrase matches
+10 points (highest priority)

📝 Title relevance
+5 points per matching word

🏷️ Tag relevance
+3 points per matching tag

📄 Content relevance
+1 point per matching word

🎯 Contextual boosting
+3-5 points based on query intent

3. RAG Service Implementation

The RAG service orchestrates retrieval with intelligent caching and detailed metadata:

// lib/services/rag-service.ts
export interface RetrievalResult {
  chunks: DocumentChunk[];
  scores: number[];
  context: string;
  metadata: {
    totalChunksSearched: number;
    chunksRetrieved: number;
    maxScore: number;
    minScore: number;
    avgScore: number;
  };
}

// Cache the index to avoid rebuilding on every request
let cachedIndex: DocumentIndex | null = null;
let lastIndexTime: number = 0;
const INDEX_CACHE_TTL = 5 * 60 * 1000; // 5 minutes

export async function getRelevantContext(
  query: string,
  topK: number = 5
): Promise<RetrievalResult> {
  const index = await getDocumentIndex();

  // Calculate relevance scores for all chunks
  const scoredChunks = index.chunks
    .map((chunk) => ({
      chunk,
      score: calculateRelevanceScore(query, chunk),
    }))
    .filter((item) => item.score > 0) // Only include relevant chunks
    .sort((a, b) => b.score - a.score) // Sort by highest score first
    .slice(0, topK); // Take top K results

  const chunks = scoredChunks.map((item) => item.chunk);
  const scores = scoredChunks.map((item) => item.score);

  // Build context string with metadata
  const context = chunks
    .map((chunk, index) => {
      return `[Source: ${chunk.metadata.title}]
[Type: ${chunk.metadata.sourceType}]
[Relevance Score: ${scores[index]}]
Content:
${chunk.content}`;
    })
    .join("\n\n---\n\n");

  return {
    chunks,
    scores,
    context,
    metadata: {
      totalChunksSearched: index.metadata.totalChunks,
      chunksRetrieved: chunks.length,
      maxScore: scores.length > 0 ? Math.max(...scores) : 0,
      minScore: scores.length > 0 ? Math.min(...scores) : 0,
      avgScore:
        scores.length > 0
          ? scores.reduce((sum, s) => sum + s, 0) / scores.length
          : 0,
    },
  };
}

Performance Optimizations:

⏱️ 5-minute caching - Avoids rebuilding index on every request
🔍 Score filtering - Only processes chunks with score > 0
📊 Top-K retrieval - Limits context to most relevant chunks
📈 Metadata tracking - Provides visibility into retrieval quality

4. OpenAI Integration

The API route handles the complete RAG pipeline with detailed logging:

// app/api/chat/route.ts
export async function POST(request: NextRequest) {
  try {
    const { messages } = await request.json();
    const lastUserMessage = messages
      .filter((m: { role: string }) => m.role === "user")
      .pop();

    // Get relevant context using RAG
    const retrievalResult = await getRelevantContext(
      lastUserMessage.content,
      5
    );
    const relevantContext = retrievalResult.context;

    // Log retrieval details for debugging
    console.log("📊 RAG Retrieval Details:", {
      query: lastUserMessage.content.substring(0, 50) + "...",
      chunksRetrieved: retrievalResult.metadata.chunksRetrieved,
      avgScore: retrievalResult.metadata.avgScore.toFixed(2),
      sources: retrievalResult.chunks.map((c) => c.metadata.title),
    });

    // Create system prompt with context
    const systemPrompt = `You are David Kwon, a Full-Stack Engineer & Cloud Architect. You are answering questions about yourself in first person.

Your expertise includes:
- Full-stack development with React, TypeScript, Next.js, Go
- Cloud architecture with AWS (Lambda, DynamoDB, CloudFront, Step Functions, etc.)
- Building scalable systems, GraphQL APIs, serverless architectures
- System design, performance optimization, and developer experience

IMPORTANT INSTRUCTIONS:
- Only use information from the context below - do NOT make up facts
- If the context doesn't contain enough information, say "I don't have that information in my knowledge base"
- Be professional but conversational
- Reference specific projects, blog posts, or experiences when relevant
- Keep responses concise but informative (2-3 paragraphs max)
- Use technical terms appropriately but explain complex concepts clearly

RELEVANT CONTEXT FROM PORTFOLIO (with relevance scores):
---
${relevantContext}
---

Answer in first person as David Kwon based on the context above.`;

    // Create chat completion
    const completion = await openai.chat.completions.create({
      model: "gpt-4o-mini", // Most cost-effective: $0.15/1M input, $0.60/1M output tokens
      messages: [
        { role: "system", content: systemPrompt },
        ...messages.slice(-5), // Only include last 5 messages for context
      ],
      temperature: 0.7,
      max_tokens: 500,
    });

    return NextResponse.json({
      message: completion.choices[0]?.message?.content,
    });
  } catch (error) {
    console.error("Chat API error:", error);
    return NextResponse.json(
      { error: "Failed to process chat request" },
      { status: 500 }
    );
  }
}

Key Features:

🔗 Context injection - Relevant chunks are embedded in the system prompt
💭 Conversation memory - Last 5 messages provide context
📊 Detailed logging - Console output shows retrieval metrics
🛡️ Error handling - Graceful fallbacks for API failures

Frontend Implementation

Chat Interface Components

The frontend provides a responsive chat interface with suggested questions and real-time updates:

// components/home/chatbot-section.tsx
function useChatbot() {
  const [messages, setMessages] = useState<Message[]>([INITIAL_MESSAGE]);
  const [input, setInput] = useState("");
  const [isLoading, setIsLoading] = useState(false);
  const [suggestedQuestions, setSuggestedQuestions] = useState<
    SuggestedQuestion[]
  >([]);

  const handleSubmit = async (e: React.FormEvent) => {
    e.preventDefault();
    if (!input.trim() || isLoading) return;

    const userMessage: Message = {
      role: "user",
      content: input.trim(),
      timestamp: new Date(),
    };

    setMessages((prev) => [...prev, userMessage]);
    setInput("");
    setIsLoading(true);

    try {
      const response = await fetch("/api/chat", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          messages: [...messages, userMessage].map((m) => ({
            role: m.role,
            content: m.content,
          })),
        }),
      });

      const data = await response.json();
      const assistantMessage: Message = {
        role: "assistant",
        content: data.message,
        timestamp: new Date(),
      };

      setMessages((prev) => [...prev, assistantMessage]);
    } catch (error) {
      console.error("Chat error:", error);
      // Handle error gracefully
    } finally {
      setIsLoading(false);
    }
  };

  return {
    messages,
    input,
    setInput,
    isLoading,
    suggestedQuestions,
    handleSubmit,
  };
}

Performance & Cost Analysis

Cost Breakdown

Model Selection Analysis:

Model	Input Cost	Output Cost	Quality	Recommendation
GPT-4o-mini	$0.15/1M tokens	$0.60/1M tokens	High	✅ Current choice
GPT-3.5-turbo	$0.50/1M tokens	$1.50/1M tokens	Medium	❌ More expensive, lower quality
GPT-4o	$2.50/1M tokens	$10.00/1M tokens	Highest	❌ 17x more expensive

Real-world Usage:

💬 Average conversation: ~800 tokens input, ~200 tokens output
💰 Cost per conversation: ~$0.0002 (less than 0.02 cents)
📈 1000 conversations: ~$0.20 total cost

Performance Metrics

Response Times:

📚 Document indexing
~200ms (cached after first request)

🔍 RAG retrieval
~50ms

🤖 OpenAI API call
~800ms

⚡ Total response time
~1.1 seconds

Scalability:

👥 Concurrent users: Limited by OpenAI rate limits (not application)
💾 Memory usage: ~50MB for document index
🗄️ Database: None required (file-based indexing)

Optimization Strategies

⏱️ Caching
5-minute document index cache

📏 Chunking
Optimal 800-word chunks with overlap

🔍 Filtering
Only process chunks with relevance score > 0

🤖 Model selection
GPT-4o-mini for best cost/quality ratio

📊 Context limiting
Top 5 most relevant chunks only

Production Deployment

Environment Setup

# .env.local
OPENAI_API_KEY=sk-proj-your-actual-api-key-here

Vercel Deployment

// vercel.json
{
  "functions": {
    "app/api/chat/route.ts": {
      "maxDuration": 30
    }
  },
  "env": {
    "OPENAI_API_KEY": "@openai-api-key"
  }
}

Security Considerations

API Key Protection: Server-side only, never exposed to client
Rate Limiting: Consider implementing for production use
Input Validation: Sanitize user inputs before processing
Error Handling: Graceful fallbacks for API failures

Monitoring & Debugging

Console Logging

The system provides detailed logging for debugging:

// Example console output
📚 Document index built: {
  totalChunks: 45,
  totalWords: 125000,
  sources: { blogs: 35, resume: 10, projects: 0 }
}

🔍 RAG Retrieval: {
  query: "What AWS experience do you have?...",
  chunksRetrieved: 5,
  avgScore: 8.2,
  sources: ["AWS Lambda Experience", "Serverless ETL", "Cloud Architecture"]
}

📊 RAG Retrieval Details: {
  query: "What AWS experience do you have?...",
  chunksRetrieved: 5,
  avgScore: 8.2,
  sources: ["AWS Lambda Experience", "Serverless ETL", "Cloud Architecture"]
}

Visual Inspection

The /rag-index page provides a visual interface to inspect indexed content:

View all document chunks
Search and filter functionality
Relevance scores and metadata
Source attribution

Future Enhancements

1. Semantic Search with Embeddings

// Future: Vector-based similarity search
const embedding = await openai.embeddings.create({
  model: "text-embedding-ada-002",
  input: query,
});

// Calculate cosine similarity with chunk embeddings
const similarity = cosineSimilarity(queryEmbedding, chunkEmbedding);

2. Hybrid Search Strategy

// Combine keyword and semantic scoring
const keywordScore = calculateRelevanceScore(query, chunk);
const semanticScore = calculateSemanticSimilarity(query, chunk);
const finalScore = keywordScore * 0.7 + semanticScore * 0.3;

3. Streaming Responses

// Real-time response streaming
const stream = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [...],
  stream: true
});

for await (const chunk of stream) {
  // Stream response chunks to client
}

4. User Feedback Loop

// Collect user feedback to improve scoring
interface UserFeedback {
  query: string;
  response: string;
  rating: 1 | 2 | 3 | 4 | 5;
  chunksUsed: string[];
}

// Use feedback to tune relevance scoring weights

Lessons Learned

What Worked Well

Custom relevance scoring - More effective than simple keyword matching
Contextual boosting - Query intent recognition improves accuracy
Cost optimization - GPT-4o-mini provides excellent quality at low cost
Caching strategy - 5-minute cache balances performance and freshness
Detailed logging - Essential for debugging and optimization

What Could Be Improved

Semantic search - Embeddings would improve relevance for complex queries
Response streaming - Better UX for longer responses
User feedback - Need data to tune scoring algorithms
Content updates - Manual cache invalidation could be automated
Rate limiting - Production deployment needs usage controls

Key Insights

RAG beats fine-tuning - More flexible, cheaper, easier to update
Relevance scoring is critical - Quality of retrieval determines response quality
Context injection works - AI responds accurately when given proper context
Cost optimization matters - Model selection has massive cost implications
Debugging is essential - Detailed logging enables continuous improvement

Conclusion

Building a production RAG chatbot taught me that the magic isn't in the AI model—it's in the retrieval system. By focusing on intelligent document chunking, sophisticated relevance scoring, and context injection, we created a chatbot that:

Costs less than $0.001 per conversation
Responds accurately using only verified content
Scales effortlessly with content updates
Provides detailed debugging visibility

The RAG approach proved superior to fine-tuning because it's more flexible, cost-effective, and maintainable. The key was building a robust retrieval system that finds the right content and injects it into the AI prompt with proper context.

Next steps: Implement semantic search with embeddings, add user feedback loops, and explore streaming responses for even better user experience.

The code is open-source and available in my portfolio repository. Feel free to adapt it for your own use case—the patterns work for any domain where you need an AI assistant with access to specific knowledge.

Resources

Repository: GitHub - Kwonfolio
Live Demo: Portfolio Chatbot
RAG Index: View Indexed Content
OpenAI Documentation: Chat Completions API
Next.js API Routes: API Routes Guide

This article is part of my technical blog series. Check out my other posts on DynamoDB patterns, workflow editors, and system architecture.

Portfolio