What is RAG and why do I need it?

RAG (Retrieval-Augmented Generation) combines LLMs with external knowledge retrieval to provide accurate, contextual responses grounded in your data, reducing hallucinations.

How is Pinecone different from other vector databases?

Pinecone is a fully managed, serverless vector database optimized for production ML workloads. It offers automatic scaling and requires no infrastructure management.

Can I use RAG for customer support?

Yes, RAG is excellent for customer support chatbots. It can retrieve relevant documentation, FAQs, and product information to provide accurate answers.

Building a RAG System with Pinecone and LangChain

Retrieval-Augmented Generation (RAG) is transforming how we build AI applications. By combining the reasoning capabilities of Large Language Models with precise information retrieval from vector databases, RAG systems provide accurate, contextual responses grounded in your own data. This comprehensive guide shows you how to build a production-ready RAG system using Pinecone, LangChain, and n8n.

Understanding RAG Architecture

RAG systems solve a fundamental LLM problem: hallucinations. Instead of relying solely on the model's training data, RAG retrieves relevant information from your knowledge base and provides it as context.

How RAG Works

┌─────────────────────────────────────────────────────────────┐
│                    RAG System Flow                           │
│                                                              │
│  1. User Query                                               │
│     "What are our Q4 revenue projections?"                  │
│                    │                                         │
│                    ▼                                         │
│  2. Query Embedding                                          │
│     [0.123, -0.456, 0.789, ...]  (1536 dimensions)         │
│                    │                                         │
│                    ▼                                         │
│  3. Vector Search (Pinecone)                                │
│     ┌──────────────────────────────────────────┐           │
│     │ • Financial Report Q4 (similarity: 0.92) │           │
│     │ • Revenue Forecast Doc (similarity: 0.88)│           │
│     │ • Board Meeting Notes (similarity: 0.85) │           │
│     └──────────────────────────────────────────┘           │
│                    │                                         │
│                    ▼                                         │
│  4. Context Assembly                                         │
│     Relevant passages + user query                          │
│                    │                                         │
│                    ▼                                         │
│  5. LLM Generation (GPT-4/Claude)                           │
│     "Based on our Q4 financial report, revenue              │
│      projections are $X million, representing a Y%          │
│      increase from Q3..."                                   │
└─────────────────────────────────────────────────────────────┘

Key Components

Vector Database (Pinecone): Stores document embeddings for fast similarity search
Embedding Model: Converts text to vector representations
LLM (GPT-4/Claude): Generates responses using retrieved context
Orchestration (LangChain): Connects components and manages workflow
Automation (n8n): Handles document ingestion and updates

Setting Up Pinecone

Create Pinecone Index

import pinecone

# Initialize Pinecone
pinecone.init(
    api_key="your-api-key",
    environment="us-west1-gcp"
)

# Create index for OpenAI embeddings (1536 dimensions)
index_name = "knowledge-base"

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=1536,  # OpenAI ada-002 embedding size
        metric="cosine",  # or "euclidean" or "dotproduct"
        pods=1,
        pod_type="p1.x1",  # Standard pod
        metadata_config={
            "indexed": ["source", "timestamp", "category"]
        }
    )

# Connect to index
index = pinecone.Index(index_name)

# Check index stats
print(index.describe_index_stats())

Understanding Index Configuration

Dimensions: Must match your embedding model

OpenAI text-embedding-ada-002: 1536
OpenAI text-embedding-3-small: 1536
OpenAI text-embedding-3-large: 3072
Cohere embed-english-v3.0: 1024

Metrics:

Cosine: Best for most use cases, measures angle between vectors
Euclidean: Measures straight-line distance
Dotproduct: Faster but requires normalized vectors

Pod Types:

p1.x1: Standard (100k vectors, ~$70/month)
p1.x2: 2x capacity (200k vectors)
s1.x1: Storage-optimized (5M vectors, slower queries)
p2: Performance-optimized (fastest queries)

Document Processing Pipeline

Chunking Strategy

Proper chunking is critical for RAG performance:

// document-chunker.js
class DocumentChunker {
  constructor(options = {}) {
    this.chunkSize = options.chunkSize || 1000; // Characters
    this.chunkOverlap = options.chunkOverlap || 200; // Overlap for context
    this.separators = options.separators || ['\n\n', '\n', '. ', ' '];
  }

  chunk(text, metadata = {}) {
    const chunks = [];
    let startIndex = 0;

    while (startIndex < text.length) {
      // Determine chunk end
      let endIndex = Math.min(startIndex + this.chunkSize, text.length);

      // Try to break at natural boundaries (paragraph, sentence, etc.)
      if (endIndex < text.length) {
        endIndex = this.findBestSplit(text, startIndex, endIndex);
      }

      const chunk = text.slice(startIndex, endIndex).trim();

      if (chunk.length > 0) {
        chunks.push({
          text: chunk,
          metadata: {
            ...metadata,
            chunk_index: chunks.length,
            start_char: startIndex,
            end_char: endIndex,
          },
        });
      }

      // Move to next chunk with overlap
      startIndex = endIndex - this.chunkOverlap;
    }

    return chunks;
  }

  findBestSplit(text, start, end) {
    // Try each separator in order of preference
    for (const separator of this.separators) {
      const lastIndex = text.lastIndexOf(separator, end);
      if (lastIndex > start) {
        return lastIndex + separator.length;
      }
    }
    return end;
  }

  // Special chunking for code files
  chunkCode(code, language, metadata = {}) {
    const chunks = [];

    // Split by functions/classes
    const functionRegex = {
      javascript: /(?:function|class|const|let|var)\s+\w+/g,
      python: /(?:def|class)\s+\w+/g,
      java: /(?:public|private|protected)?\s*(?:static)?\s*(?:class|interface|void|[\w<>]+)\s+\w+/g,
    };

    const regex = functionRegex[language] || functionRegex.javascript;
    const matches = [...code.matchAll(regex)];

    if (matches.length === 0) {
      // No functions found, use regular chunking
      return this.chunk(code, { ...metadata, type: 'code' });
    }

    // Create chunks based on function boundaries
    for (let i = 0; i < matches.length; i++) {
      const start = matches[i].index;
      const end = i < matches.length - 1 ? matches[i + 1].index : code.length;
      const chunk = code.slice(start, end).trim();

      chunks.push({
        text: chunk,
        metadata: {
          ...metadata,
          type: 'code',
          language,
          function_name: matches[i][0],
          chunk_index: i,
        },
      });
    }

    return chunks;
  }

  // Markdown-aware chunking
  chunkMarkdown(markdown, metadata = {}) {
    const chunks = [];
    const lines = markdown.split('\n');
    let currentChunk = '';
    let currentHeading = '';
    let currentLevel = 0;

    for (const line of lines) {
      // Check for headers
      const headerMatch = line.match(/^(#{1,6})\s+(.+)$/);

      if (headerMatch) {
        // Save previous chunk
        if (currentChunk.trim().length > 0) {
          chunks.push({
            text: currentChunk.trim(),
            metadata: {
              ...metadata,
              heading: currentHeading,
              heading_level: currentLevel,
              chunk_index: chunks.length,
            },
          });
        }

        // Start new chunk
        currentHeading = headerMatch[2];
        currentLevel = headerMatch[1].length;
        currentChunk = line + '\n';
      } else {
        currentChunk += line + '\n';

        // Check if chunk is getting too large
        if (currentChunk.length > this.chunkSize) {
          chunks.push({
            text: currentChunk.trim(),
            metadata: {
              ...metadata,
              heading: currentHeading,
              heading_level: currentLevel,
              chunk_index: chunks.length,
            },
          });
          currentChunk = '';
        }
      }
    }

    // Add final chunk
    if (currentChunk.trim().length > 0) {
      chunks.push({
        text: currentChunk.trim(),
        metadata: {
          ...metadata,
          heading: currentHeading,
          heading_level: currentLevel,
          chunk_index: chunks.length,
        },
      });
    }

    return chunks;
  }
}

// Usage
const chunker = new DocumentChunker({
  chunkSize: 1000,
  chunkOverlap: 200,
});

const document = 'Your long document text here...';
const chunks = chunker.chunk(document, {
  source: 'company-docs',
  document_id: 'doc-123',
  timestamp: new Date().toISOString(),
});

Generating Embeddings

// embedding-service.js
const OpenAI = require('openai');

class EmbeddingService {
  constructor(apiKey) {
    this.openai = new OpenAI({ apiKey });
    this.model = 'text-embedding-3-small';
    this.batchSize = 100; // OpenAI allows up to 2048 inputs per request
  }

  async createEmbedding(text) {
    const response = await this.openai.embeddings.create({
      model: this.model,
      input: text,
      encoding_format: 'float',
    });

    return response.data[0].embedding;
  }

  async createEmbeddingsBatch(texts) {
    // Split into batches
    const batches = [];
    for (let i = 0; i < texts.length; i += this.batchSize) {
      batches.push(texts.slice(i, i + this.batchSize));
    }

    // Process batches
    const allEmbeddings = [];
    for (const batch of batches) {
      const response = await this.openai.embeddings.create({
        model: this.model,
        input: batch,
        encoding_format: 'float',
      });

      allEmbeddings.push(...response.data.map(d => d.embedding));

      // Rate limiting
      await this.sleep(100);
    }

    return allEmbeddings;
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }

  // Calculate cost for embedding generation
  calculateCost(numTokens) {
    // text-embedding-3-small: $0.02 / 1M tokens
    const pricePerMillion = 0.02;
    return (numTokens / 1000000) * pricePerMillion;
  }

  // Estimate tokens
  estimateTokens(text) {
    // Rough estimate: ~4 characters per token
    return Math.ceil(text.length / 4);
  }
}

module.exports = EmbeddingService;

Upserting to Pinecone

// pinecone-service.js
const { Pinecone } = require('@pinecone-database/pinecone');

class PineconeService {
  constructor(apiKey, environment, indexName) {
    this.client = new Pinecone({
      apiKey,
      environment,
    });
    this.index = this.client.index(indexName);
  }

  async upsertChunks(chunks, embeddings) {
    const vectors = chunks.map((chunk, i) => ({
      id: chunk.metadata.chunk_id || `chunk-${Date.now()}-${i}`,
      values: embeddings[i],
      metadata: {
        text: chunk.text,
        source: chunk.metadata.source,
        document_id: chunk.metadata.document_id,
        chunk_index: chunk.metadata.chunk_index,
        timestamp: chunk.metadata.timestamp,
        // Add any custom metadata
        ...chunk.metadata,
      },
    }));

    // Pinecone allows max 100 vectors per upsert
    const batchSize = 100;
    for (let i = 0; i < vectors.length; i += batchSize) {
      const batch = vectors.slice(i, i + batchSize);
      await this.index.upsert(batch);
    }

    return vectors.length;
  }

  async query(embedding, options = {}) {
    const { topK = 5, filter = {}, includeMetadata = true, includeValues = false } = options;

    const results = await this.index.query({
      vector: embedding,
      topK,
      filter,
      includeMetadata,
      includeValues,
    });

    return results.matches;
  }

  async deleteByMetadata(filter) {
    await this.index.deleteMany({ filter });
  }

  async updateMetadata(id, metadata) {
    await this.index.update({
      id,
      setMetadata: metadata,
    });
  }

  async getStats() {
    return await this.index.describeIndexStats();
  }

  // Hybrid search: combine vector similarity with metadata filtering
  async hybridSearch(embedding, filters = {}, options = {}) {
    const { topK = 10, minScore = 0.7, rerank = true } = options;

    // First pass: vector search with filters
    const results = await this.query(embedding, {
      topK: topK * 2, // Get more results for reranking
      filter: filters,
      includeMetadata: true,
    });

    // Filter by minimum score
    let filtered = results.filter(r => r.score >= minScore);

    // Rerank based on additional criteria
    if (rerank) {
      filtered = this.rerankResults(filtered);
    }

    return filtered.slice(0, topK);
  }

  rerankResults(results) {
    return results.sort((a, b) => {
      // Boost recent documents
      const aRecency = new Date(a.metadata.timestamp).getTime();
      const bRecency = new Date(b.metadata.timestamp).getTime();
      const recencyScore = (bRecency - aRecency) / (1000 * 60 * 60 * 24 * 365); // Years

      // Combine similarity score with recency
      const aFinalScore = a.score + recencyScore * 0.1;
      const bFinalScore = b.score + recencyScore * 0.1;

      return bFinalScore - aFinalScore;
    });
  }
}

module.exports = PineconeService;

LangChain Integration

Building the RAG Chain

// rag-chain.js
const { ChatOpenAI } = require('@langchain/openai');
const { PromptTemplate } = require('@langchain/core/prompts');
const { RunnableSequence } = require('@langchain/core/runnables');
const { StringOutputParser } = require('@langchain/core/output_parsers');

class RAGChain {
  constructor(pineconeService, embeddingService) {
    this.pinecone = pineconeService;
    this.embeddings = embeddingService;

    // Initialize LLM
    this.llm = new ChatOpenAI({
      modelName: 'gpt-4-turbo-preview',
      temperature: 0.1, // Low temperature for factual responses
      maxTokens: 1000,
    });

    // Create prompt template
    this.promptTemplate = PromptTemplate.fromTemplate(`
You are a helpful assistant that answers questions based on the provided context.

Context:
{context}

Question: {question}

Instructions:
- Answer based ONLY on the provided context
- If the context doesn't contain the answer, say "I don't have enough information to answer that"
- Be concise and accurate
- Cite specific parts of the context when possible

Answer:`);
  }

  async query(question, options = {}) {
    const { topK = 5, filters = {}, includeReferences = true } = options;

    // 1. Create embedding for the question
    const questionEmbedding = await this.embeddings.createEmbedding(question);

    // 2. Search Pinecone for relevant chunks
    const searchResults = await this.pinecone.hybridSearch(questionEmbedding, filters, {
      topK,
      minScore: 0.7,
    });

    if (searchResults.length === 0) {
      return {
        answer: "I don't have any relevant information to answer that question.",
        references: [],
        confidence: 0,
      };
    }

    // 3. Format context from search results
    const context = this.formatContext(searchResults);

    // 4. Generate answer using LLM
    const chain = RunnableSequence.from([this.promptTemplate, this.llm, new StringOutputParser()]);

    const answer = await chain.invoke({
      context,
      question,
    });

    // 5. Calculate confidence score
    const avgScore = searchResults.reduce((sum, r) => sum + r.score, 0) / searchResults.length;

    return {
      answer,
      references: includeReferences ? this.formatReferences(searchResults) : [],
      confidence: avgScore,
      searchResults: searchResults.length,
    };
  }

  formatContext(results) {
    return results
      .map((result, i) => {
        const source = result.metadata.source || 'Unknown';
        const text = result.metadata.text;
        return `[Source ${i + 1}: ${source}]\n${text}`;
      })
      .join('\n\n---\n\n');
  }

  formatReferences(results) {
    return results.map((result, i) => ({
      index: i + 1,
      source: result.metadata.source,
      document_id: result.metadata.document_id,
      score: result.score,
      excerpt: result.metadata.text.substring(0, 200) + '...',
    }));
  }

  // Conversational RAG with chat history
  async queryWithHistory(question, chatHistory = [], options = {}) {
    // Reformulate question based on chat history
    const reformulatedQuestion = await this.reformulateQuestion(question, chatHistory);

    // Get answer
    const response = await this.query(reformulatedQuestion, options);

    return {
      ...response,
      reformulated_question: reformulatedQuestion,
    };
  }

  async reformulateQuestion(question, chatHistory) {
    if (chatHistory.length === 0) {
      return question;
    }

    const historyText = chatHistory
      .slice(-3) // Last 3 exchanges
      .map(h => `Human: ${h.question}\nAssistant: ${h.answer}`)
      .join('\n\n');

    const reformulationPrompt = `
Given the following conversation history and a new question, reformulate the question to be standalone.

Conversation history:
${historyText}

New question: ${question}

Standalone question:`;

    const response = await this.llm.invoke(reformulationPrompt);
    return response.content.trim();
  }

  // Multi-query retrieval
  async multiQueryRetrieval(question, options = {}) {
    // Generate multiple variations of the question
    const variations = await this.generateQueryVariations(question);

    // Search for each variation
    const allResults = [];
    for (const variation of variations) {
      const embedding = await this.embeddings.createEmbedding(variation);
      const results = await this.pinecone.query(embedding, {
        topK: 3,
        ...options,
      });
      allResults.push(...results);
    }

    // Deduplicate and rank
    const uniqueResults = this.deduplicateResults(allResults);
    const topResults = uniqueResults.slice(0, options.topK || 5);

    // Generate answer
    const context = this.formatContext(topResults);
    const answer = await this.llm.invoke(await this.promptTemplate.format({ context, question }));

    return {
      answer: answer.content,
      references: this.formatReferences(topResults),
      query_variations: variations,
    };
  }

  async generateQueryVariations(question) {
    const prompt = `
Generate 3 different ways to ask the following question:

Original question: ${question}

Variations:
1.`;

    const response = await this.llm.invoke(prompt);
    const variations = response.content
      .split('\n')
      .filter(line => line.match(/^\d+\./))
      .map(line => line.replace(/^\d+\.\s*/, '').trim());

    return [question, ...variations];
  }

  deduplicateResults(results) {
    const seen = new Set();
    const unique = [];

    for (const result of results) {
      const id = result.metadata.chunk_id || result.id;
      if (!seen.has(id)) {
        seen.add(id);
        unique.push(result);
      }
    }

    // Sort by score
    return unique.sort((a, b) => b.score - a.score);
  }
}

module.exports = RAGChain;

n8n Workflow for Document Ingestion

Automated Document Processing

{
  "name": "Document Ingestion Pipeline",
  "nodes": [
    {
      "name": "Webhook - New Document",
      "type": "n8n-nodes-base.webhook",
      "parameters": {
        "path": "ingest-document",
        "responseMode": "responseNode",
        "options": {}
      }
    },
    {
      "name": "Download Document",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "={{ $json.document_url }}",
        "responseFormat": "text"
      }
    },
    {
      "name": "Detect Document Type",
      "type": "n8n-nodes-base.function",
      "parameters": {
        "functionCode": "const url = $json.document_url;\nconst extension = url.split('.').pop().toLowerCase();\n\nreturn {\n  content: $json.body,\n  type: extension,\n  metadata: {\n    source: $json.source || 'upload',\n    document_id: $json.document_id,\n    timestamp: new Date().toISOString()\n  }\n};"
      }
    },
    {
      "name": "Chunk Document",
      "type": "n8n-nodes-base.code",
      "parameters": {
        "language": "javascript",
        "jsCode": "const DocumentChunker = require('./document-chunker');\n\nconst chunker = new DocumentChunker({\n  chunkSize: 1000,\n  chunkOverlap: 200\n});\n\nconst content = $input.item.json.content;\nconst type = $input.item.json.type;\nconst metadata = $input.item.json.metadata;\n\nlet chunks;\nif (type === 'md' || type === 'markdown') {\n  chunks = chunker.chunkMarkdown(content, metadata);\n} else if (['js', 'py', 'java'].includes(type)) {\n  chunks = chunker.chunkCode(content, type, metadata);\n} else {\n  chunks = chunker.chunk(content, metadata);\n}\n\nreturn chunks;"
      }
    },
    {
      "name": "Generate Embeddings",
      "type": "n8n-nodes-base.openAi",
      "parameters": {
        "operation": "embeddings",
        "model": "text-embedding-3-small",
        "text": "={{ $json.text }}"
      }
    },
    {
      "name": "Upsert to Pinecone",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "method": "POST",
        "url": "https://{{ $env.PINECONE_INDEX }}-{{ $env.PINECONE_ENV }}.svc.pinecone.io/vectors/upsert",
        "authentication": "genericCredentialType",
        "headers": {
          "Api-Key": "={{ $env.PINECONE_API_KEY }}",
          "Content-Type": "application/json"
        },
        "body": {
          "vectors": [
            {
              "id": "={{ $json.metadata.document_id }}-chunk-{{ $json.metadata.chunk_index }}",
              "values": "={{ $json.embedding }}",
              "metadata": "={{ $json.metadata }}"
            }
          ]
        }
      }
    },
    {
      "name": "Send Success Response",
      "type": "n8n-nodes-base.respondToWebhook",
      "parameters": {
        "respondWith": "json",
        "responseBody": "={{ { success: true, chunks_processed: $items().length } }}"
      }
    }
  ]
}

Advanced RAG Techniques

Contextual Compression

Reduce context size while preserving relevance:

class ContextualCompressor {
  constructor(llm) {
    this.llm = llm;
  }

  async compressContext(question, documents) {
    const compressionPrompt = `
Given a question and a document, extract only the parts that are relevant to answering the question.

Question: ${question}

Document:
{document}

Relevant excerpts (preserve exact quotes):`;

    const compressed = [];

    for (const doc of documents) {
      const prompt = compressionPrompt.replace('{document}', doc.metadata.text);
      const response = await this.llm.invoke(prompt);

      if (response.content.trim().length > 0) {
        compressed.push({
          ...doc,
          compressed_text: response.content.trim(),
        });
      }
    }

    return compressed;
  }
}

Hypothetical Document Embeddings (HyDE)

Improve retrieval by generating hypothetical answers:

class HyDERetriever {
  constructor(llm, pineconeService, embeddingService) {
    this.llm = llm;
    this.pinecone = pineconeService;
    this.embeddings = embeddingService;
  }

  async retrieve(question, topK = 5) {
    // Generate hypothetical answer
    const hydePrompt = `
Write a detailed answer to the following question. Make it specific and factual.

Question: ${question}

Answer:`;

    const hypotheticalAnswer = await this.llm.invoke(hydePrompt);

    // Embed the hypothetical answer
    const embedding = await this.embeddings.createEmbedding(hypotheticalAnswer.content);

    // Search using hypothetical answer embedding
    const results = await this.pinecone.query(embedding, { topK });

    return results;
  }
}

Parent Document Retrieval

Retrieve small chunks but provide larger context:

class ParentDocumentRetriever {
  constructor(pineconeService) {
    this.pinecone = pineconeService;
  }

  async retrieve(embedding, options = {}) {
    const { topK = 5 } = options;

    // Search for small chunks
    const childResults = await this.pinecone.query(embedding, { topK: topK * 2 });

    // Get parent documents
    const parentDocs = new Map();

    for (const result of childResults) {
      const parentId = result.metadata.parent_id;

      if (!parentDocs.has(parentId)) {
        // Fetch full parent document
        const parent = await this.pinecone.fetch([parentId]);
        parentDocs.set(parentId, {
          ...parent,
          max_score: result.score,
        });
      } else {
        // Update score if this chunk is more relevant
        const existing = parentDocs.get(parentId);
        if (result.score > existing.max_score) {
          existing.max_score = result.score;
        }
      }
    }

    // Sort by best child score
    return Array.from(parentDocs.values())
      .sort((a, b) => b.max_score - a.max_score)
      .slice(0, topK);
  }
}

Evaluation and Monitoring

RAG Metrics

class RAGEvaluator {
  constructor(llm) {
    this.llm = llm;
  }

  async evaluateAnswer(question, answer, groundTruth, retrievedDocs) {
    const metrics = {};

    // 1. Answer Relevancy
    metrics.relevancy = await this.evaluateRelevancy(question, answer);

    // 2. Answer Correctness (if ground truth available)
    if (groundTruth) {
      metrics.correctness = await this.evaluateCorrectness(answer, groundTruth);
    }

    // 3. Faithfulness (answer grounded in retrieved docs)
    metrics.faithfulness = await this.evaluateFaithfulness(answer, retrievedDocs);

    // 4. Context Relevancy
    metrics.contextRelevancy = await this.evaluateContextRelevancy(question, retrievedDocs);

    return metrics;
  }

  async evaluateRelevancy(question, answer) {
    const prompt = `
Rate how well the answer addresses the question on a scale of 0-1.

Question: ${question}
Answer: ${answer}

Rating (0-1):`;

    const response = await this.llm.invoke(prompt);
    return parseFloat(response.content.trim());
  }

  async evaluateFaithfulness(answer, documents) {
    const context = documents.map(d => d.metadata.text).join('\n\n');

    const prompt = `
Rate how well the answer is supported by the context on a scale of 0-1.
1.0 means fully supported, 0.0 means not supported at all.

Context:
${context}

Answer: ${answer}

Rating (0-1):`;

    const response = await this.llm.invoke(prompt);
    return parseFloat(response.content.trim());
  }

  async evaluateCorrectness(answer, groundTruth) {
    const prompt = `
Rate how correct the answer is compared to the ground truth on a scale of 0-1.

Ground Truth: ${groundTruth}
Answer: ${answer}

Rating (0-1):`;

    const response = await this.llm.invoke(prompt);
    return parseFloat(response.content.trim());
  }

  async evaluateContextRelevancy(question, documents) {
    const relevantCount = documents.filter(d => d.score > 0.7).length;
    return relevantCount / documents.length;
  }
}

Monitoring RAG Performance

// Track RAG metrics
const trackRAGMetrics = async (query, response, duration) => {
  await metrics.recordQuery({
    query,
    num_results: response.searchResults,
    avg_confidence: response.confidence,
    duration_ms: duration,
    timestamp: new Date(),
  });

  // Alert on low confidence
  if (response.confidence < 0.5) {
    await sendAlert({
      type: 'low_confidence_answer',
      query,
      confidence: response.confidence,
    });
  }
};

Production Best Practices

1. Caching

class RAGCache {
  constructor(redis) {
    this.redis = redis;
    this.ttl = 3600; // 1 hour
  }

  async get(question) {
    const cached = await this.redis.get(`rag:${this.hashQuestion(question)}`);
    return cached ? JSON.parse(cached) : null;
  }

  async set(question, response) {
    await this.redis.setex(
      `rag:${this.hashQuestion(question)}`,
      this.ttl,
      JSON.stringify(response)
    );
  }

  hashQuestion(question) {
    const crypto = require('crypto');
    return crypto.createHash('md5').update(question.toLowerCase()).digest('hex');
  }
}

2. Rate Limiting

const rateLimit = require('express-rate-limit');

const ragLimiter = rateLimit({
  windowMs: 60 * 1000, // 1 minute
  max: 20, // 20 requests per minute
  message: 'Too many queries, please try again later',
});

app.post('/api/rag/query', ragLimiter, async (req, res) => {
  // Handle RAG query
});

3. Cost Optimization

// Track costs
const trackCosts = {
  embedding: numTokens => (numTokens * 0.00002) / 1000, // $0.02/1M tokens
  llm: (inputTokens, outputTokens) => {
    return (inputTokens * 0.01) / 1000 + (outputTokens * 0.03) / 1000;
  },
  pinecone: numQueries => numQueries * 0.000004, // Approximate
};

// Optimize by caching and batch processing

Complete Example

// main.js - Complete RAG implementation
const EmbeddingService = require('./embedding-service');
const PineconeService = require('./pinecone-service');
const RAGChain = require('./rag-chain');

// Initialize services
const embeddings = new EmbeddingService(process.env.OPENAI_API_KEY);
const pinecone = new PineconeService(
  process.env.PINECONE_API_KEY,
  process.env.PINECONE_ENV,
  'knowledge-base'
);
const rag = new RAGChain(pinecone, embeddings);

// Example: Ingest document
async function ingestDocument(text, metadata) {
  const chunker = new DocumentChunker();
  const chunks = chunker.chunk(text, metadata);

  const texts = chunks.map(c => c.text);
  const embeddingVectors = await embeddings.createEmbeddingsBatch(texts);

  await pinecone.upsertChunks(chunks, embeddingVectors);

  console.log(`Ingested ${chunks.length} chunks`);
}

// Example: Query
async function query(question) {
  const response = await rag.query(question, {
    topK: 5,
    includeReferences: true,
  });

  console.log('Answer:', response.answer);
  console.log('Confidence:', response.confidence);
  console.log('References:', response.references);

  return response;
}

// Run examples
(async () => {
  // Ingest
  await ingestDocument('Your document text here...', { source: 'docs', category: 'technical' });

  // Query
  await query('What is the main topic discussed?');
})();

Join the Community

Building production RAG systems requires expertise in AI, databases, and system architecture. The House of Loops community brings together AI engineers and developers building advanced RAG applications.

Join us to:

Share RAG implementations and best practices
Get feedback on your RAG architecture
Access production-ready RAG templates
Participate in AI/LLM workshops
Connect with developers building similar systems

Join House of Loops Today and get $100K+ in startup credits including OpenAI and Pinecone credits to build your RAG system.

Building a RAG system? Our community has AI engineers ready to help optimize your implementation!

Building a RAG System with Pinecone and LangChain [2025]