Skip to content

Simplified Learning Blog

Learning made easy

  • Home
  • Modern Java
  • Architecture & Design
    • Cloud Native
    • System Design
  • AI Engineering
  • Resources
  • About Us
  • Toggle search form

Building a RAG Pipeline with Spring AI and pgvector (No Python Required)

Posted on January 27, 2026January 28, 2026 By Govind No Comments on Building a RAG Pipeline with Spring AI and pgvector (No Python Required)
AI Engineering Updated Jan 2026 · 15 min read

Building a RAG Pipeline with Spring AI and pgvector

The “Python Tax” is officially repealed.

Table of Contents

Toggle
  • Building a RAG Pipeline with Spring AI and pgvector
  • 1. The Architecture: Keep It Single-Stack
  • 2. Setting Up the Foundation
    • Dependencies (Gradle)
    • Database Schema (The Search Index)
  • 3. The Ingestion Pipeline (ETL)
    • The Document Reader
  • 4. The Retrieval & Generation (The “Chat”)
  • 5. Advanced Techniques for 2026
    • Metadata Filtering (Hybrid Search)
    • Re-Ranking (The Precision Booste)
  • 6. Comparison: Spring AI vs. LangChain4j
  • 7. Why “No Python” Matters for Enterprises
  • Conclusion
  • FAQ

For too long, the ‘AI Engineering’ world has been gatekept by Python. If you wanted to build a RAG (Retrieval Augmented Generation) pipeline, you had to spin up a FastAPI service, manage a fragile `requirements.txt`, and bridge it to your robust Java backend via REST. It was brittle, operationally complex, and frankly, unnecessary.

As of 2026, with the maturity of Spring AI 1.0+ and the widespread adoption of PostgreSQL pgvector, Java developers can now build end-to-end, production-grade GenAI applications without writing a single line of Python. This guide is your blueprint.

Why this Stack? (The “Boring” Stack)
  • Spring AI: Provides a portable API across OpenAI, Bedrock, and Gemini. It handles the “glue” code securely.
  • pgvector: Turns your existing Postgres instance into a Vector Database. No new vendors, no new contracts.
  • Java 23+: With Virtual Threads, Java allows for highly concurrent ingestion pipelines that smoke Python’s async loops.

1. The Architecture: Keep It Single-Stack

In the Python-centric world, a RAG architecture typically involves a mess of microservices. In the Spring world, we collapse this complexity.

[Document Source] ⬇ (ETL) [Spring Batch / Spring Integration] ⬇ (EmbeddingClient) [PostgreSQL (pgvector)] ⬇ (Vector Search) [Spring AI ChatClient] ➡ [LLM (GPT-4/Claude)]

Notice what’s missing: Vector DB Glue Code. Because we are using Postgres, our transactional data (e.g., “Is this user a premium member?”) lives right next to our vector data. We can join them in a single SQL query. That is a superpower specialized Vector DBs generally lack.

2. Setting Up the Foundation

Dependencies (Gradle)

First, let’s pull in the Spring AI BOM and the pgvector starter. Note that in 2026, we are using the `1.0.0` (or newer) release train.

dependencies {
    // The core starter
    implementation 'org.springframework.ai:spring-ai-openai-spring-boot-starter'
    implementation 'org.springframework.ai:spring-ai-pgvector-store-spring-boot-starter'
    
    // For robust ETL processing
    implementation 'org.springframework.boot:spring-boot-starter-batch'
    
    // Postgres driver
    implementation 'org.postgresql:postgresql'
    implementation 'org.springframework.boot:spring-boot-starter-jdbc'
}

dependencyManagement {
    imports {
        mavenBom "org.springframework.ai:spring-ai-bom:1.0.0"
    }
}

Database Schema (The Search Index)

You don’t need a complex migration script. Spring AI can auto-initialize the schema, but as Senior Engineers, we prefer explicit control. Enable the extension and create the HNSW index for speed.

-- Enable the extension (Run once)
CREATE EXTENSION IF NOT EXISTS vector;

-- The standard Spring AI table structure
CREATE TABLE IF NOT EXISTS vector_store (
    id uuid DEFAULT uuid_generate_v4() PRIMARY KEY,
    content text,
    metadata json,
    embedding vector(1536) -- OpenAI uses 1536 dimensions
);

-- CRITICAL: Create an HNSW index for performance
-- Without this, queries will be full table scans (slow!)
CREATE INDEX ON vector_store USING hnsw (embedding vector_cosine_ops);
Performance Tip: Do not just use `IVFFlat` (Inverted File Flat) unless your dataset is static. In 2026, `HNSW` (Hierarchical Navigable Small World) is the gold standard for dynamic datasets involving frequent updates, offering better recall/performance trade-offs.

3. The Ingestion Pipeline (ETL)

A RAG system is only as good as its data. “Garbage In, Garbage Out.” We need to Chunk, Embed, and Store.

The Document Reader

Spring AI provides `DocumentReader` interfaces for PDF, JSON, and Text. Here is a robust service that ingests a document:

@Service
public class IngestionService {

    private final VectorStore vectorStore;
    private final TokenTextSplitter textSplitter;

    public IngestionService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
        // Split by tokens (better for LLM context windows)
        this.textSplitter = new TokenTextSplitter(defaultChunkSize, defaultMinChunkSizeChars, defaultMinChunkLengthToEmbed, defaultMaxNumChunks, true);
    }

    @Transactional
    public void ingestFile(Resource file) {
        // 1. Read
        TikaDocumentReader loader = new TikaDocumentReader(file);
        List documents = loader.get();

        // 2. Transform (Chunking)
        // This is crucial. Sending a 50-page PDF to an embedding model fails.
        // We break it into context-window-sized semantic chunks.
        List splitDocuments = textSplitter.apply(documents);

        // 3. Load (Embed & Persist)
        // Spring AI handles the call to OpenAI embedding API 
        // and the SQL INSERT/UPSERT behind the scenes.
        vectorStore.add(splitDocuments);
    }
}

4. The Retrieval & Generation (The “Chat”)

Now for the fun part. In Spring AI 1.0, the `ChatClient` has evolved into a fluent, highly testable API. We will use the `QuestionAnswerAdvisor` pattern to handle the RAG logic automatically.

@RestController
@RequestMapping("/api/chat")
public class RagController {

    private final ChatClient chatClient;

    public RagController(ChatClient.Builder builder, VectorStore vectorStore) {
        // We configure the internal RAG logic here
        this.chatClient = builder
                .defaultAdvisors(new QuestionAnswerAdvisor(vectorStore, SearchRequest.defaults()
                        .withTopK(5) // Retrieve top 5 most similar chunks
                        .withSimilarityThreshold(0.7))) // Filter out noise
                .build();
    }

    @PostMapping
    public Map chat(@RequestBody String userQuery) {
        // The framework automatically:
        // 1. Vectorizes the 'userQuery'
        // 2. Queries pgvector for context
        // 3. Stuffs context into the prompt
        // 4. Calls the LLM
        String response = chatClient.prompt()
                .user(userQuery)
                .call()
                .content();

        return Map.of("response", response);
    }
}

That’s it. Roughly 20 lines of code for a full RAG endpoint. No LangChain spaghetti. No separate Python service.

5. Advanced Techniques for 2026

Basic RAG is easy. Production RAG is hard. Here is how to handle the edge cases.

Metadata Filtering (Hybrid Search)

Pure semantic search is often imprecise. If a user asks “What were my earnings in 2024?”, a vector search might return earnings from 2023 because they look “semantically similar.”

We solve this with Metadata Filtering. This is where pgvector shines—it combines JSONB filtering with vector search.

FilterExpressionBuilder b = new FilterExpressionBuilder();
// Create a filter: ONLY search documents belonging to this user AND year 2024
Filter.Expression filter = b.and(
        b.eq("userId", currentUser.getId()),
        b.eq("year", 2024)
).build();

List results = vectorStore.similaritySearch(
    SearchRequest.query(userQuery)
        .withFilterExpression(filter)
);

This is implemented as a standard SQL `WHERE` clause on the `metadata` JSONB column in Postgres. It is incredibly fast.

Re-Ranking (The Precision Booste)

Sometimes vector search retrieves “related” but irrelevant documents. In 2026, it is standard practice to add a Re-ranking step. You retrieve 20 documents from Postgres, and then pass them through a specialized Cross-Encoder model (like Cohere Rerank) to sort them by true relevance.

Spring AI supports this via `DocumentRetriever` chains, allowing you to plug in a re-ranker transparently.

6. Comparison: Spring AI vs. LangChain4j

Feature Spring AI LangChain4j
Philosophy Spring-Native, Opinionated, Integration-heavy Framework-agnostic, Agent-heavy, Cutting-edge
Configuration Standard `application.yml` properties More programmatic builder patterns
Agent Support Growing (Function Calling), but simpler First-class citizen (Autonomous Agents)

Verdict: If you are building a transactional Enterprise App where AI is a feature (e.g., a “Co-pilot” for a dashboard), use Spring AI. It fits your lifecycle. If you are building a pure AI Agent that runs autonomously, LangChain4j might offer more flexibility.

7. Why “No Python” Matters for Enterprises

It is not just about language preference. It is about Operational Homogeneity.

  1. Unified Security: You use the same Spring Security context, OIDC/OAuth2 flows, and Vault secrets for your AI logic as you do for your banking logic.
  2. Single CI/CD Pipeline: One Jenkins/GitHub Actions pipeline builds a single JAR. No Docker Compose matching Python versions.
  3. Type Safety: Java Records map JSON responses to strong types. No more `KeyError` at runtime because an LLM hallucinated a JSON field name.
  4. Thread Management: Virtual Threads in Java 21+ allow you to handle thousands of concurrent LLM requests (which are I/O bound) with minimal memory footprint, far outperforming standard Python deployments without complex optimizations.

Conclusion

The days of needing a separate “AI Team” writing Python scripts in a silo are over. With Spring AI and pgvector, AI Engineering is now just… Software Engineering.

You have the database (Postgres). You have the runtime (JVM). You have the framework (Spring). You have everything you need to build the next generation of intelligent applications today.

FAQ

What about local LLMs (Ollama)?

Spring AI supports Ollama out of the box. You simply change a property `spring.ai.ollama.base-url` and the `ChatClient` implementation switches transparently from OpenAI to your local Llama 3 instance. This is perfect for local dev loops.

Does pgvector scale to 100M+ vectors?

For massive scale (100M+), dedicated vector DBs (Milvus, Weaviate) or specialized indexing (DiskANN) might be necessary. However, for 99% of corporate use cases (Support Docs, Wikis, User History), data fits comfortably within Postgres pgvector limits (10M-50M vectors is very doable with HNSW).

How do I handle updates to documents?

This is the hard part of RAG. You need an “Upsert” logic. In the `IngestionService` above, you would ideally checksum the file content before embedding. If the checksum hasn’t changed, skip re-embedding (saving money). If it has, delete old chunks by `documentId` and re-insert new ones.

Govind

For over 15 years, I have worked as a hands-on Java Architect and Senior Engineer, specializing in building and scaling high-performance, enterprise-level applications. My career has been focused primarily within the FinTech, Telecommunications, or E-commerce sector, where I’ve led teams in designing systems that handle millions of transactions per day.

Checkout my profile here : AUTHOR https://simplifiedlearningblog.com/author/

Related

AI Engineering Tags:Building a RAG Pipeline with Spring AI and pgvector

Post navigation

Previous Post: Technical Debt vs. Feature Velocity: A Framework for Tech Leads (2026)
Next Post: Event-Driven Architecture in 2026: Kafka vs. Pulsar vs. Redpanda

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Event-Driven Architecture in 2026: Kafka vs. Pulsar vs. Redpanda
  • Building a RAG Pipeline with Spring AI and pgvector (No Python Required)
  • Technical Debt vs. Feature Velocity: A Framework for Tech Leads (2026)
  • Testing Asynchronous Flows with Awaitility: The End of Flaky Tests
  • Migrating from Java 8/11 to Java 25: The Refactoring Checklist (2026 Edition)

Recent Comments

  1. Govind on Performance Principles of Software Architecture
  2. Gajanan Pise on Performance Principles of Software Architecture
Simplified Learning

Demystifying complex enterprise architecture for senior engineers. Practical guides on Java, Spring Boot, and Cloud Native systems.

Explore

  • Home
  • About Us
  • Author Profile: Govind
  • Contact Us

Legal

  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Simplified Learning Blog. All rights reserved.
We use cookies to improve your experience and personalize ads. By continuing, you agree to our Privacy Policy and use of cookies.

Powered by PressBook Green WordPress theme