Building a RAG Pipeline with Spring AI and pgvector
The “Python Tax” is officially repealed.
For too long, the ‘AI Engineering’ world has been gatekept by Python. If you wanted to build a RAG (Retrieval Augmented Generation) pipeline, you had to spin up a FastAPI service, manage a fragile `requirements.txt`, and bridge it to your robust Java backend via REST. It was brittle, operationally complex, and frankly, unnecessary.
As of 2026, with the maturity of Spring AI 1.0+ and the widespread adoption of PostgreSQL pgvector, Java developers can now build end-to-end, production-grade GenAI applications without writing a single line of Python. This guide is your blueprint.
- Spring AI: Provides a portable API across OpenAI, Bedrock, and Gemini. It handles the “glue” code securely.
- pgvector: Turns your existing Postgres instance into a Vector Database. No new vendors, no new contracts.
- Java 23+: With Virtual Threads, Java allows for highly concurrent ingestion pipelines that smoke Python’s async loops.
1. The Architecture: Keep It Single-Stack
In the Python-centric world, a RAG architecture typically involves a mess of microservices. In the Spring world, we collapse this complexity.
Notice what’s missing: Vector DB Glue Code. Because we are using Postgres, our transactional data (e.g., “Is this user a premium member?”) lives right next to our vector data. We can join them in a single SQL query. That is a superpower specialized Vector DBs generally lack.
2. Setting Up the Foundation
Dependencies (Gradle)
First, let’s pull in the Spring AI BOM and the pgvector starter. Note that in 2026, we are using the `1.0.0` (or newer) release train.
dependencies {
// The core starter
implementation 'org.springframework.ai:spring-ai-openai-spring-boot-starter'
implementation 'org.springframework.ai:spring-ai-pgvector-store-spring-boot-starter'
// For robust ETL processing
implementation 'org.springframework.boot:spring-boot-starter-batch'
// Postgres driver
implementation 'org.postgresql:postgresql'
implementation 'org.springframework.boot:spring-boot-starter-jdbc'
}
dependencyManagement {
imports {
mavenBom "org.springframework.ai:spring-ai-bom:1.0.0"
}
}
Database Schema (The Search Index)
You don’t need a complex migration script. Spring AI can auto-initialize the schema, but as Senior Engineers, we prefer explicit control. Enable the extension and create the HNSW index for speed.
-- Enable the extension (Run once)
CREATE EXTENSION IF NOT EXISTS vector;
-- The standard Spring AI table structure
CREATE TABLE IF NOT EXISTS vector_store (
id uuid DEFAULT uuid_generate_v4() PRIMARY KEY,
content text,
metadata json,
embedding vector(1536) -- OpenAI uses 1536 dimensions
);
-- CRITICAL: Create an HNSW index for performance
-- Without this, queries will be full table scans (slow!)
CREATE INDEX ON vector_store USING hnsw (embedding vector_cosine_ops);
3. The Ingestion Pipeline (ETL)
A RAG system is only as good as its data. “Garbage In, Garbage Out.” We need to Chunk, Embed, and Store.
The Document Reader
Spring AI provides `DocumentReader` interfaces for PDF, JSON, and Text. Here is a robust service that ingests a document:
@Service
public class IngestionService {
private final VectorStore vectorStore;
private final TokenTextSplitter textSplitter;
public IngestionService(VectorStore vectorStore) {
this.vectorStore = vectorStore;
// Split by tokens (better for LLM context windows)
this.textSplitter = new TokenTextSplitter(defaultChunkSize, defaultMinChunkSizeChars, defaultMinChunkLengthToEmbed, defaultMaxNumChunks, true);
}
@Transactional
public void ingestFile(Resource file) {
// 1. Read
TikaDocumentReader loader = new TikaDocumentReader(file);
List documents = loader.get();
// 2. Transform (Chunking)
// This is crucial. Sending a 50-page PDF to an embedding model fails.
// We break it into context-window-sized semantic chunks.
List splitDocuments = textSplitter.apply(documents);
// 3. Load (Embed & Persist)
// Spring AI handles the call to OpenAI embedding API
// and the SQL INSERT/UPSERT behind the scenes.
vectorStore.add(splitDocuments);
}
}
4. The Retrieval & Generation (The “Chat”)
Now for the fun part. In Spring AI 1.0, the `ChatClient` has evolved into a fluent, highly testable API. We will use the `QuestionAnswerAdvisor` pattern to handle the RAG logic automatically.
@RestController
@RequestMapping("/api/chat")
public class RagController {
private final ChatClient chatClient;
public RagController(ChatClient.Builder builder, VectorStore vectorStore) {
// We configure the internal RAG logic here
this.chatClient = builder
.defaultAdvisors(new QuestionAnswerAdvisor(vectorStore, SearchRequest.defaults()
.withTopK(5) // Retrieve top 5 most similar chunks
.withSimilarityThreshold(0.7))) // Filter out noise
.build();
}
@PostMapping
public Map chat(@RequestBody String userQuery) {
// The framework automatically:
// 1. Vectorizes the 'userQuery'
// 2. Queries pgvector for context
// 3. Stuffs context into the prompt
// 4. Calls the LLM
String response = chatClient.prompt()
.user(userQuery)
.call()
.content();
return Map.of("response", response);
}
}
That’s it. Roughly 20 lines of code for a full RAG endpoint. No LangChain spaghetti. No separate Python service.
5. Advanced Techniques for 2026
Basic RAG is easy. Production RAG is hard. Here is how to handle the edge cases.
Metadata Filtering (Hybrid Search)
Pure semantic search is often imprecise. If a user asks “What were my earnings in 2024?”, a vector search might return earnings from 2023 because they look “semantically similar.”
We solve this with Metadata Filtering. This is where pgvector shines—it combines JSONB filtering with vector search.
FilterExpressionBuilder b = new FilterExpressionBuilder();
// Create a filter: ONLY search documents belonging to this user AND year 2024
Filter.Expression filter = b.and(
b.eq("userId", currentUser.getId()),
b.eq("year", 2024)
).build();
List results = vectorStore.similaritySearch(
SearchRequest.query(userQuery)
.withFilterExpression(filter)
);
This is implemented as a standard SQL `WHERE` clause on the `metadata` JSONB column in Postgres. It is incredibly fast.
Re-Ranking (The Precision Booste)
Sometimes vector search retrieves “related” but irrelevant documents. In 2026, it is standard practice to add a Re-ranking step. You retrieve 20 documents from Postgres, and then pass them through a specialized Cross-Encoder model (like Cohere Rerank) to sort them by true relevance.
Spring AI supports this via `DocumentRetriever` chains, allowing you to plug in a re-ranker transparently.
6. Comparison: Spring AI vs. LangChain4j
| Feature | Spring AI | LangChain4j |
|---|---|---|
| Philosophy | Spring-Native, Opinionated, Integration-heavy | Framework-agnostic, Agent-heavy, Cutting-edge |
| Configuration | Standard `application.yml` properties | More programmatic builder patterns |
| Agent Support | Growing (Function Calling), but simpler | First-class citizen (Autonomous Agents) |
Verdict: If you are building a transactional Enterprise App where AI is a feature (e.g., a “Co-pilot” for a dashboard), use Spring AI. It fits your lifecycle. If you are building a pure AI Agent that runs autonomously, LangChain4j might offer more flexibility.
7. Why “No Python” Matters for Enterprises
It is not just about language preference. It is about Operational Homogeneity.
- Unified Security: You use the same Spring Security context, OIDC/OAuth2 flows, and Vault secrets for your AI logic as you do for your banking logic.
- Single CI/CD Pipeline: One Jenkins/GitHub Actions pipeline builds a single JAR. No Docker Compose matching Python versions.
- Type Safety: Java Records map JSON responses to strong types. No more `KeyError` at runtime because an LLM hallucinated a JSON field name.
- Thread Management: Virtual Threads in Java 21+ allow you to handle thousands of concurrent LLM requests (which are I/O bound) with minimal memory footprint, far outperforming standard Python deployments without complex optimizations.
Conclusion
The days of needing a separate “AI Team” writing Python scripts in a silo are over. With Spring AI and pgvector, AI Engineering is now just… Software Engineering.
You have the database (Postgres). You have the runtime (JVM). You have the framework (Spring). You have everything you need to build the next generation of intelligent applications today.
FAQ
Spring AI supports Ollama out of the box. You simply change a property `spring.ai.ollama.base-url` and the `ChatClient` implementation switches transparently from OpenAI to your local Llama 3 instance. This is perfect for local dev loops.
For massive scale (100M+), dedicated vector DBs (Milvus, Weaviate) or specialized indexing (DiskANN) might be necessary. However, for 99% of corporate use cases (Support Docs, Wikis, User History), data fits comfortably within Postgres pgvector limits (10M-50M vectors is very doable with HNSW).
This is the hard part of RAG. You need an “Upsert” logic. In the `IngestionService` above, you would ideally checksum the file content before embedding. If the checksum hasn’t changed, skip re-embedding (saving money). If it has, delete old chunks by `documentId` and re-insert new ones.

For over 15 years, I have worked as a hands-on Java Architect and Senior Engineer, specializing in building and scaling high-performance, enterprise-level applications. My career has been focused primarily within the FinTech, Telecommunications, or E-commerce sector, where I’ve led teams in designing systems that handle millions of transactions per day.
Checkout my profile here : AUTHOR https://simplifiedlearningblog.com/author/