Build a RAG chatbot with LangChain, OpenAI embeddings, and Neon PostgreSQL. pgvector, no Pinecone, full Python code, 30 minutes.
Most RAG tutorials use Pinecone or Chroma. Both require a separate vector database. A separate bill. A separate set of credentials. A separate thing that can break at 2 AM.
I'm going to show you how to skip that entirely and use PostgreSQL — the database you probably already have.
I run two production apps on Neon PostgreSQL: ismatsamadov.com and birjob.com. When I needed to add RAG to a side project, I didn't spin up a Pinecone cluster. I added a vector column to my existing Postgres database. It took about five minutes. The whole chatbot was working in under 30.
This tutorial walks you through the exact same setup. By the end, you'll have a working RAG chatbot that answers questions about your own documents using LangChain, OpenAI embeddings, and Neon PostgreSQL with pgvector. No Pinecone. No Chroma. No separate vector DB.
Here's the architecture in plain English:
text-embedding-3-small modelThat's it. No magic. No complicated orchestration. Just embeddings, SQL, and a language model.
The stack:
I need to make the case before we write code, because most people default to Pinecone without thinking about it.
You probably already run Postgres. If you have a web app, there's a good chance you have a PostgreSQL database somewhere. Adding a vector column to an existing table is a one-line migration. Adding Pinecone is a new service, new SDK, new billing page, new API keys, new failure mode.
The performance is actually better. This surprised me too. According to benchmarks from Timescale, pgvector combined with pgvectorscale delivers 28x lower p95 latency than Pinecone's storage-optimized index and 16x higher query throughput. At 75% lower cost. Those aren't theoretical numbers — they're from real workload comparisons.
ACID compliance matters. Your vectors and your relational data live in the same transaction. When you delete a document, you delete its embeddings in the same commit. No sync jobs. No eventual consistency headaches. No "the document was deleted but the vectors are still there" bugs.
It scales far enough for most projects. pgvector handles up to ~10 million vectors comfortably with HNSW indexes. If you're working with 100M+ vectors, sure, you might need a dedicated vector database. But most RAG projects have thousands to low millions of documents. PostgreSQL handles that without breaking a sweat.
Here's how the options compare:
| Feature | pgvector (Neon) | Pinecone | Chroma |
|---|---|---|---|
| Cost (10K vectors) | ~$0/mo (free tier) | $70/mo (starter) | Free (self-hosted) |
| Managed | Yes | Yes | No (unless cloud) |
| ACID transactions | Yes | No | No |
| Latency (p95) | ~5ms | ~140ms | Varies |
| Max vectors | ~10M (comfortable) | Billions | Millions |
| Separate service | No | Yes | Yes (usually) |
| SQL queries | Yes | No | No |
The RAG market hit $2.33 billion in 2025 and is projected to reach $3.33 billion by 2026, growing to $9.86 billion by 2030 at a 38.4% CAGR. 80% of enterprise developers now say RAG is the most effective way to ground LLMs in factual data. This isn't a niche pattern anymore. It's the default architecture for knowledge-grounded AI.
But here's what those market reports don't tell you: most of that growth is going to happen on infrastructure people already have. Adding a vector column to Postgres is a much easier sell to your ops team than "we need a new managed vector database service."
Before we start, you need:
That's it. No Docker. No Kubernetes. No vector database account.
First, create a Neon project if you don't have one. The free tier is enough.
pgvector comes pre-installed on Neon. You don't need to compile anything. You don't need to install anything on the server. Just run one SQL command:
CREATE EXTENSION IF NOT EXISTS vector;
Now create the table that will store your document chunks and their embeddings:
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
embedding VECTOR(1536)
);
The VECTOR(1536) type is from pgvector. 1536 is the dimension count for OpenAI's text-embedding-3-small model. If you use a different embedding model, change this number to match.
Now add an HNSW index. This is important — without it, every similarity search does a full table scan:
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
HNSW (Hierarchical Navigable Small World) is the faster index type. pgvector also supports IVFFlat, but HNSW gives better recall at query time without needing to tune the number of lists. Use HNSW unless you have a specific reason not to.
pip install langchain langchain-openai langchain-community langchain-postgres psycopg2-binary python-dotenv
Here's what each package does:
langchain — core framework (v1.2.14 as of March 2026)langchain-openai — OpenAI integration for embeddings and chat modelslangchain-community — community integrations including document loaderslangchain-postgres — PostgreSQL vector store integrationpsycopg2-binary — PostgreSQL driverpython-dotenv — loads environment variables from .env filesCreate a .env file in your project root:
OPENAI_API_KEY=sk-your-key-here
DATABASE_URL=postgresql://user:pass@ep-something.us-east-1.aws.neon.tech/neondb?sslmode=require
RAG is only as good as your chunks. Too big and you feed the LLM irrelevant context. Too small and you lose meaning. I've found 500-1000 characters with 100 character overlap works well for most text.
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
load_dotenv()
# Load all .txt files from a directory
loader = DirectoryLoader(
"./docs",
glob="**/*.txt",
loader_cls=TextLoader
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")
# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
RecursiveCharacterTextSplitter is my go-to. It tries to split on paragraph breaks first, then sentences, then words. This keeps semantic units together better than a simple character-count split.
If you're loading PDFs instead of text files, swap TextLoader for PyPDFLoader:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("./docs/manual.pdf")
documents = loader.load()
This is where the magic happens — or rather, where the straightforward math happens. Each chunk gets turned into a 1536-dimensional vector using OpenAI's embedding model, then stored directly in your Neon database.
from langchain_openai import OpenAIEmbeddings
from langchain_postgres import PGVector
# Initialize embeddings
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
openai_api_key=os.getenv("OPENAI_API_KEY")
)
# Connection string for Neon
connection_string = os.getenv("DATABASE_URL")
# Create vector store and add documents
vector_store = PGVector.from_documents(
documents=chunks,
embedding=embeddings,
connection=connection_string,
collection_name="my_documents",
pre_delete_collection=False,
)
print(f"Stored {len(chunks)} chunks in Neon PostgreSQL")
Let's talk cost. OpenAI's text-embedding-3-small costs $0.02 per 1 million tokens. If you're indexing 10,000 documents averaging 500 tokens each, that's 5 million tokens total. Cost: $0.10. Ten cents. You could index your entire company's knowledge base for the price of a gumball.
The PGVector.from_documents call does three things: generates embeddings via the OpenAI API, connects to your Neon database, and inserts each chunk with its embedding vector. If the table doesn't exist yet, it creates one automatically (though I prefer creating it manually with the HNSW index as shown in Step 1).
Now we connect the retriever (pgvector search) to the language model (GPT-4o) using LangChain's expression language:
from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
# Initialize the LLM
llm = ChatOpenAI(
model="gpt-4o",
temperature=0.2,
openai_api_key=os.getenv("OPENAI_API_KEY")
)
# Create retriever from the vector store
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
# Define the prompt
system_prompt = """You are a helpful assistant that answers questions based on
the provided context. If you don't know the answer based on the context,
say "I don't have enough information to answer that."
Context:
{context}"""
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
("human", "{input}")
])
# Build the chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
Here's what happens when a user asks a question:
text-embedding-3-small modelk=4 parameter)The temperature=0.2 keeps answers grounded and factual. Higher temperatures make the model more creative, which is the opposite of what you want in a RAG chatbot.
Let's wrap this in a terminal chat interface:
def chat():
print("RAG Chatbot ready. Type 'quit' to exit.\n")
while True:
question = input("You: ").strip()
if question.lower() in ("quit", "exit", "q"):
print("Goodbye!")
break
if not question:
continue
response = rag_chain.invoke({"input": question})
print(f"\nAssistant: {response['answer']}\n")
# Optionally show sources
if response.get("context"):
print("Sources:")
for i, doc in enumerate(response["context"], 1):
source = doc.metadata.get("source", "unknown")
print(f" {i}. {source}")
print()
if __name__ == "__main__":
chat()
That's it. Run it, ask questions about your documents, get answers with sources. No fancy UI. No web framework. Just a terminal loop that works.
Here's everything combined into a single script you can copy-paste and run. Save this as rag_chatbot.py:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_postgres import PGVector
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
load_dotenv()
# --- Configuration ---
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
DATABASE_URL = os.getenv("DATABASE_URL")
DOCS_DIR = "./docs"
COLLECTION_NAME = "my_documents"
# --- Step 1: Load documents ---
print("Loading documents...")
loader = DirectoryLoader(
DOCS_DIR,
glob="**/*.txt",
loader_cls=TextLoader
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")
# --- Step 2: Split into chunks ---
print("Splitting into chunks...")
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
# --- Step 3: Create embeddings and store in Neon ---
print("Generating embeddings and storing in Neon PostgreSQL...")
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
openai_api_key=OPENAI_API_KEY
)
vector_store = PGVector.from_documents(
documents=chunks,
embedding=embeddings,
connection=DATABASE_URL,
collection_name=COLLECTION_NAME,
pre_delete_collection=False,
)
print(f"Stored {len(chunks)} chunks with embeddings")
# --- Step 4: Set up the RAG chain ---
print("Setting up RAG chain...")
llm = ChatOpenAI(
model="gpt-4o",
temperature=0.2,
openai_api_key=OPENAI_API_KEY
)
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
system_prompt = """You are a helpful assistant that answers questions based on
the provided context. If you don't know the answer based on the context,
say "I don't have enough information to answer that."
Context:
{context}"""
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
("human", "{input}")
])
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
# --- Step 5: Chat loop ---
def chat():
print("\nRAG Chatbot ready. Type 'quit' to exit.\n")
while True:
question = input("You: ").strip()
if question.lower() in ("quit", "exit", "q"):
print("Goodbye!")
break
if not question:
continue
response = rag_chain.invoke({"input": question})
print(f"\nAssistant: {response['answer']}\n")
if response.get("context"):
print("Sources:")
for i, doc in enumerate(response["context"], 1):
source = doc.metadata.get("source", "unknown")
print(f" {i}. {source}")
print()
if __name__ == "__main__":
chat()
Before running, make sure you:
docs/ folder with some .txt files in it.env file with your OPENAI_API_KEY and DATABASE_URLThen:
python rag_chatbot.py
I've built a few of these now. Here are the mistakes I've made so you don't have to.
Wrong embedding dimensions. If you create a VECTOR(1536) column but use an embedding model that outputs 768 dimensions, every insert will fail. Match the column size to your model. text-embedding-3-small = 1536. text-embedding-3-large = 3072. Check before you create the table.
Missing HNSW index. Without the index, pgvector falls back to exact nearest-neighbor search, which means a sequential scan of every row. Fine for 1,000 vectors. Unusable at 100,000. Always create the HNSW index. It's one SQL statement.
Chunk size too large. If your chunks are 4,000 characters each, you're feeding huge blocks of text to the retriever. The LLM gets a wall of text and has to find the relevant part itself. Keep chunks between 500-1000 characters. Smaller is usually better than larger.
Chunk size too small. On the flip side, if you chunk at 100 characters, you lose all context. A sentence fragment doesn't carry enough meaning for accurate retrieval. Find the middle ground. 800 characters with 100 overlap is my default.
Neon free tier limits. The free tier gives you 0.5 GB of storage. With 1536-dimensional float32 vectors, each embedding takes about 6 KB. That means you can store roughly 80,000 vectors before hitting the limit. For most projects, that's more than enough. If you need more, the Launch plan at $19/month gives you 10 GB.
Forgetting to handle connection pooling. Neon uses a connection pooler by default. If you're running multiple concurrent requests, make sure your connection string uses the pooled endpoint (it has -pooler in the hostname). The direct connection has a limit of around 20 concurrent connections on the free tier.
Not setting sslmode=require. Neon requires SSL. If your connection string doesn't include sslmode=require, you'll get cryptic connection errors. This one has wasted more of my time than I care to admit.
Here's my strong take, and I'll own it: pgvector is the right choice for 90% of RAG projects. Maybe 95%.
Most teams add Pinecone because they read a tutorial that used Pinecone. Or because some influencer on Twitter said vector databases are the future. Or because their AI engineer wanted to try something new. None of those are good reasons.
If you already run PostgreSQL — and you almost certainly do — adding a vector column is a 5-minute job. You run CREATE EXTENSION vector, add a column, create an index. Done. Your vectors live next to your data. Same backups. Same monitoring. Same credentials. Same failure domain.
A separate vector database is a separate bill, a separate failure mode, and a separate thing to keep in sync. Every time someone deletes a user from your main database, you need to remember to delete their vectors from Pinecone too. Or you build a sync job. Or you accept that your vector store will drift out of sync with reality. None of those options are good.
"But pgvector can't scale to billions of vectors!" True. Pinecone and purpose-built vector databases handle massive scale better. But how many vectors do you actually have? I've worked on production RAG systems with 50,000 documents. That's 50,000 vectors. pgvector doesn't even notice. Even at a million vectors with HNSW, query times stay under 10ms.
The companies that genuinely need Pinecone-level scale are building products like Notion AI or Perplexity, where they're indexing the entire web. You're probably not doing that. You're probably indexing your company's docs, your support tickets, or your product catalog. PostgreSQL handles that without any drama.
Neon makes this even easier because pgvector is pre-installed — you don't need to compile anything or manage extensions. And if you want to go further, Neon also offers pgrag, an extension that lets you build entire RAG pipelines in SQL. No Python required.
The RAG market is projected to hit $9.86 billion by 2030. Large enterprises already make up 72.2% of that market. These companies have PostgreSQL everywhere. They're not going to rip out their existing database infrastructure to use a startup's vector database. They're going to add pgvector. And so should you.
26 aprel 2026
23 aprel 2026
20 aprel 2026