AI features are easy to demo and hard to ship. The gap is mostly plumbing — vector storage, observability, cost ceilings, prompt management — not model choice. Here's the lean stack we use to get from prototype to production in a couple of weeks.
The stack

- OpenAI or Anthropic for the LLM calls
- Supabase Postgres + pgvector for embeddings and chat history
- n8n for back-office automations (or Vercel cron + Edge Functions if simpler)
- Vercel AI SDK for streaming, tool calls, and structured output
- PostHog or Helicone for usage + cost tracking
Total monthly cost for a small product with a few hundred daily AI requests: under $50, often under $20.
RAG, properly
If your AI feature answers questions from your own knowledge base, you need retrieval. The mistake we see: people store embeddings in a separate vector DB and end up with two source-of-truth problems. Use pgvector — it's a column type on the same Supabase Postgres.
-- Schema for a simple RAG store on Supabase
create extension if not exists vector;
create table documents (
id uuid primary key default gen_random_uuid(),
content text not null,
metadata jsonb,
embedding vector(1536) -- OpenAI text-embedding-3-small
);
create index documents_embedding_idx
on documents using ivfflat (embedding vector_cosine_ops)
with (lists = 100);Then a similarity-search function the API can call:
create or replace function match_documents(
query_embedding vector(1536),
match_count int default 5
)
returns table (id uuid, content text, similarity float)
language sql stable as $$
select id, content,
1 - (embedding <=> query_embedding) as similarity
from documents
order by embedding <=> query_embedding
limit match_count;
$$;Guardrails that actually pay off
In rough priority order:
- Per-user rate limits at the route level (Upstash, 5 req/min works for most)
- Hard daily cost ceiling per workspace — pause new requests if breached, page yourself
- Output validation — Zod schemas on every structured response
- Refusal grounding for RAG — instruct the model to answer 'I don't know' if context is empty
- Prompt logging that excludes PII (or excludes prompts entirely for sensitive products)
Evals, not vibes
Track model quality the same way you'd track a regular API: with tests. Build a small set of input → expected-output pairs early, run them on every prompt change, and grade with a cheaper model. Don't ship prompt changes that drop your eval score, no matter how much better they 'feel'.
Vibes-based prompt engineering is the bug bash equivalent of staging environments — it works once and then quietly degrades.
When to skip AI altogether
If a regex, a SQL query, or a five-line algorithm could do the job — do that. AI features add cost, latency, error surface, and review burden. They're worth it for tasks that genuinely need fuzzy understanding (summarisation, freeform Q&A, classification of messy text). They're not worth it for problems with a clean deterministic answer.
var im= "test";

