Published on March 6, 2026 • 14 min read
Over the past couple of years I've integrated AI into production systems across three stacks: .NET Core backends, Python/FastAPI services, and Next.js frontends. The patterns that survive contact with production look quite different from the ones in tutorials. This article is a walkthrough of the architecture I actually use — diagrams first, commentary second.
Most examples show you how to call an API. POST /v1/chat/completions, get a response, done. That works fine until you need to swap providers mid-project, your primary provider goes down at 2 AM, or a client's bill triples because their usage pattern changed.
The architectures below are designed around one question: what does this look like in six months when requirements shift?
The backbone of most of my backend AI work is a provider-agnostic interface layer. The business logic never imports OpenAI or Anthropic directly — it talks to an interface, and the concrete implementation is wired up at startup.
graph TD
A[Business Logic / Use Cases] --> B[ILLMProvider Interface]
B --> C[OpenAI Adapter]
B --> D[Anthropic Adapter]
B --> E[Azure OpenAI Adapter]
B --> F[AWS Bedrock Adapter]
B --> G[Groq Adapter]
H[HTTP Request] --> I[Auth Middleware]
I --> J[Rate Limit Middleware]
J --> K[Cost Tracking Middleware]
K --> L[Request Router]
L --> B
B --> M[Response]
M --> N[Cost Logger]
N --> O[HTTP Response]
The middleware pipeline is where most of the operational value lives. Auth and rate limiting are obvious. Cost tracking is the one that saves projects — every request gets tagged with a cost estimate before it hits the provider, so you're never surprised by a monthly bill.
The request router is what makes the whole thing interesting. It looks at the request metadata (task type, priority tier, current provider health) and picks which concrete adapter to use. A document summarization job during off-peak hours might route to Groq. The same job during a latency-sensitive user-facing flow routes to GPT-4o.
public interface ILLMProvider
{
Task<LLMResponse> CompleteAsync(LLMRequest request, CancellationToken ct = default);
Task<IAsyncEnumerable<string>> StreamAsync(LLMRequest request, CancellationToken ct = default);
ProviderCapabilities GetCapabilities();
}
public class LLMRequest
{
public string Model { get; set; }
public List<Message> Messages { get; set; }
public int? MaxTokens { get; set; }
public string TaskType { get; set; } // "summarization" | "extraction" | "reasoning"
public ProviderTier RequiredTier { get; set; }
}
The TaskType and RequiredTier fields let the router make intelligent decisions without the business logic caring about providers at all.
Python tends to be where I build the AI-heavy processing pipelines — document intelligence, OCR enhancement, structured extraction. The pattern separates the HTTP layer cleanly from the AI service layer.
graph TD
A[FastAPI Endpoints] --> B[AI Service Layer]
B --> C[OpenAI Adapter]
B --> D[Anthropic Adapter]
B --> E[Groq Adapter]
subgraph OCR Pipeline
F[File Upload Endpoint] --> G[File Validation]
G --> H[Tesseract OCR]
H --> I[Raw Text]
I --> J[LLM Enhancement]
J --> K[Structured Output Parser]
K --> L[Pydantic Model]
L --> M[Storage / Response]
end
B --> J
The OCR pipeline is the most interesting part. Tesseract gives you raw text, but raw OCR output from real-world documents is messy — inconsistent whitespace, merged words, missing punctuation. Running that through an LLM enhancement step before parsing dramatically improves extraction accuracy.
class AIService:
def __init__(self, provider_registry: ProviderRegistry):
self._registry = provider_registry
async def enhance_ocr_output(
self,
raw_text: str,
document_type: DocumentType,
task_config: TaskConfig
) -> EnhancedOCRResult:
provider = self._registry.select(task_config)
prompt = self._build_enhancement_prompt(raw_text, document_type)
response = await provider.complete(prompt)
return self._parser.parse(response.content, document_type)
The ProviderRegistry.select() call is the same cost-routing logic as the .NET layer, just in Python. Same concept, different runtime.
One thing worth calling out: the Pydantic model at the end of the pipeline is the contract between the AI service and everything downstream. It validates that the LLM actually returned structured data in the expected shape. If validation fails, the pipeline retries with a stricter prompt — not with a crash.
On the Next.js side, I use the Vercel AI SDK for streaming and build the RAG pipeline as a server action or API route depending on what the app needs.
sequenceDiagram
participant User
participant Frontend
participant API Route
participant Embedder
participant VectorDB
participant LLM
User->>Frontend: Submit query
Frontend->>API Route: POST /api/chat (streaming)
API Route->>Embedder: embed(query)
Embedder-->>API Route: query_vector
API Route->>VectorDB: similarity_search(query_vector, k=5)
VectorDB-->>API Route: relevant_chunks[]
API Route->>LLM: complete(system + chunks + query)
LLM-->>API Route: stream tokens
API Route-->>Frontend: SSE stream
Frontend-->>User: Render tokens as they arrive
The streaming pattern is what makes AI-powered interfaces feel fast. Even if the full response takes 8 seconds, the user sees words appearing after ~400ms. That perceived latency difference matters enormously for UX.
// API route (simplified)
export async function POST(req: Request) {
const { messages } = await req.json();
const lastMessage = messages[messages.length - 1].content;
// RAG: embed + retrieve
const queryVector = await embedder.embed(lastMessage);
const chunks = await vectorDB.search(queryVector, { k: 5 });
const context = chunks.map(c => c.content).join('\n\n');
// Stream response
const result = await streamText({
model: openai('gpt-4o-mini'),
system: buildSystemPrompt(context),
messages,
});
return result.toDataStreamResponse();
}
The model choice here (gpt-4o-mini) isn't arbitrary — for a RAG assistant where the heavy lifting is retrieval rather than reasoning, a smaller faster model works well. The context does most of the work; the LLM just needs to synthesize it coherently.
This is the pattern that makes AI integration economically sustainable at scale. The decision tree runs before every request and picks the cheapest provider that meets the quality bar for that task.
graph TD
A[Incoming LLM Request] --> B{Task Classification}
B -->|Simple extraction<br/>keyword match<br/>classification| C[Free Tier: Groq]
B -->|Moderate reasoning<br/>summarization<br/>standard chat| D[Mid Tier: Claude Haiku<br/>or GPT-4o-mini]
B -->|Complex reasoning<br/>code generation<br/>multi-step analysis| E[Premium Tier:<br/>GPT-4o or Claude Opus]
C -->|Provider down?| D
D -->|Provider down?| E
E -->|All down?| F[Fallback: Queue for retry]
C --> G[$0.00005 / request]
D --> H[$0.001 / request]
E --> I[$0.015 / request]
G --> J[Cost Logger]
H --> J
I --> J
J --> K[Budget Gate]
K -->|Under budget| L[Return Response]
K -->|Over budget| M[Downgrade tier<br/>or reject request]
The cost difference is stark: $0.00005 per request on Groq's free tier vs $0.015 on premium models. For a product running 10,000 daily requests, that's the difference between $0.50/day and $150/day. Task classification isn't perfect, but even routing 70% of requests to the free tier cuts costs dramatically.
The budget gate at the end is a hard limit. If a tenant has burned their daily budget, requests either get queued, downgraded, or rejected with a clear error — no silent overspend.
class CostRouter {
async selectProvider(request: LLMRequest): Promise<ProviderConfig> {
const tier = this.classify(request);
const budget = await this.budgetService.check(request.tenantId);
if (budget.exceeded && tier === ProviderTier.Premium) {
return this.providers[ProviderTier.Mid]; // graceful downgrade
}
const primary = this.providers[tier];
const healthy = await this.healthCheck.isHealthy(primary.id);
return healthy ? primary : this.failover(tier);
}
private classify(request: LLMRequest): ProviderTier {
if (request.taskType === 'extraction' || request.taskType === 'classification') {
return ProviderTier.Free;
}
if (request.taskType === 'summarization' || request.taskType === 'chat') {
return ProviderTier.Mid;
}
return ProviderTier.Premium;
}
}
These four patterns don't have to coexist in a single project — most projects use two or three. But they share a common thread: the AI provider is always behind an abstraction, cost is always tracked, and there's always a fallback.
The provider-agnostic interface (Diagrams 1 and 2) means I can swap Anthropic for OpenAI without touching business logic. The cost router (Diagram 4) means I can add a new cheap provider and immediately start routing budget-sensitive requests to it. The streaming frontend pattern (Diagram 3) works with any of them because the API contract is stable.
What I've found over multiple projects: the architecture decisions that seem like over-engineering on day one — the interfaces, the cost tracking, the provider abstraction — pay back within a month as requirements shift. AI providers update their models, change pricing, have outages. The abstraction layer is what keeps those events from becoming emergency deploys.
Working on AI integration for a project and want to talk through the architecture? Let's get into the details — these problems are more interesting when they're specific.