⚔️ The Code Forge ⚔️
Light
Dark
System
← Back to Articles

AI Integration Architecture: .NET, Python, and Frontend Patterns That Actually Ship

AI Integration Architecture: .NET, Python, and Frontend Patterns That Actually Ship

Published on March 6, 2026 • 14 min read

Over the past couple of years I've integrated AI into production systems across three stacks: .NET Core backends, Python/FastAPI services, and Next.js frontends. The patterns that survive contact with production look quite different from the ones in tutorials. This article is a walkthrough of the architecture I actually use — diagrams first, commentary second.

The Problem With "AI Integration" Examples

Most examples show you how to call an API. POST /v1/chat/completions, get a response, done. That works fine until you need to swap providers mid-project, your primary provider goes down at 2 AM, or a client's bill triples because their usage pattern changed.

The architectures below are designed around one question: what does this look like in six months when requirements shift?


Diagram 1: .NET Core + LLM Integration

The backbone of most of my backend AI work is a provider-agnostic interface layer. The business logic never imports OpenAI or Anthropic directly — it talks to an interface, and the concrete implementation is wired up at startup.

graph TD
    A[Business Logic / Use Cases] --> B[ILLMProvider Interface]

    B --> C[OpenAI Adapter]
    B --> D[Anthropic Adapter]
    B --> E[Azure OpenAI Adapter]
    B --> F[AWS Bedrock Adapter]
    B --> G[Groq Adapter]

    H[HTTP Request] --> I[Auth Middleware]
    I --> J[Rate Limit Middleware]
    J --> K[Cost Tracking Middleware]
    K --> L[Request Router]
    L --> B
    B --> M[Response]
    M --> N[Cost Logger]
    N --> O[HTTP Response]

The middleware pipeline is where most of the operational value lives. Auth and rate limiting are obvious. Cost tracking is the one that saves projects — every request gets tagged with a cost estimate before it hits the provider, so you're never surprised by a monthly bill.

The request router is what makes the whole thing interesting. It looks at the request metadata (task type, priority tier, current provider health) and picks which concrete adapter to use. A document summarization job during off-peak hours might route to Groq. The same job during a latency-sensitive user-facing flow routes to GPT-4o.

public interface ILLMProvider
{
    Task<LLMResponse> CompleteAsync(LLMRequest request, CancellationToken ct = default);
    Task<IAsyncEnumerable<string>> StreamAsync(LLMRequest request, CancellationToken ct = default);
    ProviderCapabilities GetCapabilities();
}

public class LLMRequest
{
    public string Model { get; set; }
    public List<Message> Messages { get; set; }
    public int? MaxTokens { get; set; }
    public string TaskType { get; set; }      // "summarization" | "extraction" | "reasoning"
    public ProviderTier RequiredTier { get; set; }
}

The TaskType and RequiredTier fields let the router make intelligent decisions without the business logic caring about providers at all.


Diagram 2: Python/FastAPI + AI Service

Python tends to be where I build the AI-heavy processing pipelines — document intelligence, OCR enhancement, structured extraction. The pattern separates the HTTP layer cleanly from the AI service layer.

graph TD
    A[FastAPI Endpoints] --> B[AI Service Layer]

    B --> C[OpenAI Adapter]
    B --> D[Anthropic Adapter]
    B --> E[Groq Adapter]

    subgraph OCR Pipeline
        F[File Upload Endpoint] --> G[File Validation]
        G --> H[Tesseract OCR]
        H --> I[Raw Text]
        I --> J[LLM Enhancement]
        J --> K[Structured Output Parser]
        K --> L[Pydantic Model]
        L --> M[Storage / Response]
    end

    B --> J

The OCR pipeline is the most interesting part. Tesseract gives you raw text, but raw OCR output from real-world documents is messy — inconsistent whitespace, merged words, missing punctuation. Running that through an LLM enhancement step before parsing dramatically improves extraction accuracy.

class AIService:
    def __init__(self, provider_registry: ProviderRegistry):
        self._registry = provider_registry

    async def enhance_ocr_output(
        self,
        raw_text: str,
        document_type: DocumentType,
        task_config: TaskConfig
    ) -> EnhancedOCRResult:
        provider = self._registry.select(task_config)

        prompt = self._build_enhancement_prompt(raw_text, document_type)
        response = await provider.complete(prompt)

        return self._parser.parse(response.content, document_type)

The ProviderRegistry.select() call is the same cost-routing logic as the .NET layer, just in Python. Same concept, different runtime.

One thing worth calling out: the Pydantic model at the end of the pipeline is the contract between the AI service and everything downstream. It validates that the LLM actually returned structured data in the expected shape. If validation fails, the pipeline retries with a stricter prompt — not with a crash.


Diagram 3: Frontend AI Integration

On the Next.js side, I use the Vercel AI SDK for streaming and build the RAG pipeline as a server action or API route depending on what the app needs.

sequenceDiagram
    participant User
    participant Frontend
    participant API Route
    participant Embedder
    participant VectorDB
    participant LLM

    User->>Frontend: Submit query
    Frontend->>API Route: POST /api/chat (streaming)
    API Route->>Embedder: embed(query)
    Embedder-->>API Route: query_vector
    API Route->>VectorDB: similarity_search(query_vector, k=5)
    VectorDB-->>API Route: relevant_chunks[]
    API Route->>LLM: complete(system + chunks + query)
    LLM-->>API Route: stream tokens
    API Route-->>Frontend: SSE stream
    Frontend-->>User: Render tokens as they arrive

The streaming pattern is what makes AI-powered interfaces feel fast. Even if the full response takes 8 seconds, the user sees words appearing after ~400ms. That perceived latency difference matters enormously for UX.

// API route (simplified)
export async function POST(req: Request) {
  const { messages } = await req.json();
  const lastMessage = messages[messages.length - 1].content;

  // RAG: embed + retrieve
  const queryVector = await embedder.embed(lastMessage);
  const chunks = await vectorDB.search(queryVector, { k: 5 });
  const context = chunks.map(c => c.content).join('\n\n');

  // Stream response
  const result = await streamText({
    model: openai('gpt-4o-mini'),
    system: buildSystemPrompt(context),
    messages,
  });

  return result.toDataStreamResponse();
}

The model choice here (gpt-4o-mini) isn't arbitrary — for a RAG assistant where the heavy lifting is retrieval rather than reasoning, a smaller faster model works well. The context does most of the work; the LLM just needs to synthesize it coherently.


Diagram 4: Multi-Provider Cost Routing

This is the pattern that makes AI integration economically sustainable at scale. The decision tree runs before every request and picks the cheapest provider that meets the quality bar for that task.

graph TD
    A[Incoming LLM Request] --> B{Task Classification}

    B -->|Simple extraction<br/>keyword match<br/>classification| C[Free Tier: Groq]
    B -->|Moderate reasoning<br/>summarization<br/>standard chat| D[Mid Tier: Claude Haiku<br/>or GPT-4o-mini]
    B -->|Complex reasoning<br/>code generation<br/>multi-step analysis| E[Premium Tier:<br/>GPT-4o or Claude Opus]

    C -->|Provider down?| D
    D -->|Provider down?| E
    E -->|All down?| F[Fallback: Queue for retry]

    C --> G[$0.00005 / request]
    D --> H[$0.001 / request]
    E --> I[$0.015 / request]

    G --> J[Cost Logger]
    H --> J
    I --> J
    J --> K[Budget Gate]
    K -->|Under budget| L[Return Response]
    K -->|Over budget| M[Downgrade tier<br/>or reject request]

The cost difference is stark: $0.00005 per request on Groq's free tier vs $0.015 on premium models. For a product running 10,000 daily requests, that's the difference between $0.50/day and $150/day. Task classification isn't perfect, but even routing 70% of requests to the free tier cuts costs dramatically.

The budget gate at the end is a hard limit. If a tenant has burned their daily budget, requests either get queued, downgraded, or rejected with a clear error — no silent overspend.

class CostRouter {
  async selectProvider(request: LLMRequest): Promise<ProviderConfig> {
    const tier = this.classify(request);
    const budget = await this.budgetService.check(request.tenantId);

    if (budget.exceeded && tier === ProviderTier.Premium) {
      return this.providers[ProviderTier.Mid]; // graceful downgrade
    }

    const primary = this.providers[tier];
    const healthy = await this.healthCheck.isHealthy(primary.id);

    return healthy ? primary : this.failover(tier);
  }

  private classify(request: LLMRequest): ProviderTier {
    if (request.taskType === 'extraction' || request.taskType === 'classification') {
      return ProviderTier.Free;
    }
    if (request.taskType === 'summarization' || request.taskType === 'chat') {
      return ProviderTier.Mid;
    }
    return ProviderTier.Premium;
  }
}

Putting It Together

These four patterns don't have to coexist in a single project — most projects use two or three. But they share a common thread: the AI provider is always behind an abstraction, cost is always tracked, and there's always a fallback.

The provider-agnostic interface (Diagrams 1 and 2) means I can swap Anthropic for OpenAI without touching business logic. The cost router (Diagram 4) means I can add a new cheap provider and immediately start routing budget-sensitive requests to it. The streaming frontend pattern (Diagram 3) works with any of them because the API contract is stable.

What I've found over multiple projects: the architecture decisions that seem like over-engineering on day one — the interfaces, the cost tracking, the provider abstraction — pay back within a month as requirements shift. AI providers update their models, change pricing, have outages. The abstraction layer is what keeps those events from becoming emergency deploys.


Working on AI integration for a project and want to talk through the architecture? Let's get into the details — these problems are more interesting when they're specific.