Published on January 15, 2025 • Updated March 2026 • 20 min read
The cloud isn't just about hosting anymore. With AI becoming a core part of modern applications, we need to think differently about how we architect systems. I've been diving deep into cloud architecture lately, especially around AI integration, and I want to share what I've learned about building systems that are both scalable and intelligent.
Let's be honest - AI is expensive. Training models, running inference, storing massive datasets - it all costs money. But here's the thing: the cloud gives you something you can't get on-premises: the ability to scale AI workloads up and down based on demand.
I recently worked on a project where we needed to process thousands of documents with OCR and natural language processing. On-premises, this would have required a massive upfront investment in GPU hardware that would sit idle 80% of the time. In the cloud, we spin up processing power when we need it and pay only for what we use.
After building several AI-integrated applications, I've found a few patterns that consistently work well:
Instead of tight coupling between your application and AI services, use events to trigger AI processing:
User uploads document → Event triggered → AI service processes → Results stored → User notified
This approach gives you:
Not everything needs AI, and not all AI needs to be real-time. I use a tiered approach:
Tier 1 - Rule-based logic: Fast, cheap, handles 80% of cases Tier 2 - Simple ML models: Handles edge cases that rules can't Tier 3 - Advanced AI: Complex reasoning, expensive but powerful
This way, you get the benefits of AI without the cost of running complex models for simple decisions.
I've worked extensively with Azure's AI services, and here's my typical stack:
These services are mature, well-documented, and integrate seamlessly with other Azure services.
For tasks that require more sophisticated understanding:
The key is using the right tool for the job. Don't use GPT-4 to extract a phone number from text when a simple regex or Azure Form Recognizer will do.
Let me show you how this looks in practice with a real project I worked on:
The Challenge: A client needed to process legal documents, extract key information, and generate summaries.
The Solution:
The Architecture Benefits:
[FunctionName("ProcessDocument")]
public static async Task Run(
[BlobTrigger("documents/{name}")] Stream documentStream,
string name,
[CosmosDB(DatabaseName = "DocumentDB", CollectionName = "Documents")] IAsyncCollector<Document> documentsOut,
ILogger log)
{
try
{
// Step 1: OCR Processing
var ocrResult = await _formRecognizerClient.AnalyzeDocumentAsync(documentStream);
// Step 2: Entity Extraction
var entities = await _customModelClient.ExtractEntitiesAsync(ocrResult.Content);
// Step 3: Summarization (for important documents only)
string summary = null;
if (entities.Importance > 0.8)
{
summary = await _openAIClient.GenerateSummaryAsync(ocrResult.Content);
}
// Step 4: Store Results
var document = new Document
{
Id = Guid.NewGuid().ToString(),
FileName = name,
Content = ocrResult.Content,
Entities = entities,
Summary = summary,
ProcessedAt = DateTime.UtcNow
};
await documentsOut.AddAsync(document);
// Step 5: Notify User
await _notificationService.NotifyProcessingComplete(document.Id);
}
catch (Exception ex)
{
log.LogError(ex, $"Failed to process document {name}");
await _notificationService.NotifyProcessingFailed(name, ex.Message);
}
}
AI can get expensive fast. Here's how I keep costs under control:
Cache AI results aggressively. If you're processing the same type of document repeatedly, store the results and reuse them.
Instead of processing documents one at a time, batch them together. Most AI services offer better pricing for batch operations.
Use the cheapest model that gives you acceptable results. Don't use GPT-4 when GPT-3.5 will do the job.
AI services pricing varies by region. Deploy in regions that offer the best price/performance ratio for your workload.
AI introduces new security considerations:
Keep detailed logs of:
AI systems need different DevOps practices:
Track which version of which model processed each piece of data. When you update models, you need to be able to compare results.
Test new models against existing ones with real data before full deployment.
Monitor more than just uptime:
Have a plan for when new models perform worse than expected. Sometimes you need to roll back to a previous model version quickly.
The AI landscape changes fast. Here's how I build systems that can adapt:
Don't tie your business logic directly to specific AI services. Use abstraction layers that let you swap providers:
public interface IDocumentAnalyzer
{
Task<AnalysisResult> AnalyzeAsync(Stream document);
}
public class AzureDocumentAnalyzer : IDocumentAnalyzer
{
// Azure-specific implementation
}
public class AWSDocumentAnalyzer : IDocumentAnalyzer
{
// AWS-specific implementation
}
Build pipelines that can be reconfigured without code changes:
{
"pipeline": [
{
"step": "ocr",
"service": "azure-form-recognizer",
"enabled": true
},
{
"step": "entity-extraction",
"service": "custom-model-v2",
"enabled": true
},
{
"step": "summarization",
"service": "azure-openai",
"enabled": false
}
]
}
Standardize on data formats that work across different AI providers. JSON with well-defined schemas is usually a safe bet.
Not every AI capability needs to be custom-built:
If you're looking to add AI to your applications:
Here's why clients love AI-integrated cloud solutions:
The intersection of cloud and AI is just getting started. We're seeing exciting developments in:
In my next article, I'll dive into why I prefer .NET for backend development and how it integrates beautifully with modern cloud and AI services.
Updated March 2026
The original version of this article focused on managed cloud services and event-driven pipelines. That foundation still holds, but the way I wire AI into systems has shifted considerably with the rise of agentic patterns. Here's what's changed in practice.
Modern LLMs can invoke structured tools mid-conversation. Instead of asking the model to return JSON you parse yourself, you define a schema and the model decides when and how to call it:
// Define a tool the agent can call
var tools = new[]
{
new Tool
{
Name = "search_documents",
Description = "Search the document repository by keyword or entity",
Parameters = new { query = "string", limit = "integer" }
},
new Tool
{
Name = "extract_entities",
Description = "Extract named entities from a document chunk",
Parameters = new { text = "string", entity_types = "array" }
}
};
// The LLM decides which tool to call based on the user's request
var response = await _llmClient.CompleteWithToolsAsync(prompt, tools);
At Hyland I use this pattern for enterprise document workflows — the agent calls search and extraction tools as it reasons through a user's request, rather than the application hard-coding the sequence.
The bigger shift is moving from single-shot completions to multi-turn reasoning loops. The application drives the loop; the model decides when it's done:
User intent received
→ LLM reasons about intent
→ Calls tool (search, extract, classify)
→ Receives tool result
→ Reasons again
→ Calls another tool if needed
→ Produces final response when satisfied
What makes this different from older pipelines is that the model's reasoning determines the path, not pre-programmed conditionals. For document processing at Hyland, this means the agent handles edge cases it was never explicitly programmed for — it figures out when to pull in a second tool, when to ask for clarification, and when the confidence is high enough to proceed.
For complex tasks I run multiple specialized agents rather than one general one. The orchestrator coordinates:
Orchestrator
├── Research Agent → gathers context, searches knowledge base
├── Analysis Agent → extracts structure, identifies gaps
└── Synthesis Agent → writes final output, formats for delivery
I use this pattern in my own development workflow with parallel Claude Code agents — each agent gets an isolated git worktree so they can't step on each other, and an orchestrator coordinates phases and merges results. The same principle applies to production AI systems: specialized, isolated agents that compose cleanly.
My production LLM proxy — built for Kitchen-Core and the pattern I now reuse across projects — routes requests across 7 providers based on cost, capability, and availability:
| Provider | Tier | Approx. cost/1K tokens | Best for | |----------|------|------------------------|----------| | Groq (free tier) | Free | ~$0.00005 | Fast, cheap, high-volume | | Groq (paid) | Low | ~$0.0002 | Structured extraction | | Anthropic Haiku | Mid | ~$0.00025 | Balanced reasoning | | GPT-4o Mini | Mid | ~$0.00015 | Code, structured output | | GPT-4o | Premium | ~$0.005 | Complex reasoning | | Claude Opus | Premium | ~$0.015 | Advanced agents | | Azure AI Foundry | Variable | Depends on model | Enterprise compliance |
The routing logic is tiered:
Incoming request
→ Is it a simple extraction/classification? → Groq free tier ($0.00005)
→ Is it mid-complexity reasoning? → Haiku or GPT-4o Mini (~$0.0002)
→ Is it complex agent orchestration? → GPT-4o or Opus ($0.005–$0.015)
→ Did the primary provider fail? → Fallback to next tier
On Kitchen-Core's document pipeline, this reduced LLM costs by routing ~70% of high-volume classification requests to Groq's free tier while preserving premium models for the reasoning-heavy steps that actually need them. The difference between $0.00005 and $0.015 per request is 300x — it adds up fast at scale.
The routing above only works cleanly if your business logic never talks to a specific provider directly. Here's the interface pattern I use:
public interface ILLMProvider
{
string Name { get; }
Task<CompletionResponse> CompleteAsync(CompletionRequest request);
Task<CompletionResponse> CompleteWithToolsAsync(CompletionRequest request, IEnumerable<Tool> tools);
Task StreamAsync(CompletionRequest request, Func<string, Task> onChunk);
ProviderCapabilities GetCapabilities();
decimal EstimateCost(CompletionRequest request);
}
// Each provider implements the same interface
public class GroqProvider : ILLMProvider { ... }
public class AnthropicProvider : ILLMProvider { ... }
public class AzureAIProvider : ILLMProvider { ... }
public class BedrockProvider : ILLMProvider { ... }
The router sits above this layer:
public class LLMRouter
{
private readonly IEnumerable<ILLMProvider> _providers;
private readonly IRoutingStrategy _strategy;
public async Task<CompletionResponse> RouteAsync(CompletionRequest request)
{
var ranked = _strategy.Rank(request, _providers);
foreach (var provider in ranked)
{
try
{
return await provider.CompleteAsync(request);
}
catch (ProviderException ex) when (ex.IsRetryable)
{
// Try next provider in ranked list
continue;
}
}
throw new AllProvidersFailedException();
}
}
This gives you two things the original article mentioned as goals but didn't show concretely:
ILLMProvider, not changing anything elseIn practice, I extended this with a GetCapabilities() method so the router can also check whether a provider supports tool-use, streaming, or vision before routing to it. Some tasks require specific capabilities and the routing strategy needs to know.
Since the original article referenced a document processing project in general terms, here are the real numbers from what became Kitchen-Core:
ILLMProvider interfaceThe hybrid OCR approach is worth highlighting: running Tesseract locally costs nothing per-document. Only sending the low-confidence chunks to an LLM for correction keeps the pipeline fast and cheap while still producing clean structured output.
Interested in adding AI capabilities to your application? Let's discuss your specific use case - I love helping businesses leverage AI to solve real problems.