Edge AI Inference: Running Models at the CDN Layer
The fastest inference call is the one that never crosses an ocean. Edge AI moves quantized models to the CDN layer -- Cloudflare Workers AI, Deno, Vercel -- placing intelligence at the same tier as your static assets. This is a production guide to when it works, when it fails, and what it actually costs.
Table of Contents
The fastest inference call is the one that never crosses an ocean.
For two decades, CDNs existed to cache bytes: images, scripts, HTML fragments. The contract was simple. Put static content near the user. Reduce round-trip time. Charge per gigabyte.
That contract just changed. The CDN layer now runs models. Not toy demos. Production classifiers, embedding generators, PII detectors, content moderators. Small models, quantized aggressively, executing at the same network tier where your CSS lives.
This is not a theoretical shift. Cloudflare Workers AI serves inference from 300+ data centers. Deno Deploy runs ONNX models at the edge. Vercel ships AI SDK helpers that front edge-deployed functions. The infrastructure is real. The question is no longer "can we?" but "when should we, and what do we give up?"
We have been running edge inference in production for classification and embedding workloads. This article is what we learned: the architecture, the cost model, the quantization trade-offs, and the operational patterns that survive contact with real traffic.
I. The Latency Argument
Latency is not a number. It is a user experience.
A centralized GPU cluster in us-east-1 delivers first-token latency of 200-800ms for a typical LLM call. Add network transit for a user in Oslo, and you are looking at 300-1000ms before the first byte of intelligence arrives. For a user in Sao Paulo, double it.
Edge inference changes the arithmetic:
Centralized (us-east-1):
User (Oslo) -> CDN edge (Oslo) -> Origin (Virginia) -> GPU cluster -> back
Network RTT: ~120ms * 2 = ~240ms
Inference: ~200-600ms
Total TTFT: ~440-840ms
Edge inference:
User (Oslo) -> CDN edge (Oslo) -> local model -> back
Network RTT: ~5ms * 2 = ~10ms
Inference: ~50-300ms (quantized, small model)
Total TTFT: ~60-310ms
For classification tasks -- "is this comment toxic?", "what language is this?", "does this contain PII?" -- the edge path is 3-8x faster. Not because the model is faster. Because the network vanished.
First-token latency matters most for interactive flows: autocomplete, inline moderation, real-time classification, embedding generation for search-as-you-type. These are the tasks where 400ms feels broken and 60ms feels instant. The human perceptual threshold for "instantaneous" is roughly 100ms. Edge inference puts you under it for the right workloads.
For long-form generation -- multi-paragraph summaries, code completion, chain-of-thought reasoning -- the edge story is weaker. You need larger models, longer context windows, and more memory than edge nodes typically offer. The latency advantage exists but is dwarfed by generation time.
Know the shape of your workload before you pick the topology.
II. What "Edge" Actually Means in 2026
The word "edge" is overloaded to the point of uselessness. Let us be precise.
CDN edge nodes are lightweight compute instances running in Points of Presence (PoPs) -- the same data centers that serve your cached assets. They have constrained memory (128MB-512MB per isolate), limited execution time (typically 30s-50ms CPU time budgets), and no persistent GPU. Inference runs on CPUs or, increasingly, purpose-built accelerators co-located with CDN infrastructure.
Regional edge is different. These are full compute regions closer to users than a single centralized cloud, but not as distributed as CDN PoPs. Think Fly.io machines, Railway regions, or AWS Local Zones.
On-device is the ultimate edge: the user's browser or phone. WebGPU inference, Core ML, ONNX Runtime Web.
This article focuses primarily on CDN-layer edge -- the most constrained and most widely distributed tier. The patterns apply, with modifications, to regional edge and on-device.
Distribution vs. Capability:
CDN Edge (300+ PoPs) | Lowest latency, smallest models
Regional Edge (20-50) | More memory, medium models
Centralized Cloud (3-8) | Full GPU, largest models
─────────────────────────┼─────────────────────────────────
More locations | More capability
III. Model Quantization: The Art of Controlled Degradation
You cannot run a 7B-parameter model in float32 at the CDN edge. The math is unforgiving: 7B params * 4 bytes = 28GB. A Cloudflare Worker gets 128MB.
Quantization is the bridge. Reduce precision, reduce size, accept some quality loss. The question is always: how much quality do you sacrifice, and can your application tolerate it?
The Quantization Landscape
INT8 -- 8-bit integer quantization. Cuts model size by ~4x versus float32. Quality loss is typically 1-3% on standard benchmarks. This is the conservative choice. Most models survive INT8 without meaningful degradation for classification, embedding, and short-generation tasks.
INT4 -- 4-bit quantization. Cuts size by ~8x. Quality loss is 3-8% depending on the model and task. Viable for classification and embedding. Risky for nuanced generation.
GPTQ (Post-Training Quantization) -- A calibration-based method that finds optimal quantization parameters using a small calibration dataset. Produces high-quality INT4 models. The standard for offline quantization of large language models.
AWQ (Activation-Aware Weight Quantization) -- Protects the 1% of "salient" weights that matter most to model quality, quantizing everything else aggressively. Often outperforms GPTQ at the same bit width. Our default choice for edge deployment.
# Quantizing a model with AWQ for edge deployment
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "mistralai/Mistral-7B-Instruct-v0.3"
quant_path = "mistral-7b-awq-int4"
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Practical Guidance
For edge inference, our sizing rule:
- Under 500M parameters, INT8: Runs comfortably in most edge environments. Embedding models (BGE, E5, GTE), small classifiers, language detectors.
- 500M-3B parameters, INT4/AWQ: Fits in constrained environments with careful memory management. Phi-3 Mini, Gemma 2B, TinyLlama.
- 3B-7B parameters, INT4/AWQ: Requires regional edge or generous memory budgets. Feasible on Cloudflare Workers AI (which manages model placement), tight elsewhere.
- Above 7B: Centralized. Do not force it.
Measure quality on your actual task distribution, not on academic benchmarks. A model that scores 2% lower on MMLU might score 15% lower on your specific PII detection corpus. Build task-specific evals and run them against every quantization variant.
IV. Cloudflare Workers AI: Architecture and Reality
Cloudflare's Workers AI is the most mature CDN-layer inference platform. Understanding its architecture reveals both the potential and the constraints of edge AI.
How It Works
Workers AI does not run models in your Worker isolate. Your Worker makes an API call to @cf/ namespaced models. Cloudflare routes that call to the nearest PoP that has the model loaded. The model runs on Cloudflare's inference hardware -- a mix of GPUs and purpose-built accelerators distributed across their network.
// Cloudflare Workers AI -- text classification
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { text } = await request.json<{ text: string }>();
// Classification at the edge
const result = await env.AI.run(
"@cf/huggingface/distilbert-sst-2",
{ text }
);
return Response.json({
label: result[0].label,
confidence: result[0].score,
served_from: request.cf?.colo // which PoP handled this
});
}
};
// Embedding generation -- same edge path
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { texts } = await request.json<{ texts: string[] }>();
const embeddings = await env.AI.run(
"@cf/baai/bge-base-en-v1.5",
{ text: texts }
);
// Store in Vectorize or return directly
return Response.json({ embeddings: embeddings.data });
}
};
The Model Catalog
As of early 2026, Workers AI offers ~30 models across categories: text generation (Llama, Mistral, Phi), text classification, embeddings (BGE, GTE), translation, image classification, speech-to-text (Whisper), and image generation (Stable Diffusion variants). The catalog grows quarterly.
The models are pre-quantized. You do not choose the quantization method. Cloudflare optimizes for their hardware profile. This is a trade-off: you lose control over quantization parameters, but you gain automatic distribution and hardware-aware optimization.
Limitations You Will Hit
Cold start. The first request to a model at a given PoP incurs model loading time -- 1-5 seconds depending on model size. Subsequent requests are fast. For latency-sensitive production use, you need warm-keeping strategies: synthetic health checks, minimum request rates, or acceptance of occasional cold-start penalties.
Model selection. You run what Cloudflare offers. No custom models (yet). No fine-tuned variants of catalog models. If you need a specialized model, Workers AI is not your path.
Context windows. Smaller than centralized deployments. Llama models on Workers AI typically support 2048-4096 tokens, not the 32K-128K you get from a dedicated vLLM instance.
Rate limits and pricing. Generous free tier (10,000 neurons/day), but production workloads need the paid tier. Pricing is per-neuron (a Cloudflare-specific unit roughly proportional to compute), which makes cost prediction less intuitive than per-token pricing.
No streaming for all models. Text generation supports streaming; classification and embedding calls are batch-response only.
Wrangler Configuration
# wrangler.toml
name = "edge-inference-service"
main = "src/index.ts"
compatibility_date = "2026-01-01"
[ai]
binding = "AI"
Simple. The complexity lives in knowing which models to use and when to fall back.
V. ONNX Runtime Web and WebGPU: The Browser as Inference Engine
The CDN edge is not the only edge. The browser is.
ONNX Runtime Web runs ONNX models in the browser using WebAssembly (CPU) or WebGPU (GPU acceleration). This is true on-device inference -- no server call at all. The model downloads once, caches in the browser, and runs entirely client-side.
// Browser-side inference with ONNX Runtime Web
import * as ort from 'onnxruntime-web';
// Configure WebGPU backend (falls back to WASM)
ort.env.wasm.numThreads = 4;
async function classifyText(text: string): Promise<string> {
const session = await ort.InferenceSession.create(
'/models/distilbert-classification.onnx',
{ executionProviders: ['webgpu', 'wasm'] }
);
// Tokenize (using a lightweight tokenizer)
const encoded = tokenizer.encode(text);
const inputIds = new ort.Tensor('int64',
BigInt64Array.from(encoded.map(BigInt)),
[1, encoded.length]
);
const results = await session.run({ input_ids: inputIds });
const logits = results.logits.data as Float32Array;
return logits[0] > logits[1] ? 'positive' : 'negative';
}
When Browser Inference Wins
- Privacy-absolute workflows. Legal document analysis where text must never leave the device. Medical pre-screening. Personal finance categorization.
- Offline capability. PWAs that classify, embed, or moderate without connectivity.
- Cost elimination. If you serve 10M classification requests/month, moving them to the browser saves the entire inference bill. The user's device pays the compute cost.
When It Does Not
- Model size. A 500MB ONNX model download on first visit is a non-starter for most web applications. Practical limit: 50-150MB for acceptable UX.
- Device variance. A 2024 MacBook Pro with WebGPU will run inference 20x faster than a 2020 budget Android phone with WASM fallback. You must design for the floor, not the ceiling.
- WebGPU availability. Supported in Chrome, Edge, and Firefox (behind flags). Safari support is incomplete as of early 2026. Your fallback to WASM must be seamless.
Build for progressive enhancement: try WebGPU, fall back to WASM, fall back to edge API, fall back to centralized. Each tier is a capability, not a requirement.
VI. Practical Use Cases: Where Edge Inference Earns Its Keep
Not every ML task belongs at the edge. Here is where we deploy edge inference in production, and why.
Classification
Language detection, sentiment analysis, intent routing. These are the canonical edge inference tasks. Models are small (50-200M parameters), latency matters, and the output is a label -- not generated text. A distilbert classifier at the edge returns a label in 15-40ms. The same call to a centralized API takes 150-400ms including network.
Embedding Generation
Search-as-you-type, semantic similarity, recommendation. Embedding models (BGE-base, E5-small) are 100-400M parameters and produce fixed-length vectors. Generate embeddings at the edge, query your vector database from the nearest region. The embedding step is the latency-sensitive part; the vector search can tolerate a regional hop.
User types query
|
v
Edge: generate embedding (BGE-base, ~30ms)
|
v
Regional: vector search (Pinecone/Qdrant, ~20ms)
|
v
Edge: format and return results (~5ms)
Total: ~55ms vs ~250ms centralized
PII Detection
Pre-flight data scanning before data leaves the edge. Run a small NER model at the edge to detect names, emails, phone numbers, addresses. Flag or redact before the request reaches your application server. This is not just a latency play -- it is a data governance play. PII that never leaves the edge PoP never enters your centralized logs.
Content Moderation
Toxicity scoring, NSFW detection, spam classification. These are gate functions: they run before content enters your system. Edge placement means moderation decisions happen at intake speed, not at queue-processing speed. For user-generated content platforms, this cuts the window of exposure from seconds to milliseconds.
Where Edge Inference Does Not Belong
- Multi-turn conversation. Requires state, context windows, and model sizes that exceed edge budgets.
- Long-form generation. Token-by-token generation at the edge is slower than a single round trip to a centralized GPU for anything beyond a sentence.
- RAG with large corpora. Retrieval-augmented generation needs vector stores, document chunks, and synthesis models that do not fit edge constraints.
- Fine-tuned domain models. If your model is custom-trained, Workers AI cannot serve it. You need your own infrastructure.
VII. The Cost Model: Edge vs. Centralized GPU
Cost comparisons in inference are treacherous. Units differ. Pricing models differ. But the directional economics are clear.
Centralized GPU
A dedicated A100 instance (AWS p4d.24xlarge) costs ~$32/hour. At full utilization with batching, it serves roughly 500-2000 requests/second for a 7B model, depending on sequence length. That is $0.016-$0.064 per 1000 requests at full load. But you pay for the instance whether it is busy or idle. Utilization below 60% doubles your effective cost.
Edge Inference (Cloudflare Workers AI)
Workers AI pricing is per-neuron, with the free tier covering 10,000 neurons/day. On the paid plan, costs scale with usage. For classification and embedding tasks, typical cost is $0.01-$0.05 per 1000 requests. No idle cost. No GPU reservation.
The Break-Even
Cost comparison (classification, 1M requests/day):
Centralized GPU (reserved):
Instance: $32/hr * 24 = $768/day
At 1M req/day: $0.77 per 1000 requests
(assuming single-model, modest batching)
Centralized GPU (serverless, e.g., Replicate):
~$0.10-$0.30 per 1000 requests
Edge (Workers AI):
~$0.01-$0.05 per 1000 requests
Edge (browser/ONNX):
$0.00 (user's device)
Edge wins decisively for bursty, classification-heavy workloads. Centralized wins for sustained, high-throughput generation. The crossover point depends on your traffic pattern, model complexity, and whether you can keep GPUs hot.
For most applications, the answer is both: edge for the fast path, centralized for the heavy path. The architecture is not either/or. It is a routing decision.
VIII. Architecture Patterns That Survive Production
Pattern 1: Edge-First with Cloud Fallback
The default pattern. Route all inference-eligible requests to the edge. If the edge model cannot handle the request (too complex, context too long, model unavailable), fall back to centralized.
Request
|
v
+----------------+
| Edge Router |
| (complexity |
| estimator) |
+-------+--------+
|
+--------+--------+
| |
Simple/Fast Complex/Long
| |
v v
+----------+ +---------------+
| Edge AI | | Centralized |
| (Workers | | GPU Cluster |
| AI, 30ms| | (vLLM, 400ms)|
+----------+ +---------------+
| |
+--------+--------+
|
v
Response
// Edge-first routing in a Cloudflare Worker
async function handleInference(
request: Request,
env: Env
): Promise<Response> {
const { text, task } = await request.json<InferenceRequest>();
// Fast path: classification and embedding at edge
if (task === 'classify' || task === 'embed') {
try {
const result = await env.AI.run(
EDGE_MODELS[task],
{ text }
);
return Response.json({ result, tier: 'edge' });
} catch (e) {
// Fall through to centralized
}
}
// Slow path: generation and complex tasks centralized
const response = await fetch(env.CENTRAL_INFERENCE_URL, {
method: 'POST',
headers: {
'Authorization': Bearer ${env.INFERENCE_API_KEY},
'Content-Type': 'application/json'
},
body: JSON.stringify({ text, task })
});
const result = await response.json();
return Response.json({ ...result, tier: 'centralized' });
}
Pattern 2: Split Inference
Run the first stage of a pipeline at the edge, send intermediate results to centralized for the expensive stage. Common for search: embed at edge, retrieve and re-rank centrally.
Edge PoP (Oslo) Region (eu-west-1)
+-----------------+ +-------------------+
| Embed query | vector | Vector search |
| (BGE, 30ms) |--------->| Re-rank (cross- |
| | | encoder, 80ms) |
+-----------------+ | Generate snippet |
| (LLM, 200ms) |
+-------------------+
Pattern 3: Speculative Edge Decoding
A frontier pattern borrowed from speculative decoding in LLMs. A small edge model generates a draft response. A centralized model verifies and corrects. If the edge draft is good enough (and for simple queries, it often is), the user sees sub-100ms latency. If not, the centralized model's correction arrives within the normal latency window.
This works for predictable queries: FAQ responses, templated answers, simple completions. It does not work for creative or reasoning-heavy tasks where the small model's drafts are consistently wrong.
Pattern 4: Edge Preprocessing, Central Reasoning
The most conservative pattern. The edge handles tokenization, PII detection, input validation, and formatting. The central cluster handles all inference. Edge adds 5-15ms of preprocessing but removes garbage-in requests, reduces token count, and enforces data governance before the request ever hits a GPU.
IX. The Operational Reality: Monitoring Edge Inference at Scale
Edge inference introduces monitoring challenges that centralized deployments do not have.
The Observability Problem
Your model runs in 300+ locations. Cold start behavior differs by PoP. Hardware varies. Network conditions vary. A quality degradation in Asia-Pacific might not appear in your aggregate metrics.
You need per-PoP monitoring. Not optional.
// Structured logging for edge inference observability
interface InferenceMetric {
timestamp: number;
colo: string; // Cloudflare PoP identifier
model: string;
task: 'classify' | 'embed' | 'generate';
latency_ms: number;
tokens_in: number;
tokens_out: number;
cold_start: boolean;
fallback: boolean;
confidence: number; // model confidence score
error?: string;
}
async function instrumentedInference(
env: Env,
request: Request,
task: string,
input: unknown
): Promise<{ result: unknown; metric: InferenceMetric }> {
const start = performance.now();
const colo = request.cf?.colo ?? 'unknown';
try {
const result = await env.AI.run(MODEL_MAP[task], input);
const metric: InferenceMetric = {
timestamp: Date.now(),
colo,
model: MODEL_MAP[task],
task: task as InferenceMetric['task'],
latency_ms: performance.now() - start,
tokens_in: estimateTokens(input),
tokens_out: estimateTokens(result),
cold_start: (performance.now() - start) > COLD_START_THRESHOLD,
fallback: false,
confidence: extractConfidence(result)
};
// Ship to analytics (non-blocking)
env.ANALYTICS.writeDataPoint(metric);
return { result, metric };
} catch (error) {
// ... error tracking with PoP context
}
}
What to Alert On
- P99 latency per PoP exceeding 3x the P50 baseline. Cold starts cause spikes; sustained elevation means something is wrong.
- Fallback rate above threshold. If more than 5% of edge requests fall back to centralized, investigate. Model loading failures, memory pressure, or request patterns that exceed edge capability.
- Confidence drift. If average model confidence for a given task drops over time, your input distribution has shifted. The model may need retraining or your edge/central routing threshold needs adjustment.
- Cold start frequency. Track how often each PoP serves a cold-start request. If a PoP is cycling constantly, your traffic is too sparse there for edge inference to be worthwhile. Route that PoP to centralized and save the cold-start penalty.
The Dashboard You Actually Need
Forget vanity metrics. Your edge inference dashboard has three panels:
- Latency heatmap by PoP. Color-coded world map showing P50 inference latency at each location. Red means "investigate or reroute."
- Fallback waterfall. For every request that fell back from edge to centralized: why, from where, and how much latency it added.
- Cost per useful inference. Not cost per request -- cost per request that returned a result the application actually used. A classification with 0.51 confidence that gets re-checked centrally is not a useful edge inference; it is a wasted one.
X. Limitations: The Honest List
Edge inference is not magic. Here is what will bite you.
Model size ceiling. The largest model you can practically run at the CDN edge is 3-7B parameters, quantized to INT4. That rules out the models most people think of when they say "AI" -- GPT-4-class, Claude-class, Llama 70B. Edge AI is small-model AI.
No fine-tuning at edge. You cannot deploy custom fine-tuned models to Workers AI or similar platforms (as of early 2026). If your use case requires domain-specific models, you need your own serving infrastructure, which means regional edge at best.
Memory pressure. Edge isolates have hard memory limits. A model that fits in memory might still OOM if your request batching is too aggressive or if the runtime has memory fragmentation. Test under realistic concurrent load, not single-request benchmarks.
Cold start is real. Model loading takes 1-5 seconds. For low-traffic deployments, every request might be a cold start. Warm-keeping adds cost and complexity.
Quality variance across quantization. A model quantized for edge may produce subtly different results than the same model at full precision. For classification, this usually does not matter. For generation or extraction, it can matter enormously. Always evaluate at the quantization level you deploy.
Vendor lock-in. Workers AI models are Cloudflare-specific API calls. ONNX Runtime Web is portable but requires you to manage model distribution. There is no universal edge inference API. Plan your abstraction layer early.
XI. Why This Matters for Edge-First Architecture
We build on Cloudflare Workers. We think in edge-first terms. This is not tribal loyalty; it is a bet on where the capability curve is heading.
The trajectory is clear: models get smaller and better. Quantization gets less lossy. Edge hardware gets more capable. WebGPU matures. The set of tasks you can run at the CDN layer expands every quarter.
Three years ago, "AI at the edge" meant feature flags and A/B test logic. Two years ago, it meant basic text classification. Today, it means embedding generation, multi-language NER, content moderation, and small generative models. Next year, it will mean more.
The architecture that wins is the one that can absorb this expansion without redesign. Edge-first with cloud fallback is that architecture. It starts conservative -- classify and embed at the edge, generate centrally -- and naturally absorbs new capability as models shrink and edge hardware grows.
The capability migration over time:
2024: Classification, language detection
2025: Embeddings, NER, content moderation
2026: Small generation, PII detection, speculative decoding
2027: Multi-modal classification, real-time translation (projected)
2028: On-device + edge collaborative inference (projected)
──────────────────────────────────────────────>
More tasks move from centralized to edge
For Gothar's clients, this translates to concrete value: lower latency for user-facing intelligence, reduced data exposure through edge-local processing, and infrastructure costs that scale with usage rather than reservations. The clients who adopt edge-first inference patterns now will have a structural advantage as the capability frontier moves outward.
References
- Cloudflare Workers AI Documentation. https://developers.cloudflare.com/workers-ai/
- ONNX Runtime Web. https://onnxruntime.ai/docs/tutorials/web/
- WebGPU W3C Specification. https://www.w3.org/TR/webgpu/
- Deng, T. et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. https://arxiv.org/abs/2306.00978
- Frantar, E. et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023. https://arxiv.org/abs/2210.17323
- Cloudflare AI Gateway and Model Routing. https://developers.cloudflare.com/ai-gateway/
- Vercel AI SDK Documentation. https://sdk.vercel.ai/docs
- Deno Deploy and Edge Functions. https://deno.com/deploy
- Leviathan, Y. et al. "Fast Inference from Transformers via Speculative Decoding." ICML 2023. https://arxiv.org/abs/2211.17192
- vLLM: Easy, Fast, and Cheap LLM Serving. https://github.com/vllm-project/vllm
- Cloudflare Network Map and PoP Locations. https://www.cloudflare.com/network/
- ONNX Model Zoo. https://github.com/onnx/models
There is something fitting about intelligence at the edge. For centuries, decisions were centralized -- in courts, in capitals, in mainframes. The arc of distributed systems bends toward subsidiarity: handle things at the lowest level that can handle them competently. A CDN node that can classify a comment does not need to ask Virginia for permission. A browser that can embed a query does not need to cross an ocean. This is not just engineering efficiency. It is a quiet argument about where intelligence should live: not in a cathedral, but in the parish. Close to the ground. Close to the need. Close to the human who asked the question and deserves an answer before they finish blinking.