December 19, 2025 24 min read

Portability, Proximity, Provenance: A CTO's Second Playbook for the AI-Native Company

Three levers that quietly determine whether your AI strategy compounds or calcifies: portability (can you move models without pain?), proximity (can you run intelligence where it matters?), and provenance (can you prove where data came from?). These are the difference between owning a system and merely renting one.

ai-strategycto-rolemodel-portabilityedge-aion-device-aidata-provenancec2paeu-ai-actonnxvllmcore-mlwebgpuai-governanceenterprise-ai

*Follow-up to The AI-Native CTO*

This piece goes deep on three levers that quietly determine whether your AI strategy compounds or calcifies: portability (can you move models without pain?), proximity (can you run intelligence where it matters—on device and at the edge?), and provenance (can you prove where data and outputs came from, in a way auditors and partners will accept?). These are not decorative concerns. They are the difference between owning a system and merely renting one.

We'll make the case, show the seams, and give you a short operational spine you can implement in a sprint.

---

I. Portability as Power

Why portability moved from nice-to-have to balance-sheet item

When model cost curves move faster than your contracts, exit becomes a capability. If you can export a model, retarget it to another runtime, and keep your application protocol stable, you can arbitrage latency and price—and you can say "no" when a vendor bundles you into a corner.

Two pragmatic pillars make this tractable in 2026:

Model interchange: ONNX (Open Neural Network Exchange) is the closest thing to a lingua franca. It standardizes operators and a file format so trained models move between frameworks, compilers, and runtimes without a rewrite. For your team, that means the difference between refactoring an app and swapping a serialized graph.

Serving abstraction: A unified inference API lets apps call local or remote models the same way. Projects like vLLM present an OpenAI-compatible endpoint while delivering high-throughput decoding, prefix caches, and sharded state under the hood. Keep your application speaking one dialect; move the back end at will. On NVIDIA hardware, TensorRT-LLM gives you another "face": a deeply optimized engine you can slot behind the same front door.

Put bluntly: portability is the operationalization of your negotiating position.

A portability rubric you can enforce next quarter

Adopt three rules and audit them quarterly:

Exit plan in CI: Every material model exports successfully (e.g., ONNX) and spins on at least two serving stacks (say, vLLM and TensorRT-LLM). The export/stand-up/validate cycle is a test target, not a "someday."

Stable interface: Product teams talk to inference via a single API (OpenAI-compatible REST or internal gRPC). No vendor SDKs in product code. Stub in routing so you can fail over or "burst" to alternate capacity without touching business logic.

Comparative evals: Your evaluation harness runs against all supported stacks and reports latency, cost per successful task, and quality deltas. A model that can't move is a risk; a model that moves but degrades without you noticing is a liability.

"But our ops are unique." Good—prove it, don't maroon it

You will find non-portable edges: custom ops, fused kernels, tokenizer quirks. That is not an argument against portability; it's a list of TODOs. Wrap them in adapters. Document what doesn't export. Track how much of your throughput depends on non-standard ops. When the market shifts—and it will—being able to mostly move, with known gaps, is still leverage.

A tiny example (the whole point in ~30 lines)

Your app calls a familiar endpoint; ops decide where it lands.

# vLLM with an OpenAI-compatible server
pip install vllm
python -m vllm.entrypoints.api_server --model your/model --host 0.0.0.0 --port 8000

# application code (stays the same if you swap runtimes)
import os, requests
url = os.getenv("INFERENCE_URL", "http://localhost:8000/v1/chat/completions")
payload = {"model":"your/model", "messages":[{"role":"user","content":"Summarize this policy."}]}
print(requests.post(url, json=payload, timeout=10).json()["choices"][0]["message"]["content"])

Tomorrow you point INFERENCE_URL at a TensorRT-LLM-backed service or a managed endpoint; the product code is unchanged. You just bought yourself options.

---

II. Proximity: On-Device and Edge as a First-Class Design Choice

Running intelligence near the user is not a stunt; it's a feature with three non-substitutable benefits:

Latency that a round trip can't touch (sub-blink text, real-time transforms).
Privacy by locality (sensitive data never leaves the device).
Resilience when connectivity is bad or regulated boundaries block cloud use.

The on-device landscape is real now

Apple Core ML

Core ML now explicitly targets generative workloads: stateful transformers, advanced compression, and efficient execution of transformer ops. The developer pitch is clear: run fully on device for responsiveness and privacy. If you ship to iOS/macOS and you're not exploring this path for at least some tasks (summarize, redact, classify), you're overpaying for round trips.

Apple's research notes about Foundation Models signal the company's chosen footprint for on-device tasks (small, reliable, "production-quality" text operations) and the intent to keep those experiences snappy and contained on Apple silicon. Translation for a CTO: expect fast paths for compact LLMs and official hooks that won't break annually.

Android AICore + Gemini Nano

Google's AICore service exposes Gemini Nano—Google's smallest general model—for on-device tasks via ML Kit GenAI APIs. That means you can ship summarize/rewriter/classifier flows that run offline and honor local data boundaries. This isn't a conference demo; it's documented platform surface. If you're already using ML Kit, this path is surprisingly low friction.

One more consequential shift: NNAPI, the venerable acceleration API introduced in Android 8.1, is deprecated as of Android 15; Google provides a migration guide. For many teams this means less direct NNAPI plumbing and more reliance on higher-level system services (AICore, Lite runtime paths) or framework providers. Plan your deprecation work; don't wake up surprised.

WebGPU (the thin client learns new tricks)

In the browser, WebGPU reached Candidate Recommendation status, with the W3C inviting implementations and publishing rolling CR drafts through late 2025. This is not just graphics bragging—it unlocks practical GPU compute in the web sandbox: client-side feature extraction, vector math, tokenization, and small-model inference without install steps. The "thin client" is getting a real math engine.

Picking the cut line: what runs where?

A simple, durable rule: first token locally if it helps the experience, heavy synthesis where cost and context are abundant. In practice:

Device-worthy: redaction/pre-classification; preview summaries; privacy-sensitive transforms; "first token" UX; fallback modes.
Server or edge: long-context synthesis, multi-doc reasoning, batch enrichment, anything requiring cross-user corpora.

Instrument both sides. If a device path underperforms, you'll see it in your time-to-useful-token and completion quality under low signal metrics.

Engineering footnotes you'll thank yourself for later

Model versioning by target: Keep builds for (server, Apple, Android, WebGPU) with per-target conversion & quantization steps in CI. Measure quality drift; don't assume an 8-bit server model and an 8-bit device model behave the same.

Pipelines respect privacy by default: If you can do PII filtering, redaction, or early ranking on device, do it—and record that choice in your model cards. Customers notice. Auditors notice more.

Edge is not "just another region." Expect cache invalidation and partial feature states. Build explicit health probes for device/edge paths so product doesn't silently degrade.

---

III. Provenance: From "Don't Scrape Me" to Verifiable Lineage

You cannot lead an AI program in 2026 unless you can answer, with receipts: Where did this data come from? What are we allowed to do with it? What did we emit, and how do we prove it? The calendars are no longer abstract.

The regulatory clock is explicit

The EU AI Act entered into force 1 August 2024. Prohibited-use rules and AI literacy duties kicked in 2 February 2025. General-purpose AI obligations became applicable 2 August 2025. High-risk system rules ramp across 2026–2027 (with embedded regulated products given extra runway to 2027). If you operate in or sell to the EU, your obligations aren't hypothetical. They are dated.

Community timelines align: independent trackers summarize the staggered application—12 months post-entry-into-force for GPAI, 24–36 months for high-risk classes. Use them to sanity-check your internal plan.

Harmonised standards (the practical, testable ways to prove compliance) are running late, and may take up to ~3 years from request to publication—CEN/CENELEC have publicly discussed acceleration measures, but even with fast-path procedures, finalization and "legal presumption of conformity" via Official Journal publication will lag. Plan to be compliant on principles before standards save you.

Operational translation: Build your own evidence trail now; don't wait for a harmonized checklist to arrive.

The supply-chain response: C2PA content credentials

For media and document lineage, the C2PA specification gives you a concrete way to attach content credentials—cryptographically signed manifests that travel with assets. Version 2.2 (May 2025) tightened important mechanics: update manifests, binding modes, redaction semantics. If your product ingests or emits text, images, audio, or video at scale, plan for C2PA both inbound (trust signals) and outbound (transparent claims).

A good mental model: a JPEG with a passport. Every transformation leaves a stamp. When a customer, a partner, or a regulator asks "who made this and how," you have more than a log—you have a verifiable embedded story.

A minimal provenance spine you can implement in a sprint

Data Rights Register: For each corpus—license basis (contract/terms/statute), permitted uses (train/fine-tune/retrieve), geofencing, retention windows, contact for re-permissioning. Ingestion jobs consult the register before they run.

Model & Prompt Cards: You've seen the papers; treat them as living operational docs. Record intended use, eval setup, and known limits. They're tedious only until someone asks. Then they're oxygen.

Credentials on I/O: Attach C2PA manifests to assets you emit; preserve manifests on assets you ingest; validate and surface trust in your UI. This is more than compliance—your users will begin to expect it where content risk is high.

---

IV. The Programmer as Architect of Execution

Intent made executable

Programming is not code; programming is intent made executable. We stand between a human's foggy desire and a machine's pitiless literalism, translating "make it pop but professional, edgy but safe" into a choreography a silicon choir can actually sing. Code is the fossil of that translation—the sheet music after the melody has already taken up residence in your head. Execution is the point. Code is the sediment it leaves behind.

Then the punchline arrived: large language models prefer prose.

After decades of shaving language down to tokens, keywords, loops—speaking to machines like monks of minimalism—the machines now behave as if they'd like the whole novel, please. Layered context. Redundancy. Hints and hedges and implied intent. The pendulum swung from "formal language" to "performative language," and the surprise is not that it moved; the surprise is that it moves back and forth. We aren't evolving away from code so much as oscillating between poles: accessibility and determinism, fluency and formality, "everyone can ship" and "nobody understands what shipped."

Prompts are not better than code, and code is not nobler than prompts. They simply sit at different coordinates:

Prompts: approachable, expressive, context-heavy, but probabilistic. You get a distribution—not a guarantee.
Code: exacting, brittle to newcomers, but deterministic. You get the same output for the same input, or you file a bug.

If you like a new term that shouldn't exist, call an LLM a *probabilistic compiler*: it maps an imprecise, human-sized specification to a precise artifact most of the time. When it fails, it fails with confidence. The debugging technique is not printf but prompt surgery, eval harnesses, and guardrails. Different tools, same job: you are still the keeper of intent.

To be clear: LLMs do not replace programmers any more than compilers or IDEs did. They widen the aperture, change the economics, and generate a new class of failure modes. What remains is judgment: what should exist, what is safe to exist, and what we'll sign our names to when auditors ask how it came to exist at all.

Reality check: the energy cost of inference

Units matter. So does new data.

The right lens is energy per query (watt-hours, Wh), not a raw "watts" claim. Recent, methodical estimates put typical LLM text prompts on the order of ~0.24–0.34 Wh per query for mainstream, optimized systems, with Epoch AI's independent estimate for GPT-4o around ~0.3 Wh.

Earlier, much larger figures (multi-Wh) circulated—some analyses and press summaries cited ~2.9–3 Wh and even higher—but these are increasingly viewed as overestimates for current, optimized deployments; the spread reflects different methodologies, hardware, and workloads.

For comparison, a typical laptop under normal use often draws ~30–70 watts (power, not energy), with modern ultrabooks idling in the single-digit watts and peaking higher under load. That's a device draw, not a per-query energy figure—but it's useful context when people casually compare "a prompt" to "my laptop."

The punchline: An LLM text prompt typically costs a fraction of a watt-hour (roughly a few seconds of a 60W bulb), though it varies with model size, prompt length, hardware, and data-center efficiency. Using natural language at scale isn't free; it moves compute from your laptop to the data center. But the per-query cost for optimized systems has dropped to well under a watt-hour, and keeps falling as runtimes, hardware, and batching improve.

What programmers do now (same as always, just louder)

We continue to clarify intent and constrain execution. We decide whether a feature belongs in human-language space (prompt chains) or code space (deterministic functions). We write evaluations so the model is graded before it ships. We annotate data for provenance so a future regulator can retrace our steps. We design fallbacks so the product remains useful when the probabilistic side falters. And we accept responsibility for the system's behavior, including the parts generated by something that, strictly speaking, cannot want anything at all.

When you strip away the novelty, the shape of the work persists: therapy for computers, at scale. You sit in the liminal space where the human request is negotiated into something a machine can actually do, knowing that the map (code, prompt, eval) is never the territory (execution), and that your craft is not in the artifact but in the alignment between the two.

> Programmers are architects of execution. > Code and prompts are both scaffolding. The job is to realize intent with guarantees the business can sign. Where guarantees matter most, keep determinism. Where exploration matters most, let probability breathe. Then weld the two together with tests, schemas, and evals.

---

V. Putting the Three Together

You can view portability, proximity, and provenance as separate programs; they're really one agency triangle:

Portability keeps you free to move when prices, politics, or performance change.
Proximity gives users speed and privacy you simply can't fake server-side.
Provenance gives you permission to operate—and the receipts to prove it.

Omit any leg, and the other two wobble. A portable stack that cannot prove lineage is just a movable risk; a private, on-device experience that can't export models or switch runtimes is a local dead end; perfect provenance on a stack you can't steer is an expensive confession.

---

VI. A 30-60-90 You Can Actually Run

Days 1–30: Make "exit" real

Implement model export checks (e.g., ONNX) in CI for your top three models.
Stand up a shadow serving path (vLLM) that mirrors your current production API. Route 1% of eval traffic; compare latency/$/quality.
Publish an inference API contract. Ban vendor SDKs from product code. Provide a shim library if you must.

Days 31–60: Ship proximity

Identify two on-device candidates (privacy-sensitive or latency-sensitive).
For Apple: Core ML conversion + quantization + eval; for Android: use AICore / ML Kit GenAI where possible. Ship one feature per platform. Measure time-to-useful-token and offline correctness.
Add a WebGPU experiment in the browser (tokenization/vector math). Flip it behind a feature flag.

Days 61–90: Close the loop on provenance

Stand up the Data Rights Register and wire ingestion jobs to it.
Start attaching C2PA content credentials to the outputs of at least one media pipeline; record validation on ingestion.
Map your use cases to the EU AI Act timeline (GPAI obligations are already in force since Aug 2, 2025; bans since Feb 2, 2025; high-risk 2026–2027). Brief the exec team with the dated plan.

---

VII. Architecture Sketches

A. Portable Serving

+------------------+                      +-----------------+
App code ->| Inference Client |---- OpenAI REST ---> |  vLLM Cluster   |
           +------------------+                      +-----------------+

  
    
    
    
      same contract
      alt path
    
  

                    v                                          v
             (Managed Endpoint)                          (TensorRT-LLM)

The contract is the product; the engine is an implementation detail.

B. Proximity Split

[Device]                                  [Edge/Cloud]
   PII filter / redact (LLM-small)         Long-context synth / retrieval
   First-token hint / preview              Batch enrichment / orchestration
   Offline summarize / classify            Global signals / cross-user graphs

C. Provenance Loop (C2PA)

Ingest Asset -> Verify Manifest -> Store + Surface Trust
             -> Transform -> Update Manifest -> Emit with Credentials

Each arrow is code you can write this sprint. Each box reduces audit time later.

---

VIII. Risks & Rough Edges (so you plan, not panic)

ONNX gaps: Not every cutting-edge op exports cleanly; keep a map of custom/fused ops and their fallbacks. Your CI should fail noisily when exports drift.

AICore/NNAPI churn: Android's migration away from direct NNAPI usage means your app logic should prefer higher-level APIs; test across device generations. Budget migration time now—don't discover it during your holiday freeze.

WebGPU variability: It's a Candidate Recommendation with active drafts; feature support evolves. Keep a graceful degradation path and capability checks.

EU AI Act standards lag: Don't wait for harmonised standards to "rescue" you. Build your own controls (data sheets, model cards, eval harness, content credentials) and update them as CEN/CENELEC publishes. Expect standards to arrive after some obligations apply.

---

IX. Why This Matters More Than Hype

Because these three levers mostly operate in the dark. They don't headline your launch blog. They keep your teams, budgets, and ethics from getting quietly cornered by success: the happy state where usage doubles, a jurisdiction tightens, a cloud discount expires, a handset generation ships, and your beautiful demos meet the world's boring constraints. If you've made portability, proximity, and provenance routine, you can adjust without drama.

If you haven't, the bill comes due—in legal letters, in surprise GPU invoices, in mobile reviews that say "feels slow," in meetings where you explain why a model that used to cost pennies now costs dollars and can't move.

There is a tone to good engineering leadership that is part professor, part novelist. Professors insist on proofs; novelists track cause and effect through characters who lie and change. Treat your models, runtimes, devices, and documents like characters in a story with consequences. Keep the receipts. Keep the exits. Keep the work close to the people it serves.

Then the rest—features, campaigns, quarterly letters—remains what it should be: a visible surface atop a system that knows how to move.

---

Quick Reference

Standards & Specifications

EU AI Act – Prohibited uses (Feb 2025), GPAI (Aug 2025), High-risk (2026–2027)
C2PA 2.2 – Content credentials with update manifests and binding modes
ONNX – Model interchange format and operator catalog
WebGPU – W3C Candidate Recommendation for browser GPU compute

Runtime & Tooling

vLLM – OpenAI-compatible high-throughput serving
TensorRT-LLM – NVIDIA-optimized LLM inference
Core ML – Apple on-device ML with generative support
AICore / Gemini Nano – Android on-device GenAI via ML Kit

Key Metrics

Energy per query: ~0.24–0.34 Wh for optimized LLM inference
Model export coverage: % of production models with ONNX + multi-runtime validation
Provenance coverage: % of I/O with C2PA credentials attached/validated