
I think about token usage more than is probably healthy. I read token-optimizing repos for fun. I have opinions about prompt caching. I will happily lose an afternoon to someone’s Claude Code skill that makes the model talk like a caveman to cut output by two-thirds (why use many token when few token do trick). And I think about the other cost too, the one that never shows up on an invoice: the power, the water, the physical weight of all this compute humming away somewhere I’ll never see.
So when Jensen Huang held up a chip last week and called it the new PC, it didn’t land on me as a product announcement. It landed as a question I’d already been chewing on. When the meter is running on every token, who actually ends up holding it.
There’s a longer shadow behind that question, and it’s the part that nags at me. This rhymes with something we already did. In the early Web 2.0 years we moved everything to the cloud because it was cheap and easy and somebody else’s machine to run. We were right that it was convenient. What we didn’t price in was the landfill we’d build: petabytes of bloated, duplicated, abandoned data sitting on servers nobody remembers turning on, drawing power to stay alive for no reason anyone can name. I have a strong suspicion this AI shift has the same shape. We’re optimizing for “it just works and costs nothing I can see,” which is exactly the setup that produces a mess you only notice years later.

The pitch
NVIDIA and Microsoft told a capability story. At Computex, Huang unveiled the RTX Spark superchip and said, plainly, “this is going to be the new PC.” The framing is that your machine becomes smart enough to run real AI on its own, with an agent that watches your screen and acts on your behalf. The OEM list was not boutique: Dell, HP, Lenovo, ASUS, Microsoft Surface, MSI. This is going into the mainstream supply chain.
The specs back the pitch. The chip pairs a Blackwell GPU with a Grace CPU sharing one 128GB memory pool, which kills the usual bottleneck of shuffling data between GPU memory and system RAM. NVIDIA mentions that this is the same silicon that debuted in early 2025 as Project DIGITS, a Linux developer box. A year later it got a Windows wrapper and a consumer story bolted on. Same chip. Different argument about who it’s for.
That gap is the part I find interesting. The capability existed a year ago. What changed is the reason to want it on your desk instead of in a rack somewhere.
The thing that actually changed
The reason, I think, is the bill. The per-token price of AI has been falling fast, and it has not helped. VentureBeat reported that the architectures companies built during the flat-fee era, long-context agents and elaborate retrieval pipelines, became liabilities the moment pricing shifted to metered usage. When tokens were a sunk cost, nobody counted them. Now they’re a line item, and the line is moving the wrong way.
The driver is agents. A chatbot answers when you ask. An agent that monitors your inbox, your logs, your calendar runs whether you’re there or not, and you can’t throttle it without breaking the thing it was hired to do. The token consumption of always-on AI looks nothing like the token consumption of a person typing questions. That’s the cost curve that scared people, and I don’t think it’s a coincidence that the hardware pitch arrived right behind it.
Who actually pays
Here is where my thinking lands, and I’ll flag it as a hunch rather than a fact. I don’t believe regular people are going to pay per token to get AI. They’re going to get it one of two ways: folded into something they already pay for, or running on hardware they already own.
The folded-in version is already here. Google’s own shareholder letter called the last quarter its strongest ever for consumer AI, and the Gemini-powered AI Overviews sit inside Search for roughly two billion people whether they wanted them or not. Nobody chose a token budget. The AI is just in the box now, next to the search results they came for. Amazon is running the same logic with Alexa: keep it bundled for the home, sell the metered tokens to businesses through AWS.
The hardware version is the other half, and it’s what RTX Spark and the new wave of AI PCs are reaching for. The cost of inference, in that model, moves from a vendor’s cloud bill to your electricity bill, which you were already paying and don’t itemize by query. I’m not sure consumers will frame it that way consciously. I think they’ll just notice that the feature works and doesn’t cost extra, and that’s the whole sell.
Which leaves enterprises. They’re the ones who can’t bundle their way out, because their AI use is the product, not a feature inside one. They’re the ones running agents at volume against a meter. The token fight, the real one with budgets and FinOps teams and architecture reviews, is happening in enterprise, and I think it stays there.

Apple already ran this play
None of this is new. Apple has been doing the on-device version for years, and I don’t think it was mainly about being clever. Their foundation models post describes a roughly 3-billion-parameter model that runs locally on the device, with a bigger model on Private Cloud Compute for the heavy lifting. The privacy framing is real and they’ve earned it. But there’s a quieter reason a company with a billion-plus active devices would want routine inference happening on the phone: at that scale, you cannot afford to eat the cloud cost of every small AI task yourself. Push the cheap stuff to the device. Reserve the expensive cloud for when it’s actually needed.
What’s happening now is that Microsoft is copying the architecture, not the privacy pitch. At Build, the Windows AI Platform showed an on-device runtime that can run small models locally and hand off to Azure when the task is too big. That’s the same hybrid Apple drew up. The interesting thing is watching the rest of the industry arrive at a design Apple shipped quietly while everyone was busy calling them behind on AI.
The other thing pushing inference outward
There’s a second force here that doesn’t get mentioned in the keynotes, which is that people don’t want the data center. A March 2026 Gallup poll found majorities oppose building new AI data centers near them. This isn’t abstract: community opposition has blocked or delayed tens of billions in projects, with opposition groups active across dozens of states. It’s gone political, too. Sanders and AOC introduced a moratorium bill this spring.
Inference on your own device sidesteps all of it. No new substation, no water draw for cooling, no zoning fight in somebody’s town. The thing was already plugged into the wall. I’m not claiming the chipmakers designed for the backlash. But the backlash and the cost curve are pushing in the same direction, and that’s a strong tailwind for moving routine work off the centralized grid.
The complicated part
First, today’s cloud prices are not real prices. The frontier labs are widely understood to be pricing inference below what it costs them, buying market share. So the cost pressure enterprises feel right now is the subsidized number. When that floor corrects, the math for moving things local gets stronger, but it also means I’m partly betting on a future that hasn’t happened.
Second, on-device doesn’t kill the data center, it just changes what it’s for. Anthropic’s recent piece on AI building itself makes the case that the binding constraint on AI progress may end up being the supply chain, energy and compute, rather than intelligence itself. They mention, almost in passing, that GitHub went from about a billion code commits in all of 2025 to 275 million in a single week by mid-2026. Demand for compute is not shrinking. The frontier still wants enormous centralized machines. What might change is the split: the routine, high-volume, latency-sensitive work drifts to the edge, while the heavy reasoning stays central. Two different jobs, two different places.
And the hardware isn’t free either. The first RTX Spark systems are priced for the premium end, not the family laptop. The consumer version of this is a few years out, riding the same curve every chip rides down.
Where this goes
I don’t think there’s one AI race. I think the market is quietly splitting into layers that barely compete: consumer AI bundled into services people already pay for, consumer AI baked into hardware they already own, and enterprise AI fighting the actual token war in the cloud. Google and Amazon look strong on the bundled-services layer. Apple, Microsoft, and now NVIDIA are reaching for the device layer. The metered-token brawl, the one with real money on the line per query, lives in enterprise.
That maps onto something I’ve been turning over for a while. We spent a decade centralizing computation into a handful of clouds because it was cheaper and easier. The economics that justified that may be starting to push the other way for a meaningful slice of the work. If routine inference comes home to the device, that’s a small re-decentralization of a thing we just finished centralizing, and it happened not because anyone wanted a principled distributed future but because the spreadsheet stopped working.
But I keep coming back to the landfill. Moving inference to the edge doesn’t make the waste go away, it just spreads it out. Instead of forgotten data rotting in one company’s data center, you get a billion devices each quietly burning cycles on agents nobody’s watching, background inference that runs because it can, not because anyone asked. The Web 2.0 cloud migration didn’t fail. It just left a bill we’re still paying and mostly pretending not to see. My worry isn’t that the AI version of this won’t work. It’s that it’ll work exactly well enough that we stop counting again.
The question I started with is the one I’d still want answered: who pays for the tokens. The answer seems to be that consumers mostly won’t, not directly, and the people who can’t avoid the meter are going to spend the next few years figuring out how to turn it down. The rest of us will get our AI bundled and on-device and free-feeling, which is the most expensive kind of free there is.