tokens
30 articles about tokens in AI news
Gemma 4 26B A4B Hits 45.7 tokens/sec Decode Speed on MacBook Air via MLX Community
A community benchmark shows the Gemma 4 26B A4B model running at 45.7 tokens/sec decode speed on a MacBook Air using the MLX framework. This highlights rapid progress in efficient local deployment of mid-size language models on consumer Apple Silicon.
Fireworks AI Launches 'Fire Pass' with Kimi K2.5 Turbo at 250 Tokens/Second
Fireworks AI has launched a new 'Fire Pass' subscription offering access to Kimi K2.5 Turbo at speeds up to 250 tokens/second. The service includes a free trial followed by a $7 weekly subscription.
Zhipu AI Announces GLM-5.1 Series, Featuring 1M Context and 128K Output Tokens
Zhipu AI has announced the GLM-5.1 model series, featuring a 1 million token context window and support for 128K output tokens. The update includes multiple model sizes and API availability.
MCP vs CLI: When to Skip MCP Servers and Save 37% on Tokens
Benchmarks show MCP servers can add 37% more input tokens vs. direct CLI commands. Learn when to use CLI for efficiency and when MCP's structure is worth the cost.
NVIDIA Spending ~$75K Per Engineer on AI Compute Tokens, Indicating Multi-Billion Dollar Annual Budget
NVIDIA is reportedly allocating approximately $75,000 in AI compute tokens per engineer annually, translating to a multi-billion dollar organization-wide budget for AI development resources.
Jensen Huang's AI Productivity Mandate: Engineers Must Spend 50% of Salary on AI Tokens
NVIDIA CEO Jensen Huang argues that a $500K engineer should spend at least $250K annually on AI inference tokens, framing token consumption as essential as CAD tools for chip design. He claims this investment eliminates perceptions of difficulty, time, and resource constraints in development.
PRISM Study: Mid-Training on 27B Tokens Boosts Math Scores by +15 to +40 Points, Enables Effective RL
A comprehensive study shows mid-training on 27B high-quality tokens consistently improves reasoning in LLMs. This 'retention-aware' phase restructures 90% of weights, creating a configuration where RL can succeed.
Sam Altman Aims for '5T Tokens Per Day' as OpenAI Reportedly Scales GPT-5.4
Sam Altman stated his goal is to flood the market with AI tokens, comparing intelligence to a utility. A separate, unverified report claims GPT-5.4 is processing '5T tokens per day' in its first week.
HyperTokens Break the Forgetting Cycle: A New Architecture for Continual Multimodal AI Learning
Researchers introduce HyperTokens, a transformer-based system that generates task-specific tokens on demand for continual video-language learning. This approach dramatically reduces catastrophic forgetting while maintaining fixed memory costs, enabling AI models to learn sequentially without losing previous knowledge.
Diffusion Architecture Breaks Speed Barrier: Inception's Mercury 2 Hits 1,000 Tokens/Second
Inception's Mercury 2 achieves unprecedented text generation speeds of 1,000 tokens per second using diffusion architecture borrowed from image AI. This represents a 10x speed advantage over leading models like Claude 4.5 Haiku and GPT-5 Mini without requiring custom hardware.
Anthropic Tightens Security: OAuth Tokens Banned from Third-Party Tools in Major Policy Shift
Anthropic has implemented a significant security policy change, prohibiting the use of OAuth tokens and its Agent SDK in third-party tools. This move comes amid growing enterprise adoption and heightened security concerns in the AI industry.
Claude Code's Secret Weapon: How the /btw Command Saves Tokens and Keeps You in Flow
Use the /btw command to ask quick, contextual questions without resetting your main task's conversation, saving tokens and preventing workflow interruptions.
Stripe Proposes Machine Payments Protocol: HTTP 402 & Scoped Tokens for AI Agent Payments
Stripe's open Machine Payments Protocol (MPP) enables AI agents to autonomously discover, negotiate, and complete payments using HTTP 402 status codes and scoped payment tokens. It supports both fiat and crypto rails, eliminating the need for human-in-the-loop payment flows.
CLAUDE.md Promises 63% Reduction in Claude Output Tokens with Drop-in Prompt File
A new prompt engineering file called CLAUDE.md claims to reduce Claude's output token usage by 63% without code changes. The drop-in file aims to make Claude's code generation more efficient by structuring its responses.
Stop Claude Code's Web Fetches from Burning 700K Tokens on HTML Junk
A new MCP server, token-enhancer, strips scripts, nav bars, and ads from web pages before they hit Claude's context, cutting token waste by 90%+.
Context Cartography: Formal Framework Proposes 7 Operators to Govern LLM Context, Moving Beyond 'More Tokens'
Researchers propose 'Context Cartography,' a formal framework for managing LLM context as a structured space, defining 7 operators to move information between zones like 'black fog' and 'visible field.' It argues that simply expanding context windows is insufficient due to transformer attention limitations.
Stop Wasting Tokens in Your CLAUDE.md: The Layered Configuration System
Separate global, project, and file-type rules into different CLAUDE.md files to cut token waste and make Claude Code more effective.
Stop Burning Tokens Blindly: Use vibe-budget to Estimate Claude Code Costs Before You Start
The new vibe-budget CLI tool lets you estimate the token cost and price of any AI coding project before you write a single prompt.
Support Tokens: The Hidden Mathematical Structure Making LLMs More Robust
Researchers have discovered a surprising mathematical constraint in transformer attention mechanisms that reveals a 'support token' structure similar to support vector machines. This insight enables a simple but powerful training modification that improves LLM robustness without sacrificing performance.
GeoSR Achieves SOTA on VSI-Bench with Geometry Token Fusion
GeoSR improves spatial reasoning by masking 2D vision tokens to prevent shortcuts and using gated fusion to amplify geometry information, achieving state-of-the-art results on key benchmarks.
The Cognitive Divergence: AI Context Windows Expand as Human Attention Declines, Creating a Delegation Feedback Loop
A new arXiv paper documents the exponential growth of AI context windows (512 tokens in 2017 to 2M in 2026) alongside a measured decline in human sustained-attention capacity. It introduces the 'Delegation Feedback Loop' hypothesis, where easier AI delegation may further erode human cognitive practice. This is a foundational study on human-AI interaction dynamics.
Qwen 3.6 Plus Preview Launches on OpenRouter with Free 1M Token Context, Disrupting API Pricing
Alibaba's Qwen team has released a preview of Qwen 3.6 Plus on OpenRouter with a 1 million token context window, charging $0 for both input and output tokens. This directly undercuts paid long-context offerings from Anthropic and OpenAI.
Meta-Harness Framework Automates AI Agent Engineering, Achieves 6x Performance Gap on Same Model
A new framework called Meta-Harness automates the optimization of AI agent harnesses—the system prompts, tools, and logic that wrap a model. By analyzing raw failure logs at scale, it improved text classification by 7.7 points while using 4x fewer tokens, demonstrating that harness engineering is a major leverage point as model capabilities converge.
Claude Code Digest — Mar 26–Mar 29
Stop using MCP servers — CLI saves 37% more tokens, boosting efficiency.
Memory Sparse Attention (MSA) Achieves 100M Token Context with Near-Linear Complexity
A new attention architecture, Memory Sparse Attention (MSA), breaks the 100M token context barrier while maintaining 94% accuracy at 1M tokens. It uses document-wise RoPE and end-to-end sparse attention to outperform RAG systems and frontier models.
Claude Code Digest — Mar 24–Mar 27
Stop wasting tokens: CLI saves 37% more than MCP servers.
ReDiPrune: Training-Free Token Pruning Before Projection Boosts MLLM Efficiency 6x, Gains 2% Accuracy
Researchers propose ReDiPrune, a plug-and-play method that prunes visual tokens before the vision-language projector in multimodal LLMs. On EgoSchema with LLaVA-NeXT-Video-7B, it achieves a +2.0% accuracy gain while reducing computation by over 6× in TFLOPs.
SIDReasoner: A New Framework for Reasoning-Enhanced Generative Recommendation
Researchers propose SIDReasoner, a two-stage framework that improves LLM-based recommendation by enhancing reasoning over Semantic IDs. It strengthens the alignment between item tokens and language, enabling better interpretability and cross-domain generalization without extensive labeled reasoning data.
How to Prevent Cost Explosions with MCP Gateway Budget Enforcement
Standard MCP gateways miss economic governance. Add per-tool cost modeling and budget-aware tokens to prevent agents from burning through thousands in minutes.
Claude Code Digest — Mar 21–Mar 24
MCP servers can waste 37% more tokens—use CLI for efficiency.