tokens

30 articles about tokens in AI news

Gemma 4 26B A4B Hits 45.7 tokens/sec Decode Speed on MacBook Air via MLX Community

A community benchmark shows the Gemma 4 26B A4B model running at 45.7 tokens/sec decode speed on a MacBook Air using the MLX framework. This highlights rapid progress in efficient local deployment of mid-size language models on consumer Apple Silicon.

Apr 3, 202693% relevant

Fireworks AI Launches 'Fire Pass' with Kimi K2.5 Turbo at 250 Tokens/Second

Fireworks AI has launched a new 'Fire Pass' subscription offering access to Kimi K2.5 Turbo at speeds up to 250 tokens/second. The service includes a free trial followed by a $7 weekly subscription.

Mar 27, 202685% relevant

Zhipu AI Announces GLM-5.1 Series, Featuring 1M Context and 128K Output Tokens

Zhipu AI has announced the GLM-5.1 model series, featuring a 1 million token context window and support for 128K output tokens. The update includes multiple model sizes and API availability.

Mar 21, 202685% relevant

MCP vs CLI: When to Skip MCP Servers and Save 37% on Tokens

Benchmarks show MCP servers can add 37% more input tokens vs. direct CLI commands. Learn when to use CLI for efficiency and when MCP's structure is worth the cost.

Mar 20, 2026100% relevant

NVIDIA Spending ~$75K Per Engineer on AI Compute Tokens, Indicating Multi-Billion Dollar Annual Budget

NVIDIA is reportedly allocating approximately $75,000 in AI compute tokens per engineer annually, translating to a multi-billion dollar organization-wide budget for AI development resources.

Mar 20, 202687% relevant

Jensen Huang's AI Productivity Mandate: Engineers Must Spend 50% of Salary on AI Tokens

NVIDIA CEO Jensen Huang argues that a $500K engineer should spend at least $250K annually on AI inference tokens, framing token consumption as essential as CAD tools for chip design. He claims this investment eliminates perceptions of difficulty, time, and resource constraints in development.

Mar 20, 202685% relevant

PRISM Study: Mid-Training on 27B Tokens Boosts Math Scores by +15 to +40 Points, Enables Effective RL

A comprehensive study shows mid-training on 27B high-quality tokens consistently improves reasoning in LLMs. This 'retention-aware' phase restructures 90% of weights, creating a configuration where RL can succeed.

Mar 19, 202688% relevant

Sam Altman Aims for '5T Tokens Per Day' as OpenAI Reportedly Scales GPT-5.4

Sam Altman stated his goal is to flood the market with AI tokens, comparing intelligence to a utility. A separate, unverified report claims GPT-5.4 is processing '5T tokens per day' in its first week.

Mar 16, 202687% relevant

HyperTokens Break the Forgetting Cycle: A New Architecture for Continual Multimodal AI Learning

Researchers introduce HyperTokens, a transformer-based system that generates task-specific tokens on demand for continual video-language learning. This approach dramatically reduces catastrophic forgetting while maintaining fixed memory costs, enabling AI models to learn sequentially without losing previous knowledge.

Mar 10, 202675% relevant

Diffusion Architecture Breaks Speed Barrier: Inception's Mercury 2 Hits 1,000 Tokens/Second

Inception's Mercury 2 achieves unprecedented text generation speeds of 1,000 tokens per second using diffusion architecture borrowed from image AI. This represents a 10x speed advantage over leading models like Claude 4.5 Haiku and GPT-5 Mini without requiring custom hardware.

Feb 25, 202695% relevant

Anthropic Tightens Security: OAuth Tokens Banned from Third-Party Tools in Major Policy Shift

Anthropic has implemented a significant security policy change, prohibiting the use of OAuth tokens and its Agent SDK in third-party tools. This move comes amid growing enterprise adoption and heightened security concerns in the AI industry.

Feb 18, 202678% relevant

Claude Code's Secret Weapon: How the /btw Command Saves Tokens and Keeps You in Flow

Use the /btw command to ask quick, contextual questions without resetting your main task's conversation, saving tokens and preventing workflow interruptions.

Mar 20, 2026100% relevant

Stripe Proposes Machine Payments Protocol: HTTP 402 & Scoped Tokens for AI Agent Payments

Stripe's open Machine Payments Protocol (MPP) enables AI agents to autonomously discover, negotiate, and complete payments using HTTP 402 status codes and scoped payment tokens. It supports both fiat and crypto rails, eliminating the need for human-in-the-loop payment flows.

Mar 19, 202695% relevant

CLAUDE.md Promises 63% Reduction in Claude Output Tokens with Drop-in Prompt File

A new prompt engineering file called CLAUDE.md claims to reduce Claude's output token usage by 63% without code changes. The drop-in file aims to make Claude's code generation more efficient by structuring its responses.

Apr 1, 202687% relevant

Stop Claude Code's Web Fetches from Burning 700K Tokens on HTML Junk

A new MCP server, token-enhancer, strips scripts, nav bars, and ads from web pages before they hit Claude's context, cutting token waste by 90%+.

Mar 25, 202684% relevant

Context Cartography: Formal Framework Proposes 7 Operators to Govern LLM Context, Moving Beyond 'More Tokens'

Researchers propose 'Context Cartography,' a formal framework for managing LLM context as a structured space, defining 7 operators to move information between zones like 'black fog' and 'visible field.' It argues that simply expanding context windows is insufficient due to transformer attention limitations.

Mar 24, 202680% relevant

Stop Wasting Tokens in Your CLAUDE.md: The Layered Configuration System

Separate global, project, and file-type rules into different CLAUDE.md files to cut token waste and make Claude Code more effective.

Mar 20, 2026100% relevant

Stop Burning Tokens Blindly: Use vibe-budget to Estimate Claude Code Costs Before You Start

The new vibe-budget CLI tool lets you estimate the token cost and price of any AI coding project before you write a single prompt.

Mar 14, 2026100% relevant

Support Tokens: The Hidden Mathematical Structure Making LLMs More Robust

Researchers have discovered a surprising mathematical constraint in transformer attention mechanisms that reveals a 'support token' structure similar to support vector machines. This insight enables a simple but powerful training modification that improves LLM robustness without sacrificing performance.

Feb 27, 202675% relevant

GeoSR Achieves SOTA on VSI-Bench with Geometry Token Fusion

GeoSR improves spatial reasoning by masking 2D vision tokens to prevent shortcuts and using gated fusion to amplify geometry information, achieving state-of-the-art results on key benchmarks.

Apr 5, 202685% relevant

The Cognitive Divergence: AI Context Windows Expand as Human Attention Declines, Creating a Delegation Feedback Loop

A new arXiv paper documents the exponential growth of AI context windows (512 tokens in 2017 to 2M in 2026) alongside a measured decline in human sustained-attention capacity. It introduces the 'Delegation Feedback Loop' hypothesis, where easier AI delegation may further erode human cognitive practice. This is a foundational study on human-AI interaction dynamics.

Mar 31, 202684% relevant

Qwen 3.6 Plus Preview Launches on OpenRouter with Free 1M Token Context, Disrupting API Pricing

Alibaba's Qwen team has released a preview of Qwen 3.6 Plus on OpenRouter with a 1 million token context window, charging $0 for both input and output tokens. This directly undercuts paid long-context offerings from Anthropic and OpenAI.

Mar 30, 202697% relevant

Meta-Harness Framework Automates AI Agent Engineering, Achieves 6x Performance Gap on Same Model

A new framework called Meta-Harness automates the optimization of AI agent harnesses—the system prompts, tools, and logic that wrap a model. By analyzing raw failure logs at scale, it improved text classification by 7.7 points while using 4x fewer tokens, demonstrating that harness engineering is a major leverage point as model capabilities converge.

Mar 30, 202691% relevant

Claude Code Digest — Mar 26–Mar 29

Stop using MCP servers — CLI saves 37% more tokens, boosting efficiency.

Mar 29, 2026100% relevant

Memory Sparse Attention (MSA) Achieves 100M Token Context with Near-Linear Complexity

A new attention architecture, Memory Sparse Attention (MSA), breaks the 100M token context barrier while maintaining 94% accuracy at 1M tokens. It uses document-wise RoPE and end-to-end sparse attention to outperform RAG systems and frontier models.

Mar 29, 202695% relevant

Claude Code Digest — Mar 24–Mar 27

Stop wasting tokens: CLI saves 37% more than MCP servers.

Mar 27, 2026100% relevant

ReDiPrune: Training-Free Token Pruning Before Projection Boosts MLLM Efficiency 6x, Gains 2% Accuracy

Researchers propose ReDiPrune, a plug-and-play method that prunes visual tokens before the vision-language projector in multimodal LLMs. On EgoSchema with LLaVA-NeXT-Video-7B, it achieves a +2.0% accuracy gain while reducing computation by over 6× in TFLOPs.

Mar 27, 202679% relevant

SIDReasoner: A New Framework for Reasoning-Enhanced Generative Recommendation

Researchers propose SIDReasoner, a two-stage framework that improves LLM-based recommendation by enhancing reasoning over Semantic IDs. It strengthens the alignment between item tokens and language, enabling better interpretability and cross-domain generalization without extensive labeled reasoning data.

Mar 25, 202682% relevant

How to Prevent Cost Explosions with MCP Gateway Budget Enforcement

Standard MCP gateways miss economic governance. Add per-tool cost modeling and budget-aware tokens to prevent agents from burning through thousands in minutes.

Mar 24, 202685% relevant

Claude Code Digest — Mar 21–Mar 24

MCP servers can waste 37% more tokens—use CLI for efficiency.

Mar 24, 2026100% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety