inference optimization

30 articles about inference optimization in AI news

arXiv Survey Maps KV Cache Optimization Landscape: 5 Strategies for Million-Token LLM Inference

A comprehensive arXiv review categorizes five principal KV cache optimization techniques—eviction, compression, hybrid memory, novel attention, and combinations—to address the linear memory scaling bottleneck in long-context LLM inference. The analysis finds no single dominant solution, with optimal strategy depending on context length, hardware, and workload.

Mar 24, 2026100% relevant

Nvidia Claims MLPerf Inference v6.0 Records with 288-GPU Blackwell Ultra Systems, Highlights 2.7x Software Gains

MLCommons released MLPerf Inference v6.0 results, introducing multimodal and video model tests. Nvidia set records using 288-GPU Blackwell Ultra systems and achieved a 2.7x performance jump on DeepSeek-R1 via software optimizations alone.

Apr 2, 2026100% relevant

Image Prompt Packaging Cuts Multimodal Inference Costs Up to 91%

A new method called Image Prompt Packaging (IPPg) embeds structured text directly into images, reducing token-based inference costs by 35.8–91% across GPT-4.1, GPT-4o, and Claude 3.5 Sonnet. Performance outcomes are highly model-dependent, with GPT-4.1 showing simultaneous accuracy and cost gains on some tasks.

Apr 6, 202678% relevant

Inference Beauty Today Announces Global Platform Expansion, Powering Personalized Beauty Discovery for 100+ Retailers and Brands

Inference Beauty Today has expanded its AI-powered personalized beauty discovery platform globally, now serving over 100 retailers and brands across five markets. This signals the maturation of specialized, third-party AI recommendation engines in the beauty and personal care sector.

Apr 1, 2026100% relevant

Fractal Emphasizes LLM Inference Efficiency as Generative AI Moves to Production

AI consultancy Fractal highlights the critical shift from generative AI experimentation to production deployment, where inference efficiency—cost, latency, and scalability—becomes the primary business constraint. This marks a maturation phase where operational metrics trump model novelty.

Mar 25, 202676% relevant

Fine-Tuning Llama 3 with Direct Preference Optimization (DPO): A Code-First Walkthrough

A technical guide details the end-to-end process of fine-tuning Meta's Llama 3 using Direct Preference Optimization (DPO), from raw preference data to a deployment-ready model. This provides a practical blueprint for customizing LLM behavior.

Mar 24, 202676% relevant

HyEvo Framework Automates Hybrid LLM-Code Workflows, Cuts Inference Cost 19x vs. SOTA

Researchers propose HyEvo, an automated framework that generates agentic workflows combining LLM nodes for reasoning with deterministic code nodes for execution. It reduces inference cost by up to 19x and latency by 16x while outperforming existing methods on reasoning benchmarks.

Mar 23, 2026100% relevant

Why Companies End Up Using Triton Inference Server: A Simple Case Study

A case study explains the common journey from a simple ML experiment to a production system requiring a robust inference server like NVIDIA's Triton, highlighting its role in managing multi-model, multi-framework deployments at scale.

Mar 16, 202675% relevant

Quantized Inference Breakthrough for Next-Gen Recommender Systems: OneRec-V2 Achieves 49% Latency Reduction with FP8

New research shows FP8 quantization can dramatically speed up modern generative recommender systems like OneRec-V2, achieving 49% lower latency and 92% higher throughput with no quality loss. This breakthrough bridges the gap between LLM optimization techniques and industrial recommendation workloads.

Mar 13, 202697% relevant

IonRouter Emerges as Cost-Efficient Challenger to OpenAI's Inference Dominance

YC-backed Cumulus Labs launches IonRouter, a high-throughput inference API that promises to slash AI deployment costs by optimizing for Nvidia's Grace Hopper architecture. The service offers OpenAI-compatible endpoints while enabling teams to run open-source or fine-tuned models without cold starts.

Mar 12, 202698% relevant

The Hidden Cost of Mixture-of-Experts: New Research Reveals Why MoE Models Struggle at Inference

A groundbreaking paper introduces the 'qs inequality,' revealing how Mixture-of-Experts architectures suffer a 'double penalty' during inference that can make them 4.5x slower than dense models. The research shows training efficiency doesn't translate to inference performance, especially with long contexts.

Mar 11, 202675% relevant

AWS Expands Claude AI Access Across Southeast Asia with Global Cross-Region Inference

Amazon Bedrock now offers Global Cross-Region Inference for Anthropic's Claude models in Thailand, Malaysia, Singapore, Indonesia, and Taiwan. This enables enterprise customers to access Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5 through a resilient, distributed architecture designed for high-throughput AI applications.

Feb 24, 202670% relevant

NVIDIA's Inference Breakthrough: Real-World Testing Reveals 100x Performance Gains Beyond Promises

NVIDIA's GTC 2024 promise of 30x inference improvements appears conservative as real-world testing reveals up to 100x gains on rack-scale NVL72 systems. This represents a paradigm shift in AI deployment economics and capabilities.

Feb 17, 202695% relevant

X Post Reveals Audible Quality Differences in GPU vs. NPU AI Inference

A developer demonstrated audible quality differences in AI text-to-speech output when run on GPU, CPU, and NPU hardware, highlighting a key efficiency vs. fidelity trade-off for on-device AI.

Apr 5, 202675% relevant

MemFactory Framework Unifies Agent Memory Training & Inference, Reports 14.8% Gains Over Baselines

Researchers introduced MemFactory, a unified framework treating agent memory as a trainable component. It supports multiple memory paradigms and shows up to 14.8% relative improvement over baseline methods.

Apr 1, 202697% relevant

Meta's Adaptive Ranking Model: A Technical Breakthrough for Efficient LLM-Scale Inference

Meta has developed a novel Adaptive Ranking Model (ARM) architecture designed to drastically reduce the computational cost of serving large-scale ranking models for ads. This represents a core infrastructure breakthrough for deploying LLM-scale models in production at massive scale.

Mar 31, 2026100% relevant

Ollama Now Supports Apple MLX Backend for Local LLM Inference on macOS

Ollama, the popular framework for running large language models locally, has added support for Apple's MLX framework as a backend. This enables more efficient execution of models like Llama 3.2 and Mistral on Apple Silicon Macs.

Mar 31, 202685% relevant

Throughput Optimization as a Strategic Lever in Large-Scale AI Systems

A new arXiv paper argues that optimizing data pipeline and memory throughput is now a strategic necessity for training large AI models, citing specific innovations like OVERLORD and ZeRO-Offload that deliver measurable efficiency gains.

Mar 31, 202688% relevant

Why Cheaper LLMs Can Cost More: The Hidden Economics of AI Inference in 2026

A Medium article outlines a practical framework for balancing performance, cost, and operational risk in real-world LLM deployment, arguing that focusing solely on model cost can lead to higher total expenses.

Mar 27, 202682% relevant

AgenticGEO: Self-Evolving AI Framework for Generative Search Engine Optimization Outperforms 14 Baselines

Researchers propose AgenticGEO, an AI framework that evolves content strategies to maximize inclusion in generative search engine outputs. It uses MAP-Elites and a Co-Evolving Critic to reduce costly API calls, achieving state-of-the-art performance across 3 datasets.

Mar 24, 202691% relevant

Beyond Cosine Similarity: How Embedding Magnitude Optimization Can Transform Luxury Search & Recommendation

New research reveals that controlling embedding magnitude—not just direction—significantly boosts retrieval and RAG performance. For luxury retail, this means more accurate product discovery, personalized recommendations, and enhanced clienteling through superior semantic search.

Mar 6, 202660% relevant

GR4AD: Kuaishou's Production-Ready Generative Recommender for Ads Delivers 4.2% Revenue Lift

Researchers from Kuaishou present GR4AD, a generative recommendation system designed for high-throughput ad serving. It introduces innovations in tokenization (UA-SID), decoding (LazyAR), and optimization (RSPO) to balance performance with cost. Online A/B tests on 400M users show a 4.2% ad revenue improvement.

Apr 3, 2026100% relevant

Robust DPO with Stochastic Negatives Improves Multimodal Sequential Recommendations

New research introduces RoDPO, a method that improves recommendation ranking by using stochastic sampling from a dynamic candidate pool for negative selection during Direct Preference Optimization training. This addresses the false negative problem in implicit feedback, achieving up to 5.25% NDCG@5 gains on Amazon benchmarks.

Apr 1, 202688% relevant

Text-to-Speech Cost Plummets from $0.15/Word to Free Local Models Using 3GB RAM

High-quality text-to-speech has shifted from a $0.15 per word cloud service to free, local models requiring only 3GB of RAM in 12 months, signaling a broader price collapse in AI inference.

Mar 30, 202685% relevant

Google's TurboQuant Compresses LLM KV Cache 6x with Zero Accuracy Loss, Cutting GPU Memory by 80%

Google researchers introduced TurboQuant, a method that compresses LLM KV cache from 32-bit to 3-bit precision without accuracy degradation. This reduces GPU memory consumption by over 80% and speeds up inference 8x on H100 GPUs.

Mar 28, 202697% relevant

Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial

A new arXiv study shows that aggressive prompt compression can increase total AI inference costs by causing longer outputs, while moderate compression (50% retention) reduces costs by 28%. The findings challenge the 'compress more' heuristic for production AI systems.

Mar 26, 202676% relevant

MinerU-Diffusion: A 2.5B Parameter Diffusion Model for OCR Achieves 3.2x Speedup Over Autoregressive Methods

Researchers introduced MinerU-Diffusion, a 2.5B parameter diffusion model for OCR that replaces autoregressive decoding with parallel block-wise diffusion. It achieves up to 3.2x faster inference while improving robustness on complex documents with tables and formulas.

Mar 25, 202685% relevant

Reuters Analysis: China's AI Strategy Shifts from Chip Dominance to Open-Source Distribution

A Reuters analysis suggests China's AI advancement may stem from dominating open-source distribution and software optimization, not just semiconductor supremacy. This strategic pivot leverages existing hardware constraints to build ecosystem influence.

Mar 25, 202685% relevant

CausalDPO: A New Method to Make LLM Recommendations More Robust to Distribution Shifts

Researchers propose CausalDPO, a causal extension to Direct Preference Optimization (DPO) for LLM-based recommendations. It addresses DPO's tendency to amplify spurious correlations, improving out-of-distribution generalization by an average of 17.17%.

Mar 25, 202678% relevant

We Hosted a 35B LLM on an NVIDIA DGX Spark — A Technical Post-Mortem

A detailed, practical guide to deploying the Qwen3.5–35B model on NVIDIA's GB10 Blackwell hardware. The article serves as a crucial case study on the real-world challenges and solutions for on-premise LLM inference.

Mar 17, 2026100% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety