IBM Research Unveils Cost-Effective AI Inferencing with Speculative Decoding

catskill.news 24 June 2024

221 2 minutes read

IBM Research Unveils Cost-Effective AI Inferencing with Speculative Decoding

IBM Research has announced a significant breakthrough in AI inferencing, combining speculative decoding with paged attention to enhance the cost performance of large language models (LLMs). This development promises to make customer care chatbots more efficient and cost-effective, according to IBM Research.

In recent years, LLMs have improved the ability of chatbots to understand customer queries and provide accurate responses. However, the high cost and slow speed of serving these models have hindered broader AI adoption. Speculative decoding emerges as an optimization technique to accelerate AI inferencing by generating tokens faster, which can reduce latency by two to three times, thereby improving customer experience.

Despite its advantages, reducing latency traditionally comes with a trade-off: decreased throughput, or the number of users that can simultaneously utilize the model, which increases operational costs. IBM Research has tackled this challenge by cutting the latency of its open-source Granite 20B code model in half while quadrupling its throughput.

Speculative Decoding: Efficiency in Token Generation

LLMs use a transformer architecture, which is inefficient at generating text. Typically, a forward pass is required to process each previously generated token before producing a new one. Speculative decoding modifies this process to evaluate several prospective tokens simultaneously. If these tokens are validated, one forward pass can generate multiple tokens, thus increasing inferencing speed.

This technique can be executed by a smaller, more efficient model or part of the main model itself. By processing tokens in parallel, speculative decoding maximizes the efficiency of each GPU, potentially doubling or tripling inferencing speed. Initial introductions of speculative decoding by DeepMind and Google researchers utilized a draft model, while newer methods, such as the Medusa speculator, eliminate the need for a secondary model.

IBM researchers adapted the Medusa speculator by conditioning future tokens on each other rather than on the model’s next predicted token. This approach, combined with an efficient fine-tuning method using small and large batches of text, aligns the speculator’s responses closely with the LLM, significantly boosting inferencing speeds.

Paged Attention: Optimizing Memory Usage

Reducing LLM latency often compromises throughput due to increased GPU memory strain. Dynamic batching can mitigate this but not when speculative decoding is also competing for memory. IBM researchers addressed this by employing paged attention, an optimization technique inspired by virtual memory and paging concepts from operating systems.

Traditional attention algorithms store key-value (KV) sequences in contiguous memory, leading to fragmentation. Paged attention, however, divides these sequences into smaller blocks, or pages, that can be accessed as needed. This method minimizes redundant computation and allows the speculator to generate multiple candidates for each predicted word without duplicating the entire KV-cache, thus freeing up memory.

Future Implications

IBM has integrated speculative decoding and paged attention into its Granite 20B code model. The IBM speculator has been open-sourced on Hugging Face, enabling other developers to adapt these techniques for their LLMs. IBM plans to implement these optimization techniques across all models on its watsonx platform, enhancing enterprise AI applications.

Image source: Shutterstock

Source link

catskill.news 24 June 2024

221 2 minutes read

IBM Research Unveils Cost-Effective AI Inferencing with Speculative Decoding

Speculative Decoding: Efficiency in Token Generation

Paged Attention: Optimizing Memory Usage

Future Implications

catskill.news

Consent, Courtship, and Comedy in Keiko Green’s The Bed Trick

Stablecoins to make up 10% of money in the next decade or so: Circle CEO

Binance Integrates DeXe (DEXE) on BNB Smart Chain, Opens Deposits and Withdrawals

“Observing the Credit Landscape: Unveiling the Five-Month Shield”

Russia’s war in Ukraine: Live updates – CNN

IN CANNES WITH THE ASTON MARTIN DB12

TIFFANY & CO. HARDWEAR EYEWEAR

Ikea Billy Bookcase Hack: The Saga of the “Built-In Bookshelves”

Texas ‘still very strong’ candidate for 5-star WR Ryan Wingo

Speculative Decoding: Efficiency in Token Generation

Paged Attention: Optimizing Memory Usage

Future Implications

catskill.news

MetalCore Closed Beta 3 introduces NFTs and Web3 integration

Its move from Israel to Jordan complete, Love Into Action resumes mission to aid disabled

Related Articles

NVIDIA Introduces Generative AI Models and NIM Microservices for OpenUSD

NVIDIA’s AI Masters Triumph in KDD Cup 2024 Data Science Competition

Sui Community Fights Scams with Sui Guardians Initiative

Mt. Gox Bitcoin Distribution Underway After a Decade-Long Legal Battle

Consent, Courtship, and Comedy in Keiko Green’s The Bed Trick

Stablecoins to make up 10% of money in the next decade or so: Circle CEO

Binance Integrates DeXe (DEXE) on BNB Smart Chain, Opens Deposits and Withdrawals

“Observing the Credit Landscape: Unveiling the Five-Month Shield”

Russia’s war in Ukraine: Live updates – CNN

IN CANNES WITH THE ASTON MARTIN DB12

TIFFANY & CO. HARDWEAR EYEWEAR

Ikea Billy Bookcase Hack: The Saga of the “Built-In Bookshelves”

Texas ‘still very strong’ candidate for 5-star WR Ryan Wingo