/* ---- Google Analytics Code Below */

Wednesday, May 17, 2023

Peking University Researchers Introduce FastServe:

  Level of this advance, revealing it?

Peking University Researchers Introduce FastServe: A Distributed Inference Serving System For Large Language Models LLMs

By Aneesh Tickoo- May 16, 2023

https://arxiv.org/abs/2305.05920

Large language model (LLM) improvements create opportunities in various fields and inspire a new wave of interactive AI applications. The most noteworthy one is ChatGPT, which enables people to communicate informally with an AI agent to resolve problems ranging from software engineering to language translation. ChatGPT is one of the fastest-growing programs in history, thanks to its remarkable capabilities. Many companies follow the trend of releasing LLMs and ChatGPT-like products, including Microsoft’s New Bing, Google’s Bard, Meta’s LLaMa, Stanford’s Alpaca, Databricks’ Dolly, and UC Berkeley’s Vicuna. 

LLM inference differs from another deep neural network (DNN) model inference, such as ResNet, because it has special traits. Interactive AI applications built on LLMs must provide inferences to function. These apps’ interactive design necessitates quick job completion times (JCT) for LLM inference to deliver engaging user experiences. For instance, consumers anticipate an immediate response when they submit data into ChatGPT. However, the inference serving infrastructure is under great strain due to the number and complexity of LLMs. Businesses set up pricey clusters with accelerators like GPUs and TPUs to handle LLM inference operations. 

DNN inference jobs are often deterministic and highly predictable, i.e., the model and the hardware largely determine the inference job’s execution time. For instance, the execution time of various input photos varies a little while using the same ResNet model on a certain GPU. LLM inference positions, in contrast, have a unique autoregressive pattern. The LLM inference work goes through several rounds. Each iteration produces one output token, which is then added to the input to make the subsequent token in the following iteration. The output length, which is unknown at the outset, affects both the execution time and input length. Most deterministic model inference tasks, like those performed by ResNet, are catered for by existing inference serving systems like Clockwork and Shepherd.   .... ' 

No comments: