Residential College | false |
Status | 即將出版Forthcoming |
Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint | |
Lin, Yanying1,2; Peng, Shijie1,2; Wu, Shuaipeng1,3; Li, Yanbo1,3; Lu, Chengzhi1,2; Xu, Chengzhong4![]() | |
2024 | |
Source Publication | Proceedings of the IEEE International Conference on Web Services, ICWS
![]() |
ISSN | 2836-3876 |
Pages | 1306-1313 |
Abstract | Pipeline parallelism is an important strategy for improving inference performance in Large Language Models (LLMs). However, we find that different stages of LLM pipelines exhibit distinct performance and request characteristics, posing challenges to system performance in online inference scenarios. To address this issue, we propose Planck, a performance optimization framework tailored for LLM pipeline inference. By balancing request traffic, queue length, and execution time at each stage, Planck introduces a progressive SLO (Service Level Objective) allocation method and a stage instance performance controller. Planck fine-grainedly allocates SLOs to each pipeline stage and dynamically adjusts according to request distribution to control queue length. Through optimizing queue lengths across different stages of the model pipeline, Planck effectively reduces waiting time and tail latency. Evaluations conducted on a real cloud cluster using diverse workloads demonstrate that Planck effectively reduces P99 latency and queue length for each pipeline stage. |
Keyword | LLM Serving Pipeline Bubble Pipeline Parallelism SLO Constraint |
DOI | 10.1109/ICWS62655.2024.00157 |
URL | View the original |
Language | 英語English |
Scopus ID | 2-s2.0-85210239739 |
Fulltext Access | |
Citation statistics | |
Document Type | Journal article |
Collection | DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE |
Affiliation | 1.Chinese Academy of Sciences, Shenzhen Institute of Advanced Technology, China 2.University of Chinese Academy of Sciences, China 3.Southern University of Science and Technology, China 4.University of Macau, Macao |
Recommended Citation GB/T 7714 | Lin, Yanying,Peng, Shijie,Wu, Shuaipeng,et al. Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint[J]. Proceedings of the IEEE International Conference on Web Services, ICWS, 2024, 1306-1313. |
APA | Lin, Yanying., Peng, Shijie., Wu, Shuaipeng., Li, Yanbo., Lu, Chengzhi., Xu, Chengzhong., & Ye, Kejiang (2024). Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint. Proceedings of the IEEE International Conference on Web Services, ICWS, 1306-1313. |
MLA | Lin, Yanying,et al."Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint".Proceedings of the IEEE International Conference on Web Services, ICWS (2024):1306-1313. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment