Residential Collegefalse
Status即將出版Forthcoming
Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint
Lin, Yanying1,2; Peng, Shijie1,2; Wu, Shuaipeng1,3; Li, Yanbo1,3; Lu, Chengzhi1,2; Xu, Chengzhong4; Ye, Kejiang1
2024
Source PublicationProceedings of the IEEE International Conference on Web Services, ICWS
ISSN2836-3876
Pages1306-1313
AbstractPipeline parallelism is an important strategy for improving inference performance in Large Language Models (LLMs). However, we find that different stages of LLM pipelines exhibit distinct performance and request characteristics, posing challenges to system performance in online inference scenarios. To address this issue, we propose Planck, a performance optimization framework tailored for LLM pipeline inference. By balancing request traffic, queue length, and execution time at each stage, Planck introduces a progressive SLO (Service Level Objective) allocation method and a stage instance performance controller. Planck fine-grainedly allocates SLOs to each pipeline stage and dynamically adjusts according to request distribution to control queue length. Through optimizing queue lengths across different stages of the model pipeline, Planck effectively reduces waiting time and tail latency. Evaluations conducted on a real cloud cluster using diverse workloads demonstrate that Planck effectively reduces P99 latency and queue length for each pipeline stage.
KeywordLLM Serving Pipeline Bubble Pipeline Parallelism SLO Constraint
DOI10.1109/ICWS62655.2024.00157
URLView the original
Language英語English
Scopus ID2-s2.0-85210239739
Fulltext Access
Citation statistics
Document TypeJournal article
CollectionDEPARTMENT OF COMPUTER AND INFORMATION SCIENCE
Affiliation1.Chinese Academy of Sciences, Shenzhen Institute of Advanced Technology, China
2.University of Chinese Academy of Sciences, China
3.Southern University of Science and Technology, China
4.University of Macau, Macao
Recommended Citation
GB/T 7714
Lin, Yanying,Peng, Shijie,Wu, Shuaipeng,et al. Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint[J]. Proceedings of the IEEE International Conference on Web Services, ICWS, 2024, 1306-1313.
APA Lin, Yanying., Peng, Shijie., Wu, Shuaipeng., Li, Yanbo., Lu, Chengzhi., Xu, Chengzhong., & Ye, Kejiang (2024). Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint. Proceedings of the IEEE International Conference on Web Services, ICWS, 1306-1313.
MLA Lin, Yanying,et al."Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint".Proceedings of the IEEE International Conference on Web Services, ICWS (2024):1306-1313.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Lin, Yanying]'s Articles
[Peng, Shijie]'s Articles
[Wu, Shuaipeng]'s Articles
Baidu academic
Similar articles in Baidu academic
[Lin, Yanying]'s Articles
[Peng, Shijie]'s Articles
[Wu, Shuaipeng]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Lin, Yanying]'s Articles
[Peng, Shijie]'s Articles
[Wu, Shuaipeng]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.