Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint

doi:10.1109/ICWS62655.2024.00157

UM > Faculty of Science and Technology > DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE

Residential College	false
Status	即將出版Forthcoming
	Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint
	Lin, Yanying 1,2; Peng, Shijie 1,2; Wu, Shuaipeng 1,3; Li, Yanbo 1,3; Lu, Chengzhi 1,2; Xu, Chengzhong4 ; Ye, Kejiang 1
	2024
Source Publication	Proceedings of the IEEE International Conference on Web Services, ICWS
ISSN	2836-3876
Pages	1306-1313
Abstract	Pipeline parallelism is an important strategy for improving inference performance in Large Language Models (LLMs). However, we find that different stages of LLM pipelines exhibit distinct performance and request characteristics, posing challenges to system performance in online inference scenarios. To address this issue, we propose Planck, a performance optimization framework tailored for LLM pipeline inference. By balancing request traffic, queue length, and execution time at each stage, Planck introduces a progressive SLO (Service Level Objective) allocation method and a stage instance performance controller. Planck fine-grainedly allocates SLOs to each pipeline stage and dynamically adjusts according to request distribution to control queue length. Through optimizing queue lengths across different stages of the model pipeline, Planck effectively reduces waiting time and tail latency. Evaluations conducted on a real cloud cluster using diverse workloads demonstrate that Planck effectively reduces P99 latency and queue length for each pipeline stage.
Keyword	LLM Serving Pipeline Bubble Pipeline Parallelism SLO Constraint
DOI	10.1109/ICWS62655.2024.00157
URL	View the original
Language	英語English
Scopus ID	2-s2.0-85210239739
Fulltext Access	View Full-Text via DOI View Full-Text via Web of Science View Full-Text via Scopus
Citation statistics
Document Type	Journal article
Collection	DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE
Affiliation	1.Chinese Academy of Sciences, Shenzhen Institute of Advanced Technology, China 2.University of Chinese Academy of Sciences, China 3.Southern University of Science and Technology, China 4.University of Macau, Macao
Recommended Citation GB/T 7714	Lin, Yanying,Peng, Shijie,Wu, Shuaipeng,et al. Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint[J]. Proceedings of the IEEE International Conference on Web Services, ICWS, 2024, 1306-1313.
APA	Lin, Yanying., Peng, Shijie., Wu, Shuaipeng., Li, Yanbo., Lu, Chengzhi., Xu, Chengzhong., & Ye, Kejiang (2024). Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint. Proceedings of the IEEE International Conference on Web Services, ICWS, 1306-1313.
MLA	Lin, Yanying,et al."Planck: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint".Proceedings of the IEEE International Conference on Web Services, ICWS (2024):1306-1313.

Files in This Item:
There are no files associated with this item.

If you have any objections to this item, please fill out the form below and the administrator will contact you as soon as possible.
Content:
Email：	*
Affiliation No.
Verification Code:	Refresh

Any comments and suggestions are welcomed.
Title:	*
Content:
Email：	*
Verification Code:	Refresh