publications | RUI LI

BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training

Li, Rui, Zhi, Xiaoyun, Chi, Jinxin, Yu, Menghan, Huang, Lixin, Zhu, Jia, Zhang, Weilun, Ma, Xing, Liu, Wenjia, Zhu, Zhicheng, others

arXiv preprint arXiv:2507.12619 (2025)

Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and (c) model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and (c) striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead.

PDF

HotSwap: Enabling Live Dependency Sharing in Serverless Computing

Li, Rui, Cooperman, Gene, Tiwari, Devesh

Proceedings of the IEEE International Conference on Cloud Computing (CLOUD) (2025)

This work presents HotSwap, a novel provider-side cold-start optimization for serverless computing. This optimization reduces cold-start time when booting and loading dependencies at runtime inside a function container. Previous research has extensively focused on reducing cold-start latency for specific functions. However, little attention has been given to skewed production workloads. In such cases, cross-function optimization becomes essential. Without cross-function optimization, a cloud provider is left with two equally poor options: (i) Either the cloud provider gives up optimization for each function in the long tail (which is slow); or (ii) the cloud provider applies function-specific optimizations (e.g., cache function images) to every function in the long tail (which violates the vendor's cache constraints). HotSwap demonstrates cross-function optimization using a novel pre-warming strategy. In this strategy, a pre-initialized live dependency image is migrated to the new function instance. At the same time, HotSwap respects the provider's cache constraints, because a single pre-warmed dependency image in the cache can be shared among all serverless functions that require that image. HotSwap has been tested on seven representative functions from FunctionBench. In those tests, HotSwap accelerates dependency loading for those serverless functions with large dependency requirements by a factor ranging from 2.2 to 3.2. Simulation experiments using Azure traces indicate that HotSwap can save 88% of space, compared with a previous function-specific method, PreBaking, when sharing a dependency image among ten different functions.

PDF

Automated Intelligent Healing for Cloud-Scale Data Centers

Li, Rui, Cheng, Zhinan, Lee, Patrick P. C., Wang, Pinghui, Qiang, Yi, Lan, Lin, He, Cheng, Lu, Jinlong, Wang, Mian, Ding, Xinquan

40th International Symposium on Reliable Distributed Systems (SRDS) (2021) - Acceptance Rate: 25.5% (27/106)

Modern cloud-scale data centers necessitate self-healing (i.e., the automation of detecting and repairing component failures) to support reliable and scalable cloud services in the face of prevalent failures. Traditional policy-based self-healing solutions rely on expert knowledge to define the proper policies for choosing repair actions, and hence are error-prone and non-scalable in practical deployment. We propose AIHS, an automated intelligent healing system that applies machine learning to achieve scalable self-healing in cloud-scale data centers. AIHS is designed as a full-fledged, general pipeline that supports various machine learning models for predicting accurate repair actions based on raw monitoring logs. We conduct extensive trace-driven and production experiments, and show that AIHS achieves higher prediction accuracy than current self-healing solutions and successfully fixes 92.4% of the total of 33.7 million production failures over seven months. AIHS also reduces 51% of unavailable time of each failed server on average compared to policy-based self-healing. AIHS is now deployed in production cloud-scale data centers at Alibaba with a total of 600 K servers. We open-source a Python prototype that reproduces the self-healing pipeline of AIHS for public validation.

DOI

EBSNN: Extended Byte Segment Neural Network for Network Traffic Classification

Xiao, Xi, Xiao, Wentao, Li, Rui, Luo, Xiapu, Zheng, Hai-Tao, Xia, Shu-Tao

IEEE Transactions on Dependable and Secure Computing (2021)

Network traffic classification is important to intrusion detection and network management. Most of existing methods are based on machine learning techniques and rely on the features extracted manually from flows or packets. However, with the rapid growth of network applications, it is difficult for these approaches to handle new complex applications. In this article, we design a novel neural network, the Extended Byte Segment Neural Network (EBSNN), to classify netwrk traffic. EBSNN first divides a packet into header segments and payload segments, which are then fed into encoders composed of the recurrent neural networks with the attention mechanism. Based on the outputs, another encoder learns the high-level representation of the whole packet. In particular, side-channel features are learned from header segments to improve the performance. Finally, the label of the packet is obtained by the softmax function. Furthermore, EBSNN can classify network flows by examining the first few packets. Thorough experiments on the real-world datasets show that EBSNN achieves better performance than the state-of-the-art methods in both the application identification task and the website identification task.

DOI

SAM: Self-Attention based Deep Learning Method for Online Traffic Classification

Xie, Guorui, Li, Qing, Jiang, Yong, Dai, Tao, Shen, Gengbiao, Li, Rui, Sinnott, Richard, Xia, Shutao

Proceedings of the Workshop on Network Meets AI & ML, ACM SIGCOMM 2020 Workshop (2020) - Acceptance Rate: 24% (13/38)

Network traffic classification categorizes traffic classes based on protocols (e.g., HTTP or DNS) or applications (e.g., Facebook or Gmail). Its accuracy is a key foundation of some network management tasks like Quality-of-Service (QoS) control, anomaly detection, etc. To further improve the accuracy of traffic classification, recent researches have introduced deep learning based methods. However, most of them utilize the privacy-concerned payload (user data). Besides, they generally do not consider the dependency of bytes in a packet, which we believe can be exploited for the more accurate classification. In this work, we treat the initial bytes of a network packet as a language and propose a novel Self-Attention based Method (SAM) for traffic classification. The average F1-scores of SAM on protocol and application classification are 98.62% and 98.93%. With the higher accuracy of SAM, better QoS and anomaly detection can be met.

PDF

Novel Dynamic Multiple Classification System for Network Traffic

Xiao, Xi, Li, Rui, Zheng, Hai-Tao, Ye, Runguo, KumarSangaiah, Arun, Xia, Shutao

Information Sciences (2019)

Network traffic classification is important to intrusion detection and network management. Most of existing methods are based on machine learning techniques and rely on the features extracted manually from flows or packets. However, with the rapid growth of network applications, it is difficult for these approaches to handle new complex applications. In this article, we design a novel neural network, the Extended Byte Segment Neural Network (EBSNN), to classify netwrk traffic. EBSNN first divides a packet into header segments and payload segments, which are then fed into encoders composed of the recurrent neural networks with the attention mechanism. Based on the outputs, another encoder learns the high-level representation of the whole packet. In particular, side-channel features are learned from header segments to improve the performance. Finally, the label of the packet is obtained by the softmax function. Furthermore, EBSNN can classify network flows by examining the first few packets. Thorough experiments on the real-world datasets show that EBSNN achieves better performance than the state-of-the-art methods in both the application identification task and the website identification task.

DOI

Byte Segment Neural Network for Network Traffic Classification

Li, Rui, Xiao, Xi, Ni, Shiguang, Zheng, Haitao, Xia, Shutao

Proceedings of the IEEE/ACM 26th International Symposium on Quality of Service (2018) - Acceptance Rate: 20.8% (26/125)

Network traffic classification, which can map network traffic to protocols in the application layer, is a fundamental technique for network management and security issues such as Quality of Service, network measurement, and network monitoring. Recent researchers focus on extracting features for traditional machine learning methods from flows or datagrams of the specific protocol. However, as the rapid growth of network applications, previous works cannot handle complex novel protocols well. In this paper, we introduce the recurrent neural network to network traffic classification and design a novel neural network, the Byte Segment Neural Network (BSNN). BSNN treats network datagrams as input and gives the classification results directly. In BSNN, a datagram is firstly broken into serval byte segments. Then, these segments are fed to encoders which are based on the recurrent neural network. The information extracted by encoders is combined to a representation vector of the whole datagram. Finally, we apply the softmax function to use this vector for predicting the application protocol of this datagram. There are several key advantages of BSNN: 1) no need for prior knowledge of target applications; 2) can handle both connection-oriented protocols and connection-less protocols; 3) supports multi-classification for protocols; 4) shows outstanding accuracy in both traditional protocols and complex novel protocols. Our thorough experiments on real-world data with different protocols indicate that BSNN gains average F1-measure about 95.82% in multi-classification for five protocols including QQ, PPLive, DNS, 360 and BitTorrent. And it also shows excellent performance for detection of novel protocols. Furthermore, compared with two recent state-of-the-art works, BSNN has superiority over the traditional machine learning-based method and the packet inspection method.

DOI

LogFold: Enhancing Log Anomaly Detection through Sequence Folding and Reconstruction

Shi, X., Li, Rui, Du, Q., He, C., Tian, F.

30th Asia-Pacific Software Engineering Conference (APSEC) (2023)

Modern large-scale systems and networks necessitate automated anomaly detection to support the high availability and quality of services. Since logs are an essential data source that can accurately reflect the state of a system, log anomaly detection has attracted a lot of attention from researchers in both academia and industry. As the technology of artificial intelligence advances, plenty of work has adopted deep learning to detect log anomalies and achieved promising results. Nevertheless, it usually suffers from a lack of labels, excessive log sequence length, and low throughput problems when deploying to real-world systems. To address these challenges, we propose Log-Fold, an unsupervised Transformer-based log anomaly detection approach. In LogFold, we propose fold embedding, which can compress long log sequences to enhance the efficiency of anomaly detection. And we design a sequence reconstruction technique to enhance the effectiveness of anomaly detection. Our evaluation shows LogFold achieves 90.55% and 90.55% and 99.90% Fl-score on HDFS and BGL datasets, respectively, outperforming state-of-the-art methods. Besides, the fold embedding layer achieves compression rates of 36.55% and 64.86% on HDFS and BGL datasets, respectively, which helps to improve the throughput of LogFold.

Robust data preprocessing for machine-learning-based disk failure prediction in cloud production environments

Han, Shujie, Wu, Jun, Xu, Erci, He, Cheng, Lee, Patrick PC, Qiang, Yi, Zheng, Qixing, Huang, Tao, Huang, Zixi, Li, Rui

arXiv preprint arXiv:1912.09722 (2019)

To provide proactive fault tolerance for modern cloud data centers, extensive studies have proposed machine learning (ML) approaches to predict imminent disk failures for early remedy and evaluated their approaches directly on public datasets (e.g., Backblaze SMART logs). However, in real-world production environments, the data quality is imperfect (e.g., inaccurate labeling, missing data samples, and complex failure types), thereby degrading the prediction accuracy. We present RODMAN, a robust data preprocessing pipeline that refines data samples before feeding them into ML models. We start with a large-scale trace-driven study of over three million disks from Alibaba Cloud's data centers, and motivate the practical challenges in ML-based disk failure prediction. We then design RODMAN with three data preprocessing echniques, namely failure-type filtering, spline-based data filling, and automated pre-failure backtracking, that are applicable for general ML models. Evaluation on both the Alibaba and Backblaze datasets shows that RODMAN improves the prediction accuracy compared to without data preprocessing under various settings.

PDF

All Publications