2025-03-13

cs.AR - Architecture

标题	作者	发布日期	PDF	摘要
Demystifying FPGA Hard NoC Performance	Sihao Liu, Jake Ke, Tony Nowatzki, Jason Cong	2025-03-13	下载	With the advent of modern multi-chiplet FPGA architectures, vendors have begun integrating hardened NoC to address the scalability, resource usage, and frequency disadvantages of soft NoCs.
CODEI: Resource-Efficient Task-Driven Co-Design of Perception and Decision Making for Mobile Robots Applied to Autonomous Vehicles	Dejan Milojevic, Gioele Zardini, Miriam Elser, Andrea Censi, Emilio Frazzoli	2025-03-13	下载	This paper discusses the integration challenges and strategies for designing mobile robots, by focusing on the task-driven, optimal selection of hardware and software to balance safety, efficiency, an...
Efficient Implementation of CRYSTALS-KYBER Key Encapsulation Mechanism on ESP32	Fabian Segatz, Muhammad Ihsan Al Hafiz	2025-03-13	下载	Kyber, an IND-CCA2-secure lattice-based post-quantum key-encapsulation mechanism, is the winner of the first post-quantum cryptography standardization process of the US National Institute of Standards...
Faster Inference of LLMs using FP8 on the Intel Gaudi	Joonhyung Lee, Shmulik Markovich-Golan, Daniel Ohayon, Yair Hanani, Gunho Park, Byeongwook Kim, Asaf Karnieli, Uri Livne, Haihao Shen, Tai Huang, Se Jung Kwon, Dongsoo Lee	2025-03-13	下载	Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resour...

cs.DC - Distributed, Parallel, and Cluster Computing

标题	作者	发布日期	PDF	摘要
Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters	Abeda Sultana, Nabin Pakka, Fei Xu, Xu Yuan, Li Chen, Nian-Feng Tzeng	2025-03-13	下载	Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources...
Design and Analysis of an Extreme-Scale, High-Performance, and Modular Agent-Based Simulation Platform	Lukas Johannes Breitwieser	2025-03-13	下载	Agent-based modeling is indispensable for studying complex systems across many domains. However, existing simulation platforms exhibit two major issues: performance and modularity.
Galvatron: Automatic Distributed Training for Large Transformer Models	Esmail Gumaan	2025-03-13	下载	Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbe...
Efficient Precoding in XL-MIMO-AFDM System	Jun Zhu, Yin Xu, Dazhi He, Haoyang Li, Yunfeng Guan, Wenjun Zhang, Tianyao Ma, Haozhi Yuan	2025-03-13	下载	This paper explores the potential of affine frequency division multiplexing (AFDM) to mitigate the multiuser interference (MUI) problem by employing time-domain precoding in extremely-large-scale mult...
Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems	Fabian Knorr, Philip Salzmann, Peter Thoman, Thomas Fahringer	2025-03-13	下载	Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system.
SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading	Qiaoling Chen, Shenggui Li, Wei Gao, Peng Sun, Yonggang Wen, Tianwei Zhang	2025-03-13	下载	In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities, driving advancements in real-world applications. However, training LLMs on increasingly long input sequences impos...
Collaborative Speculative Inference for Efficient LLM Inference Serving	Luyao Gao, Jianchun Liu, Hongli Xu, Xichong Zhang, Yunming Liao, Liusheng Huang	2025-03-13	下载	Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language m...
Message Size Matters: AlterBFT's Approach to Practical Synchronous BFT in Public Clouds	Nenad Milošević, Daniel Cason, Zarko Milošević, Robert Soulé, Fernando Pedone	2025-03-13	下载	Synchronous consensus protocols offer a significant advantage over their asynchronous and partially synchronous counterparts by providing higher fault tolerance -- an essential benefit in distributed ...
Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores	Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan	2025-03-13	下载	The escalating size of Mixture-of-Experts (MoE) based Large Language Models (LLMs) presents significant computational and memory challenges, necessitating innovative solutions to enhance efficiency wi...
Shaved Ice: Optimal Compute Resource Commitments for Dynamic Multi-Cloud Workloads	Murray Stokely, Neel Nadgir, Jack Peele, Orestis Kostakis	2025-03-13	下载	Cloud providers have introduced pricing models to incentivize long-term commitments of compute capacity. These long-term commitments allow the cloud providers to get guaranteed revenue for their inves...
Efficient Federated Fine-Tuning of Large Language Models with Layer Dropout	Shilong Wang, Jianchun Liu, Hongli Xu, Jiaming Yan, Xianjun Gao	2025-03-13	下载	Fine-tuning plays a crucial role in enabling pre-trained LLMs to evolve from general language comprehension to task-specific expertise. To preserve user data privacy, federated fine-tuning is often em...
Introducing MareNostrum5: A European pre-exascale energy-efficient system designed to serve a broad spectrum of scientific workloads	Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, Joan Vinyals, Josep Pocurull, David Vicente, Beatriz Eguzkitza, Flavio C. C. Galeazzo, Mario C. Acosta, Sergi Girona	2025-03-13	下载	MareNostrum5 is a pre-exascale supercomputer at the Barcelona Supercomputing Center (BSC), part of the EuroHPC Joint Undertaking. With a peak performance of 314 petaflops, MareNostrum5 features a hybr...

cs.NI - Networking and Internet Architecture

标题	作者	发布日期	PDF	摘要
Motor Rotation Speed Estimation based on Magnetic Inductive Sensing	Rahul Hoskeri, Hua Huang	2025-03-13	下载	Rotation speed is a key metric for many applications, such as calibrating electric motors in a factory, monitoring a car's engine health, detecting faults in electrical appliances, and more.

cs.OS - Operating Systems

标题	作者	发布日期	PDF	摘要
Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores	Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan	2025-03-13	下载	The escalating size of Mixture-of-Experts (MoE) based Large Language Models (LLMs) presents significant computational and memory challenges, necessitating innovative solutions to enhance efficiency wi...

cs.PF - Performance

标题	作者	发布日期	PDF	摘要
Design and Analysis of an Extreme-Scale, High-Performance, and Modular Agent-Based Simulation Platform	Lukas Johannes Breitwieser	2025-03-13	下载	Agent-based modeling is indispensable for studying complex systems across many domains. However, existing simulation platforms exhibit two major issues: performance and modularity.
Super-Linear Speedup by Generalizing Runtime Repeated Recursion Unfolding in Prolog	Thom Fruehwirth	2025-03-13	下载	Runtime repeated recursion unfolding was recently introduced as a just-in-time program transformation strategy that can achieve super-linear speedup.
Introducing MareNostrum5: A European pre-exascale energy-efficient system designed to serve a broad spectrum of scientific workloads	Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, Joan Vinyals, Josep Pocurull, David Vicente, Beatriz Eguzkitza, Flavio C. C. Galeazzo, Mario C. Acosta, Sergi Girona	2025-03-13	下载	MareNostrum5 is a pre-exascale supercomputer at the Barcelona Supercomputing Center (BSC), part of the EuroHPC Joint Undertaking. With a peak performance of 314 petaflops, MareNostrum5 features a hybr...