Skip to content

2025-03-13

cs.AR - Architecture

标题作者发布日期PDF摘要
Demystifying FPGA Hard NoC PerformanceSihao Liu, Jake Ke, Tony Nowatzki, Jason Cong2025-03-13下载With the advent of modern multi-chiplet FPGA architectures, vendors have begun integrating hardened NoC to address the scalability, resource usage, and frequency disadvantages of soft NoCs.
CODEI: Resource-Efficient Task-Driven Co-Design of Perception and Decision Making for Mobile Robots Applied to Autonomous VehiclesDejan Milojevic, Gioele Zardini, Miriam Elser, Andrea Censi, Emilio Frazzoli2025-03-13下载This paper discusses the integration challenges and strategies for designing mobile robots, by focusing on the task-driven, optimal selection of hardware and software to balance safety, efficiency, an...
Efficient Implementation of CRYSTALS-KYBER Key Encapsulation Mechanism on ESP32Fabian Segatz, Muhammad Ihsan Al Hafiz2025-03-13下载Kyber, an IND-CCA2-secure lattice-based post-quantum key-encapsulation mechanism, is the winner of the first post-quantum cryptography standardization process of the US National Institute of Standards...
Faster Inference of LLMs using FP8 on the Intel GaudiJoonhyung Lee, Shmulik Markovich-Golan, Daniel Ohayon, Yair Hanani, Gunho Park, Byeongwook Kim, Asaf Karnieli, Uri Livne, Haihao Shen, Tai Huang, Se Jung Kwon, Dongsoo Lee2025-03-13下载Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resour...

cs.DC - Distributed, Parallel, and Cluster Computing

标题作者发布日期PDF摘要
Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning ClustersAbeda Sultana, Nabin Pakka, Fei Xu, Xu Yuan, Li Chen, Nian-Feng Tzeng2025-03-13下载Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources...
Design and Analysis of an Extreme-Scale, High-Performance, and Modular Agent-Based Simulation PlatformLukas Johannes Breitwieser2025-03-13下载Agent-based modeling is indispensable for studying complex systems across many domains. However, existing simulation platforms exhibit two major issues: performance and modularity.
Galvatron: Automatic Distributed Training for Large Transformer ModelsEsmail Gumaan2025-03-13下载Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbe...
Efficient Precoding in XL-MIMO-AFDM SystemJun Zhu, Yin Xu, Dazhi He, Haoyang Li, Yunfeng Guan, Wenjun Zhang, Tianyao Ma, Haozhi Yuan2025-03-13下载This paper explores the potential of affine frequency division multiplexing (AFDM) to mitigate the multiuser interference (MUI) problem by employing time-domain precoding in extremely-large-scale mult...
Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU SystemsFabian Knorr, Philip Salzmann, Peter Thoman, Thomas Fahringer2025-03-13下载Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system.
SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel OffloadingQiaoling Chen, Shenggui Li, Wei Gao, Peng Sun, Yonggang Wen, Tianwei Zhang2025-03-13下载In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities, driving advancements in real-world applications. However, training LLMs on increasingly long input sequences impos...
Collaborative Speculative Inference for Efficient LLM Inference ServingLuyao Gao, Jianchun Liu, Hongli Xu, Xichong Zhang, Yunming Liao, Liusheng Huang2025-03-13下载Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language m...
Message Size Matters: AlterBFT's Approach to Practical Synchronous BFT in Public CloudsNenad Milošević, Daniel Cason, Zarko Milošević, Robert Soulé, Fernando Pedone2025-03-13下载Synchronous consensus protocols offer a significant advantage over their asynchronous and partially synchronous counterparts by providing higher fault tolerance -- an essential benefit in distributed ...
Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor CoresChenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan2025-03-13下载The escalating size of Mixture-of-Experts (MoE) based Large Language Models (LLMs) presents significant computational and memory challenges, necessitating innovative solutions to enhance efficiency wi...
Shaved Ice: Optimal Compute Resource Commitments for Dynamic Multi-Cloud WorkloadsMurray Stokely, Neel Nadgir, Jack Peele, Orestis Kostakis2025-03-13下载Cloud providers have introduced pricing models to incentivize long-term commitments of compute capacity. These long-term commitments allow the cloud providers to get guaranteed revenue for their inves...
Efficient Federated Fine-Tuning of Large Language Models with Layer DropoutShilong Wang, Jianchun Liu, Hongli Xu, Jiaming Yan, Xianjun Gao2025-03-13下载Fine-tuning plays a crucial role in enabling pre-trained LLMs to evolve from general language comprehension to task-specific expertise. To preserve user data privacy, federated fine-tuning is often em...
Introducing MareNostrum5: A European pre-exascale energy-efficient system designed to serve a broad spectrum of scientific workloadsFabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, Joan Vinyals, Josep Pocurull, David Vicente, Beatriz Eguzkitza, Flavio C. C. Galeazzo, Mario C. Acosta, Sergi Girona2025-03-13下载MareNostrum5 is a pre-exascale supercomputer at the Barcelona Supercomputing Center (BSC), part of the EuroHPC Joint Undertaking. With a peak performance of 314 petaflops, MareNostrum5 features a hybr...

cs.NI - Networking and Internet Architecture

标题作者发布日期PDF摘要
Motor Rotation Speed Estimation based on Magnetic Inductive SensingRahul Hoskeri, Hua Huang2025-03-13下载Rotation speed is a key metric for many applications, such as calibrating electric motors in a factory, monitoring a car's engine health, detecting faults in electrical appliances, and more.

cs.OS - Operating Systems

标题作者发布日期PDF摘要
Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor CoresChenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan2025-03-13下载The escalating size of Mixture-of-Experts (MoE) based Large Language Models (LLMs) presents significant computational and memory challenges, necessitating innovative solutions to enhance efficiency wi...

cs.PF - Performance

标题作者发布日期PDF摘要
Design and Analysis of an Extreme-Scale, High-Performance, and Modular Agent-Based Simulation PlatformLukas Johannes Breitwieser2025-03-13下载Agent-based modeling is indispensable for studying complex systems across many domains. However, existing simulation platforms exhibit two major issues: performance and modularity.
Super-Linear Speedup by Generalizing Runtime Repeated Recursion Unfolding in PrologThom Fruehwirth2025-03-13下载Runtime repeated recursion unfolding was recently introduced as a just-in-time program transformation strategy that can achieve super-linear speedup.
Introducing MareNostrum5: A European pre-exascale energy-efficient system designed to serve a broad spectrum of scientific workloadsFabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, Joan Vinyals, Josep Pocurull, David Vicente, Beatriz Eguzkitza, Flavio C. C. Galeazzo, Mario C. Acosta, Sergi Girona2025-03-13下载MareNostrum5 is a pre-exascale supercomputer at the Barcelona Supercomputing Center (BSC), part of the EuroHPC Joint Undertaking. With a peak performance of 314 petaflops, MareNostrum5 features a hybr...

基于 VitePress 构建