Appearance
2025-03-13
cs.AR - Architecture
| 标题 | 作者 | 发布日期 | 摘要 | |
|---|---|---|---|---|
| Demystifying FPGA Hard NoC Performance | Sihao Liu, Jake Ke, Tony Nowatzki, Jason Cong | 2025-03-13 | 下载 | With the advent of modern multi-chiplet FPGA architectures, vendors have begun integrating hardened NoC to address the scalability, resource usage, and frequency disadvantages of soft NoCs. |
| CODEI: Resource-Efficient Task-Driven Co-Design of Perception and Decision Making for Mobile Robots Applied to Autonomous Vehicles | Dejan Milojevic, Gioele Zardini, Miriam Elser, Andrea Censi, Emilio Frazzoli | 2025-03-13 | 下载 | This paper discusses the integration challenges and strategies for designing mobile robots, by focusing on the task-driven, optimal selection of hardware and software to balance safety, efficiency, an... |
| Efficient Implementation of CRYSTALS-KYBER Key Encapsulation Mechanism on ESP32 | Fabian Segatz, Muhammad Ihsan Al Hafiz | 2025-03-13 | 下载 | Kyber, an IND-CCA2-secure lattice-based post-quantum key-encapsulation mechanism, is the winner of the first post-quantum cryptography standardization process of the US National Institute of Standards... |
| Faster Inference of LLMs using FP8 on the Intel Gaudi | Joonhyung Lee, Shmulik Markovich-Golan, Daniel Ohayon, Yair Hanani, Gunho Park, Byeongwook Kim, Asaf Karnieli, Uri Livne, Haihao Shen, Tai Huang, Se Jung Kwon, Dongsoo Lee | 2025-03-13 | 下载 | Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resour... |
cs.DC - Distributed, Parallel, and Cluster Computing
| 标题 | 作者 | 发布日期 | 摘要 | |
|---|---|---|---|---|
| Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters | Abeda Sultana, Nabin Pakka, Fei Xu, Xu Yuan, Li Chen, Nian-Feng Tzeng | 2025-03-13 | 下载 | Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources... |
| Design and Analysis of an Extreme-Scale, High-Performance, and Modular Agent-Based Simulation Platform | Lukas Johannes Breitwieser | 2025-03-13 | 下载 | Agent-based modeling is indispensable for studying complex systems across many domains. However, existing simulation platforms exhibit two major issues: performance and modularity. |
| Galvatron: Automatic Distributed Training for Large Transformer Models | Esmail Gumaan | 2025-03-13 | 下载 | Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbe... |
| Efficient Precoding in XL-MIMO-AFDM System | Jun Zhu, Yin Xu, Dazhi He, Haoyang Li, Yunfeng Guan, Wenjun Zhang, Tianyao Ma, Haozhi Yuan | 2025-03-13 | 下载 | This paper explores the potential of affine frequency division multiplexing (AFDM) to mitigate the multiuser interference (MUI) problem by employing time-domain precoding in extremely-large-scale mult... |
| Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems | Fabian Knorr, Philip Salzmann, Peter Thoman, Thomas Fahringer | 2025-03-13 | 下载 | Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. |
| SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading | Qiaoling Chen, Shenggui Li, Wei Gao, Peng Sun, Yonggang Wen, Tianwei Zhang | 2025-03-13 | 下载 | In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities, driving advancements in real-world applications. However, training LLMs on increasingly long input sequences impos... |
| Collaborative Speculative Inference for Efficient LLM Inference Serving | Luyao Gao, Jianchun Liu, Hongli Xu, Xichong Zhang, Yunming Liao, Liusheng Huang | 2025-03-13 | 下载 | Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language m... |
| Message Size Matters: AlterBFT's Approach to Practical Synchronous BFT in Public Clouds | Nenad Milošević, Daniel Cason, Zarko Milošević, Robert Soulé, Fernando Pedone | 2025-03-13 | 下载 | Synchronous consensus protocols offer a significant advantage over their asynchronous and partially synchronous counterparts by providing higher fault tolerance -- an essential benefit in distributed ... |
| Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores | Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan | 2025-03-13 | 下载 | The escalating size of Mixture-of-Experts (MoE) based Large Language Models (LLMs) presents significant computational and memory challenges, necessitating innovative solutions to enhance efficiency wi... |
| Shaved Ice: Optimal Compute Resource Commitments for Dynamic Multi-Cloud Workloads | Murray Stokely, Neel Nadgir, Jack Peele, Orestis Kostakis | 2025-03-13 | 下载 | Cloud providers have introduced pricing models to incentivize long-term commitments of compute capacity. These long-term commitments allow the cloud providers to get guaranteed revenue for their inves... |
| Efficient Federated Fine-Tuning of Large Language Models with Layer Dropout | Shilong Wang, Jianchun Liu, Hongli Xu, Jiaming Yan, Xianjun Gao | 2025-03-13 | 下载 | Fine-tuning plays a crucial role in enabling pre-trained LLMs to evolve from general language comprehension to task-specific expertise. To preserve user data privacy, federated fine-tuning is often em... |
| Introducing MareNostrum5: A European pre-exascale energy-efficient system designed to serve a broad spectrum of scientific workloads | Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, Joan Vinyals, Josep Pocurull, David Vicente, Beatriz Eguzkitza, Flavio C. C. Galeazzo, Mario C. Acosta, Sergi Girona | 2025-03-13 | 下载 | MareNostrum5 is a pre-exascale supercomputer at the Barcelona Supercomputing Center (BSC), part of the EuroHPC Joint Undertaking. With a peak performance of 314 petaflops, MareNostrum5 features a hybr... |
cs.NI - Networking and Internet Architecture
| 标题 | 作者 | 发布日期 | 摘要 | |
|---|---|---|---|---|
| Motor Rotation Speed Estimation based on Magnetic Inductive Sensing | Rahul Hoskeri, Hua Huang | 2025-03-13 | 下载 | Rotation speed is a key metric for many applications, such as calibrating electric motors in a factory, monitoring a car's engine health, detecting faults in electrical appliances, and more. |
cs.OS - Operating Systems
| 标题 | 作者 | 发布日期 | 摘要 | |
|---|---|---|---|---|
| Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores | Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan | 2025-03-13 | 下载 | The escalating size of Mixture-of-Experts (MoE) based Large Language Models (LLMs) presents significant computational and memory challenges, necessitating innovative solutions to enhance efficiency wi... |
cs.PF - Performance
| 标题 | 作者 | 发布日期 | 摘要 | |
|---|---|---|---|---|
| Design and Analysis of an Extreme-Scale, High-Performance, and Modular Agent-Based Simulation Platform | Lukas Johannes Breitwieser | 2025-03-13 | 下载 | Agent-based modeling is indispensable for studying complex systems across many domains. However, existing simulation platforms exhibit two major issues: performance and modularity. |
| Super-Linear Speedup by Generalizing Runtime Repeated Recursion Unfolding in Prolog | Thom Fruehwirth | 2025-03-13 | 下载 | Runtime repeated recursion unfolding was recently introduced as a just-in-time program transformation strategy that can achieve super-linear speedup. |
| Introducing MareNostrum5: A European pre-exascale energy-efficient system designed to serve a broad spectrum of scientific workloads | Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, Joan Vinyals, Josep Pocurull, David Vicente, Beatriz Eguzkitza, Flavio C. C. Galeazzo, Mario C. Acosta, Sergi Girona | 2025-03-13 | 下载 | MareNostrum5 is a pre-exascale supercomputer at the Barcelona Supercomputing Center (BSC), part of the EuroHPC Joint Undertaking. With a peak performance of 314 petaflops, MareNostrum5 features a hybr... |