Skip to content

2025-10-16

cs.AR - Architecture

标题作者发布日期PDF摘要
From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIRErwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, Andre Rosti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, Jinming Zhuang2025-10-16下载General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures.
ColumnDisturb: Understanding Column-based Read Disturbance in Real DRAM Chips and Implications for Future Systemsİsmail Emir Yüksel, Ataberk Olgun, F. Nisa Bostancı, Haocong Luo, A. Giray Yağlıkçı, Onur Mutlu2025-10-16下载We experimentally demonstrate a new widespread read disturbance phenomenon, ColumnDisturb, in real commodity DRAM chips. By repeatedly opening or keeping a DRAM row (aggressor row) open, we show that ...
Deadlock-free routing for Full-mesh networks without using Virtual ChannelsAlejandro Cano, Cristóbal Camarero, Carmen Martínez, Ramón Beivide2025-10-16下载High-radix, low-diameter networks like HyperX and Dragonfly use a Full-mesh core, and rely on multiple virtual channels (VCs) to avoid packet deadlocks in adaptive routing.
Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous ReferencesHongzheng Chen, Bin Fan, Alexander Collins, Bastian Hagedorn, Evghenii Gaburov, Masahiro Masuda, Matthew Brookhart, Chris Sullivan, Jason Knight, Zhiru Zhang, Vinod Grover2025-10-16下载Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this tas...
MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model ServingJungi Lee, Junyong Park, Soohyun Cha, Jaehoon Cho, Jaewoong Sim2025-10-16下载Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrus...
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized DataflowChing-Lin Hsiung, Tian-Sheuan Chang2025-10-16下载Current transformer accelerators primarily focus on optimizing self-attention due to its quadratic complexity. However, this focus is less relevant for vision transformers with short token lengths, wh...
Computing-In-Memory Aware Model Adaption For Edge DevicesMing-Han Lin, Tian-Sheuan Chang2025-10-16下载Computing-in-Memory (CIM) macros have gained popularity for deep learning acceleration due to their highly parallel computation and low power consumption.
Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge ComputingTianhua Xia, Sai Qian Zhang2025-10-16下载Running Large Language Models (LLMs) on edge devices is crucial for reducing latency, improving real-time processing, and enhancing privacy. By performing inference directly on the device, data does n...
Systolic Array Acceleration of Diagonal-Optimized Sparse-Sparse Matrix Multiplication for Efficient Quantum SimulationYuchao Su, Srikar Chundury, Jiajia Li, Frank Mueller2025-10-16下载Hamiltonian simulation is a key workload in quantum computing, enabling the study of complex quantum systems and serving as a critical tool for classical verification of quantum devices.

cs.DC - Distributed, Parallel, and Cluster Computing

标题作者发布日期PDF摘要
An Elastic Job Scheduler for HPC Applications on the CloudAditya Bhosale, Kavitha Chandrasekar, Laxmikant Kale, Sara Kokkila-Schumacher2025-10-16下载The last few years have seen an increase in adoption of the cloud for running HPC applications. The pay-as-you-go cost model of these cloud resources has necessitated the development of specialized pr...
NEMO: Faster Parallel Execution for Highly Contended Blockchain Workloads (Full version)François Ezard, Can Umut Ileri, Jérémie Decouchant2025-10-16下载Following the design of more efficient blockchain consensus algorithms, the execution layer has emerged as the new performance bottleneck of blockchains, especially under high contention.
Targeted Attacks and Defenses for Distributed Federated Learning in Vehicular NetworksUtku Demir, Tugba Erpek, Yalin E. Sagduyu, Sastry Kompella, Mengran Xue2025-10-16下载In emerging networked systems, mobile edge devices such as ground vehicles and unmanned aerial system (UAS) swarms collectively aggregate vast amounts of data to make machine learning decisions such a...
Hive Hash Table: A Warp-Cooperative, Dynamically Resizable Hash Table for GPUsMd Sabbir Hossain Polak, David Troendle, Byunghyun Jang2025-10-16下载Hash tables are essential building blocks in data-intensive applications, yet existing GPU implementations often struggle with concurrent updates, high load factors, and irregular memory access patter...
Multi-modal video data-pipelines for machine learning with minimal human supervisionMihai-Cristian Pîrvu, Marius Leordeanu2025-10-16下载The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost.
Balls and Bins and the Infinite Process with Random DeletionsPetra Berenbrink, Tom Friedetzky, Peter Kling, Lars Nagel2025-10-16下载We consider an infinite balls-into-bins process with deletions where in each discrete step tt a coin is tossed as to whether, with probability β(t) \in (0,1), a new ball is allocated using the Gree...
Deadlock-free routing for Full-mesh networks without using Virtual ChannelsAlejandro Cano, Cristóbal Camarero, Carmen Martínez, Ramón Beivide2025-10-16下载High-radix, low-diameter networks like HyperX and Dragonfly use a Full-mesh core, and rely on multiple virtual channels (VCs) to avoid packet deadlocks in adaptive routing.
xLLM Technical ReportTongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, Yichen Zhang, Jinrun Yin, Keyang Zheng, Jiawei Yin, Jun Zhang, Ziyue Wang, Xiaobo Lin, Liangyu Liu, Liwei Lan, Yang Liu, Chunhua Peng, Han Liu, Songcheng Ren, Xuezhu Wang, Yunheng Shen, Yi Wang, Guyue Liu, Yitao Hu, Hui Chen, Tong Yang, Hailong Yang, Jing Li, Guiguang Ding, Ke Zhang2025-10-16下载We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse ...
The Bidding Games: Reinforcement Learning for MEV Extraction on Polygon BlockchainAndrei Seoev, Leonid Gremyachikh, Anastasiia Smirnova, Yash Madhwal, Alisa Kalacheva, Dmitry Belousov, Ilia Zubov, Aleksei Smirnov, Denis Fedyanin, Vladimir Gorgadze, Yury Yanovich2025-10-16下载In blockchain networks, the strategic ordering of transactions within blocks has emerged as a significant source of profit extraction, known as Maximal Extractable Value (MEV).
MPI-over-CXL: Enhancing Communication Efficiency in Distributed HPC SystemsMiryeong Kwon, Donghyun Gouk, Hyein Woo, Junhee Kim, Jinwoo Baek, Kyungkuk Nam, Sangyoon Ji, Jiseon Kim, Hanyeoreum Bae, Junhyeok Jang, Hyunwoo You, Junseok Moon, Myoungsoo Jung2025-10-16下载MPI implementations commonly rely on explicit memory-copy operations, incurring overhead from redundant data movement and buffer management. This overhead notably impacts HPC workloads involving inten...
JASDA: Introducing Job-Aware Scheduling in Scheduler-Driven Job AtomizationMichal Konopa, Jan Fesl, Ladislav Ber ánek2025-10-16下载The increasing complexity and temporal variability of workloads on MIG-enabled GPUs challenge the scalability of traditional centralized scheduling.
ScalePool: Hybrid XLink-CXL Fabric for Composable Resource Disaggregation in Unified Scale-up DomainsHyein Woo, Miryeong Kwon, Jiseon Kim, Eunjee Na, Hanjin Choi, Seonghyeon Jang, Myoungsoo Jung2025-10-16下载This paper proposes ScalePool, a novel cluster architecture designed to interconnect numerous accelerators using unified hardware interconnects rather than traditional long-distance networking.
FairBatching: Fairness-Aware Batch Formation for LLM InferenceHongtao Lyu, Boyue Liu, Mingyu Wu, Haibo Chen2025-10-16下载Large language model (LLM) inference systems face a fundamental tension between minimizing Time-to-First-Token (TTFT) latency for new requests and maintaining a high, steady token generation rate (low...
From Attention to Disaggregation: Tracing the Evolution of LLM InferenceMadabattula Rajesh Kumar, Srinivasa Rao Aravilli, Mustafa Saify, Shashank Srivastava2025-10-16下载The evolution of Large Language Models from the Transformer architecture to models with trillions of parameters has shifted the primary bottleneck from model training to real time inference.
Incentive-Based Federated Learning: Architectural Elements and Future DirectionsChanuka A. S. Hewa Kaluannakkage, Rajkumar Buyya2025-10-16下载Federated learning promises to revolutionize machine learning by enabling collaborative model training without compromising data privacy. However, practical adaptability can be limited by critical fac...
Proof-Carrying Fair Ordering: Asymmetric Verification for BFT via Incremental GraphsPengkun Ren, Hai Dong, Nasrin Sohrabi, Zahir Tari, Pengcheng Zhang2025-10-16下载Byzantine Fault-Tolerant (BFT) consensus protocols ensure agreement on transaction ordering despite malicious actors, but unconstrained ordering power enables sophisticated value extraction attacks li...

cs.NI - Networking and Internet Architecture

标题作者发布日期PDF摘要
Targeted Attacks and Defenses for Distributed Federated Learning in Vehicular NetworksUtku Demir, Tugba Erpek, Yalin E. Sagduyu, Sastry Kompella, Mengran Xue2025-10-16下载In emerging networked systems, mobile edge devices such as ground vehicles and unmanned aerial system (UAS) swarms collectively aggregate vast amounts of data to make machine learning decisions such a...
Decoherence-Aware Entangling and Swapping Strategy Optimization for Entanglement Routing in Quantum NetworksShao-Min Huang, Cheng-Yang Cheng, Ming-Huang Chien, Jian-Jhih Kuo, Chih-Yu Wang2025-10-16下载Quantum teleportation enables high-security communications through end-to-end quantum entangled pairs. End-to-end entangled pairs are created by using swapping processes to consume short entangled pai...
Intelligent Dynamic Handover via AI-assisted Signal Quality Prediction in 6G Multi-RAT NetworksMaria Lamprini A. Bartsioka, Anastasios Giannopoulos, Sotirios Spantideas2025-10-16下载The emerging paradigm of 6G multiple Radio Access Technology (multi-RAT) networks, where cellular and Wireless Fidelity (WiFi) transmitters coexist, requires mobility decisions that remain reliable un...
Automated Extraction of Protocol State Machines from 3GPP Specifications with Domain-Informed Prompts and LLM EnsemblesMiao Zhang, Runhan Feng, Hongbo Tang, Yu Zhao, Jie Yang, Hang Qiu, Qi Liu2025-10-16下载Mobile telecommunication networks are foundational to global infrastructure and increasingly support critical sectors such as manufacturing, transportation, and healthcare.
Energy-Latency Optimization for Dynamic 5G Mobile Radio Access NetworksGabriela N. Caspa H., Carlos A. Astudillo, Nelson L. S. da Fonseca2025-10-16下载In 5G networks, base station (BS) disaggregation and new services present challenges in radio access network (RAN) configuration, particularly in meeting their bandwidth and latency constraints.

cs.PF - Performance

标题作者发布日期PDF摘要
Stability and Heavy-traffic Delay Optimality of General Load Balancing Policies in Heterogeneous Service SystemsYishun Luo, Martin Zubeldia2025-10-16下载We consider a load balancing system consisting of nn single-server queues working in parallel, with heterogeneous service rates. Jobs arrive to a central dispatcher, which has to dispatch them to one...

基于 VitePress 构建