2025-10-16

cs.AR - Architecture

标题	作者	发布日期	PDF	摘要
From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR	Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, Andre Rosti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, Jinming Zhuang	2025-10-16	下载	General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures.
ColumnDisturb: Understanding Column-based Read Disturbance in Real DRAM Chips and Implications for Future Systems	İsmail Emir Yüksel, Ataberk Olgun, F. Nisa Bostancı, Haocong Luo, A. Giray Yağlıkçı, Onur Mutlu	2025-10-16	下载	We experimentally demonstrate a new widespread read disturbance phenomenon, ColumnDisturb, in real commodity DRAM chips. By repeatedly opening or keeping a DRAM row (aggressor row) open, we show that ...
Deadlock-free routing for Full-mesh networks without using Virtual Channels	Alejandro Cano, Cristóbal Camarero, Carmen Martínez, Ramón Beivide	2025-10-16	下载	High-radix, low-diameter networks like HyperX and Dragonfly use a Full-mesh core, and rely on multiple virtual channels (VCs) to avoid packet deadlocks in adaptive routing.
Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References	Hongzheng Chen, Bin Fan, Alexander Collins, Bastian Hagedorn, Evghenii Gaburov, Masahiro Masuda, Matthew Brookhart, Chris Sullivan, Jason Knight, Zhiru Zhang, Vinod Grover	2025-10-16	下载	Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this tas...
MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving	Jungi Lee, Junyong Park, Soohyun Cha, Jaehoon Cho, Jaewoong Sim	2025-10-16	下载	Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrus...
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow	Ching-Lin Hsiung, Tian-Sheuan Chang	2025-10-16	下载	Current transformer accelerators primarily focus on optimizing self-attention due to its quadratic complexity. However, this focus is less relevant for vision transformers with short token lengths, wh...
Computing-In-Memory Aware Model Adaption For Edge Devices	Ming-Han Lin, Tian-Sheuan Chang	2025-10-16	下载	Computing-in-Memory (CIM) macros have gained popularity for deep learning acceleration due to their highly parallel computation and low power consumption.
Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing	Tianhua Xia, Sai Qian Zhang	2025-10-16	下载	Running Large Language Models (LLMs) on edge devices is crucial for reducing latency, improving real-time processing, and enhancing privacy. By performing inference directly on the device, data does n...
Systolic Array Acceleration of Diagonal-Optimized Sparse-Sparse Matrix Multiplication for Efficient Quantum Simulation	Yuchao Su, Srikar Chundury, Jiajia Li, Frank Mueller	2025-10-16	下载	Hamiltonian simulation is a key workload in quantum computing, enabling the study of complex quantum systems and serving as a critical tool for classical verification of quantum devices.

cs.DC - Distributed, Parallel, and Cluster Computing

标题	作者	发布日期	PDF	摘要
An Elastic Job Scheduler for HPC Applications on the Cloud	Aditya Bhosale, Kavitha Chandrasekar, Laxmikant Kale, Sara Kokkila-Schumacher	2025-10-16	下载	The last few years have seen an increase in adoption of the cloud for running HPC applications. The pay-as-you-go cost model of these cloud resources has necessitated the development of specialized pr...
NEMO: Faster Parallel Execution for Highly Contended Blockchain Workloads (Full version)	François Ezard, Can Umut Ileri, Jérémie Decouchant	2025-10-16	下载	Following the design of more efficient blockchain consensus algorithms, the execution layer has emerged as the new performance bottleneck of blockchains, especially under high contention.
Targeted Attacks and Defenses for Distributed Federated Learning in Vehicular Networks	Utku Demir, Tugba Erpek, Yalin E. Sagduyu, Sastry Kompella, Mengran Xue	2025-10-16	下载	In emerging networked systems, mobile edge devices such as ground vehicles and unmanned aerial system (UAS) swarms collectively aggregate vast amounts of data to make machine learning decisions such a...
Hive Hash Table: A Warp-Cooperative, Dynamically Resizable Hash Table for GPUs	Md Sabbir Hossain Polak, David Troendle, Byunghyun Jang	2025-10-16	下载	Hash tables are essential building blocks in data-intensive applications, yet existing GPU implementations often struggle with concurrent updates, high load factors, and irregular memory access patter...
Multi-modal video data-pipelines for machine learning with minimal human supervision	Mihai-Cristian Pîrvu, Marius Leordeanu	2025-10-16	下载	The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost.
Balls and Bins and the Infinite Process with Random Deletions	Petra Berenbrink, Tom Friedetzky, Peter Kling, Lars Nagel	2025-10-16	下载	We consider an infinite balls-into-bins process with deletions where in each discrete step $t$ a coin is tossed as to whether, with probability β(t) \in (0,1), a new ball is allocated using the Gree...
Deadlock-free routing for Full-mesh networks without using Virtual Channels	Alejandro Cano, Cristóbal Camarero, Carmen Martínez, Ramón Beivide	2025-10-16	下载	High-radix, low-diameter networks like HyperX and Dragonfly use a Full-mesh core, and rely on multiple virtual channels (VCs) to avoid packet deadlocks in adaptive routing.
xLLM Technical Report	Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, Yichen Zhang, Jinrun Yin, Keyang Zheng, Jiawei Yin, Jun Zhang, Ziyue Wang, Xiaobo Lin, Liangyu Liu, Liwei Lan, Yang Liu, Chunhua Peng, Han Liu, Songcheng Ren, Xuezhu Wang, Yunheng Shen, Yi Wang, Guyue Liu, Yitao Hu, Hui Chen, Tong Yang, Hailong Yang, Jing Li, Guiguang Ding, Ke Zhang	2025-10-16	下载	We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse ...
The Bidding Games: Reinforcement Learning for MEV Extraction on Polygon Blockchain	Andrei Seoev, Leonid Gremyachikh, Anastasiia Smirnova, Yash Madhwal, Alisa Kalacheva, Dmitry Belousov, Ilia Zubov, Aleksei Smirnov, Denis Fedyanin, Vladimir Gorgadze, Yury Yanovich	2025-10-16	下载	In blockchain networks, the strategic ordering of transactions within blocks has emerged as a significant source of profit extraction, known as Maximal Extractable Value (MEV).
MPI-over-CXL: Enhancing Communication Efficiency in Distributed HPC Systems	Miryeong Kwon, Donghyun Gouk, Hyein Woo, Junhee Kim, Jinwoo Baek, Kyungkuk Nam, Sangyoon Ji, Jiseon Kim, Hanyeoreum Bae, Junhyeok Jang, Hyunwoo You, Junseok Moon, Myoungsoo Jung	2025-10-16	下载	MPI implementations commonly rely on explicit memory-copy operations, incurring overhead from redundant data movement and buffer management. This overhead notably impacts HPC workloads involving inten...
JASDA: Introducing Job-Aware Scheduling in Scheduler-Driven Job Atomization	Michal Konopa, Jan Fesl, Ladislav Ber ánek	2025-10-16	下载	The increasing complexity and temporal variability of workloads on MIG-enabled GPUs challenge the scalability of traditional centralized scheduling.
ScalePool: Hybrid XLink-CXL Fabric for Composable Resource Disaggregation in Unified Scale-up Domains	Hyein Woo, Miryeong Kwon, Jiseon Kim, Eunjee Na, Hanjin Choi, Seonghyeon Jang, Myoungsoo Jung	2025-10-16	下载	This paper proposes ScalePool, a novel cluster architecture designed to interconnect numerous accelerators using unified hardware interconnects rather than traditional long-distance networking.
FairBatching: Fairness-Aware Batch Formation for LLM Inference	Hongtao Lyu, Boyue Liu, Mingyu Wu, Haibo Chen	2025-10-16	下载	Large language model (LLM) inference systems face a fundamental tension between minimizing Time-to-First-Token (TTFT) latency for new requests and maintaining a high, steady token generation rate (low...
From Attention to Disaggregation: Tracing the Evolution of LLM Inference	Madabattula Rajesh Kumar, Srinivasa Rao Aravilli, Mustafa Saify, Shashank Srivastava	2025-10-16	下载	The evolution of Large Language Models from the Transformer architecture to models with trillions of parameters has shifted the primary bottleneck from model training to real time inference.
Incentive-Based Federated Learning: Architectural Elements and Future Directions	Chanuka A. S. Hewa Kaluannakkage, Rajkumar Buyya	2025-10-16	下载	Federated learning promises to revolutionize machine learning by enabling collaborative model training without compromising data privacy. However, practical adaptability can be limited by critical fac...
Proof-Carrying Fair Ordering: Asymmetric Verification for BFT via Incremental Graphs	Pengkun Ren, Hai Dong, Nasrin Sohrabi, Zahir Tari, Pengcheng Zhang	2025-10-16	下载	Byzantine Fault-Tolerant (BFT) consensus protocols ensure agreement on transaction ordering despite malicious actors, but unconstrained ordering power enables sophisticated value extraction attacks li...

cs.NI - Networking and Internet Architecture

标题	作者	发布日期	PDF	摘要
Targeted Attacks and Defenses for Distributed Federated Learning in Vehicular Networks	Utku Demir, Tugba Erpek, Yalin E. Sagduyu, Sastry Kompella, Mengran Xue	2025-10-16	下载	In emerging networked systems, mobile edge devices such as ground vehicles and unmanned aerial system (UAS) swarms collectively aggregate vast amounts of data to make machine learning decisions such a...
Decoherence-Aware Entangling and Swapping Strategy Optimization for Entanglement Routing in Quantum Networks	Shao-Min Huang, Cheng-Yang Cheng, Ming-Huang Chien, Jian-Jhih Kuo, Chih-Yu Wang	2025-10-16	下载	Quantum teleportation enables high-security communications through end-to-end quantum entangled pairs. End-to-end entangled pairs are created by using swapping processes to consume short entangled pai...
Intelligent Dynamic Handover via AI-assisted Signal Quality Prediction in 6G Multi-RAT Networks	Maria Lamprini A. Bartsioka, Anastasios Giannopoulos, Sotirios Spantideas	2025-10-16	下载	The emerging paradigm of 6G multiple Radio Access Technology (multi-RAT) networks, where cellular and Wireless Fidelity (WiFi) transmitters coexist, requires mobility decisions that remain reliable un...
Automated Extraction of Protocol State Machines from 3GPP Specifications with Domain-Informed Prompts and LLM Ensembles	Miao Zhang, Runhan Feng, Hongbo Tang, Yu Zhao, Jie Yang, Hang Qiu, Qi Liu	2025-10-16	下载	Mobile telecommunication networks are foundational to global infrastructure and increasingly support critical sectors such as manufacturing, transportation, and healthcare.
Energy-Latency Optimization for Dynamic 5G Mobile Radio Access Networks	Gabriela N. Caspa H., Carlos A. Astudillo, Nelson L. S. da Fonseca	2025-10-16	下载	In 5G networks, base station (BS) disaggregation and new services present challenges in radio access network (RAN) configuration, particularly in meeting their bandwidth and latency constraints.

cs.PF - Performance

标题	作者	发布日期	PDF	摘要
Stability and Heavy-traffic Delay Optimality of General Load Balancing Policies in Heterogeneous Service Systems	Yishun Luo, Martin Zubeldia	2025-10-16	下载	We consider a load balancing system consisting of $n$ single-server queues working in parallel, with heterogeneous service rates. Jobs arrive to a central dispatcher, which has to dispatch them to one...