Skip to content

2025-11-13

cs.AR - Architecture

标题作者发布日期PDF摘要
Inside VOLT: Designing an Open-Source GPU CompilerShinnung Jeong, Chihyo Ahn, Huanzhi Pu, Jisheng Zhao, Hyesoon Kim, Blaise Tine2025-11-13下载Recent efforts in open-source GPU research are opening new avenues in a domain that has long been tightly coupled with a few commercial vendors.
Tiny Chiplets Enabled by Packaging Scaling: Opportunities in ESD Protection and Signal IntegrityEmad Haque, Pragnya Sudershan Nalla, Jeff Zhang, Sachin S. Sapatnekar, Chaitali Chakrabarti, Yu Cao2025-11-13下载The scaling of advanced packaging technologies provides abundant interconnection resources for 2.5D/3D heterogeneous integration (HI), thereby enabling the construction of larger-scale VLSI systems wi...
FengHuang: Next-Generation Memory Orchestration for AI InferencingJiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuotao Xu, Peng Cheng, Lidong Zhou2025-11-13下载This document presents a vision for a novel AI infrastructure design that has been initially validated through inference simulations on state-of-the-art large language models.
Beamspace Equalization for mmWave Massive MIMO: Algorithms and VLSI ImplementationsSeyed Hadi Mirfarshbafan, Christoph Studer2025-11-13下载Massive multiuser multiple-input multiple-output (MIMO) and millimeter-wave (mmWave) communication are key physical layer technologies in future wireless systems.
Critical Path Aware Timing-Driven Global Placement for Large-Scale Heterogeneous FPGAsHe Jiang, Yi Guo, Shikai Guo, Huijiang Liu, Xiaochen Li, Ning Wang, Zhixiong Di2025-11-13下载Timing optimization during global placement is critical for achieving optimal circuit performance and remains a key challenge in modern Field Programmable Gate Array (FPGA) design.
Combined power management and congestion control in High-Speed Ethernet-based Networks for Supercomputers and Data CentersMiguel Sánchez de la Rosa, Francisco J. andújar, Jesus Escudero-Sahuquillo, José L. Sánchez, Francisco J. Alfaro-Cortés2025-11-13下载The demand for computer in our daily lives has led to the proliferation of Datacenters that power indispensable many services. On the other hand, computing has become essential for some research for v...
The Role of Advanced Computer Architectures in Accelerating Artificial Intelligence WorkloadsShahid Amin, Syed Pervez Hussnain Shah2025-11-13下载The remarkable progress in Artificial Intelligence (AI) is foundation-ally linked to a concurrent revolution in computer architecture. As AI models, particularly Deep Neural Networks (DNNs), have grow...
AssertMiner: Module-Level Spec Generation and Assertion Mining using Static Analysis Guided LLMsHongqin Lyu, Yonghao Wang, Jiaxin Zhou, Zhiteng Chao, Tiancheng Wang, Huawei Li2025-11-13下载Assertion-based verification (ABV) is a key approach to checking whether a logic design complies with its architectural specifications. Existing assertion generation methods based on design specificat...
Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUsMarco Kurzynski, Shaizeen Aga, Di Wu2025-11-13下载GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems suffer from performance variation at the node and cluster levels.

cs.DC - Distributed, Parallel, and Cluster Computing

标题作者发布日期PDF摘要
HPCAgentTester: A Multi-Agent LLM Approach for Enhanced HPC Unit Test GenerationRabimba Karanjai, Lei Xu, Weidong Shi2025-11-13下载Unit testing in High-Performance Computing (HPC) is critical but challenged by parallelism, complex algorithms, and diverse hardware. Traditional methods often fail to address non-deterministic behavi...
Inside VOLT: Designing an Open-Source GPU CompilerShinnung Jeong, Chihyo Ahn, Huanzhi Pu, Jisheng Zhao, Hyesoon Kim, Blaise Tine2025-11-13下载Recent efforts in open-source GPU research are opening new avenues in a domain that has long been tightly coupled with a few commercial vendors.
EarthSight: A Distributed Framework for Low-Latency Satellite IntelligenceAnsel Kaplan Erol, Seungjun Lee, Divya Mahajan2025-11-13下载Low-latency delivery of satellite imagery is essential for time-critical applications such as disaster response, intelligence, and infrastructure monitoring.
FengHuang: Next-Generation Memory Orchestration for AI InferencingJiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuotao Xu, Peng Cheng, Lidong Zhou2025-11-13下载This document presents a vision for a novel AI infrastructure design that has been initially validated through inference simulations on state-of-the-art large language models.
STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-designChanghai Man, Joongun Park, Hanjiang Wu, Huan Xu, Srinivas Sridharan, Tushar Krishna2025-11-13下载Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution.
How Machine Learning-Data Driven Replication Strategies Enhance Fault Tolerance in Large-Scale Distributed SystemsAlmond Kiruthu Murimi2025-11-13下载This research paper investigates how machine learning-driven data replication strategies can enhance fault tolerance in large-scale distributed systems.
FastGraph: Optimized GPU-Enabled Algorithms for Fast Graph Building and Message PassingAarush Agarwal, Raymond He, Jan Kieseler, Matteo Cremonesi, Shah Rukh Qasim2025-11-13下载We introduce FastGraph, a novel GPU-optimized k-nearest neighbor algorithm specifically designed to accelerate graph construction in low-dimensional spaces (2-10 dimensions), critical for high-perform...
Unlocking Dynamic Inter-Client Spatial Dependencies: A Federated Spatio-Temporal Graph Learning Method for Traffic Flow ForecastingFeng Wang, Tianxiang Chen, Shuyue Wei, Qian Chu, Yi Zhang, Yifan Sun, Zhiming Zheng2025-11-13下载Spatio-temporal graphs are powerful tools for modeling complex dependencies in traffic time series. However, the distributed nature of real-world traffic data across multiple stakeholders poses signif...
On The Performance of Prefix-Sum Parallel Kalman Filters and Smoothers on GPUsSimo Särkkä, Ángel F. García-Fernández2025-11-13下载This paper presents an experimental evaluation of parallel-in-time Kalman filters and smoothers using graphics processing units (GPUs). In particular, the paper evaluates different all-prefix-sum algo...
Massively Parallel Proof-Number Search for Impartial Games and BeyondTomáš Čížek, Martin Balko, Martin Schmid2025-11-13下载Proof-Number Search is a best-first search algorithm with many successful applications, especially in game solving. As large-scale computing clusters become increasingly accessible, parallelization is...
Workload Schedulers -- Genesis, Algorithms and DifferencesLeszek Sliwko, Vladimir Getov2025-11-13下载This paper presents a novel approach to categorization of modern workload schedulers. We provide descriptions of three classes of schedulers: Operating Systems Process Schedulers, Cluster Systems Jobs...
Pk-IOTA: Blockchain empowered Programmable Data Plane to secure OPC UA communications in Industry 4.0Rinieri Lorenzo, Gori Giacomo, Melis Andrea, Girau Roberto, Prandini Marco, Callegati Franco2025-11-13下载The OPC UA protocol is becoming the de facto standard for Industry 4.0 machine-to-machine communication. It stands out as one of the few industrial protocols that provide robust security features desi...
Selection of Supervised Learning-based Sparse Matrix Reordering AlgorithmsTao Tang, Youfu Jiang, Yingbo Cui, Jianbin Fang, Peng Zhang, Lin Peng, Chun Huang2025-11-13下载Sparse matrix ordering is a vital optimization technique often employed for solving large-scale sparse matrices. Its goal is to minimize the matrix bandwidth by reorganizing its rows and columns, thus...
Noise-Aware Optimization in Nominally Identical Manufacturing and Measuring Systems for High-Throughput Parallel WorkflowsChristina Schenk, Miguel Hernández-del-Valle, Luis Calero-Lumbreras, Marcus Noack, Maciej Haranczyk2025-11-13下载Device-to-device variability in experimental noise critically impacts reproducibility, especially in automated, high-throughput systems like additive manufacturing farms.
Dynamic Edge Server Selection in Time-Varying Environments: A Reliability-Aware Predictive ApproachJaime Sebastian Burbano, Arnova Abdullah, Eldiyar Zhantileuov, Mohan Liyanage, Rolf Schuster2025-11-13下载Latency-sensitive embedded applications increasingly rely on edge computing, yet dynamic network congestion in multi-server architectures challenges proper edge server selection.
dHPR: A Distributed Halpern Peaceman--Rachford Method for Non-smooth Distributed Optimization ProblemsZhangcheng Feng, Defeng Sun, Yancheng Yuan, Guojun Zhang2025-11-13下载This paper introduces the distributed Halpern Peaceman--Rachford (dHPR) method, an efficient algorithm for solving distributed convex composite optimization problems with non-smooth objectives, which ...
Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation ThroughputJingwei Song, Wanyi Chen, Xinyuan Song, Max, Chris Tong, Gufeng Chen, Tianyi Zhao, Eric Yang, Bill Shi, Lynn Ai2025-11-13下载Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model.
Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service PlatformsAo Xu, Han Zhao, Weihao Cui, Quan Chen, Yukang Chen, Shulai Zhang, Shuang Chen, Jiemin Jiang, Zhibin Yu, Minyi Guo2025-11-13下载Large language models (LLMs) are increasingly deployed under the Model-as-a-Service (MaaS) paradigm. To meet stringent quality-of-service (QoS) requirements, existing LLM serving systems disaggregate ...
Optimizing CPU Cache Utilization in Cloud VMs with Accurate Cache AbstractionMani Tofigh, Edward Guo, Weiwei Jia, Xiaoning Ding, Zirui Neil Zhao, Jianchen Shan2025-11-13下载This paper shows that cache-based optimizations are often ineffective in cloud virtual machines (VMs) due to limited visibility into and control over provisioned caches.
Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUsMarco Kurzynski, Shaizeen Aga, Di Wu2025-11-13下载GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems suffer from performance variation at the node and cluster levels.
MoFa: A Unified Performance Modeling Framework for LLM PretrainingLu Zhao, Rong Shi, Shaoqing Zhang, Shangchao Su, Ziqing Yin, Zhiyan Cui, Hongfeng Sun, Baoguo He, Yueqiang Chen, Liang Dong, Xiyuan Li, Lingbin Wang, Lijun Ma, Qiang Huang, Ting Liu, Chong Wang, Can Wei2025-11-13下载The exponential growth in LLM scales, with parameters soaring from billions to trillions, has necessitated distributed pretraining across large clusters comprising thousands to tens of thousands of de...
A Meta-Heuristic Load Balancer for Cloud Computing SystemsLeszek Sliwko, Vladimir Getov2025-11-13下载This paper presents a strategy to allocate services on a Cloud system without overloading nodes and maintaining the system stability with minimum cost.
SMoFi: Step-wise Momentum Fusion for Split Federated Learning on Heterogeneous DataMingkun Yang, Ran Zhu, Qing Wang, Jie Yang2025-11-13下载Split Federated Learning is a system-efficient federated learning paradigm that leverages the rich computing resources at a central server to train model partitions.

cs.NI - Networking and Internet Architecture

标题作者发布日期PDF摘要
LM4Opt-RA: A Multi-Candidate LLM Framework with Structured Ranking for Automating Network Resource AllocationTasnim Ahmed, Siana Rizwan, Naveed Ejaz, Salimur Choudhury2025-11-13下载Building on advancements in Large Language Models (LLMs), we can tackle complex analytical and mathematical reasoning tasks requiring nuanced contextual understanding.
Millimeter-Wave UAV Channel Model with Height-Dependent Path Loss and Shadowing in Urban ScenariosAbdul Saboor, Evgenii Vinogradov2025-11-13下载Uncrewed Aerial Vehicles (UAVs) serving as Aerial Base Stations (ABSs) are expected to extend 6G millimeter-Wave (mmWave) coverage and improve link reliability in urban areas.
Towards an Agentic Workflow for Internet Measurement ResearchAlagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Abdu Jyothi2025-11-13下载Internet measurement research faces an accessibility crisis: complex analyses require custom integration of multiple specialized tools that demands specialized domain expertise.
Causal Model-Based Reinforcement Learning for Sample-Efficient IoT Channel AccessAswin Arun, Christo Kurisummoottil Thomas, Rimalpudi Sarvendranath, Walid Saad2025-11-13下载Despite the advantages of multi-agent reinforcement learning (MARL) for wireless use case such as medium access control (MAC), their real-world deployment in Internet of Things (IoT) is hindered by th...
P4-TAS: P4-Based Time-Aware Shaper for Time-Sensitive NetworkingFabian Ihle, Moritz Flüchter, Michael Menth2025-11-13下载Time-Sensitive Networking (TSN) is a set of IEEE standards that extends Ethernet with real-time capabilities. Among its mechanisms, TSN can coordinate transmission times network-wide to minimize queue...
Pk-IOTA: Blockchain empowered Programmable Data Plane to secure OPC UA communications in Industry 4.0Rinieri Lorenzo, Gori Giacomo, Melis Andrea, Girau Roberto, Prandini Marco, Callegati Franco2025-11-13下载The OPC UA protocol is becoming the de facto standard for Industry 4.0 machine-to-machine communication. It stands out as one of the few industrial protocols that provide robust security features desi...
Dynamic Edge Server Selection in Time-Varying Environments: A Reliability-Aware Predictive ApproachJaime Sebastian Burbano, Arnova Abdullah, Eldiyar Zhantileuov, Mohan Liyanage, Rolf Schuster2025-11-13下载Latency-sensitive embedded applications increasingly rely on edge computing, yet dynamic network congestion in multi-server architectures challenges proper edge server selection.
Learning-Based Channel Access in Wi-Fi: A Multi-Armed Bandit ApproachMiguel Casasnovas, Francesc Wilhelmi, Richard Combes, Maksymilian Wojnar, Katarzyna Kosek-Szott, Szymon Szott, Anders Jonsson, Luis Esteve, Boris Bellalta2025-11-13下载Due to its static protocol design, IEEE 802.11 (aka Wi-Fi) channel access lacks adaptability to address dynamic network conditions, resulting in inefficient spectrum utilization, unnecessary contentio...
See and Beam: Leveraging LiDAR Sensing and Specular Surfaces for Indoor mmWave ConnectivityRaj Sai Sohel Bandari, Amod Ashtekar, Omar Ibrahim, Mohammed E. Eltayeb2025-11-13下载Millimeter-wave (mmWave) communication enables multi-gigabit-per-second data rates but is highly susceptible to path loss and blockage, especially indoors.

cs.OS - Operating Systems

标题作者发布日期PDF摘要
Vmem: A Lightweight Hot-Upgradable Memory Management for In-production Cloud EnvironmentHao Zheng, Qiang Wang, Longxiang Wang, Xishi Qiu, Yibin Shen, Xiaoshe Dong, Naixuan Guan, Jia Wei, Fudong Qiu, Xingjun Zhang, Yun Xu, Mao Zhao, Yisheng Xie, Shenglong Zhao, Min He, Yu Li, Xiao Zheng, Ben Luo, Jiesheng Wu2025-11-13下载Traditional memory management suffers from metadata overhead, architectural complexity, and stability degradation, problems intensified in cloud environments.
Optimizing CPU Cache Utilization in Cloud VMs with Accurate Cache AbstractionMani Tofigh, Edward Guo, Weiwei Jia, Xiaoning Ding, Zirui Neil Zhao, Jianchen Shan2025-11-13下载This paper shows that cache-based optimizations are often ineffective in cloud virtual machines (VMs) due to limited visibility into and control over provisioned caches.
Taiji: A DPU Memory Elasticity Solution for In-production Cloud EnvironmentsHao Zheng, Longxiang Wang, Yun Xu, Qiang Wang, Yibin Shen, Xiaoshe Dong, Bang Di, Jia Wei, Shenyu Dong, Xingjun Zhang, Weichen Chen, Zhao Han, Sanqian Zhao, Dongdong Huang, Jie Qi, Yifan Yang, Zhao Gao, Yi Wang, Jinhu Li, Xudong Ren, Min He, Hang Yang, Xiao Zheng, Haijiao Hao, Jiesheng Wu2025-11-13下载The growth of cloud computing drives data centers toward higher density and efficiency. Data processing units (DPUs) enhance server network and storage performance but face challenges such as long har...

cs.PF - Performance

标题作者发布日期PDF摘要
The Configuration Wall: Characterization and Elimination of Accelerator Configuration OverheadJosse Van Delm, Anton Lydike, Joren Dumoulin, Jonas Crols, Xiaoling Yi, Ryan Antonio, Jackson Woodruff, Tobias Grosser, Marian Verhelst2025-11-13下载Contemporary compute platforms increasingly offload compute kernels from CPU to integrated hardware accelerators to reach maximum performance per Watt.
EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM TrainingQingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao, Shiyou Qian, Dingyu Yang, Jian Cao, Jinghua Tang, Yinghao Yu, Chenzhi Liao, Kangjin Wang, Liping Zhang2025-11-13下载Training large language models (LLMs) poses significant challenges regarding computational resources and memory capacity. Although distributed training techniques help mitigate these issues, they stil...
Optimizing CPU Cache Utilization in Cloud VMs with Accurate Cache AbstractionMani Tofigh, Edward Guo, Weiwei Jia, Xiaoning Ding, Zirui Neil Zhao, Jianchen Shan2025-11-13下载This paper shows that cache-based optimizations are often ineffective in cloud virtual machines (VMs) due to limited visibility into and control over provisioned caches.
Steering Pretrained Drafters during Speculative DecodingFrédéric Berdoz, Peer Rheinboldt, Roger Wattenhofer2025-11-13下载Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits toke...

基于 VitePress 构建