2025-10-23

cs.AR - Architecture

标题	作者	发布日期	PDF	摘要
FIFOAdvisor: A DSE Framework for Automated FIFO Sizing of High-Level Synthesis Designs	Stefan Abi-Karam, Rishov Sarkar, Suhail Basalama, Jason Cong, Callie Hao	2025-10-23	下载	Dataflow hardware designs enable efficient FPGA implementations via high-level synthesis (HLS), but correctly sizing first-in-first-out (FIFO) channel buffers remains challenging.
Lincoln AI Computing Survey (LAICS) and Trends	Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Jeremy Kepner	2025-10-23	下载	In the past year, generative AI (GenAI) models have received a tremendous amount of attention, which in turn has increased attention to computing systems for training and inference for GenAI.
Hardware-Aware DNN Compression for Homogeneous Edge Devices	Kunlong Zhang, Guiying Li, Ning Lu, Peng Yang, Ke Tang	2025-10-23	下载	Deploying deep neural networks (DNNs) across homogeneous edge devices (the devices with the same SKU labeled by the manufacturer) often assumes identical performance among them.
Squire: A General-Purpose Accelerator to Exploit Fine-Grain Parallelism on Dependency-Bound Kernels	Rubén Langarita, Jesús Alastruey-Benedé, Pablo Ibáñez-Marín, Santiago Marco-Sola, Miquel Moretó, Adrià Armejach	2025-10-23	下载	Multiple HPC applications are often bottlenecked by compute-intensive kernels implementing complex dependency patterns (data-dependency bound).
In-DRAM True Random Number Generation Using Simultaneous Multiple-Row Activation: An Experimental Study of Real DRAM Chips	Ismail Emir Yuksel, Ataberk Olgun, F. Nisa Bostanci, Oguzhan Canpolat, Geraldo F. Oliveira, Mohammad Sadrosadati, Abdullah Giray Yaglikci, Onur Mutlu	2025-10-23	下载	In this work, we experimentally demonstrate that it is possible to generate true random numbers at high throughput and low latency in commercial off-the-shelf (COTS) DRAM chips by leveraging simultane...
HALOC-AxA: An Area/-Energy-Efficient Approximate Adder for Image Processing Application	Hasnain A. Ziad, Ashiq A. Sakib	2025-10-23	下载	The design of approximate adders has been widely researched to advance energy-efficient hardware for computation-intensive multimedia applications, such as image, audio, or video processing.

cs.DC - Distributed, Parallel, and Cluster Computing

标题	作者	发布日期	PDF	摘要
xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads	Jiabo Shi, Dimitrios Pezaros, Yehia Elkhatib	2025-10-23	下载	The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundament...
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization	Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, Caiwen Ding	2025-10-23	下载	Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic a...
JSTprove: Pioneering Verifiable AI for a Trustless Future	Jonathan Gold, Tristan Freiberg, Haruna Isah, Shirin Shahabi	2025-10-23	下载	The integration of machine learning (ML) systems into critical industries such as healthcare, finance, and cybersecurity has transformed decision-making processes, but it also brings new challenges ar...
Lincoln AI Computing Survey (LAICS) and Trends	Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Jeremy Kepner	2025-10-23	下载	In the past year, generative AI (GenAI) models have received a tremendous amount of attention, which in turn has increased attention to computing systems for training and inference for GenAI.
Decentralized Exchange that Mitigate a Bribery Attack	Nitin Awathare	2025-10-23	下载	Despite the popularity of Hashed Time-Locked Contracts (HTLCs) because of their use in wide areas of applications such as payment channels, atomic swaps, etc, their use in exchange is still questionab...
Morpheus: Lightweight RTT Prediction for Performance-Aware Load Balancing	Panagiotis Giannakopoulos, Bart van Knippenberg, Kishor Chandra Joshi, Nicola Calabretta, George Exarchakos	2025-10-23	下载	Distributed applications increasingly demand low end-to-end latency, especially in edge and cloud environments where co-located workloads contend for limited resources.
GPU-Accelerated Primal Heuristics for Mixed Integer Programming	Akif Çördük, Piotr Sielski, Alice Boucher, Kumar Aatish	2025-10-23	下载	We introduce a fusion of GPU accelerated primal heuristics for Mixed Integer Programming. Leveraging GPU acceleration enables exploration of larger search regions and faster iterations.
Accurate Performance Predictors for Edge Computing Applications	Panagiotis Giannakopoulos, Bart van Knippenberg, Kishor Chandra Joshi, Nicola Calabretta, George Exarchakos	2025-10-23	下载	Accurate prediction of application performance is critical for enabling effective scheduling and resource management in resource-constrained dynamic edge environments.
Symmetry in Software Platforms as an Architectural Principle	Bjorn Remseth	2025-10-23	下载	Software platforms often act as structure preserving systems. They provide consistent interfaces and behaviors that remain stable under specific transformations that we denote as symmetries.
FLAS: a combination of proactive and reactive auto-scaling architecture for distributed services	Víctor Rampérez, Javier Soriano, David Lizcano, Juan A. Lara	2025-10-23	下载	Cloud computing has established itself as the support for the vast majority of emerging technologies, mainly due to the characteristic of elasticity it offers.
In-DRAM True Random Number Generation Using Simultaneous Multiple-Row Activation: An Experimental Study of Real DRAM Chips	Ismail Emir Yuksel, Ataberk Olgun, F. Nisa Bostanci, Oguzhan Canpolat, Geraldo F. Oliveira, Mohammad Sadrosadati, Abdullah Giray Yaglikci, Onur Mutlu	2025-10-23	下载	In this work, we experimentally demonstrate that it is possible to generate true random numbers at high throughput and low latency in commercial off-the-shelf (COTS) DRAM chips by leveraging simultane...
HHEML: Hybrid Homomorphic Encryption for Privacy-Preserving Machine Learning on Edge	Yu Hin Chan, Hao Yang, Shiyu Shen, Xingyu Fan, Shengzhe Lyu, Patrick S. Y. Hung, Ray C. C. Cheung	2025-10-23	下载	Privacy-preserving machine learning (PPML) is an emerging topic to handle secure machine learning inference over sensitive data in untrusted environments.
HGraphScale: Hierarchical Graph Learning for Autoscaling Microservice Applications in Container-based Cloud Computing	Zhengxin Fang, Hui Ma, Gang Chen, Rajkumar Buyya	2025-10-23	下载	Microservice architecture has become a dominant paradigm in application development due to its advantages of being lightweight, flexible, and resilient.
Collective Communication for 100k+ GPUs	Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, Deep Shah, Ashmitha Jeevaraj Shetty, Greg Steinbrecher, Yulun Wang, Bruce Wu, Xinfeng Xie, Jingyi Yang, Mingran Yang, Kenny Yu, Minlan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Prashanth Kannan, Cristian Lumezanu, Rui Miao, Zhe Qu, Venkat Ramesh, Maxim Samoylov, Jan Seidel, Srikanth Sundaresan, Feng Tian, Qiye Tan, Shuqiang Zhang, Yimeng Zhao, Shengbao Zheng, Art Zhu, Hongyi Zeng	2025-10-23	下载	The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs.
ADP-VRSGP: Decentralized Learning with Adaptive Differential Privacy via Variance-Reduced Stochastic Gradient Push	Xiaoming Wu, Teng Liu, Xin Wang, Ming Yang, Jiguo Yu	2025-10-23	下载	Differential privacy is widely employed in decentralized learning to safeguard sensitive data by introducing noise into model updates. However, existing approaches that use fixed-variance noise often ...
A Full Stack Framework for High Performance Quantum-Classical Computing	Xin Zhan, K. Grace Johnson, Aniello Esposito, Barbara Chapman, Marco Fiorentino, Kirk M. Bresniker, Raymond G. Beausoleil, Masoud Mohseni	2025-10-23	下载	To address the growing needs for scalable High Performance Computing (HPC) and Quantum Computing (QC) integration, we present our HPC-QC full stack framework and its hybrid workload development capabi...
AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training	Huawei Bai, Yifan Huang, Wenqi Shi, Ansheng You, Feifan Shao, Tengfei Han, Minghui Yu	2025-10-23	下载	The training efficiency and scalability of language models on massive clusters currently remain a critical bottleneck. Mainstream approaches like ND parallelism are often cumbersome and complex, while...

cs.NI - Networking and Internet Architecture

标题	作者	发布日期	PDF	摘要
AI-Enabled Digital Twins for Next-Generation Networks: Forecasting Traffic and Resource Management in 5G/6G	John Sengendo, Fabrizio Granelli	2025-10-23	下载	As 5G and future 6G mobile networks become increasingly more sophisticated, the requirements for agility, scalability, resilience, and precision in real-time service provisioning cannot be met using t...
Trust, But Verify: An Empirical Evaluation of AI-Generated Code for SDN Controllers	Felipe Avencourt Soares, Muriel F. Franco, Eder J. Scheid, Lisandro Z. Granville	2025-10-23	下载	Generative Artificial Intelligence (AI) tools have been used to generate human-like content across multiple domains (e.g., sound, image, text, and programming).
On the cybersecurity of LoRaWAN-based system: a Smart-Lighting case study	Florian Hofer, Barbara Russo	2025-10-23	下载	Cyber-physical systems and the Internet of Things (IoT) are key technologies in the Industry 4.0 vision. They incorporate sensors and actuators to interact with the physical environment.
Multicast-partitioning in Time-triggered Stream Planning for Time-Sensitive Networks	Heiko Geppert, Frank Dürr, Simon Naß, Kurt Rothermel	2025-10-23	下载	Multicast allows sending a message to multiple recipients without having to create and send a separate message for each recipient. This preserves network bandwidth, which is particularly important in ...
MAC Aggregation over Lossy Channels in DTLS 1.3	Eric Wagner, David Heye, Jan Bauer, Klaus Wehrle, Martin Serror	2025-10-23	下载	Aggregating Message Authentication Codes (MACs) promises to save valuable bandwidth in resource-constrained environments. The idea is simple: Instead of appending an authentication tag to each message...
Rediscovering Recurring Routing Results	Xiao Song, John Heidemann	2025-10-23	下载	Routing is central to networking performance, including: (1) latency in anycast services and websites served from multiple locations,(2) networking expenses and throughput in multi-homed enterprises, ...
Collective Communication for 100k+ GPUs	Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, Deep Shah, Ashmitha Jeevaraj Shetty, Greg Steinbrecher, Yulun Wang, Bruce Wu, Xinfeng Xie, Jingyi Yang, Mingran Yang, Kenny Yu, Minlan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Prashanth Kannan, Cristian Lumezanu, Rui Miao, Zhe Qu, Venkat Ramesh, Maxim Samoylov, Jan Seidel, Srikanth Sundaresan, Feng Tian, Qiye Tan, Shuqiang Zhang, Yimeng Zhao, Shengbao Zheng, Art Zhu, Hongyi Zeng	2025-10-23	下载	The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs.

cs.PF - Performance

标题	作者	发布日期	PDF	摘要
xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads	Jiabo Shi, Dimitrios Pezaros, Yehia Elkhatib	2025-10-23	下载	The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundament...
Prefetching Cache Optimization Using Graph Neural Networks: A Modular Framework and Conceptual Analysis	F. I. Qowy	2025-10-23	下载	Caching and prefetching techniques are fundamental to modern computing, serving to bridge the growing performance gap between processors and memory.