2025-12-16

cs.AR - Architecture

标题	作者	发布日期	PDF	摘要
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading	William Meng, Benjamin Lee, Hong Wang	2025-12-16	下载	KV cache offloading enables long-context LLM inference by storing caches in CPU DRAM, but PCIe bandwidth limitations create severe bottlenecks.
Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models	Chiyue Wei, Cong Guo, Junyao Zhang, Haoxuan Shan, Yifan Xu, Ziyue Zhang, Yudong Liu, Qinsi Wang, Changchun Zhou, Hai "Helen" Li, Yiran Chen	2025-12-16	下载	Vision-Language Models (VLMs) have demonstrated strong performance on tasks such as video captioning and visual question answering. However, their growing scale and video-level inputs lead to signific...
PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion	Huizheng Wang, Hongbin Wang, Zichuan Wang, Zhiheng Yue, Yang Wang, Chao Li, Yang Hu, Shouyi Yin	2025-12-16	下载	Attention-based models have revolutionized AI, but the quadratic cost of self-attention incurs severe computational and memory overhead. Sparse attention methods alleviate this by skipping low-relevan...
TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips	Huizheng Wang, Taiquan Wei, Zichuan Wang, Dingcheng Jiang, Qize Yang, Jiaxin Liu, Jingxiang Hou, Chao Li, Jinyi Deng, Yang Hu, Shouyi Yin	2025-12-16	下载	Large language models (LLMs) demand significant memory and computation resources. Wafer-scale chips (WSCs) provide high computation power and die-to-die (D2D) bandwidth but face a unique trade-off bet...
ReadyPower: A Reliable, Interpretable, and Handy Architectural Power Model Based on Analytical Framework	Qijun Zhang, Shang Liu, Yao Lu, Mengming Li, Zhiyao Xie	2025-12-16	下载	Power is a primary objective in modern processor design, requiring accurate yet efficient power modeling techniques. Architecture-level power models are necessary for early power optimization and desi...
Adaptive Cache Pollution Control for Large Language Model Inference Workloads Using Temporal CNN-Based Prediction and Priority-Aware Replacement	Songze Liu, Hongkun Du, Shaowen Wang	2025-12-16	下载	Large Language Models (LLMs), such as GPT and LLaMA, introduce unique memory access characteristics during inference due to frequent token sequence lookups and embedding vector retrievals.
The Impact Market to Save Conference Peer Review: Decoupling Dissemination and Credentialing	Karthikeyan Sankaralingam	2025-12-16	下载	Top-tier academic conferences are failing under the strain of two irreconcilable roles: (1) rapid dissemination of all sound research and (2) scarce credentialing for prestige and career advancement.

cs.DC - Distributed, Parallel, and Cluster Computing

标题	作者	发布日期	PDF	摘要
Optimizing Sensor Node Localization for Achieving Sustainable Smart Agriculture System Connectivity	Mohamed Naeem	2025-12-16	下载	The innovative agriculture system is revolutionizing how we farm, making it one of the most critical innovations of our time! Yet it faces significant connectivity challenges, particularly with the se...
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading	William Meng, Benjamin Lee, Hong Wang	2025-12-16	下载	KV cache offloading enables long-context LLM inference by storing caches in CPU DRAM, but PCIe bandwidth limitations create severe bottlenecks.
PruneX: A Hierarchical Communication-Efficient System for Distributed CNN Training with Structured Pruning	Alireza Olama, Andreas Lundell, Izzat El Hajj, Johan Lilius, Jerker Björkqvist	2025-12-16	下载	Inter-node communication bandwidth increasingly constrains distributed training at scale on multi-node GPU clusters. While compact models are the ultimate deployment target, conventional pruning-aware...
Improving Slow Transfer Predictions: Generative Methods Compared	Jacob Taegon Kim, Alex Sim, Kesheng Wu, Jinoh Kim	2025-12-16	下载	Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified an...
Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant Jobs	Brenton Walker, Markus Fidler	2025-12-16	下载	In some models of parallel computation, jobs are split into smaller tasks and can be executed completely asynchronously. In other situations the parallel tasks have constraints that require them to sy...
A Hybrid Reactive-Proactive Auto-scaling Algorithm for SLA-Constrained Edge Computing	Suhrid Gupta, Muhammed Tawfiqul Islam, Rajkumar Buyya	2025-12-16	下载	Edge computing decentralizes computing resources, allowing for novel applications in domains such as the Internet of Things (IoT) in healthcare and agriculture by reducing latency and improving perfor...
Privacy-Preserving Feature Valuation in Vertical Federated Learning Using Shapley-CMI and PSI Permutation	Unai Laskurain, Aitor Aguirre-Ortuzar, Urko Zurutuza	2025-12-16	下载	Federated Learning (FL) is an emerging machine learning paradigm that enables multiple parties to collaboratively train models without sharing raw data, ensuring data privacy.
Cornserve: Efficiently Serving Any-to-Any Multimodal Models	Jeff J. Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, Mosharaf Chowdhury	2025-12-16	下载	We present Cornserve, an efficient online serving system for an emerging class of multimodal models called Any-to-Any models. Any-to-Any models accept combinations of text and multimodal data (e.g.
Real-Time Service Subscription and Adaptive Offloading Control in Vehicular Edge Computing	Chuanchao Gao, Arvind Easwaran	2025-12-16	下载	Vehicular Edge Computing (VEC) has emerged as a promising paradigm for enhancing the computational efficiency and service quality in intelligent transportation systems by enabling vehicles to wireless...

cs.NI - Networking and Internet Architecture

标题	作者	发布日期	PDF	摘要
Improving Slow Transfer Predictions: Generative Methods Compared	Jacob Taegon Kim, Alex Sim, Kesheng Wu, Jinoh Kim	2025-12-16	下载	Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified an...
Hybrid Cognitive IoT with Cooperative Caching and SWIPT-EH: A Hierarchical Reinforcement Learning Framework	Nadia Abdolkhani, Walaa Hamouda	2025-12-16	下载	This paper proposes a hierarchical deep reinforcement learning (DRL) framework based on the soft actor-critic (SAC) algorithm for hybrid underlay-overlay cognitive Internet of Things (CIoT) networks w...
Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant Jobs	Brenton Walker, Markus Fidler	2025-12-16	下载	In some models of parallel computation, jobs are split into smaller tasks and can be executed completely asynchronously. In other situations the parallel tasks have constraints that require them to sy...
Assessing the Carbon Footprint of Virtual Meetings: A Quantitative Analysis of Camera Usage	Félix Mortas	2025-12-16	下载	This paper quantifies the carbon emissions related to data consumption during video calls, focusing on the impact of having the camera on versus off.
FUSION: Forecast-Embedded Agent Scheduling with Service Incentive Optimization over Distributed Air-Ground Edge Networks	Houyi Qi, Minghui Liwang, Seyyedali Hosseinalipour, Liqun Fu, Sai Zou, Xianbin Wang, Wei Ni, Yiguang Hong	2025-12-16	下载	In this paper, we introduce a first-of-its-kind forecasting-driven, incentive-inherent service provisioning framework for distributed air-ground integrated networks that explicitly accounts for human-...
A Threshold-Triggered Deep Q-Network-Based Framework for Self-Healing in Autonomic Software-Defined IIoT-Edge Networks	Agrippina Mwangi, León Navarro-Hilfiker, Lukasz Brewka, Mikkel Gryning, Elena Fumagalli, Madeleine Gibescu	2025-12-16	下载	Stochastic disruptions such as flash events arising from benign traffic bursts and switch thermal fluctuations are major contributors to intermittent service degradation in software-defined industrial...
Cooperative Caching Towards Efficient Spectrum Utilization in Cognitive-IoT Networks	Nadia Abdolkhani, Walaa Hamouda	2025-12-16	下载	In cognitive Internet of Things (CIoT) networks, efficient spectrum sharing is essential to address increasing wireless demands. This paper presents a novel deep reinforcement learning (DRL)-based app...
Hierarchical Deep Reinforcement Learning for Robust Access in Cognitive IoT Networks under Smart Jamming Attacks	Nadia Abdolkhani, Walaa Hamouda	2025-12-16	下载	In this paper, we address the challenge of dynamic spectrum access in a cognitive Internet of Things (CIoT) network where a secondary user (SU) operates under both energy constraints and adversarial i...
Country-in-the-Middle: Measuring Paths between People and their Governments	Alisha Ukani, Katherine Izhikevich, Shambhavi Mittal, Manan Patel, Samvrit Srinath, Kristy Ly, kc claffy, Alex C. Snoeren	2025-12-16	下载	Understanding where Internet services are hosted, and how users reach them, has captured the interest of government regulators and others concerned with the privacy of data flows.

cs.OS - Operating Systems

标题	作者	发布日期	PDF	摘要
EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving	Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen, Kuntai Du, Zhuohan Gu, Rui Zhang, Yuyang Huang, Yihua Cheng, Jiayi Yao, Qizheng Zhang, Ganesh Ananthanarayanan, Junchen Jiang	2025-12-16	下载	Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has ...

cs.PF - Performance

标题	作者	发布日期	PDF	摘要
From HNSW to Information-Theoretic Binarization: Rethinking the Architecture of Scalable Vector Search	Seyed Moein Abtahi, Majid Fekri, Tara Khani, Akramul Azim	2025-12-16	下载	Modern semantic search and retrieval-augmented generation (RAG) systems rely predominantly on in-memory approximate nearest neighbor (ANN) indexes over high-precision floating-point vectors, resulting...
Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant Jobs	Brenton Walker, Markus Fidler	2025-12-16	下载	In some models of parallel computation, jobs are split into smaller tasks and can be executed completely asynchronously. In other situations the parallel tasks have constraints that require them to sy...
A Threshold-Triggered Deep Q-Network-Based Framework for Self-Healing in Autonomic Software-Defined IIoT-Edge Networks	Agrippina Mwangi, León Navarro-Hilfiker, Lukasz Brewka, Mikkel Gryning, Elena Fumagalli, Madeleine Gibescu	2025-12-16	下载	Stochastic disruptions such as flash events arising from benign traffic bursts and switch thermal fluctuations are major contributors to intermittent service degradation in software-defined industrial...
Adaptive Cache Pollution Control for Large Language Model Inference Workloads Using Temporal CNN-Based Prediction and Priority-Aware Replacement	Songze Liu, Hongkun Du, Shaowen Wang	2025-12-16	下载	Large Language Models (LLMs), such as GPT and LLaMA, introduce unique memory access characteristics during inference due to frequent token sequence lookups and embedding vector retrievals.