Skip to content

2025-12-16

cs.AR - Architecture

标题作者发布日期PDF摘要
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV OffloadingWilliam Meng, Benjamin Lee, Hong Wang2025-12-16下载KV cache offloading enables long-context LLM inference by storing caches in CPU DRAM, but PCIe bandwidth limitations create severe bottlenecks.
Focus: A Streaming Concentration Architecture for Efficient Vision-Language ModelsChiyue Wei, Cong Guo, Junyao Zhang, Haoxuan Shan, Yifan Xu, Ziyue Zhang, Yudong Liu, Qinsi Wang, Changchun Zhou, Hai "Helen" Li, Yiran Chen2025-12-16下载Vision-Language Models (VLMs) have demonstrated strong performance on tasks such as video captioning and visual question answering. However, their growing scale and video-level inputs lead to signific...
PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage FusionHuizheng Wang, Hongbin Wang, Zichuan Wang, Zhiheng Yue, Yang Wang, Chao Li, Yang Hu, Shouyi Yin2025-12-16下载Attention-based models have revolutionized AI, but the quadratic cost of self-attention incurs severe computational and memory overhead. Sparse attention methods alleviate this by skipping low-relevan...
TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale ChipsHuizheng Wang, Taiquan Wei, Zichuan Wang, Dingcheng Jiang, Qize Yang, Jiaxin Liu, Jingxiang Hou, Chao Li, Jinyi Deng, Yang Hu, Shouyi Yin2025-12-16下载Large language models (LLMs) demand significant memory and computation resources. Wafer-scale chips (WSCs) provide high computation power and die-to-die (D2D) bandwidth but face a unique trade-off bet...
ReadyPower: A Reliable, Interpretable, and Handy Architectural Power Model Based on Analytical FrameworkQijun Zhang, Shang Liu, Yao Lu, Mengming Li, Zhiyao Xie2025-12-16下载Power is a primary objective in modern processor design, requiring accurate yet efficient power modeling techniques. Architecture-level power models are necessary for early power optimization and desi...
Adaptive Cache Pollution Control for Large Language Model Inference Workloads Using Temporal CNN-Based Prediction and Priority-Aware ReplacementSongze Liu, Hongkun Du, Shaowen Wang2025-12-16下载Large Language Models (LLMs), such as GPT and LLaMA, introduce unique memory access characteristics during inference due to frequent token sequence lookups and embedding vector retrievals.
The Impact Market to Save Conference Peer Review: Decoupling Dissemination and CredentialingKarthikeyan Sankaralingam2025-12-16下载Top-tier academic conferences are failing under the strain of two irreconcilable roles: (1) rapid dissemination of all sound research and (2) scarce credentialing for prestige and career advancement.

cs.DC - Distributed, Parallel, and Cluster Computing

标题作者发布日期PDF摘要
Optimizing Sensor Node Localization for Achieving Sustainable Smart Agriculture System ConnectivityMohamed Naeem2025-12-16下载The innovative agriculture system is revolutionizing how we farm, making it one of the most critical innovations of our time! Yet it faces significant connectivity challenges, particularly with the se...
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV OffloadingWilliam Meng, Benjamin Lee, Hong Wang2025-12-16下载KV cache offloading enables long-context LLM inference by storing caches in CPU DRAM, but PCIe bandwidth limitations create severe bottlenecks.
PruneX: A Hierarchical Communication-Efficient System for Distributed CNN Training with Structured PruningAlireza Olama, Andreas Lundell, Izzat El Hajj, Johan Lilius, Jerker Björkqvist2025-12-16下载Inter-node communication bandwidth increasingly constrains distributed training at scale on multi-node GPU clusters. While compact models are the ultimate deployment target, conventional pruning-aware...
Improving Slow Transfer Predictions: Generative Methods ComparedJacob Taegon Kim, Alex Sim, Kesheng Wu, Jinoh Kim2025-12-16下载Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified an...
Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant JobsBrenton Walker, Markus Fidler2025-12-16下载In some models of parallel computation, jobs are split into smaller tasks and can be executed completely asynchronously. In other situations the parallel tasks have constraints that require them to sy...
A Hybrid Reactive-Proactive Auto-scaling Algorithm for SLA-Constrained Edge ComputingSuhrid Gupta, Muhammed Tawfiqul Islam, Rajkumar Buyya2025-12-16下载Edge computing decentralizes computing resources, allowing for novel applications in domains such as the Internet of Things (IoT) in healthcare and agriculture by reducing latency and improving perfor...
Privacy-Preserving Feature Valuation in Vertical Federated Learning Using Shapley-CMI and PSI PermutationUnai Laskurain, Aitor Aguirre-Ortuzar, Urko Zurutuza2025-12-16下载Federated Learning (FL) is an emerging machine learning paradigm that enables multiple parties to collaboratively train models without sharing raw data, ensuring data privacy.
Cornserve: Efficiently Serving Any-to-Any Multimodal ModelsJeff J. Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, Mosharaf Chowdhury2025-12-16下载We present Cornserve, an efficient online serving system for an emerging class of multimodal models called Any-to-Any models. Any-to-Any models accept combinations of text and multimodal data (e.g.
Real-Time Service Subscription and Adaptive Offloading Control in Vehicular Edge ComputingChuanchao Gao, Arvind Easwaran2025-12-16下载Vehicular Edge Computing (VEC) has emerged as a promising paradigm for enhancing the computational efficiency and service quality in intelligent transportation systems by enabling vehicles to wireless...

cs.NI - Networking and Internet Architecture

标题作者发布日期PDF摘要
Improving Slow Transfer Predictions: Generative Methods ComparedJacob Taegon Kim, Alex Sim, Kesheng Wu, Jinoh Kim2025-12-16下载Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified an...
Hybrid Cognitive IoT with Cooperative Caching and SWIPT-EH: A Hierarchical Reinforcement Learning FrameworkNadia Abdolkhani, Walaa Hamouda2025-12-16下载This paper proposes a hierarchical deep reinforcement learning (DRL) framework based on the soft actor-critic (SAC) algorithm for hybrid underlay-overlay cognitive Internet of Things (CIoT) networks w...
Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant JobsBrenton Walker, Markus Fidler2025-12-16下载In some models of parallel computation, jobs are split into smaller tasks and can be executed completely asynchronously. In other situations the parallel tasks have constraints that require them to sy...
Assessing the Carbon Footprint of Virtual Meetings: A Quantitative Analysis of Camera UsageFélix Mortas2025-12-16下载This paper quantifies the carbon emissions related to data consumption during video calls, focusing on the impact of having the camera on versus off.
FUSION: Forecast-Embedded Agent Scheduling with Service Incentive Optimization over Distributed Air-Ground Edge NetworksHouyi Qi, Minghui Liwang, Seyyedali Hosseinalipour, Liqun Fu, Sai Zou, Xianbin Wang, Wei Ni, Yiguang Hong2025-12-16下载In this paper, we introduce a first-of-its-kind forecasting-driven, incentive-inherent service provisioning framework for distributed air-ground integrated networks that explicitly accounts for human-...
A Threshold-Triggered Deep Q-Network-Based Framework for Self-Healing in Autonomic Software-Defined IIoT-Edge NetworksAgrippina Mwangi, León Navarro-Hilfiker, Lukasz Brewka, Mikkel Gryning, Elena Fumagalli, Madeleine Gibescu2025-12-16下载Stochastic disruptions such as flash events arising from benign traffic bursts and switch thermal fluctuations are major contributors to intermittent service degradation in software-defined industrial...
Cooperative Caching Towards Efficient Spectrum Utilization in Cognitive-IoT NetworksNadia Abdolkhani, Walaa Hamouda2025-12-16下载In cognitive Internet of Things (CIoT) networks, efficient spectrum sharing is essential to address increasing wireless demands. This paper presents a novel deep reinforcement learning (DRL)-based app...
Hierarchical Deep Reinforcement Learning for Robust Access in Cognitive IoT Networks under Smart Jamming AttacksNadia Abdolkhani, Walaa Hamouda2025-12-16下载In this paper, we address the challenge of dynamic spectrum access in a cognitive Internet of Things (CIoT) network where a secondary user (SU) operates under both energy constraints and adversarial i...
Country-in-the-Middle: Measuring Paths between People and their GovernmentsAlisha Ukani, Katherine Izhikevich, Shambhavi Mittal, Manan Patel, Samvrit Srinath, Kristy Ly, kc claffy, Alex C. Snoeren2025-12-16下载Understanding where Internet services are hosted, and how users reach them, has captured the interest of government regulators and others concerned with the privacy of data flows.

cs.OS - Operating Systems

标题作者发布日期PDF摘要
EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM ServingShaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen, Kuntai Du, Zhuohan Gu, Rui Zhang, Yuyang Huang, Yihua Cheng, Jiayi Yao, Qizheng Zhang, Ganesh Ananthanarayanan, Junchen Jiang2025-12-16下载Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has ...

cs.PF - Performance

标题作者发布日期PDF摘要
From HNSW to Information-Theoretic Binarization: Rethinking the Architecture of Scalable Vector SearchSeyed Moein Abtahi, Majid Fekri, Tara Khani, Akramul Azim2025-12-16下载Modern semantic search and retrieval-augmented generation (RAG) systems rely predominantly on in-memory approximate nearest neighbor (ANN) indexes over high-precision floating-point vectors, resulting...
Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant JobsBrenton Walker, Markus Fidler2025-12-16下载In some models of parallel computation, jobs are split into smaller tasks and can be executed completely asynchronously. In other situations the parallel tasks have constraints that require them to sy...
A Threshold-Triggered Deep Q-Network-Based Framework for Self-Healing in Autonomic Software-Defined IIoT-Edge NetworksAgrippina Mwangi, León Navarro-Hilfiker, Lukasz Brewka, Mikkel Gryning, Elena Fumagalli, Madeleine Gibescu2025-12-16下载Stochastic disruptions such as flash events arising from benign traffic bursts and switch thermal fluctuations are major contributors to intermittent service degradation in software-defined industrial...
Adaptive Cache Pollution Control for Large Language Model Inference Workloads Using Temporal CNN-Based Prediction and Priority-Aware ReplacementSongze Liu, Hongkun Du, Shaowen Wang2025-12-16下载Large Language Models (LLMs), such as GPT and LLaMA, introduce unique memory access characteristics during inference due to frequent token sequence lookups and embedding vector retrievals.

基于 VitePress 构建