近期新增论文

跟踪 arXiv 与关注会议/期刊的新论文,按发布时间浏览和检索。

近期新增论文
汇总 arXiv 与关注会议/期刊的新论文,检索时按相关性优先排序。
智能提示
arXiv cs.CRDependability and SecurityarXivbenchmarkapproachempirical studyVulnerability detection and software securityConfidentialityintegritycs.CRarXiv cs.CR

Anchors that Don't Lift: Understanding Supply Chain Driven Kernel Lock-In and Governance-Mediated Mitigation Strategies in SOHO Devices

Ritwik Badola, Rajdeep Ghosh, Ashita Gupta, Chester Rebeiro, Mainack Mondal
2026/06/10 01:51已过 2 天

Small Office/Home Office (SOHO) devices are widely popular, yet often attacked due to security vulnerabilities in their firmware, affecting thousands of devices. These security vulnerabilities often stem from outdated Linux kernel versions included in SOHO device firmware. Naturally, prior work audited the extent and impact of this issue by simple Linux version extraction and version number based vulnerability mapping. However, it is unclear how many of these anticipated vulnerabilities actually exist in the heavily customized SOHO kernels and if there are any barriers towards updating Linux kernels in SOHO firmwares. To address this gap, we uncover actual kernel-related vulnerabilities found in 306 SOHO devices using a high-precision template-based CVE detection mechanism on GPL source releases of more than 900 firmwares from these devices. Next, as a first, we traced the supply chain of these vulnerable SOHO devices at scale and identify kernel lock-in as a significant security issue -- SOHO vendors are effectively locked to specific (often older) kernel versions due to the system-on-chip (SoC) SDKs they use. This kernel lock-in produces a vulnerability debt that is inherited along the supply chain from SoC vendor to firmware creators (ODM/OEM) to router/IP-camera vendor and ultimately borne by end users. All five SoC vendors in our dataset had used SDKs with Linux kernels that had reached EoL more than a year before their usage in a SOHO device. Finally, we explore the mitigation-potential of individual, regulatory and community governance by analyzing social media posts, regulations and community efforts. Our results show that regulation compliance is insufficient and only SoC vendors who engage with communities for kernel upgradation offered a viable path towards mitigation. The data and code for this work is available at https://doi.org/10.5281/zenodo.20433799

arXiv LLM4Code / Program RepairAI for Software EngineeringarXivapproachAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SECollaborative AI for SEhep-phastro-ph.COphysics.comp-pharXiv LLM4Code / Program Repair

DarkAgents

Michele Lucente, Silvia Pascoli, Filippo Sala, Matteo Zandi
2026/06/10 01:39已过 2 天

We present DarkAgents: a multi-agent system that leverages the reasoning and code-generation capabilities of large language models (LLMs), together with deterministic tested human-written code, to build orchestrated pipelines for theoretical astroparticle physics research. While related approaches have been proposed in collider physics and cosmology, DarkAgents targets the specific challenges of this domain, such as model building, complex pipeline computations, multiple constraints and assumption auditing. The framework can be powered by different agentic command-line tools, including Mistral's, Anthropic's, OpenAI's and local LLMs via Ollama. As first implementation, we apply DarkAgents to the study of cosmological first order transitions, starting from a classically scale-invariant particle-physics model and ending with the fit to the NANOGrav nanohertz gravitational-waves spectrum. DarkAgent-PT provides as output i) the best-fit values of model parameters, ii) their existing experimental and observational constraints, iii) an audit report of the assumptions and priors entering both i) and ii), of particular relevance for astroparticle physics. Our test runs identify inconsistencies in some fits in the literature and produce novel ones based on the dissipative bulk-flow GW template. The code is publicly available at https://github.com/PhysicsZandi/DarkAgents.

arXiv cs.CRDependability and SecurityarXivapproachempirical studyConfidentialityintegrityprivacycs.CRarXiv cs.CR

OpenPCC: Open and Confidential LLM Serving on Commodity TEEs

Haoling Zhou, Shixuan Zhao, Chao Wang, Zhiqiang Lin
2026/06/10 01:33已过 2 天

Generative AI applications such as personal AI agents, image generators, and chat assistants offer advanced capabilities to improve user experience. Behind the scenes, Large Language Models (LLMs) that power these services require a massive amount of computation and are usually deployed in the cloud, available as APIs, meaning that a user's request has to be sent to a Cloud Inference Service (CIS) for processing. However, the strong capabilities of LLM also mean that user's requests now contain much more personal sensitive or enterprise confidential information, demanding equally strong protection in CIS. While early industry efforts such as Apple Private Cloud Compute (PCC) and Google Private AI Compute have emerged to show the potential of secure CIS, they are not adoptable for deployment by others due to their reliance on proprietary hardware and closed ecosystem. In addition, they all suffer from their own design glitches that can undermine the ambitious goal of bringing in true privacy protection to end users. In this paper, we present our analysis of the fundamental requirements of building a secure yet open CIS. We then present OpenPCC, a Confidential CIS framework that does not rely on proprietary hardware but instead uses commercially available TEEs. We implement an open-source prototype and characterize it end-to-end on a Llama-3 8B vLLM workload, separating OpenPCC's own cost from the underlying TEE hardware. Our analysis and evaluation demonstrated the feasibility and security of the system.

arXiv cs.SETesting and AnalysisarXivapproachAutomated test generationcs.SEarXiv cs.SE

Operationalizing Property-Based Testing for Data-Intensive Scalable Computing Systems

Yaoxuan Wu, Ingrid Lee, Ahmad Humayun, Muhammad Ali Gulzar, Miryung Kim
2026/06/10 01:25已过 2 天

While fuzzing effectively catches crashes, its shallow oracles often miss semantic drifts and optimization-related errors in data-intensive scalable computing (DISC) frameworks. Property-based testing (PBT) addresses this limitation by checking general semantic invariants across diverse workloads and inputs, rather than relying on specific expected outputs. However, systematically operationalizing PBT for DISC systems remains difficult because it requires both reusable property definitions and effective instantiation into valid workloads and data. We present DiscPBT, a property-based testing engine for Apache Spark. DiscPBT introduces eight reusable meta-properties for DISC semantic testing, spanning equivalence rewriting, data decomposition, computation decomposition, and operator-local semantic relations. To operationalize these meta-properties, DiscPBT provides reusable generators for synthesizing valid workload skeletons and input data, together with an instantiation framework that realizes each meta-property in schema-compatible contexts through compatible operators, expressions, and UDFs. Our evaluation on PySpark shows that DiscPBT achieves 1.2$\times$ higher branch coverage and 1153$\times$ greater plan diversity than CometFuzz. Across 66 concrete properties, DiscPBT reveals cross-version semantic drift as well as subtle corner-case pitfalls involving NaN and empty inputs, that are not captured by crash-based fuzzing alone. These results demonstrate the value of systematic PBT for uncovering semantic issues in DISC frameworks.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachVulnerability detection and software securitycs.CRarXiv cs.CR

A Longitudinal Study of Recently Observed Malicious Domains: Characteristics, Infrastructure, and Abuse Patterns

Fathima Mashood, Mohamed Nabeel
2026/06/10 01:06已过 2 天

We present a longitudinal study of approximately 1.52 million malicious domains observed on VirusTotal (VT) between January and May 2026. Domains were selected on the basis of detection by at least five independent VT scanning engines and a first-seen date within the study window. We group the dataset into compromised domains and attacker created domains, which account for approximately 89.3% of the dataset. Combining WHOIS registration records and passive DNS (PDNS) data with the VT dataset, we characterise attacker behaviour across eight dimensions: temporal distribution, compromisedvs.attack classification, domain age at first detection, registrar and TLD preferences, DNS query volume as a damage proxy, hosting infrastructure concentration (IP and ASN level), bulk registration patterns, and brand impersonation. Key findings include: the majority of attacker created domains are short lived registrations used within weeks of creation; a small number of registrars and TLDs account for most abuse; Cloudflare infrastructure is heavily exploited for domain fronting; bulk registration events involving thousands of domains from a single registrar on a single day are widespread; and several global brands, particularly WhatsApp and Google, are heavily impersonated. We share the annotated dataset in the GitHub repo https://github.com/mufimash/malicious_domains for further research.

arXiv cs.CRArchitecture and DesignarXivapproachArchitecture quality attributescs.CRcs.LGarXiv cs.CR

Do Transformers Actually Help Intrusion Detection? A Temporal Sequence Evaluation on CIC-IDS2017

Zach Moczkodan, Hany Ragab
2026/06/10 00:57已过 2 天

Recent deep learning approaches for network intrusion detection increasingly incorporate temporal architectures such as recurrent networks and Transformers, often reporting near-perfect performance on CIC-IDS2017. However, many existing studies neither supply their temporal modules with genuine sequence inputs nor evaluate under realistic, leakage-free conditions, making it unclear whether reported gains arise from true sequence-modeling capability. In this work, we reformulate CIC-IDS2017 as a temporal intrusion-detection task by constructing ordered flow sequences from network conversations and benchmarking nine classical and deep learning architectures under a random split, two leakage-free splits, and a padding-scheme ablation. The central finding is that padding convention, not architecture, determines the Transformer's performance: on genuinely sequential (non-padded) windows the Transformer achieves the highest macro-F1 of any model in the experiment (0.89); under zero-pad+mask evaluation it drops markedly (-0.24 macro-F1), while LSTM, GRU, and 1D-CNN remain stable. Under leakage-free group evaluation the Random Forest is the most robust model (+0.009), while the Transformer's false-alarm rate grows from 0.04% to 2.7%, a 67-fold increase invisible under conventional protocols. These findings demonstrate that evaluation methodology -- specifically padding convention and split protocol -- has a larger effect on reported performance than architectural choice, and that widely used random splits with repeat-last padding can overestimate model robustness by up to 0.24 macro-F1. We advocate leakage-free splits, explicit padding disclosure, and sequence-aware benchmarking as standard practice in future IDS research. Code and implementation details are available at https://github.com/zachmocz/temporal-ids-bench.

arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkapproachempirical studyAutomating SE tasks with LLM and foundation modelsEfficacy measurement beyond traditional metricscs.AIcs.LGarXiv AI Coding 3Y

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

Joachim Schaeffer, Thomas Jiralerspong, Alexander Panfilov, Guillaume Lajoie, Jonas Geiping, Yoshua Bengio, Roland S. Zimmermann
2026/06/10 00:24已过 2 天

AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.

arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivotherAutomating SE tasks with LLM and foundation modelscs.SEcs.HCarXiv cs.SEarXiv AI Agents for Software Engineering

Making Software Meaningful

Eagon Meng, Abutalib Namazov, Carmel Schare, Alcino Cunha, Daniel Jackson
2026/06/10 00:16已过 2 天

Adopting a single measure can improve the usability, modularity and accountability of software: a commitment to explicit meaning. This entails constructing and agreeing upon a representation of the behavior of the software, as observed in the domain of application. The phenomena comprising this behavior become a vocabulary that grounds all discourse about the software, among all stakeholders, and for all artifacts and activities. These phenomena are individuals; actions they participate in; and facts that result from actions. They can be organized, by partitioning the set of actions, into concepts, offering larger units of meaning. Examples of exploiting meaning are given in three areas: designing for usability (by aligning user and designer on a single shared meaning); generating modular code with LLMs (by mapping units of meaning to units of code, achieving not only modularity but also legibility); and making agents accountable (by having them adhere to a code of conduct that defines their intended behavior).

arXiv cs.SETesting and AnalysisarXivapproachempirical studyAutomated test generationcs.SEarXiv cs.SE

GapFuzz: Cross-Plane Divergence Fuzzing for Distributed SDN Controllers

Moustapha Awwalou Diouf, Samuel Ouya, Jacques Klein, Tegawendé F. Bissyandé
2026/06/10 00:05已过 2 天

Distributed Software-Defined Networking (SDN) clusters replicate flow state asynchronously between a master node and its backups, leaving a window during which two backup nodes can each commit a contradictory rule, the master can serialize both into the data plane, and the kernel datapath can latch onto an action that no node believes authoritative. Existing SDN fuzzers miss this fault: they confine their oracle to the control plane, target a single controller, or do not steer concurrency to provoke replication races. We present GapFuzz, a stateful concurrency fuzzer for distributed SDN clusters. GapFuzz injects pairs of contradictory Northbound requests on two non-master nodes with controlled inter-injection delay $Δt$, and reconstructs the global cross-plane state by querying every replica and the kernel-datapath action through ovs-appctl ofproto/trace. A two-phase timing search detects whether a divergence exists, then doubles and bisects on $Δt$ to bound the injection-time window; a lifetime probe labels each verdict transient or persistent and assigns it to one of four cross-plane state classes derived from the ONOS 2.7 source. On a three-node ONOS 2.7 cluster, GapFuzz produces a divergent verdict in 81.7% of attempts ($N=50$, Wilson 95% CI $[77.3, 85.4]$%); every divergence sits between the cluster's authoritative state and the kernel datapath. Phase 2 separates a 5 ms race window for one template from a doubling-cap regime ($Δt_{\max}=10.24$ s) for six others, and 99.4% of divergences persist past 30 s. Replacing the kernel-datapath probe with the OpenFlow user-space probe used by prior fuzzers drops detection by 26.6 percentage points overall and by 46.5 points after excluding canonicalization-forced verdicts.

arXiv cs.CRDependability and SecurityarXivapproachVulnerability detection and software securitycs.CRarXiv cs.CR

When Discovery Outpaces Remediation: Modeling AI-Accelerated Vulnerability Discovery in Interconnected Systems

Mohamamad Reza Faghani
2026/06/09 23:58已过 2 天

Advanced AI systems for code analysis, binary analysis, fuzzing orchestration, and penetration-test planningmay significantly increase the rate at which latent vulnerabilities are discovered. While improved discovery can benefit defenders, it can also overload remediation pipelines and accelerate adversarial weaponization. This paper develops a queueing and network-theoretic model of AI-accelerated vulnerability discovery in interconnected systems. We represent an enterprise as a weighted dependency graph with replenishing vulnerability pools, finite remediation capacity, triage degradation, exploit window compression, and dynamic compromise propagation. We derive stability conditions for vulnerability backlogs, formulate a dynamic coupling between unresolved backlog and cascade risk, and evaluate mitigation strategies through simulation. Results indicate that when actionable discovery arrivals exceed remediation throughput, backlogs grow rapidly and systemic risk increases nonlinearly. In hub-dominated topologies, segmentation can reduce propagated compromise more effectively than remediation speed alone, while the strongest defense combines remediation automation with reduced network coupling.

arXiv cs.SE / arXiv cs.CRDependability and SecurityarXivapproachempirical studyConfidentialityintegrityprivacycs.CRcs.AIcs.SEarXiv cs.SE

Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill

Junchang Zheng, Junfeng Tan, Jialiang Lin
2026/06/09 23:41已过 2 天

OpenClaw has rapidly emerged as a transformative artificial intelligence (AI) agent framework, and its ability to autonomously execute complex, multi-step tasks has attracted an ever-growing and diverse user base. However, this capability comes with significant risks. While existing research has made important strides in characterizing these threats, such work is predominantly directed at technically sophisticated audiences. It remains largely inaccessible to non-technical users. This demographic now makes up an increasingly large and underserved portion of the community, yet it is these very users who most urgently need practical and straightforward guidance. In response, we bridge this gap through a series of interconnected efforts designed to lower the risk barrier for non-technical OpenClaw users. First, we identify and categorize seven core risks that OpenClaw users may encounter in daily usage, explaining each in plain language so that non-technical users can readily grasp the nature and potential consequences of these threats. Second, for each identified risk, we distill a set of corresponding defensive strategies into clear and actionable operational steps that are easy to follow. Third, to make protection even easier, we provide a companion OpenClaw Skill that automates key security configurations, enabling users to safeguard their systems with minimal manual intervention. Through this work, we demonstrate that safeguarding against the risks of intelligent agents need not be the exclusive domain of security experts, and that non-technical users can meaningfully participate in reducing these risks through simple, practical actions.

arXiv LLM4Code / Program Repair / arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEEfficacy measurement beyond traditional metricscs.AIcs.CLarXiv LLM4Code / Program RepairarXiv AI Coding 3Y

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei
2026/06/09 22:59已过 2 天

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.

arXiv cs.SE / arXiv cs.CR / arXiv LLM4Code / Program Repair / arXiv AI Coding 3YDependability and SecurityarXivapproachVulnerability detection and software securityConfidentialityintegritycs.CRcs.SEarXiv cs.SEarXiv cs.CR

Context-Based Adversarial Attacks on AI Code Generators: Vulnerability Analysis and Implications

Walther A. Del Orbe, John D. Hastings, Varghese Vaidyan
2026/06/09 22:51已过 2 天

AI-powered code generation systems have transformed software development but introduce critical inference-time security vulnerabilities. This research presents a systematic investigation of context-based adversarial attacks, where strategically crafted contextual inputs, including comments, documentation, variable names, bias large language models toward generating exploitable code. Through 2,800 controlled experiments across CodeT5+, CodeLlama, GPT-3.5-Turbo, and GPT-4, we quantify attack effectiveness and defense mechanisms. Results demonstrate that adversarial conditions increase vulnerability generation 10.7x (from 3.5% to 37.4%), with direct instruction attacks achieving 100% success on GPT-3.5-Turbo. Cross-model transferability reaches 60-100%, indicating systemic architectural vulnerabilities rather than model-specific flaws. Our dual-layer defense framework achieves 89.1% detection rate with 0.3% false positives and 520ms latency, demonstrating practical feasibility for real-time deployment in development environments.

arXiv AI Agents for Software Engineering / arXiv SWE-bench / Defects4J / arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelscs.AIarXiv AI Agents for Software EngineeringarXiv SWE-bench / Defects4JarXiv AI Coding 3Y

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

Aman Sharma, Sushrut Thorat, Paras Chopra
2026/06/09 22:44已过 2 天

LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden-test grading. Our protocol exposes capability differences between these agents that mainstream coding and agentic benchmarks such as SWE-Bench Verified and Terminal-Bench 2.0 compress into much narrower bands. We observe that the strongest agents, Claude Opus 4.6 and GPT-5.4 xhigh, often avoid writing the target language directly. On Brainfuck and Befunge-98, they write Python programs that generate target-language code and debug those generators locally. Forbidding this metaprogramming strategy causes large performance drops. Text guidance distilled from this strategy does not materially improve weaker agents. In contrast, Opus-derived Python helper code for building generators, with no solved benchmark programs or hidden-test answers, sharply improves Sonnet 4.6 and GPT-5.4 mini on the same problems, while Haiku 4.5 remains low. More interpreter calls and output tokens improve stronger agents but leave weaker agents near their original performance, indicating that these resources amplify useful strategies rather than create them. Together, these results show that strong coding agents adapt to unfamiliar languages by using tools, feedback, and workspace state to build a working model of the target language. Metaprogramming is the clearest case, but the broader gap is constructing and debugging a strategy that works under the target language's rules.

arXiv LLM4Code / Program Repair / arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEPrompt engineering for SEcs.CEcs.AIcs.LGphysics.comp-ph

A Constrained Natural-Language Interface for Variational Multi-Physics Finite Element Simulations in FEniCS

Nilay Upadhyay, Wesley F. Reinhart
2026/06/09 22:38已过 2 天

Large language models can reduce the manual effort required to set up finite element simulations, but they introduce reliability risks when generated solver code lies on the critical path. We present a constrained natural-language interface for multi-physics finite element analysis in which the LLM is limited to front-end tasks: parsing prompts into structured JSON, generating Gmsh code only for non-catalog geometries, and using retry feedback for those stages. It never writes FEniCS solver templates, derives weak forms, or writes the numerical solver core. A deterministic dispatcher maps the validated specification to five human-written FEniCS/UFL templates: linear elasticity, hyperelasticity, elastoplasticity, thermo-mechanical coupling, and phase-field fracture. We validate this deterministic template layer against analytical solutions and published 2D/3D benchmarks. Smooth cases reach sub-percent agreement on adequate meshes, while harder nonlinear cases reach the 2-5 percent range. We also evaluate the LLM-facing front end directly. In a 15-prompt parser benchmark, first-pass valid parses were obtained for 9 cases, and all remaining cases were repaired after retry, giving a final valid parse rate of 100.0 percent, 100.0 percent problem-class accuracy, and 97.1 percent field-extraction accuracy. In a 10-case custom-geometry benchmark routed through the real LLM-to-Gmsh path, first-pass and final success were both 90.0 percent, with one unrecovered invalid-geometry failure. These results show that the parser and constrained prompt/validation design are effective on these benchmarks. As an end-to-end demonstration, the system generates and analyzes a 3D elastoplastic L-bracket with a fillet and bolt hole from one natural-language prompt. The contribution is a measured architecture for natural-language-driven variational simulation, not open-ended autonomous code generation.

arXiv cs.SE / arXiv SWE-bench / Defects4J / arXiv AI Coding 3YDependability and SecurityarXivbenchmarkVulnerability detection and software securitycs.SEarXiv cs.SEarXiv SWE-bench / Defects4JarXiv AI Coding 3Y

Early Comparative Evaluation of Transformer Models for Multilingual Software Vulnerability Detection

Fiza Naseer, Javad Khan, Muhammad Yaqoob, Alexios Mylonas
2026/06/09 22:33已过 2 天

Software vulnerability detection is increasingly important as modern applications combine multiple programming languages. This paper presents an early comparative evaluation of BERT, RoBERTa, and CodeBERT for binary vulnerability detection across HTML, Python, JavaScript, and PHP using the CVEFixes dataset and language-wise three-fold stratified cross-validation. The results show clear performance differences across languages, indicating that multilingual vulnerability detection requires more language-aware and robust transformer-based modelling strategies.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.SDcs.AIcs.CRcs.LG

What Do Deepfake Speech Detectors Actually Hear?

Vojtěch Staněk, Veronika Jirmusová, Anton Firc, Kamil Malinka, Jakub Reš, Martin Perešíni
2026/06/09 22:21已过 2 天

Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachConfidentialityintegrityprivacycs.SDcs.AIcs.CRcs.LG

Ethical and Technical Limits of Deepfake Speech Datasets

Vojtěch Staněk, Eva Trnovská, Kamil Malinka, Anton Firc
2026/06/09 22:20已过 2 天

Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. We present a dataset-level audit of the deepfake speech landscape. We compile and analyze 39 deepfake speech datasets, examining key attributes including accessibility, documentation, demographic and language coverage, dataset scale, and the underlying bona fide speech sources. Our audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, we identify substantial overlap in underlying bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachFormal methods and model checkingcs.SDcs.AIcs.CRcs.LG

RAT: Reference-Augmented Training for ASV Anti-Spoofing

Vojtěch Staněk, Anton Firc, Jakub Reš, Kamil Malinka
2026/06/09 22:20已过 2 天

We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference-Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single-utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.

arXiv cs.CRAI for Software EngineeringarXivbenchmarkapproachempirical studyAutomating SE tasks with LLM and foundation modelsPrompt engineering for SETrustworthy AI for SEcs.CRarXiv cs.CR

Comparative Analysis of Inference-Time Defense Methods for Multimodal Large Language Models

Bulat Nutfullin, Vladimir Evgrafov, Dmitry Namiot
2026/06/09 22:13已过 2 天

Multimodal large language models (MLLMs) now appear in safety-critical applications, but the visual channel leaves them open to adversarial attacks that predominantly text-oriented safety alignment addresses only in part. Retraining a model for each new vulnerability class is usually too expensive to be practical. We report a comparative empirical evaluation of three inference-time defense methods and their combinations, run on eight models from the InternVL and Qwen-VL families across seven safety benchmarks that span four attack classes and total 9,000 evaluation samples. Every figure below comes from the same unified proxy classifier. Five findings emerge from the evaluation. First, within the evaluated models and benchmarks, no single defense dominates across all settings: what works depends on the model's baseline safety and on the attack type. Second, combining defenses directly drives benign-query over-refusal to 97-100% across all eight evaluated models, and SmoothVLM on its own reaches 99.2-100%. Third, a simple safety prompt keeps utility largely intact (0.0-18.2% over-refusal across all eight models, five of them below 7%, although two exceeded 15%) while still yielding moderate safety gains. Fourth, different attack classes expose different weaknesses across the evaluated setup, which is why multi-benchmark evaluation matters. Fifth, in a preliminary whitebox test on two models (n=20), text-level defenses suppressed a PGD visual attack that had succeeded without any defense: the defenses act at the output stage, where gradient optimization has limited direct leverage in the tested configuration. Read together, these results argue for adaptive defense selection rather than a single fixed defense configuration.

arXiv cs.SE / arXiv LLM4Code / Program RepairAI for Software EngineeringarXivapproachsurveyAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEPrompt engineering for SEcs.SEarXiv cs.SEarXiv LLM4Code / Program Repair

From Quality Properties to Practice: A Guideline and Workflow for Explainability Requirements

Martin Obaidi, Jakob Droste, Hannah Deters, Marc Herrmann, Michel Krahl, Kurt Schneider
2026/06/09 21:56已过 2 天

Explainability is increasingly required in AI-enabled software systems to support transparency, user trust, and compliance. Yet, explainability requirements are often written ad hoc, and unguided large language model support can yield vague, inconsistent, or incomplete statements. This paper presents a sequential, guideline-driven workflow for formulating explainability requirements and evaluates its tool-based operationalization. We first elicited candidate quality properties through a structured literature review and developer interviews. We then prioritized these properties in an online survey with practitioners (n=20) and derived a concise guideline of ten core properties with actionable formulation instructions. Next, we operationalized the guideline in a web-based tool that supports an iterative workflow of drafting, property-based checks, and revision. We evaluated the workflow in two complementary studies. In a workshop with requirements engineers (n=6), tool support reduced formulation time by 23.5% on average (Wilcoxon p=0.021). In an independent online study with software developers (n=18), tool-supported and manually written requirements received comparable ratings for implementability and formulation quality, with a descriptive slight preference tendency toward the tool-supported versions. Overall, our results suggest that combining a prioritized quality guideline with lightweight LLM support can reduce formulation effort while producing requirements that are perceived comparably to manually written ones.

arXiv cs.SE / arXiv LLM4Code / Program RepairAI for Software EngineeringarXivapproachempirical studysurveyAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEcs.SEarXiv cs.SEarXiv LLM4Code / Program Repair

Writing Better Software Explanations: A Guideline-Based Approach

Martin Obaidi, Jean-Carl Kremser, Hannah Deters, Jakob Droste, Marc Herrmann, Kurt Schneider
2026/06/09 21:54已过 2 天

As software systems increasingly rely on natural-language explanations to address user-reported explanation needs in requirements communication and support, ensuring that such explanations are consistent, relevant, and well formulated remains a major challenge. Purely automatic large language model (LLM) generation often lacks reliable grounding and controllable output quality. In this paper, we present a guideline-based formulation support tool for software explanations that combines LLM-assisted text generation with an empirically derived quality guideline. The tool structures the writing process into generation, quality checking, and iterative revision, while keeping domain control with developers. We evaluated the approach in a two-phase study consisting of an interview-based developer experiment and a controlled user survey. Six industry practitioners with software development or DevOps experience formulated explanations for real explanation needs in a human-only manual condition and in a human-with-LLM-support condition. In this small-scale evaluation, tool-supported formulation was on average 24.4% faster, although inferential analyses indicated only a trend for efficiency. In a subsequent user study with 17 participants and 204 paired comparisons, tool-supported explanations were rated significantly higher in overall satisfaction than manual explanations (p=0.003, rank-biserial correlation=0.86). Our findings suggest potential efficiency gains and higher perceived formulation quality through guideline-driven LLM assistance. Future work should examine long-term industrial use and integration into existing development workflows.

arXiv cs.SEAI for Software EngineeringarXivapproachempirical studysurveySustainable AI for SEAutomating SE tasks with LLM and foundation modelsPrompt engineering for SEcs.SEcs.AIcs.HCarXiv cs.SE

From Perception to Action: Can UI Interventions Foster Sustainable LLM Chatbot

Nitish Patkar, Pooja Rani, Jack Glässer, Simon Lüscher, Martin Kropp
2026/06/09 21:39已过 2 天

LLM-powered chatbots are increasingly embedded in everyday workflows, raising sustainability concerns due to their energy use. Most mitigation strategies emphasize model or infrastructure efficiency, while the user-interface (UI) layer remains underexplored despite its potential to shape interaction behavior. We investigate whether sustainability-oriented UI interventions can increase users' energy awareness and encourage more energy-responsible chatbot use without reducing usability. We first conducted a baseline survey with 77 participants to assess awareness and receptiveness to intervention concepts. Guided by prior work on persuasive technology and choice architecture, we implemented a web-based chatbot prototype with a three-mode switch (Energy-efficient, Balanced, Performance), per-response energy feedback, pre-send energy estimates, a usage metrics dashboard, and energy analogies. We then evaluated the prototype in a five-day field study with 11 participants. In the baseline survey, 94.8% of respondents reported at least some awareness of AI energy use, yet 88.3% misestimated actual consumption. Although concern about environmental impact was high, only 39.0% indicated willingness to accept a performance trade-off for lower energy use. In the field study, Energy-efficient mode accounted for 55.8% of logged prompts, while 90.9% self-reported actively choosing Eco-mode when high accuracy was not required. Participants did not reduce prompt length, suggesting mode switching as the primary behavioral mechanism. Sustainability-oriented UI interventions can improve awareness and support more energy-responsible interaction patterns in LLM chatbots. These effects are best interpreted as behavioral and model-based estimates that complement backend efficiency work, and the provided prototype and replication package support further research on energy-aware conversational AI design.

arXiv cs.CRDependability and SecurityarXivapproachVulnerability detection and software securitycs.CRcs.CLarXiv cs.CR

Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization

Lena S. Bolliger, Lena A. Jäger
2026/06/09 21:39已过 2 天

Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more broadly, leaves models without a principled way to resolve conflicts between legitimate but competing instructions. A common training-based response is to teach models an explicit instruction hierarchy; existing approaches, however, formalize hierarchies of only three or four levels, treat all violations as equally severe, and rarely evaluate the full set of pairwise level interactions. We formalize a k-level instruction hierarchy problem and instantiate it for k=5, yielding ten pairwise priority relations that a compliant model must enforce. We then introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting levels under a linear or bilateral schedule, the latter weighting severity by both the privilege gap and the privilege of the victim level. Combined with hierarchy-specific delimiter tokens (Chen et al., 2025) and Instructional Segment Embeddings (ISE; Wu et al., 2025), GW-DPO with the bilateral schedule Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct, raising macro pairwise priority adherence while keeping over-refusal at half the standard DPO rate. Ablations isolate ISE as a refusal-threshold calibrator and recast five- versus three-level training as a generality-specialization tradeoff.

arXiv cs.SETesting and AnalysisarXivapproachSoftware testingcs.SEarXiv cs.SE

Modular2Simple: A Tool for Modular Scenario Creation Based on the OpenSCENARIO Format

Nikolai Khriapov, Mohamed Taha Drif, Renjue Li, Cas Widdershoven
2026/06/09 21:31已过 2 天

The rapid advancement of autonomous driving systems (ADS) has introduced significant challenges, particularly in the creation of realistic and complex scenarios for testing and validation. This paper introduces Modular2Simple, a tool designed to address these challenges by simplifying and enhancing the process of creating complex ADS scenarios. Modular2Simple seamlessly integrates with the CARLA simulator and is applicable to any software that supports the OpenSCENARIO format. By leveraging existing simple scenarios in the OpenSCENARIO format, the tool enables developers to create easily customizable modular scenarios through the combination of multiple simple or modular scenarios, significantly simplifying the scenario creation process while maintaining flexibility in scenario design. This approach not only facilitates the development of complex scenarios, reducing both development time and effort, but also promotes scenario reuse and customization, which leads to a significant reduction in code complexity and enhanced efficiency in scenario design and testing compared to traditional scenario development methods.

arXiv cs.SE / arXiv cs.CRDependability and SecurityarXivbenchmarkapproachempirical studyVulnerability detection and software securityConfidentialityintegritycs.CRcs.SEarXiv cs.SEarXiv cs.CR

Securing Code Understanding: Detecting Natural Backdoor Vulnerability in Code Language Models

Yuchen Chen, Weisong Sun, Haocheng Huang, Yuan Xiao, Chunrong Fang, Yiran Zhang, Tingting Xu, Zhenpeng Chen, An Guo, Peizhuo Lv, Xiaofang Zhang, Zhenyu Chen, Yang Liu, Baowen Xu
2026/06/09 21:28已过 2 天

Code Language Models (CodeLMs) have become integral to software engineering, significantly advancing code intelligence tasks. However, their widespread adoption has raised critical security concerns, particularly regarding susceptibility to backdoor attacks. Recent studies have uncovered naturally occurring backdoors, referred to as natural backdoors, in normally trained deep learning models. Despite posing threats as serious as those introduced through data poisoning, security implications of natural backdoor vulnerabilities in CodeLMs remain poorly understood. In this paper, we conduct a thorough empirical study of natural backdoor vulnerabilities in CodeLMs across various model architectures and code intelligence tasks. Specifically, we examine potential natural backdoor vulnerabilities across 44 scenarios, demonstrating that natural backdoors are prevalent and intrinsic to CodeLMs. We reveal differences between injected and natural backdoor vulnerabilities at both the model and parameter levels. We then analyze the transferability of natural backdoor vulnerabilities from three perspectives: datasets, model architectures, and shared knowledge. We further investigate the causes of natural backdoors from two aspects: training datasets and the model training procedure. We evaluate existing backdoor defense techniques, including pre-training, in-training, and post-training defenses, in mitigating natural backdoors. Finally, we propose ScanNBT, a novel detection method designed to improve comprehensive detection of natural backdoor vulnerabilities in CodeLMs. We aim for our findings to enhance understanding of these vulnerabilities and provide insights for strengthening CodeLM security against backdoor threats.

arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelscs.CLcs.AIarXiv AI Coding 3Y

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

Yusuf Sahin, Ahmed Rockey Saikia, Volkan Cevher, Paolo Favaro
2026/06/09 21:17已过 2 天

Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predictions are coupled. Existing training-free samplers such as Top-\(k\), Fast-dLLM, and EB-Sampler mainly control how many tokens to reveal, while often ranking candidates by token-wise scores that ignore interactions within the selected set. We propose ADAS, a training-free reranking rule for parallel masked diffusion decoding. ADAS leaves the base sampler's stopping rule unchanged and modifies only subset construction: it greedily discounts a candidate when it attends strongly to already selected positions whose predictions remain uncertain. Unlike graph-constrained methods that turn attention into hard compatibility constraints, ADAS keeps attention continuous and uses it as a soft marginal penalty. Across LLaDA-8B-Base and Dream-7B-Base on GSM8K, MATH500, HumanEval, and MBPP, plugging ADAS into Top-\(k\), Fast-dLLM, and EB-Sampler improves low-NFE performance at matched denoiser evaluations by \(9.11\) and \(10.46\) percentage points on average, respectively, with \(3.1\%\) per-forward runtime overhead. These results show that soft attention-discounted reranking is a simple and modular way to improve quality in highly parallel decoding for masked diffusion language models.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachConfidentialityintegrityprivacycs.CRcs.CLarXiv cs.CR

RedAct: Redacting Agent Capability Traces for Procedural Skill Protection

Shuwen Xu, Zhitao He, Yi R., Fung
2026/06/09 20:57已过 2 天

Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills, allowing downstream methods to recover key formulas, thresholds, and strategies without access to model weights or skill files. To quantify this risk and evaluate protection, we construct \textsc{CapTraceBench}, a benchmark of 75 specialized long-horizon tasks and 154 curated skills across seven domains. We also introduce \textsc{RedAct} https://github.com/XuShuwenn/RedAct, a protected trace release framework that localizes protected key information, rewrites traces while preserving verifier-critical evidence, and embeds behavioral watermarks for downstream provenance analysis. Across representative trace reuse methods, \textsc{RedAct} reduces normalized skill transfer (NST) from 44.7--67.1\% on raw traces to below the no-skill baseline, while preserving audit evidence. Its standalone behavioral watermarks reach 93.6--100.0\% true detection with a false alarm rate of at most 1.9\%. These results frame public agent traces as security interfaces and show that selective redaction can reduce procedural capability leakage without removing audit evidence.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CRcs.AIcs.LGarXiv cs.CR

A Bayesian Network Approach for Enhancing Security-Focused Decision Support Systems

Carolina Fernández-Martínez, Shuaib Siddiqui, Vanesa Daza
2026/06/09 20:35已过 2 天

The adoption and integration of heterogeneous stacks in most of today's open-source based networks brings clear benefits like interoperability and availability of advanced features. Yet, on the other hand the increasing number of interconnecting components and moving parts requires maintaining an ever increasing base of interdisciplinary knowledge of different tools in different domains to ensure proper operation. To alleviate such efforts, this work proposes a Decision Support System (DSS) to guide infrastructure operators through the selection of security approaches (e.g. tools) to adopt in their environments. This framework easily captures the end-user high-level requirements on the security triad for different domains and runs inference on the designated models to provide the identified tools (security mechanisms) that better serve such needs. The presented DSS aims at delivering an understandable and extensible framework to accommodate varying requirements and Bayesian Network (BN) models. The architecture and modelling of the system are proposed, aligned with its theoretical framework. Its performance is evaluated in terms of time and prediction accuracy.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.ITcs.CRcs.LGarXiv cs.CR

Secure Aggregation with Top-K Sparsification in Decentralized Federated Learning

Hengxuan Tang, Jinbao Zhu, Xiaohu Tang
2026/06/09 20:33已过 2 天

Secure aggregation is a vital component for mitigating gradient leakage in federated learning, but its communication cost conventionally scales with the gradient dimension. This becomes prohibitive for large models and even more pronounced in decentralized federated learning with limited bandwidth and unreliable nodes. Top-K gradient sparsification is an effective approach to reduce communication by transmitting only a few entries of the full gradient, while maintaining competitive model accuracy. Nevertheless, the top-K entries selected by each user are unpredictable and vary across users, which poses a challenge for efficient sparse secure aggregation. This paper studies information-theoretic secure aggregation with top-K sparsification in decentralized federated learning under user dropouts and user collusion. We propose a communication-efficient sparse secure aggregation scheme that offloads dimension-dependent overhead to an offline phase and protects private gradients using random masks and permutations. Experimental results demonstrate that our scheme preserves accuracy comparable to full-gradient aggregation even with only 1% gradient sparsification, while substantially reducing the communication cost.

arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivapproachAutomating SE tasks with LLM and foundation modelscs.AIarXiv AI Agents for Software EngineeringarXiv AI Coding 3Y

AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies

Huanshuo Dong, Keyao Zhang, Hong Wang, Zhezheng Hao, Zhiwei Zhuang, Ziyan Liu, Jiacong Wang, Gengyuan Liu, Xin Jin
2026/06/09 20:02已过 2 天

Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver configuration, and resolution control, that matches the PDE structure. Recent LLM-based coding agents have begun to reduce the programming burden by generating and debugging solver implementations. However, they typically move directly from a PDE problem to solver code, leaving the solver strategy implicit in implementation details. Feedback from a failed solve is therefore routed back to code edits rather than to the underlying strategy, so numerical decisions remain hard to check before code is generated and hard to revise using numerical evidence when it fails. To address this limitation, we propose AutoPDE, a code agent that maintains the solver strategy as an explicitly represented object throughout the solving process: an independent, inspectable object that is built before any code is written and can be revised, using numerical evidence, whenever a solve fails. AutoPDE builds and maintains this object in three stages, all drawing from a library of reusable PDE-solving skills: PDE analysis identifies the equation type and algebraic structure; numerical method selection chooses a numerical method that matches the analysis result and commits to a discretization, stabilization, and linear solver accordingly; and adaptive tuning runs low-cost pilot solves to calibrate resolution and tolerances under the prescribed accuracy and runtime budget. We evaluate AutoPDE on the PDE Agent Bench, where experimental results show that AutoPDE achieves a pass rate of $54.5%$, improving over the strongest baseline by $14.2$ percentage points.

arXiv cs.CRAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelsCollaborative AI for SEPrompt engineering for SEcs.CRcs.AIarXiv cs.CR

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

Yuchen Ling, Shengcheng Yu, Zhenyu Chen, Chunrong Fang
2026/06/09 20:01已过 2 天

Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security risk. In agentic settings, failures are no longer limited to unsafe text generation. Untrusted content may redirect control flow, misuse tool privileges, corrupt persistent state, leak sensitive information, or trigger harmful external actions. At the same time, research on LLM agent security is expanding quickly but remains fragmented across attack families, defense layers, application domains, and evaluation settings. This paper synthesizes 247 papers through a lifecycle-based, systems-oriented framework that models agent security around the interaction of information flow, delegated authority, and persistent state. We organize the literature around four questions: how LLM agent security should be modeled, which threat surfaces and attack families dominate, what defenses have been proposed and with what tradeoffs, and how security claims are evaluated. We find that prompt injection and tool-mediated control-flow hijacking still dominate the field, while persistent state corruption and multi-agent propagation are becoming central emerging concerns. We further find that current defenses provide useful building blocks but remain weakly compositional, and that existing benchmarks still underrepresent long-horizon, stateful, and deployment-sensitive risks. We argue that secure LLM agents require explicit trust boundaries, principled privilege control, provenance-aware state management, and evaluation practices aligned with realistic operational settings.

arXiv cs.CRDependability and SecurityarXivapproachVulnerability detection and software securitycs.CRcs.LGarXiv cs.CR

MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents

Yv Zhang, Hao Sun, Hao Fang, Kuofeng Gao, Fan Mo, Bin Chen, Shu-Tao Xia, Yaowei Wang
2026/06/09 19:53已过 2 天

External memory has become a core component of modern web agents, enabling long-horizon reasoning through the retrieval of past experiences. However, this paradigm introduces a critical vulnerability: malicious content injected into memory can be persistently recalled and repeatedly influence agent behavior. In this work, we identify and systematically study multimodal memory poisoning, an overlooked yet practical attack surface in web-agent systems. We propose MemVenom, a unified black-box attack framework that poisons graph-structured external memory with coordinated text-image evidence. Our method consists of a two-stage design: (1) a trigger-conditioned retrieval attack that ensures high-probability recall of malicious memory, and (2) a post-retrieval attack induction that leverages adversarial perturbations and stealthy OCR injection to override the original user objective. Unlike prior attacks that operate on prompts or text-only memory, our approach enables persistent, reusable, and goal-agnostic attacks without modifying model parameters or re-optimizing malicious tasks. Experiments across multiple web-agent frameworks and vision-language models demonstrate that MemVenom achieves strong end-to-end attack success with minimal impact on benign performance, reaching up to 99.15% on GPT-5-family web agents, while transferring effectively across architectures and model scales.

arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelscs.SEarXiv cs.SEarXiv AI Agents for Software EngineeringarXiv AI Coding 3Y

DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

Jiale Zhao, Guoxin Chen, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen, Kai Jia
2026/06/09 19:37已过 2 天

As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward architecting and implementing complete software repositories from high-level specifications. However, training agents for such long-horizon software engineering tasks remains difficult due to the scarcity of large-scale, verifiable whole-repository generation data. In this paper, we introduce \textbf{DeNovoSWE}, a large-scale dataset for whole-repository generation. DeNovoSWE comprises 4,818 high-quality instances, where each instance requires generating a complete repository from documentation. Our dataset is automatically constructed through a carefully designed sandboxed agentic workflow, enabling scalable curation without human annotation. DeNovoSWE is constructed with "divide and conquer" and critic-repair philosophy. To balance data quality and diversity, we further introduce a difficulty-aware trajectory filtering strategy. Fine-tuning Qwen3-30B-A3B on DeNovoSWE substantially improves long-horizon SWE performance, raising its score on the challenging BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CRarXiv cs.CR

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

Naci Cankaya, Jakub Kryś, Jonathan Ng, Luke Marks, Felix Krückel
2026/06/09 19:33已过 2 天

In preparation for potential international agreements on artificial intelligence, the development of verification infrastructure for AI data centres is vital. We propose a method for cryptographically committing all information entering and leaving a data centre: Hashes are computed by network taps placed on all the information-carrying wires between the cluster and the outside world, enabling an auditor to retroactively challenge the preimage data to be sent to a privacy-preserving verification facility performing compliance checks. Our goal is to make it infeasible to covertly exfiltrate the results of undisclosed workloads in the cluster through the tapped wires. To this end, we specify the architecture of a ``Secure Gateway Device'', which handles the erasure of covert channels that post-hoc verification on hashed data cannot address: analogue and timing side-channels, as well as steganography in network protocol headers. The architecture eliminates the need for any processors trusted by both the Prover and the Verifier, leveraging passive optical fibre splitters and coin-flip protocols for random number generation where needed. We expect development costs of a demonstration device to be roughly equivalent to the cost of a small team of engineers for a few months, with a comparatively small bill of materials.

arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivapproachempirical studyAutomating SE tasks with LLM and foundation modelsSustainable AI for SEEfficacy measurement beyond traditional metricscs.SEarXiv cs.SEarXiv AI Agents for Software EngineeringarXiv AI Coding 3Y

Watts and Debts of Agentic Frameworks: An Empirical Study (Registered Report)

Aneetta Sara Shany, Chandrasekar S, Karthik Vaidhyanathan
2026/06/09 19:03已过 2 天

Context: Every agentic AI system shipped to production carries two hidden risks: accumulated Technical Debt (TD) and unmonitored runtime energy costs. While functional benchmarking is common, the empirical link between internal structural quality (specifically TD) and dynamic energy consumption during execution remains unexplored, creating a blind spot for practitioners and organizations managing sustainability and operational budgets at scale. Goal: We propose a confirmatory empirical study correlating Self-Admitted Technical Debt (SATD) with hardware-level runtime energy consumption across agentic frameworks, to determine whether code quality can drive energy-aware design decisions. Method: We will evaluate five open-source agentic frameworks by executing a standardized task suite in a strictly controlled environment. SATD will be extracted via automated Python-based comment mining and categorized via LLM-based classification using fine-tuned prompt, while runtime energy will be measured at the hardware level. Our study will investigate three core research questions: (RQ1) the presence of TD within these frameworks; (RQ2) the variance in runtime energy consumption across architectures; and (RQ3) the statistical correlation between a framework's TD and its task-level energy consumption. Conclusion: The findings will establish whether automated source code analysis can serve as a reliable, early-warning proxy for energy-efficient framework selection, thereby advancing both green software engineering and agentic AI quality research.

arXiv cs.CRDependability and SecurityarXivapproachFormal methods and model checkingcs.CRcs.LGarXiv cs.CR

Do LLMsMakeNeural Distinguishers Wise?

Tatsuya Sakagami, Masashi Hisai, Naoto Yanai
2026/06/09 18:51已过 2 天

Neural distinguishers are a cryptanalysis method for symmetric-key cryptography that trains machine learning models on pairs of plaintexts and ciphertexts with specific differences in order to recover a secret key. To the best of our knowledge, no existing work has explored the use of large language models (LLMs) for neural distinguishers. In this paper, we propose LLM-based neural distinguishers through a prompt design and conduct extensive experiments with them on SPECK-32/64 to investigate whether LLMs can strengthen neural distinguishers. We then found three key insights. First, by comparing the results of LLM-based neural distinguishers with ResNet in the existing work, we demonstrate that LLMs provide no observable improvement in the performance of neural distinguishers. Second, we confirm that, at high rounds, the choice of differences is no longer effective for LLM-based neural distinguishers as well as ResNet. Third, we show that the performance of LLM-based neural distinguishers can be significantly improved by incorporating only the XOR operation results as a prompt design.

arXiv cs.CRDependability and SecurityarXivapproachVulnerability detection and software securitycs.LGcs.AIcs.CRarXiv cs.CR

In Defense of Information Leakage in Concept-based Models

Mateo Espinosa Zarlenga
2026/06/09 18:19已过 2 天

Concept-based models (CMs), deep neural networks that ground their predictions on representations aligned with human-understandable concepts (e.g., "round", "stripes", etc.), have been shown to learn representations that leak concept-irrelevant information. As the traditional narrative goes, this leakage is undesirable and should be eradicated as it leads to uninterpretable models. In this paper, we posit that this conventional view of leakage in CMs is not only ill-posed, as the evidence of how leakage makes a model less interpretable is often inconclusive, but also bound to lead to impractical CMs under common real-world constraints. Specifically, we argue that in real-world settings where concept incompleteness is the norm, some leakage is often necessary for constructing accurate and intervenable CMs. To this end, we propose that there is such a thing as benign leakage and show that, by optimizing a reframing of the typical CM training objective, CMs can encourage and exploit this form of leakage without sacrificing accuracy or intervenability.

arXiv LLM4Code / Program Repair / arXiv AI Agents for Software Engineering / arXiv SWE-bench / Defects4J / arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelsCollaborative AI for SEcs.MAcs.AIarXiv LLM4Code / Program RepairarXiv AI Agents for Software Engineering

Decentralized Multi-Agent Systems with Shared Context

Yuzhen Mao, Azalia Mirhoseini
2026/06/09 18:13已过 2 天

Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another's verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CRcs.AIcs.CEq-fin.CP

Post-Quantum Secure Federated DeFi for Inclusive Banking

Swati Sachan, Dale Fickett, Richard Buchinger, Theo Miller
2026/06/09 18:06已过 2 天

Recent advances in error-corrected qubits have accelerated the timeline for practical quantum computing. It poses a threat to cryptographic primitives used to secure financial systems, government infrastructure, communication networks, and DeFi (Decentralized Finance) ecosystems. This paper introduces a post-quantum secure federated DeFi framework that enables inter-bank collaboration to improve the inclusivity of individuals underserved by local lenders due to limited financial histories. Multiple banks contribute encrypted information batches to a virtual server, where lattice-based Fully Homomorphic Encryption (FHE) enables end-to-end homomorphic computation. The server fuses local data-driven probabilistic assessments, expert beliefs, and verifiable evidence generated by the NASA-IBM Prithvi Geospatial Foundation Model (GFM), in encrypted format. Decentralized technologies are employed to ensure tamper-proof evidence and auditable accountability for all encrypted data exchanges between institutions and the server. The framework is tested on agricultural lending decisions for rural borrowers in Virginia.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CRcs.FLarXiv cs.CR

Layer Order Semantics for Automata-Based Cybersecurity

Faruk Alpay, Taylan Alpay
2026/06/09 17:57已过 2 天

Layered cybersecurity pipelines transform evidence before they decide on it, and the order of those transformations determines which security facts become visible to which layer. This paper gives layer order a finite-state semantics built from a layer-order automaton, deterministic sequential security transducers, evidence markers, and a final decision automaton. The worked case is HTTP request desynchronization: front-end and back-end processors compute incompatible request boundaries, and the same trace is detected or missed according to whether framing evidence reaches the parser-differential layer before it commits. The results separate completed-trace recognition, online editing, decision synthesis, and faithful enforcement; characterize faithful online enforcement as the regular prefix-closed case under causal visibility; and show that regular policies beyond that boundary remain recognizable without becoming deployable enforcers. The framework is monolithically equivalent to finite-output deterministic edit automata, while preserving layer-local invariants such as marker birth, marker survival, and reorder-sensitive visibility. A concrete parser-pair semantics identifies the forbidden marker factor with CL.TE, TE.CL, TE.TE, and HTTP/2-downgrade boundary disagreement under the stated abstraction, and a contextual reorder congruence classifies which component permutations induce the same decision language. The result is an automata-theoretic account of order-sensitive security failures and a compositional vocabulary for auditing, synthesizing, and comparing layered enforcement pipelines.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacyecon.GNcs.CRarXiv cs.CR

From Transactions to Records: Reconceptualizing Blockchain Systems through a Lifecycle Lens

Tom Barbereau, Ruggero Montalto, Christian Beyer
2026/06/09 17:34已过 2 天

Current blockchain research and analytics tend to prioritize observable on-chain transactions, obscuring the processes through which cryptocurrencies are created, publicised, retained, and disposed of. In response, this paper considers distributed ledger technologies from records management principles in ISO 15489-1:2016. Setting off by specifying the parallels -- that is transactions as "records", crypto-asset units as "information assets", and blockchains as "aggregations" -- we introduce a seven-stage lifecycle for blockchain data. We apply the framework to Bitcoin, a fungible token, and a non-fungible token. On this basis, we argue that blockchain systems are not merely transactional infrastructures but record management systems with distinctive characteristics. We discuss how the on-chain/off-chain boundary and privacy-enhancing technologies can complicate lifecycle visibility, with particular relevance for crypto-crime research and investigation. As a meta-level framework, the lifecycle perspective enables positioning existing research, decomposing legal, regulatory, technological, and operational challenges by stage, and informing lifecycle-aware approaches to blockchain governance, analytics, and regulation.

arXiv cs.CRDependability and SecurityarXivapproachempirical studyConfidentialityintegrityprivacycs.CRarXiv cs.CR

snaproot: Decentralized File Integrity Verification Using Blockchain-Anchored Cryptographic Hashing

Arslan Brömme, Tarkan Yavas
2026/06/09 17:25已过 2 天

The rapid growth of digital content has made reliable integrity verification increasingly important. Existing solutions rely either on centralized authorities, which introduce trust dependencies and single points of failure, or on decentralized storage systems that incur prohibitive resource overhead. In this paper, we present snaproot, a lightweight system that implements the hash-anchoring paradigm of Haber and Stornetta on the Solana blockchain to provide efficient, decentralized file integrity verification. snaproot generates a SHA-256 hash of a file and stores it immutably on-chain as a permanent reference record. Verification is performed by recomputing the hash and comparing it to the stored value, yielding a deterministic binary outcome. We describe a four-tier trust architecture comprising three realized tiers and one prospective tier for long-term persistence beyond the lifetime of any single blockchain. We present a formal threat model, a security analysis grounded in the second-preimage resistance of SHA-256, and an empirical evaluation on Solana Devnet across file sizes from 1 KB to 500 MB. A central conceptual contribution is the explicit separation between existence proof, the key-independent claim that a file existed at a given time, and authorship proof, the key-dependent binding between a record and a specific wallet identity. This separation allows existence guarantees to survive key loss while preserving stronger authorship claims where keys are retained. We position snaproot against OpenTimestamps, OriginStamp, and Chainpoint and discuss limitations with respect to pre-registration manipulation and AI-generated content.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachConfidentialityintegrityprivacycs.CRarXiv cs.CR

Two-Way Confidential VMs (2cVM): Collaborative Confidential Computing for Mutually Distrustful Parties

Jordi Thijsman, Merlijn Sebrechts, Stefan Lefever, Filip De Turck, Bruno Volckaert
2026/06/09 17:15已过 2 天

Collaborative computation across organizations is often constrained by the need to process sensitive data and proprietary code without exposing them to untrusted infrastructure or participants. Cryptographic approaches such as fully homomorphic encryption and secure multi-party computation provide strong confidentiality but remain impractical for general workloads due to their extreme computational cost. We present the Two-Way Confidential Virtual Machine (2cVM), a two-layer architecture that pairs a hardware trusted execution environment with an intra-workload isolation layer. Unlike regular Confidential Virtual Machines, 2cVM enforces mutual isolation between co-resident workloads, ensuring that participants retain control over their data and code. All computation in 2cVM is governed by a Commitment Manifest that enumerates participants, component composition, permitted data channels, and authorized outputs; the manifest is locked to the VM and incorporated into attestation evidence, making the policy immutable and independently verifiable throughout the VM's lifetime. A proof-of-concept realization combines AMD SEV-SNP for hardware protection with the WebAssembly Component Model for fine-grained sandboxing of participant code. Evaluation on commodity hardware across four benchmark classes shows that the two isolation layers do not accumulate linearly: once a workload executes inside the WebAssembly sandbox, the marginal cost of enabling hardware memory protection is small. Overhead is workload-dependent, governed primarily by memory access pattern, ranging from negligible for sequential workloads to approximately 2x for irregular, pointer-chasing access patterns. These results indicate that 2cVM provides a practical and verifiable foundation for privacy-preserving collaborative computation.

arXiv cs.SERequirements and ModelingarXivapproachempirical studyRequirements traceability and dependenciesRequirements engineeringcs.SEarXiv cs.SE

Exploring and Complementing End Users' Requirements in IoT enabled System

Haotian Li, Xiaohong Chen, Zhi Jin, Shuyuan Xiao, Chenxu Wang, Haoxiang Yan, Xiaoyi Chen
2026/06/09 17:01已过 2 天

End users create IoT automation rules via trigger action programming, but their expressions are often fragmented, capturing device operations rather than high level intents. This gap leads to missing conditions, logical conflicts, and overlooked safety constraints, risking hazardous behaviors. To address this, we propose an intent driven requirements completion approach that reframes rule completion as a dual process: reconstructing intent from fragmented rules, then regenerating rules from that intent, with safety embedded throughout. We introduce a Bidirectional Requirements Traceability Tree, a three layer model linking rules, intents, and quality concerns, and design a multiagent framework that combines LLM reasoning with structured traceability. This enables completions that are both functionally complete and inherently safe, while remaining traceable and explainable. Evaluation shows our method significantly outperforms the baselines, improving the rule completion rate by 43% and reducing logical conflicts by over 21%. By grounding completion in intent understanding, we shift the paradigm from user to system responsibility, and from functional correctness to holistic trustworthiness.

arXiv cs.CRDependability and SecurityarXivapproachempirical studysurveyConfidentialityintegrityprivacycs.CRcs.AIarXiv cs.CR

From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning

Huong Nguyen, Mickaël Bettinelli, Amirhossein Ghaffari, Alexandre Benoit, Hong-Tri Nguyen, Susanna Pirttikangas, Lauri Lovén
2026/06/09 17:00已过 2 天

Federated Learning (FL) has emerged as a promising solution for data hunger in centralized learning. This paradigm enables privacy with multiple clients to train a shared-task model collaboratively without exposing their local data. While being a key component in any learning system, data is also a primary source of vulnerabilities and challenges, and a major determinant of a stable and well-converged training. Existing FL reviews describe general foundations, security practices, opportunities, challenges, and applications, without delving into diverse aspects of data and considering problems from the data perspective. They rarely provide a data-lens synthesis that links concrete data properties, split protocols, and defenses to convergence speed and stability. This survey fills that gap with three advances. First, we analyze non-IID into measurable traits and rank their influence on convergence as strong, medium, or light, explaining the mechanisms behind each and reconciling evidence across images, texts, and graphs. Second, we connect experimental splitting practices to the real phenomena they emulate, expose the artifacts they introduce, and show how those artifacts affect target accuracy. Third, we analyze how data-related vulnerabilities and their proposed defenses affect convergence, reporting performance under clean and adversarial conditions to make the convergence-robustness trade-off explicit. To our knowledge, this is the first survey to provide a complete understanding of data-related challenges that govern FL. With clear takeaways distilled for each concern, our work serves as actionable guidance, helping practitioners design their system with predictable convergence and stability.

arXiv cs.CRAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelsTrustworthy AI for SEcs.CVcs.AIcs.CRarXiv cs.CR

Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

Lijia Yu, Jiuxin Cao, Yuchen Qiang, Changhao Chen, Yifei Huang, Bo Liu
2026/06/09 16:34已过 2 天

Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CRcs.ARcs.DCarXiv cs.CR

A Hybrid Edge-Cloud Architecture for Low-Latency Entitlement Verification in Resource-Constrained Devices

Pravin Nagare, Aditya Sabbineni, Devendra Dahiphale, Faiz Gouri, Pratik Thantharate
2026/06/09 16:04已过 2 天

As digital media consumption shifts toward large-scale Over-the-Top (OTT) platforms, the efficiency of the control plane, specifically entitlement and identity verification, has become a critical factor in user experience. Current architectures often rely on synchronous cloud-tethered validation flows that introduce significant latency, especially on resource-constrained consumer electronics. This paper proposes a Hybrid Edge-Cloud Entitlement Framework designed to minimize user-perceived friction. By implementing a secure, local caching layer within device middleware and utilizing an Adaptive Entitlement Cache with Proactive Refresh (AEC-PR) algorithm, we decouple the user interaction from backend network variability. We evaluate the performance on ARM Cortex-A series hardware, demonstrating that localized cryptographic verification reduces authorization latency from a mean of 422.8ms to 18.4ms (a 95.6% reduction) while mitigating implementation-level side-channel risks through deterministic Ed25519 arithmetic and TEE isolation.

arXiv cs.CRDependability and SecurityarXivapproachempirical studyReliabilityavailabilityand safetycs.CRcs.AIarXiv cs.CR

Assessing Automated Prompt Injection Attacks in Agentic Environments

David Hofer, Edoardo Debenedetti, Florian Tramèr
2026/06/09 15:54已过 2 天

Indirect prompt injection poses a critical threat to LLM agents that interact with untrusted external data, yet automated attack methods--proven effective for jailbreaking--remain underexplored in realistic agentic settings. We present a comprehensive empirical evaluation of automated prompt injection attacks against LLM agents, adapting both white-box (GCG) and black-box (TAP) methods to the agentic setting within the AgentDojo framework. We evaluate across 80 task pairs spanning four domains and multiple models, and find that black-box optimization substantially outperforms gradient-based methods, a gap we attribute to GCG's optimization instability under reasonable compute budgets. We also find that TAP's effectiveness depends on the attacker model, as both general capability and safety tuning affect attack success--stronger models produce more effective injections, while safety-tuned attackers can refuse to generate adversarial prompts. Task-universal attacks transfer effectively to unseen tasks and out-of-distribution domains, but attacks optimized on smaller open-source models do not transfer to frontier models like GPT-5. These findings highlight automated prompt injection as a credible but model-dependent threat, with significant barriers remaining for model-agnostic exploitation.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachConfidentialityintegrityprivacycs.CRcs.NIarXiv cs.CR

A Deployment-Oriented Framework for Explainable AI-Assisted eBPF/XDP Mitigation at the IoT Edge

Abdurrahman Tolay
2026/06/09 15:36已过 2 天

Internet of Things (IoT) deployments combine heterogeneous, resource-constrained devices with weak security configurations, exposed services, limited logging, patching constraints, and long lifecycles. Signature- and threshold-based controls remain useful baselines, but they are insufficient as standalone mechanisms in dynamic IoT networks. Likewise, offline artificial intelligence (AI) benchmark performance alone does not establish operational deployability. This article presents a conceptual framework and research agenda for a Linux-based IoT edge gateway that combines resource-aware flow-level AI-assisted risk scoring, event-level explainability, and bounded mitigation through eBPF/XDP. The controller applies reversible, time-limited actions subject to critical-device safeguards, updates packet-level enforcement state, and records structured logs. The architecture separates complex reasoning and policy control in user space from concise packet-handling decisions in the kernel. It also defines a future hardware-aware evaluation pathway covering detection quality, resource cost, response timing, rollback behaviour, and legitimate-traffic preservation. The paper does not report new experimental meas

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CRarXiv cs.CR

When VR Meets BCI: (Un)Observable Brainwave-aware Privacy Reconstruction in the Metaverse via Unrestricted Inbuilt Motion Sensors

Tao Ni, Zehua Sun, Qingchuan Zhao, Wei-Bin Lee, Cong Wang
2026/06/09 15:28已过 2 天

Metaverse devices, such as virtual reality (VR), have seen substantial development and widespread applications in numerous areas. Although recent studies have revealed privacy leakages in VR, these vulnerabilities were limited in the scope of observable behaviors in virtual scenes (e.g., what a user is seeing). In this work, we uncover the feasibility of going beyond the scope of observable user behaviors to unobservable brain EEG-correlated representations (e.g., what a user is perceiving) by leveraging unrestricted motion sensors in VR headsets to reconstruct brain EEG signals, a seemingly neglected but promising vector. The insight is that the inbuilt motion sensors (e.g., accelerometers) in the VR headset can capture subtle vibrations induced by pupillary responses, which are highly correlated with users' visual stimuli and in-brain perceptions. Therefore, we design and implement BraVeSpy to systematically investigate and demonstrate the feasibility of this severe privacy leakage originating from brain EEG-correlated representations reconstructed from variations of inbuilt motion sensors. Our extensive evaluation results from different VR devices show that BraVeSpy, for the first time in the Metaverse, can reveal unobservable privacy, where we successfully unveiled perceptive images in the brain with 52.0%-67.2% accuracy. In particular, we also find that BraVeSpy outperforms the current approaches that are limited to coarse-grained inference of observable behaviors and achieves over 85.0% accuracy in inferring user activity-related sensitive information, such as fingerprinting websites, apps, and streaming videos, and over 96.0% accuracy in user de-anonymization, gaze movement tracking, and virtual keystroke inference.

arXiv cs.CRDependability and SecurityarXivapproachsurveyConfidentialityintegrityprivacycs.CRarXiv cs.CR

AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments

Peiyang Li, Songping Wang, Yi Huang, Yanhua Shi, Chenhao Zhang, Qi Li, Yueming Lyu, Caifeng Shan, Fengting Li, Chao Feng, Chuanqun Zhu, Liang Chen
2026/06/09 14:55已过 2 天

Autonomous AI agents have driven the transition from conversation to task execution, shifting security failures from textual deception to system compromise. Although security evaluation is crucial for proactive risk prevention, prior work is constrained by fundamental bottlenecks, including fragmented risk coverage, static or low-fidelity execution environments, and single-dimensional and coarse-grained assessment metrics. To address these challenges, we propose AgentCanary, a comprehensive security evaluation framework for autonomous AI agents. AgentCanary provides a systematic solution along three contributions. First, comprehensive risk coverage: we introduce an orthogonal Entry $\times$ Impact risk taxonomy that decouples how adversarial influence enters the agent from what harm it ultimately causes, and instantiate it as a scenario-aligned task suite spanning realistic deployment workflows. Second, a high-fidelity real executable environment: rather than static Q&A or mocked tool responses, agents interact with real tools against dynamically provisioned task artifacts, with persistent state across multi-step interactions that naturally supports long-horizon attack evaluation. Third, trajectory-grounded multi-dimensional evaluation: evaluation consumes the full agent trajectory rather than the reply text or a single tool call, enabling decomposed scoring along three orthogonal dimensions, Outcome Safety, Security Awareness, and Task Utility. We evaluate a broad set of frontier models on AgentCanary against multiple established adversarial attack methods across three agent frameworks. The results reveal that current agents often fail to recognize the attacks they face, particularly under compromised skills, persistent state, and long-horizon execution attacks, and provide a systematic baseline for developing more reliable and secure agent systems.

arXiv cs.CRDependability and SecurityarXivapproachempirical studyConfidentialityintegrityprivacycs.LGcs.AIcs.CLcs.CR

Advancing the State-of-the-Art in Empirical Privacy Auditing

Nicole Mitchell, Galen Andrew, Arun Ganesh, Brendan McMahan, Peter Kairouz
2026/06/09 14:50已过 2 天

Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (MI) or reconstruction attacks. A key challenge in EPA is designing ``canary'' examples that are mixed with the privacy-sensitive training data. We propose generating synthetic canaries via high-temperature sampling ($T \geq 0.8$) from LLMs, using prompts tailored to the privacy-sensitive training data. These canaries act as high-influence outliers, ensuring high identifiability and hence strong audits. Further, since the canaries are themselves non-private, they are inspectable and can be inserted with repetition without jeopardizing the privacy of the real data. An important use of models fine-tuned on privacy-sensitive data is the generation of synthetic data. This also comes with privacy risk. We introduce a powerful synthetic data audit based on fine-tuning an auxiliary model on the synthetic data. Auditing the auxiliary model for the original canaries then provides a strong estimate of the privacy leakage through the synthetic data. Finally, leveraging our strong auditing methodologies, we perform a systematic investigation into the interacting effects of model capacity and canary entropy on memorization.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CRarXiv cs.CR

HE-DAP: Homomorphic Encryption-based Dynamic Adaptive Parameter Optimization for Statistical Computation

Yun-Soo Park, Hyunmin Choi, Hyoungshick Kim, Mun-Kyu Lee
2026/06/09 14:46已过 2 天

Homomorphic encryption (HE) enables privacy-preserving analytics but remains hindered by high computational overhead. We find that the inverse square root-a key primitive in many statistical and machine learning workloads-exhibits inconsistent and often suboptimal performance across HE libraries and hardware. This stems from a core trade-off between two costly HE operations: evaluating high-degree Chebyshev polynomials to speed up Newton's method versus performing bootstrapping to manage ciphertext noise. Because their relative costs vary by up to 6x across environments, any fixed configuration proves inherently inefficient. To address this challenge, we present HE-DAP, a cross-platform optimization framework that automatically navigates this trade-off. By profiling an environment's unique performance characteristics, HE-DAP finds the optimal balance between polynomial degree and iteration count to accelerate the encrypted inverse square root computation for a given accuracy target. Our evaluation on Lattigo, HEaaN-CPU, and HEaaN-GPU shows that HE-DAP's adaptive approach yields significant performance gains. It accelerates the core inverse square root computation by up to 2.35x over the fixed configuration in PP-STAT while maintaining high numerical accuracy (MRE <= 3.1 x 10^-8). We further demonstrate that optimizing this fundamental building block directly enhances the end-to-end performance of complex statistical analyses, confirming the practical benefits of our environment-aware approach. By automatically adapting to heterogeneous execution environments, HE-DAP demonstrates that principled parameter optimization can make privacy-preserving statistical analytics practical at scale.

arXiv cs.SE / arXiv AI Agents for Software EngineeringAI for Software EngineeringarXivbenchmarkapproachempirical studyCollaborative AI for SEPrompt engineering for SEcs.SEarXiv cs.SEarXiv AI Agents for Software Engineering

MASTOR: A Multi-Agent Approach to Semantic Test Oracle Generation for RESTful APIs

Sida Deng, Rubing Huang, Zhenzhen Yang, Man Zhang, Xuan Xie, Rongcun Wang
2026/06/09 14:35已过 2 天

Existing automated RESTful API testing approaches commonly rely on simple checks (e.g., HTTP status codes, schema conformance), which are insufficient for detecting semantic faults, business logic violations, and state-dependent inconsistencies. To address this, we propose MASTOR, a Multi-Agent approach for generating Semantic Test Oracles for RESTful APIs based on implementation source code. MASTOR consists of two phases: source analysis and oracle generation. The former employs a source extraction agent to construct a source context for each endpoint operation by analyzing a transitive import closure of relevant source files. The latter employs two parallel oracle-generation paths over the collected contexts: a single-operation path producing status and field oracles per operation, and a multi-operation path generating behavioral consistency oracles for operation sequences by leveraging cross-operation semantic associations. Both paths apply a challenger-agent review, where a dedicated reviewer identifies weaknesses and issues improvement hints to guide targeted regeneration, followed by oracle normalization to filter out structurally invalid oracles. We evaluated MASTOR on a benchmark of 13 open-source RESTful API projects (296 operations, 251,303 lines of code) from the WFD and PRAB datasets. MASTOR achieved an average mutation score of 75.4%, generating 10,022 oracles. These oracles were translated into executable assertions via ToJUnit and ToPostmanAssertify, and into human-readable descriptions via ToReadable. In a baseline comparison on 50 selected operations, MASTOR outperformed Direct Prompting by 30.1 percentage points (69.9% vs. 39.8%) and SATORI by 49.4 percentage points (69.9% vs. 20.5%).

arXiv cs.CRDependability and SecurityarXivapproachFormal methods and model checkingcs.CRcs.AIarXiv cs.CR

The Distributed Detectability Band Against Marginal-Preserving Attacks

Zhang Qinqin, Gao Yuze
2026/06/09 14:04已过 2 天

AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps, each individually below any per-step alarm. We construct a marginal-preserving, correlation-encoded distributed-sabotage attack using a Gaussian-copula AR(1) construction: the per-step monitor-score marginal is held exactly equal to benign, so mean, max, top-k tail, and threshold monitors (Monitor A) are defeated by construction, while harm is encoded in the temporal correlation structure. We sequence the paper around three reviewer-mandated gates. (1) Realizability gate: the stealthy attack achieves KS-distance to benign of 0.013 (effectively zero) at all tested harm levels up to 3.0, confirming that harm is fully decoupled from the per-step marginal and realizability is not harm-limited. (2) Monitor-A-vs-B reconciliation: we show formally that the attack, built against Monitor A's score marginal, remains marginal-preserving under a different-score Monitor B (the correlation/sequence family: CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic), and scope worst-case claims to score functions that admit a temporal signature. (3) Non-empty detectability band: Monitor A achieves AUC 0.52 (chance); Monitor B spans AUC 0.79-0.97 at the same 1% FPR target, and as harm is amortized over more steps Monitor A collapses to chance while Monitor B holds at AUC ~0.95. These results demonstrate a non-empty detectability band and characterize the sub-threshold sabotage frontier: distribution-shape monitors fail by construction; temporal-correlation monitors can detect but are not trivially optimal.

arXiv cs.SEAnalyticsarXivapproachempirical studySoftware metrics and measurementscs.SEarXiv cs.SE

Beyond Coverage and Kill Scores: Empirically Measuring Test Suite Behavioural Gaps

Partha Protim Paul, Reid Holmes
2026/06/09 12:46已过 2 天

Traditional test adequacy metrics measure a system's implementation, not whether it adheres to its expected behaviour. While developers rely heavily on code coverage and mutation testing to assess test suite quality, these metrics are fundamentally implementation-centric and cannot detect gaps between what the code is expected to do and what it actually does. Unfortunately, there has been no way to reliably detect these discrepancies; in this paper we introduce an automated proof-of-concept approach to investigate these gaps. The approach extracts expected method-level behaviours from natural language documentation and source code, maps them to existing test cases, and identifies gaps between expected and validated behaviours. We evaluate the approach across ten popular open-source Java libraries comprising 8,922 methods, extracting 20,729 behaviours with 93.1% precision. Our empirical analysis conservatively estimates that 17.5% of detected expected behaviours remain entirely untested, which we term as the test suite's behavioural gap. To determine if these gaps are merely an artifact of human-driven testing, we evaluate state-of-the-art automated test generators (EVOSUITE / ASTER), finding that they similarly fail to validate at least 20.6% / 27.1% of detected expected behaviours. We further demonstrate that behavioural gaps are not predicted by traditional structural metrics: the majority of untested behaviours occur in methods that already have high line coverage, and over half persist in methods with high mutation kill score. These results suggest behavioural coverage acts as an independent dimension of test suite adequacy that can complement traditional structural metrics.

arXiv LLM4Code / Program RepairAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEcs.AIarXiv LLM4Code / Program Repair

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

Haoyu Dong
2026/06/09 10:28已过 2 天

Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachConfidentialityintegrityprivacycs.LGcs.CRarXiv cs.CR

Privacy-Preserving Credit Risk Prediction with Alternative Data

Hongzhe Zhang, Jiarong Xu, Jing He, Xiao Fang
2026/06/09 10:27已过 2 天

Credit risk prediction is a critical problem in the consumer credit industry. Traditionally, financial institutions construct credit risk prediction models using borrowers' demographic, financial, and credit history data, collectively referred to as traditional data. Recent studies have demonstrated that alternative data, such as borrowers' mobile phone communication data, enable lenders to acquire fuller and more accurate profiles of borrowers' creditworthiness, thereby improving credit risk prediction performance. Nevertheless, alternative data are held by external entities independent of financial institutions. Directly sharing alternative data with financial institutions infringe on consumer privacy, yet existing credit risk prediction studies largely overlook this issue. To address this gap, we define a new problem, namely privacy-preserving credit risk prediction with alternative data, which simultaneously considers three practical constraints: the privacy-preserving constraint that protects consumer privacy, the model-confidentiality constraint that learns and stores the model centrally at the financial institution, and the lossless constraint that maintains the performance of the learned model. To solve this problem, we develop PrivacyCredit, a novel privacy-preserving machine learning method. We then theoretically demonstrate the privacy-preserving, model-confidential, and lossless properties of PrivacyCredit. Through extensive experiments using a real-world credit dataset linked with alternative data, we demonstrate the predictive value of securely incorporating alternative data into credit risk prediction and show that PrivacyCredit achieves the same predictive performance as the model learned from the insecure plaintext combination of traditional and alternative data. We further evaluate its model-confidentiality property and computational efficiency.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachDependability and security for embedded and cyber-physical systemscs.CRarXiv cs.CR

Semantic Multi-Agent Intrusion Detection for IoT:Zero-Day and Adversarial Threats with Risk-Aware Reasoning

Saeid Jamshidi
2026/06/09 10:18已过 2 天

The rapid proliferation of Internet of Things (IoT) devices has enabled unprecedented automation and connectivity, but it has also substantially increased the attack surface, exposing networks to sophisticated cyber threats, including zero-day and adversarial intrusions. Traditional Intrusion Detection Systems (IDS) struggle to generalize to unseen attacks, often require substantial computational resources, and lack interpretability, particularly in resource-constrained and heterogeneous IoT networks. Recent advances, including Deep Learning (DL), open-set detection, and Large Language Model (LLM)-based semantic reasoning, address some of these challenges but typically focus on zero-day and adversarial threats and rarely combine semantic reasoning with multi-agent systems. To overcome these limitations, we propose a semantic multi-agent ID that integrates four specialized agents (e.g., Scout, Mutator, Auditor, and Arbiter) that leverage semantic embeddings and multi-stage probabilistic decision fusion. The Scout induces structured hypotheses from semantic embeddings; the Mutator generates adversarially constrained variants; the Auditor evaluates consistency and filters unreliable outputs; and the Arbiter produces interpretable, risk-aware alerts. Extensive experiments across multiple real-world IoT datasets demonstrate that the proposed system achieves 95.9% overall detection accuracy, reduces false-positive rates to 6.8%, improves zero-day detection to 87.9%, and maintains computational efficiency suitable for edge deployment.

arXiv cs.CRDependability and SecurityarXivapproachempirical studyConfidentialityintegrityprivacycs.CRcs.MAarXiv cs.CR

Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs

Saeid Jamshidi, Amin Nikanjam, Arghavan Moradi Dakhel, Kawser Wazed Nafi, Foutse Khomh
2026/06/09 10:18已过 2 天

Large Language Models (LLMs) in multi-turn interactions maintain evolving context rather than generating isolated responses, making them vulnerable to prompt-injection and context-poisoning attacks in which locally plausible adversarial fragments gradually distort reasoning trajectories. Existing defenses mainly filter individual outputs and often ignore context evolution across turns, leaving long-horizon reasoning exposed. Although the Model Context Protocol (MCP) standardizes context exchange and tool invocation, it functions as a passive routing layer and does not enforce contextual stability. To address these limitations, we introduce the Game-Theoretic Secure Model Context Protocol (GT-MCP), a controller-driven multi-agent method that treats context management as a closed-loop dynamical process. GT-MCP coordinates three heterogeneous LLM agents and selects outputs through a trust function that jointly evaluates causal consistency against a validated context graph, semantic agreement among agents, and distributional drift over time. When instability is detected, a rollback-based self-healing mechanism restores the validated context and prevents unsupported fragments from propagating. Empirical evaluation over 500 interaction turns under an adaptive adversarial threat model shows that contextual drift remains bounded in 99.6% of turns, with recovery required in only 0.4%. Per-turn utility remains tightly concentrated, with median = -0.19, P05 = -0.72, and P95 = 0.30; severe degradation below -1 occurs in only 0.4% of cases, and no injection attempt succeeds at the controller level. Selected outputs maintain stable win rates above 98%, and computational overhead remains predictable, with latency per token = 1.63e-3 s.

arXiv cs.SEHuman and Social AspectsarXivempirical studysurveyTeamscommunitiesand companiescs.SEarXiv cs.SE

Communication Skills in Software Engineering: A Multivocal Review

Dannilo Rabelo, Deisy Peres, Emmanuel Dias, Thayssa Rocha, Enne Rebeca de Freitas, Kiev Gama, Gustavo Pinto
2026/06/09 10:16已过 2 天

Communication skills are increasingly recognized as essential in Software Engineering, yet discussions about them remain fragmented across academic and gray literature. This fragmentation is problematic because it limits a broader understanding of how communication is valued, taught, and applied in both educational and professional settings. Through a multivocal literature review, we found strong convergence between academic and gray sources in treating communication as a core competency, while also identifying differences in emphasis, with academia focusing on conceptualization and empirical evidence and gray literature stressing practical consequences and emerging industry practices.

arXiv cs.SEAI for Software EngineeringarXivapproachempirical studyAutomating SE tasks with LLM and foundation modelscs.SEarXiv cs.SE

From Awareness to Action: How Developers Engage with Accessibility Innovation in LLM-Assisted Development

Thayssa Águila da Rocha, Luciane Silva, Ana Duarte, Marcelle Pereira Mota, Gustavo Pinto
2026/06/09 09:57已过 2 天

Developers often struggle to design truly accessible digital solutions in corporate environments. In these environments, accessibility is usually treated as a compliance requirement rather than an innovation opportunity. By analyzing 14 LLM-based accessibility project proposals and focus group discussions with 9 participants at a Brazilian tech company, we found that inclusive innovation can emerge particularly when initiatives are led by People with Disabilities (PWD) themselves. If organizations adopt similar participatory approaches, accessibility would evolve from an afterthought into a driving force for technological excellence and cultural transformation.

arXiv cs.SE / arXiv cs.CRDependability and SecurityarXivapproachVulnerability detection and software securitycs.CRcs.SEarXiv cs.SEarXiv cs.CR

The Linux IOCTL Census: A Source-Derived Database of the Linux Kernel Control-Code Surface

Michael J. Bommarito
2026/06/09 09:28已过 2 天

The ioctl system call is Linux's catch-all device-control interface. A userspace program opens a device node and hands the driver a numeric command code and an argument buffer, and the driver does whatever that code means, whether configuring hardware, reading back state, or moving data into and out of the kernel. Drivers define these commands themselves, by the thousand, and parse their arguments in kernel context, which makes ioctl handlers one of the broadest and least uniform local attack surfaces in the kernel. A handler that trusts an argument length it never validates can read or write kernel memory out of bounds, and the command space is catalogued in no central place. We present the Linux IOCTL Census, a source-derived and queryable inventory of that surface. An allmodconfig build compiles 878 modules across 169 subtrees, and over them a single deterministic libclang pass over the kernel source recovers 586 ioctl dispatch entry points, 1,289 decoded _IOC command codes, 3,583 controlled-input sinks, and 1,298 permission gates. A second pass encodes the kernel's own documented threat model as a queryable column, separating the capability-ungated ioctl surface, an upper bound on unprivileged reach rather than proven reach, from the part a hard capability gate puts out of scope. We backtest the census against 22 recent in-tree ioctl CVEs and release the structural tier as open data, on a schema shared with the companion Windows IOCTL Census so a single query spans both operating systems.

arXiv LLM4Code / Program Repair / arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkapproachAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelscs.CLarXiv LLM4Code / Program RepairarXiv AI Coding 3Y

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

Jinghua Wang, Lily Jiaxin Wan, Sanjana Pingali, Scott Smith, Manvi Jha, Shalini Sivakumar, Xing Zhao, Kaiwen Cao, Deming Chen
2026/06/09 09:17已过 2 天

OpenRTLSet introduces the largest fully open-source dataset for hardware design, offering over 131,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (102k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code sample, enabling fine-tuning of various language model families (e.g., Qwen and Granite) for Verilog code generation. Our dataset explores multiple options, including Verilator-generated C++ files as additional context during labeling, quantization techniques (INT4 vs. BF16), and performance differences across model sizes (7B-32B parameters). OpenRTLSet demonstrates that open-source approaches can achieve superior performance in hardware design tasks, establishing a new foundation for accessible research and commercial use in this domain.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachempirical studyConfidentialityintegrityprivacycs.CRcs.CLarXiv cs.CR

Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

Aniket Anand, Yiwei Hou, Daniel Fields, Alex Kantchelian, David Tao, Kurt Thomas, Grant Ho
2026/06/09 09:09已过 2 天

This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. AuditBench consists of system audit logs collected from Linux and Windows machines, and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using our benchmark, we evaluate and analyze the performance of five frontier LLMs at analyzing audit logs for attack investigations. Our analysis illuminates how LLM performance and error profiles vary according to different design choices, such as differences in model size, data representation, prompt construction, and specific investigation tasks. Additionally, we characterize the quality of the explanations produced by LLMs and the types of errors that models make across our benchmark. Collectively, our work provides a foundation for assessing the capabilities of LLMs for investigating security logs, novel insights for practitioners using LLMs in security operations, and important directions for future research.

arXiv cs.SE / arXiv cs.CR / arXiv LLM4Code / Program Repair / arXiv AI Coding 3YTesting and AnalysisarXivapproachempirical studyAutomated test generationProgram analysiscs.CRcs.SEarXiv cs.SEarXiv cs.CR

RECON: An LLM-Enhanced Backward Constraint Analysis Framework

Babangida Bappah, Lamine Noureddine, Umar Farooq, Aisha Ali-Gombe
2026/06/09 08:09已过 2 天

While traditional techniques, such as symbolic execution, provide a principled foundation for precise constraint reasoning in program analysis, they struggle to scale to modern software systems mainly due to path explosion, the need for function modeling, and the loss of semantic intent at low-level program representations. In complex execution environments such as Android, characterized by extensive framework interactions and event-driven behavior, these limitations are even more amplified. Thus, in this paper, we present a novel large language model (LLM)-enhanced backward constraint analysis framework that combines the precision of static program analysis with LLM's semantic understanding to extract precise execution constraints from Android bytecode. Our approach, titled RECON, performs backward path discovery from target method(s) to the application entry point(s), discovers method-level control-flow constraints, and leverages LLM reasoning to transform bytecode conditions into interpretable specifications. We evaluated RECON using five LLMs across 78 Android constraint-extraction scenarios and compared it with traditional symbolic execution on real-world applications. Results demonstrate that our approach operates 5.8X faster than traditional symbolic execution, with a 100% success rate, while maintaining logical equivalence and providing significantly more precise and interpretable output. We further evaluated RECON for malware analysis on 100 samples. The results indicate an 84% success rate in generating semantic constraints that lead to the execution of dangerous API behaviors and in detecting complex constraints across multiple execution paths.

TOSEM Recent ArticlesAI for Software EngineeringACM Transactions on Software Engineering and MethodologybenchmarkapproachsurveyAutomating SE tasks with LLM and foundation modelsPrompt engineering for SEACM TOSEMACM Transactions on Software Engineering and MethodologyTechnical debtDocumentation

Faster Code, Deeper Debt? A Multivocal Literature Review on Technical Debt and Its Early Signs in LLM-Assisted Software Development

Ramtin Ehsani, Shriya Rawal, Yuanfang Cai, Preetha Chatterjee
2026/06/09 08:00已过 2 天

With the rapid adoption of LLM-assisted coding, the need to manage the technical debt these systems introduce has become urgent. In this paper, we conduct a multivocal literature review of 104 sources (31 formal, 73 grey) to examine how LLM-assisted development contributes to technical debt and what strategies, metrics, and benchmarks exist to mitigate it. We find that LLMs often amplify traditional forms of technical debt, particularly code, design, and documentation debts, while also introducing new LLM-specific debts. Notably, we identify fast-integration debt, where rapidly generated code prioritizes speed over quality, triggering a domino effect that leads to governance debt and increased long-term maintenance costs. Additional emerging categories include prompt, ethical, data, and provenance debt, reflecting new challenges unique to LLM adoption. To address these, strategies suggested in the literature include human-in-the-loop frameworks, prompt engineering, and data quality alignment. In practice, tools such as SonarQube are commonly used to detect technical debt indicators, while research prototypes such as CodeSmellEval are emerging to assess how LLMs contribute to debts. However, no standardized benchmarks or LLM-specific metrics yet exist, leaving an important gap. Based on findings, we outline insights and future directions to ensure reliable integration of LLMs into software engineering workflows.

TOSEM Recent ArticlesDependability and SecurityACM Transactions on Software Engineering and MethodologyapproachConfidentialityintegrityprivacyACM TOSEMACM Transactions on Software Engineering and MethodologyComputer scienceArtifact (error)

Mobile App Analysis in the New Era: Challenges and the Road Ahead

Jiale Wu, Tianming Liu, Yanjie Zhao, Haoyu Wang
2026/06/09 08:00已过 2 天

Mobile applications (apps) have grown into complex, multi-platform systems that support diverse services and operate within varying regulatory and distribution contexts. Traditional analysis techniques have established a strong foundation for ensuring quality, security, and compliance, but face increasing challenges as mobile apps evolve. This paper presents a three-layer perspective to examine these challenges. The Artifact Layer highlights how heterogeneous code bases and protection mechanisms reduce the analyzability of packaged apps. The Runtime Layer addresses growing execution complexity, from multi-modal interactions and super-app architectures to AI-driven autonomous operations. The Ecosystem Layer considers broader contexts including emerging platforms, fragmented distribution channels, and region-specific regulatory requirements. New characteristics across these layers interact and compound, creating difficulties that existing approaches are not designed to handle. For each layer, we identify key challenges and potential research directions, including framework-aware analysis, adaptive testing, and policy-aware compliance verification. We also discuss how large language models bring new analytical capabilities to the field while their integration into apps raises new security and privacy concerns. This paper aims to clarify where established methods fall short and highlight directions for future research in this evolving field.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachConfidentialityintegrityprivacycs.LGcs.CRarXiv cs.CR

Alignment Defends LLMs from Property Inference Attacks

Pengrun Huang, Chhavi Yadav, Ruihan Wu, Kamalika Chaudhuri
2026/06/09 06:15已过 2 天

Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted through property inference attacks, posing a confidentiality risk. Existing defenses against these attacks primarily operate by modifying the training data distribution and hence require access to the original data and retraining the model, limiting their applicability to settings where data is unavailable or models are already deployed. In this work, we propose alignment-based defenses for mitigating property inference attacks in LLMs. Our approach reshapes the model's output distribution towards a target property ratio via post-training alignment, without modifying the training data. In particular, we adapt two widely used RLHF frameworks--Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO)--as our defenses by constructing preference pairs and defining a specific reward function respectively. Through comprehensive experiments, we show that our alignment based defenses effectively mitigate property inference attacks while maintaining a strong utility confidentiality tradeoff.

arXiv cs.SE / arXiv AI Coding 3YAI for Software EngineeringarXivapproachAutomating SE tasks with LLM and foundation modelsEfficacy measurement beyond traditional metricsPrompt engineering for SEcs.SEarXiv cs.SEarXiv AI Coding 3Y

TestMap: Evidence Infrastructure for Foundation-Model-Assisted Test Generation

Hunter Leary, Luke Hanuska, Chris Brown
2026/06/09 06:04已过 2 天

Foundation models (FMs) can generate plausible unit tests, but determining whether those tests are correct, useful, maintainable, and worth integrating remains difficult. Generated tests must be mapped to the code they target, inserted into real projects, built, executed, measured against the baseline suite, repaired when necessary, and compared across models and generation strategies. This validation process is fragmented across build systems, test runners, coverage tools, mutation tools, static analyzers, and experiment scripts. The problem is especially important because generated tests are both code artifacts and validation artifacts: they must themselves be validated before they can be trusted as evidence about the system under test. This paper presents TestMap, an open-source infrastructure prototype that automates evidence-backed foundation-model-assisted test generation for C#/.NET repositories. TestMap supports repository analysis, source-test mapping, baseline execution, code metric collection, test smell detection, coverage measurement, mutation testing, model-guided test generation, validation, repair, and repository-specific experiment tracking. Rather than reporting only final passing tests, TestMap records the lifecycle of each generated candidate, including failed, repaired, low-impact, and evidence positive outcomes. These intermediate outcomes can reveal model limitations, missing context, repair cost, toolchain inefficiencies, or possible faults in the system under test. Using TestMap as a design case, we describe the architecture and evidence model needed to make generated tests observable, repeatable, and comparable across repositories, models, prompts, and generation strategies. We conclude with lessons learned and open challenges, including oracle and assertion quality, metric attribution, test maintainability, flakiness, execution cost, and developer acceptance.

arXiv cs.SE / arXiv LLM4Code / Program RepairAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEcs.AIcs.LGcs.SEarXiv cs.SE

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

Abhilasha Lodha, Mahsa Pahlavikhah Varnosfaderani, Abir Chakraborty, Abhinav Mithal
2026/06/09 06:01已过 2 天

Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools. We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison. The no-user-model baseline achieves only 8.0% complete itemization. Full-context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours. We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool-use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.

arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkAutomating SE tasks with LLM and foundation modelscs.LGcs.AIcs.CVarXiv AI Coding 3Y

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

Nina I. Shamsi
2026/06/09 05:36已过 2 天

Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy, EigenScore) avoid labels but plateau in quality, while supervised probes (SAPLMA) attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, SAR, EigenScore, SAPLMA, and log-probability on seven QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using nine text and vision LLMs in a deliberately label-scarce protocol ($n_{\text{cal}}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

arXiv cs.CRDependability and SecurityarXivapproachsurveyConfidentialityintegrityprivacycs.CRcs.AIarXiv cs.CR

Local Is Not a Sufficient Privacy Boundary: Governing OS-Integrated On-Device AI

Jonghyun Chung, Sanket Badhe
2026/06/09 05:07已过 2 天

As AI systems move into operating systems, privacy no longer turns only on whether a model runs locally. A local assistant may assemble email, calendar entries, files, screenshots, notifications, and app intents; retain embeddings or summaries; invoke tools; emit telemetry; or route difficult requests to cloud infrastructure. Local inference reduces some exposure, but it answers only one question: where computation occurs. It does not answer who may assemble context, what derived state persists, which actions are authorized, or how updates change the system's authority. We develop an OS-centered privacy framework for on-device AI that treats privacy as an institutional accountability problem rather than a deployment attribute. The framework specifies a threat model, a six-part privacy risk taxonomy, privacy-by-architecture controls, and a four-level audit rubric. We demonstrate the rubric through a documentation-bounded comparison of Apple Intelligence/Foundation Models, Android AICore/Gemini Nano, and Microsoft Recall. Meaningful privacy in on-device AI depends on constrained information flow, bounded authority, visible user control, and auditable governance across the operating-system lifecycle.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachConfidentialityintegrityprivacycs.CRarXiv cs.CR

Proof of Source of Funds: Efficient On-chain Provenance of Cryptoassets

Alireza Kavousi, István András Seres, Zhipeng Wang
2026/06/09 05:07已过 2 天

Regulatory compliance is increasingly mandatory for decentralized finance and privacy-enhancing technologies. Current approaches rely on binary inclusion/exclusion lists or retroactive graph analysis by centralized blockchain intelligence firms. This approach strips honest users of their financial privacy, leads to false positives and negatives, and forces decentralized platforms to bear the burden of on-chain transaction monitoring. In this work, we propose a paradigm shift: moving from platform-side surveillance to user-side provenance. We introduce Proof of Source of Funds (PoSoF), a novel cryptographic framework that shifts the burden to the user. Rather than the platform tracing funds, the user locally generates a zero-knowledge proof demonstrating that their deposit originates exclusively from a set of compliant sources. The platform is thus relieved of chain-analysis duties, requiring a constant-time, O(1) verification to enforce admission control. We formulate a unified temporal Directed Acyclic Graph (DAG) abstraction that formalizes both UTXO and account-based ledger histories within a generalized value-flow model. Users extract a compliant sub-DAG of their transaction history and utilize Incrementally Verifiable Computation (IVC) to prove rigorous state-transition predicates that protect against various attack vectors. Crucially, PoSoF provides verifiable cryptographic provenance; it guarantees the legitimacy of the funds without leaking the intermediate transaction topology, intermediary addresses, or the specific origins utilized. We formally define the security properties of PoSoF and evaluate an Ethereum-compatible prototype. Our benchmarks demonstrate that fully private, proactive compliance is highly practical, requiring only ~1.8 s to incrementally update a user's PoSoF per new transaction, and a constant-time ~1.5 ms (~800k gas) for final on-chain EVM verification.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachConfidentialityintegrityprivacycs.CRcs.ARarXiv cs.CR

GRAFT: Graphlet-Triggered Backdoor Attack on GNN-Based Hardware Security Systems

Sanaz Kazemi Abharian, Sai Manoj Pudukotai Dinakarrao
2026/06/09 04:47已过 2 天

The globalization of the integrated circuit (IC) supply chain increases the risk of security threats, such as hardware Trojans (HTs) and the theft of intellectual property (IP). Graph Neural Networks (GNNs), among the most powerful deep learning methods for processing graph-structured data, have been widely adopted to detect such threats. However, GNNs are susceptible to backdoor attacks that can maliciously manipulate output predictions toward an adversarial target. These attacks are not only difficult to detect but also compromise the integrity of GNN-based security systems. Most prior work embeds backdoor triggers using randomly generated subgraphs or gradient-guided generative subgraphs. However, such triggers are impractical for GNN-based hardware security applications as they do not guarantee the preservation of circuit functionality. In this paper, we propose GRAFT, a graph let-triggered backdoor attack targeting GNN-based hardware security. GRAFT embeds graphlet-based triggers at either the register-transfer level (RTL) or gate level of the design while preserving the circuit 's original function. We evaluate GRAFT on the ISCAS-85 and TrustHub datasets. Our experimental results demonstrate that GRAFT can effectively evade HT detection and IP piracy detection, achieving an attack success rate (ASR) of up to 100%.

arXiv cs.CRDependability and SecurityarXivapproachReliabilityavailabilityand safetycs.LGcs.CRarXiv cs.CR

Quality Is Not a Safety Proxy Under Quantization

Sahil Kadadekar
2026/06/09 04:33已过 2 天

Quantized checkpoints are often screened first with quality metrics and only later, if at all, with direct safety tests. This paper audits that shortcut on a matched 51-row matrix spanning 6 models, 4 families, a 7-level GGUF ladder, and AWQ/GPTQ INT4 checkpoints. In this matrix the shortcut fails: all 36 quality-safety pairings split direction across models, and 9 hidden-danger rows plus 1 near-hidden-danger row show quality stable or improved while refusal falls by 12-68 percentage points. Seven of the 11 AWQ/GPTQ rows are hidden-danger. A four-probe mechanistic follow-up over the 17 Hugging Face-backed FP16/AWQ/GPTQ cells does not rescue it: entropy, refusal-direction, and calibration probes are weak or null separators of dangerous rows, and although probe-identified safety-associated neurons absorb 1.39$\times$ more quantization error overall ($p < 5 \times 10^{-7}$), the effect is not regime-specific. Claude Sonnet 4 relabels 11,470 items in a predefined stratified set, agrees with the primary gemma3:12b judge on 89.9\% of rows ($κ= 0.873$, 95\% CI [0.866, 0.881]), and changes 0/10 hidden-danger cells. A calibrated study-internal behavioral screen -- the Refusal Template Stability Index (RTSI), built from four refusal-template drift features and calibrated on this matrix -- routes 10/10 hidden- or near-hidden-danger rows to direct safety testing (Wilson 95\% CI lower bound 0.72) while leaving 23 of 45 non-baseline rows in a low-risk bucket under both in-sample scoring and row-level leave-one-out validation; on the same matrix, the best single-feature baselines (unique-prefix-rate-delta, raw refusal-rate delta) recover 9/10 and 8/10 respectively at matched bucket size, and cross-stack transfer requires recalibration. For the quantized checkpoints, model families, and safety outcomes studied here, retained quality cannot waive direct safety evaluation.

arXiv cs.CRDependability and SecurityarXivapproachVulnerability detection and software securitycs.CRarXiv cs.CR

RadKey: An LLM-Guided RF Backscatter System for Through-Wall Keystroke Inference

Qijun Wang, Chunqi Qian, Huacheng Zeng
2026/06/09 04:26已过 2 天

In today's digitally connected world, keyboards remain the primary interface for inputting sensitive information, making them a persistent target for eavesdropping attacks. While prior keystroke inference techniques have exploited side-channel signals such as acoustics and vibrations, they typically rely on conspicuous, short-range sensors and require victim-specific data for model training, limiting their practicality, scalability, and stealth. In this paper, we present RadKey, an RF backscatter system for covert, long-range, through-wall keystroke eavesdropping. RadKey comprises two components: a compact batteryless backscatter tag and an RF reader. The tag captures keystroke-induced vibrations and acoustic signals, modulating them onto the frequency shift of its backscattered RF signal using two magnetically-coupled LC resonators. This design also enables spectral separation between the excitation and backscatter signals, mitigating self-interference for the RF reader and thus extending eavesdropping range. The RF reader demodulates the backscattered RF signal to infer typed content. It employs a dedicated signal processing pipeline that extracts user- and keyboard-independent keystroke features across time and frequency domains, enabling strong generalizability. To further enhance adaptability, RadKey integrates an LLM for online adaptation, leveraging LLM outputs as pseudo ground-truth labels to refine the classifier during runtime. We have built a prototype of the full RadKey system and evaluated it through extensive over-the-air experiments. Results show that RadKey achieves accurate and robust keystroke inference across diverse users in real-world settings. A demo video is available at: https://radkey-submission.github.io/RadKey/

arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv SWE-bench / Defects4J / arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkapproachAutomating SE tasks with LLM and foundation modelscs.SEcs.AIarXiv cs.SEarXiv AI Agents for Software Engineering

What makes a harness a harness: necessary and sufficient conditions for an agent harness

Sanderson Oliveira de Macedo
2026/06/09 03:35已过 2 天

The term agent harness now circulates widely in software engineering with generative artificial intelligence. It names the layer that wraps a language model and turns it into a coding agent able to act on a repository. The usage is loose and polysemous. Sometimes the term denotes the whole product (Claude Code, Codex CLI); sometimes it denotes the evaluation scaffold that runs an agent against tasks (the SWE-bench harness); sometimes it gets conflated with an agent framework, an SDK, an IDE plugin, or an orchestrator. What is missing is a reference definition that works as an instrument, one that includes and excludes cases consistently. We build that definition through a conceptual analysis that combines works with persistent identifiers and primary grey-literature sources, such as official documentation, glossaries, and engineering reports. We reconstruct the genealogy of the term, from the horse's tack to the classic test harness, to the machine-learning evaluation harness, and finally to the agent harness. We then propose a constitutive definition that states the necessary and sufficient conditions for a system to be an agent harness, we operationalize it as an inclusion and exclusion test, and we draw the boundary of the concept against an agent framework, an agent SDK, an IDE plugin, an eval harness, and an orchestrator. We apply the definition to six real harnesses (Claude Code, Codex CLI, Aider, Cline, OpenHands, and SWE-agent) and to deliberate edge cases; the test includes and excludes consistently. We close with a research agenda organized by design tension axes. The contribution is an operational definition of agent harness, with a shared vocabulary, able to guide engineering practice and the scientific comparison of agentic systems.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachConfidentialityintegrityprivacycs.CRcs.NIarXiv cs.CR

Secrets Best Not Shared: DNS Privacy Enhancements for the Constrained IoT

Martine S. Lenders, Thomas C. Schmidt, Matthias Wählisch
2026/06/09 03:22已过 2 天

Attackers often identify DNS traffic to disrupt or compromise Internet services. While prior work has focused on encrypting queries using DNS over TLS, HTTPS, or QUIC to counter such attacks, we consider IETF protocols designed for resource-constrained IoT devices and empirically analyze the potential of obfuscating DNS traffic in addition to encryption. We create a dataset of machine-to-machine-compatible data objects along with the corresponding DNS resolution processes, evaluating 296 deployment scenarios of resolving host names, including DNS over the Constrained Application Layer Protocol (CoAP) and an onion routing flavor of CoAP under varying link-layer conditions. We compare them to DNS over HTTPS. Using Random Forest and a header field analysis, we identify fields that leak most information. Our findings show that DNS over CoAP with equalized packet lengths, block-wise transfer, and header compression reduces the accuracy of identifying DNS frames to 86% and further to 77% with payload compression. Our approach outperforms DNS over HTTPS, where classifiers always identify DNS frames based on IP addresses. The dataset is publicly available.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CRcs.LGarXiv cs.CR

SoK: Colluding Adversaries in Machine Learning Pipelines

Vasisht Duddu, Lipeng He, Asim Waheed, N. Asokan
2026/06/09 03:16已过 2 天

Machine learning (ML) models are susceptible to various security, privacy, and fairness risks. Adversaries with different characteristics (i.e., objectives, knowledge, and capabilities) can collude by executing one attack to amplify others. Existing work lacks a systematic framework to explore collusion among adversaries, and to study the implications of the adversaries' characteristics. We present a framework covering collusion (a) between train- and inference-time adversaries, and (b) among inference-time adversaries. Our framework accounts for factors enabling collusion between adversaries. We propose a guideline to conjecture about the potential for collusion using enabling factors. We use it to explain prior work, conjecture about unexplored collusions, and empirically validate five such cases. Finally, we discuss how adversaries' characteristics influence the potential for collusion.

arXiv AI Coding 3YTesting and AnalysisarXivbenchmarkapproachempirical studySoftware testingcs.CLcs.LGarXiv AI Coding 3Y

CodeAlchemy: Synthetic Code Rewriting at Scale

Ankit Gupta, Aditya Prasad, Rameswar Panda
2026/06/09 03:15已过 2 天

Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements. We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge. We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding. Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.

arXiv cs.CRDependability and SecurityarXivapproachsurveyVulnerability detection and software securityConfidentialityintegritycs.CRcs.CYarXiv cs.CR

The Human Vulnerabilities & Exploits (HVE) Framework

Avichai Ben, Tom Rahav, Daniel Illaev, Aviv Nahon, Avi Grushka
2026/06/09 03:06已过 2 天

The cybersecurity community has invested over two decades in building standardized frameworks, the Common Vulnerabilities and Exposures (CVE) system, the Common Vulnerability Scoring System (CVSS), and the Common Weakness Enumeration (CWE) to identify, classify, and remediate threats to digital infrastructure. However, an emerging body of research reveals that a vast majority of successful cyberattacks exploit not software flaws, but human behavioral and psychological vulnerabilities. Social engineering, fraud, and scam attacks, which manipulate human cognition, emotion, and trust, do not have an equivalent standardized framework. Meanwhile, behavioral science and psychology research has established robust theoretical foundations, such as dual-process theory, prospect theory, social influence frameworks, and visceral state models, which explain precisely why and how these attacks succeed. This paper introduces the Human Vulnerabilities & Exploits (HVE) Framework, a structured approach for identifying, classifying, and mitigating the behavioral and psychological vulnerabilities exploited in scams, social engineering, and other human-centric fraud and attacks, analogous in concept to how CVE helps classify software vulnerabilities: it provides a shared, machine-readable taxonomy with structured identifiers, multi-dimensional severity scoring via the Human Vulnerability Severity Score (HVSS), and actionable remediation guidance through Human Vulnerability Patches (HVPs). This introduction synthesizes the relevant literature across cybersecurity standardization, behavioral science, and fraud defense to establish the theoretical and practical foundations for the HVE framework, whose architecture and technical specifications are detailed in subsequent sections.

arXiv cs.CRDependability and SecurityarXivbenchmarkempirical studysurveyConfidentialityintegrityprivacycs.CRarXiv cs.CR

The Chronicles of Radio Frequency Fingerprinting

Abdul Aziz, Ingrid Huso, Savio Sciancalepore, Gabriele Oligeri
2026/06/09 02:09已过 3 天

Radio Frequency Fingerprinting (RFF) has evolved from an early idea for radar emitter identification into a broad research field for wireless device identification and spectrum monitoring for security. Rather than presenting a conventional literature survey, this work provides a critical historical analysis of RFF organized around the field's major conceptual paradigm shifts from 1993 to 2026. We discuss the evolution of RFF across its fundamental methodological phases, beginning with early transient-based approaches, in which transmitter turn-on behavior, unintentional modulation, and hardware nonlinearities were treated as the primary fingerprint sources. We then examine the transition to digital communications, during which attention shifted to steady-state impairments and to engineered features extracted from signals. Next, we discuss the Machine Learning period, which standardized the RFF workflow around feature extraction, dimensionality reduction, and supervised classification, followed by the Deep Learning period, in which representation learning from raw IQ samples significantly improved performance and expanded the application space. Beyond a chronological list of methods and best practices, this paper critically examines the changing assumptions and persistent limitations that have driven these transitions. We highlight the central challenges that continue to shape the field, including channel dependence, receiver sensitivity, limited dataset realism, poor cross-domain generalization, open-set recognition, and adversarial robustness. By organizing more than three decades of work into a coherent narrative, this paper clarifies the evolution of RFF, identifies persistent limitations, and outlines the key research directions required to move the field toward dependable real-world adoption.

arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv AI Coding 3Y / arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivbenchmarkapproachAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelsCollaborative AI for SEcs.SEcs.AIcs.MAarXiv cs.SE

FASE: Fast Adaptive Semantic Entropy for Code Quality

Shizhe Lin, Ladan Tahvildari
2026/06/09 01:53已过 3 天

Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.

arXiv AI Coding 3YEvolutionarXivapproachEvolution and maintenancecs.AIcs.CLarXiv AI Coding 3Y

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

Matthew Ho, Brian Liu, Jixuan Chen, Audrey Wang, Lianhui Qin
2026/06/09 01:35已过 3 天

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

arXiv cs.CRDependability and SecurityarXivapproachFormal methods and model checkingcs.CYcs.CRcs.SIarXiv cs.CR

Human-Centred Risk Mitigation for AI-Mediated Information Manipulation: A SOCMINT Framework Based on Information Manipulation Sets

Antonio Scala
2026/06/09 01:12已过 3 天

AI-mediated information manipulation increasingly takes the form of social cyber attacks that target trust, attention, credibility, reputation, and decision-making rather than only technical infrastructures or isolated false contents. Existing defensive approaches often oscillate between incident-level analysis, which fragments campaigns into weak signals, and attribution-first analysis, which may delay mitigation until responsibility is established. This paper proposes a SOCMINT framework based on Information Manipulation Sets (IMS) as an intermediate operational unit between individual incidents and strategic attribution. Building on the VIGINUM/EEAS use of IMS in counter-FIMI analysis, the framework treats manipulation as a coherent process involving narratives, accounts, infrastructures, temporal patterns, cross-platform migration, synthetic amplification, and cognitive targeting. The proposed pipeline moves from signal detection and diagnostic triage to IMS hypothesis construction, confidence/severity assessment, mitigation selection, and iterative update. A compact scenario illustrates how IMS-based analysis captures what content-level and attribution-first approaches miss. The paper also proposes a tabletop evaluation protocol to assess decision quality, confidence calibration, and mitigation proportionality. The main implication is that human-centred risk mitigation requires not only better detection, but also structured reasoning under uncertainty, auditable decision-making, and safeguards against over-securitising legitimate dissent.

arXiv AI Agents for Software Engineering / arXiv Program Repair Core / arXiv AI Agents for Software Engineering / arXiv Program Repair CoreAI for Software EngineeringarXivbenchmarkapproachAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelsPrompt engineering for SEcs.CVarXiv AI Agents for Software EngineeringarXiv Program Repair Core

HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents

Letian Li, Chao Shen, Shuzhao Xie, Chenghao Gu, ZhengXiao He, Yu Meng, Xin Yang, Wenyuan Jiang, Zhi Wang
2026/06/09 01:02已过 3 天

Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but underspecify local geometry and make instruction-based edits difficult to localize. We frame this problem as structured program generation and local program repair, and propose Hierarchical Descriptive Scene Language (HDSL), an XML/CSS-style domain-specific language for structured 3D indoor scenes. HDSL represents rooms, regions, objects, and support surfaces as a tree with local coordinates, making complex scenes easier to plan recursively and easier to retrieve for editing. Our pipeline uses LLM agents to generate HDSL subtrees with bounded verification, grounds non-virtual nodes through multimodal asset retrieval, and applies force-directed layout optimization to repair boundary and collision errors. For editing, Hierarchical Retrieval-Augmented Generation retrieves the relevant subtree, asks the LLM to rewrite only that local context, and merges the result back through a deterministic three-way merge. In our reproduced benchmark, HDSL improves average object coverage, text-scene alignment, and generation time over full text-to-scene baselines while remaining competitive with recent layout-only reproductions on geometry metrics; for editing, HRAG reduces token use by $5.22\times$ and runtime by $6.19\times$, produces valid DSL for all eight paired edits, and better preserves unrelated scene objects.

arXiv cs.CRDependability and SecurityarXivapproachVulnerability detection and software securityquant-phcs.CRarXiv cs.CR

A Bell-State Extension of Loop-Back Quantum Key Distribution

Luis Adrián Lizama-Pérez
2026/06/09 00:46已过 3 天

Bidirectional quantum key distribution (QKD) protocols face persistent challenges related to classical disclosure, confinement of the signal space to predictable subspaces, and limited detectability under substitution or entanglement-swapping attacks. In this work, we present a Bell-state extension of the Loop-Back QKD architecture that improves efficiency and detectability while preserving its defining feature of a simplified, measurement-free remote terminal. The protocol employs entangled Bell states together with deterministic local Pauli encoding at the remote node. A central element is that Alice privately prepares and knows the initial Bell state, which serves as a hidden reference enabling her to interpret the Bell-state transition induced by Bob, while preventing an adversary from reconstructing the encoding without access to this reference. By exploiting both intra- and inter-family Bell transitions, the scheme expands the effective signal space beyond the subspace restrictions of earlier two-way protocols. Alice performs a Bell-state measurement to deterministically infer Bob's operation without any basis sifting. Although the traveling subsystem remains locally maximally mixed, concealing the initial Bell family amplifies disturbance under separable substitution strategies, yielding an intrinsic detection probability of approximately 3/4 per round. From an efficiency perspective, the protocol lifts the intrinsic post-selection limitation of single-qubit Loop-Back schemes: the effective throughput is bounded only by the Bell-state measurement success probability, reaching up to 50% in linear-optical implementations. These features make the proposed scheme particularly suitable for mobile or edge-based QKD scenarios, where passive remote nodes must operate under high loss and limited interaction times.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachempirical studyVulnerability detection and software securityDependability and security for embedded and cyber-physical systemscs.CRcs.HCcs.LGarXiv cs.CR

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

Qin Yang, Lu Malloy, Joshua Lee, Xiaohan Chang, Meisam Mohammady, Doowon Kim, Yuan Hong
2026/06/09 00:21已过 3 天

Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We show that this discrepancy creates a fundamental perceptual mismatch: content that is readily recognized as harmful by humans can become effectively invisible to automated moderation systems. To study this vulnerability, we introduce a class of Human-Perceptible Adversarial Attacks (HPAA), in which harmful expressions are embedded into otherwise benign text through visually salient typographic manipulations. Our key insight is that typographic features, including spacing, visual emphasis, and spatial arrangement, can be strategically combined to preserve human recognition of harmful content while substantially reducing machine detectability. Operating in black-box settings with only a small query budget, our attack automatically generates evasive content without requiring model access or gradient information. We evaluate the attack across multiple datasets and ten deployed moderation systems, including commercial APIs and state-of-the-art open-source guardrails. Results reveal a striking gap between human and machine perception: with only three detector queries, generated attacks achieve over 86\% human recognition while maintaining detection rates below 1\% across the evaluated systems. We further conduct ablation studies to identify the typographic factors driving successful evasion, analyze why current moderation architectures fail to capture these signals, and discuss practical defenses. Our findings expose a fundamental blind spot in today's LLM-based moderation ecosystem and highlight need for moderation systems that reason about content in a manner more consistent with human perceptual understanding.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CRcs.AIarXiv cs.CR

Observability for Delegated Execution in Agentic AI Systems

Abhinav Mishra, Kumar Sharad
2026/06/09 00:10已过 3 天

Delegation-scoped execution is not identifiable from standard observables: audit logs and execution traces can be identical under multiple incompatible delegation assignments. This gap is especially acute in LLM-based agentic systems, where agents dynamically select tools, vary execution sequences across runs for the same instruction, and spawn cooperating sub-agents. These dynamics fragment and interleave traces, making delegation-scoped reconstruction from causal structure alone structurally underdetermined. Although individual actions are authorized and logged, existing audit, tracing, and security schemas lack the semantics to reconstruct what actions occurred under a given delegation across heterogeneous systems. We focus on delegation-scoped attribution and access/share footprint reconstruction, not intent inference or reasoning reconstruction. We present an agent-aware observability substrate consisting of a lightweight gateway and a common information model that binds delegation context at execution time. This enables reliable cross-tool delegation-scoped reconstruction and direct forensic queries without heuristic time-window correlation.

arXiv SWE-bench / Defects4J / arXiv AI Coding 3Y / arXiv SWE-bench / Defects4J / arXiv AI Coding 3YAnalyticsarXivbenchmarkapproachMining software repositoriescs.AIarXiv SWE-bench / Defects4JarXiv AI Coding 3Y

From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design

Dun Li, Jiatao Li, Hongzhi Li
2026/06/08 23:45已过 3 天

Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a target of modification. We propose an operational evidence framework with four criteria: inspectable target system, meta-level modifier, feedback-directed selection, and recursive continuation. We then map public systems, including Darwin Goedel Machine (DGM), STOP, Goedel Agent, and ShinkaEvolve, against these criteria. DGM provides the most direct currently reported evidence: its published results show improvement from 20% to 50% on SWE-bench Verified and from 14.2% to 30.7% on full Polyglot after 80 iterations, with ablations suggesting that both open-ended exploration and self-improvement contribute. Finally, we provide MetaAI-Mini, a reproducible HumanEval-based protocol and codebase. Because no completed model run is included in this build, MetaAI-Mini is reported as a protocol rather than as an experimental result.

arXiv cs.SERequirements and ModelingarXivapproachempirical studyModeling and model-driven engineeringcs.ROcs.PLcs.SEarXiv cs.SE

Modeling Components and Connections in Cyber-Physical Systems

Kate Sanborn, Tanuj Kenchannavar, Vakul Nath, Jonathan Sprinkle
2026/06/08 23:39已过 3 天

Text based configuration files for cyber-physical systems show the hierarchy of component modules well but often hide the details of connections and interfaces between modules. A model-based visual approach to these configuration files can better capture this information. The XML structure of Robot Operating System (ROS) launch files can be improved using a modeling approach. This paper presents ROSLaunchVisual, a model-integrated environment built on WebGME for designing, visualizing, and managing ROS launch files. The tool raises the level of abstraction by allowing developers to create and modify launch files using a graphical interface that represents nodes, publishers, subscribers, and arguments as interconnected components. The tool provides a dynamic system analysis that can then be used in the static development and analysis of new and existing launch files. ROSLaunchVisual incorporates features such as metamodel-driven validation, automatic import/export of launch files, and visual communication mapping. Plugins further enhance functionality by updating libraries, checking for semantic errors, and managing remaps. By making launch file creation more intuitive and less error-prone, ROSLaunchVisual improves development efficiency and system understanding, especially in collaborative or large-scale robotics projects.

arXiv cs.SEAI for Software EngineeringarXivapproachsurveyAutomating SE tasks with LLM and foundation modelscs.SEarXiv cs.SE

Agentic Persona Generation with Critique-Refinement: An Industrial Evaluation

Mohammad Hossein Amini, David Dewar, Shiva Nejati, Mehrdad Sabetzadeh
2026/06/08 23:34已过 3 天

Personas are widely used in software engineering to support requirements elicitation, design, and validation, but their manual creation is costly, time-consuming, and hard to scale. Recent LLM-based approaches automate persona generation from textual data; however, they typically rely on single-shot generation and subjective evaluations, limiting practical reliability. We present PerGent, an industry-grade method for persona generation built around an iterative critique-refinement loop. Specifically, PerGent uses a generator and a critic LLM agent, coordinated by an orchestrator, to iteratively refine personas using external resources such as interviews, surveys, and job postings through a critique-refinement loop with a user-defined maximum number of rounds. We deploy and evaluate PerGent in an industrial setting at Kinaxis, comparing it with three baselines, including one-shot methods. In an expert in-situ evaluation, PerGent achieved the highest expert approval rate (96.9%), exceeding all baselines. We further compare PerGent-generated personas with best-practice personas manually created by domain experts prior to the adoption of LLMs. Compared to baselines, PerGent reproduces a larger proportion of expert content while also contributing substantial new content beyond the pre-LLM personas. We conclude with lessons learned from deploying and evaluating PerGent at Kinaxis.

arXiv cs.CRDependability and SecurityarXivapproachempirical studyFormal methods and model checkingcs.DCcs.CRarXiv cs.CR

Parent-Hash DAG: A Cost Analysis of Constant-Time Append for On-Chain Registries

Ian C. Moore, Fernando Paredes Garcia
2026/06/08 23:03已过 3 天

Provenance trees are append-only directed acyclic graphs of artifact registrations anchored on a public blockchain, recently introduced as the data substrate of operator-gated provenance infrastructure. Their defining data-structural pattern is a parent-hash directed acyclic graph (PHDAG), in which each append performs a constant number of storage writes to previously-untouched slots. This pattern has not previously been isolated as a standalone primitive, formally bounded with explicit constants, or benchmarked against the standard alternative, the incremental Merkle tree (IMT). We formalize PHDAG append as O(1) in gas cost, independent of registry size and tree depth, and develop a stochastic cost model for IMT in which per-insert cost is a random variable over the leaf index, deriving closed-form expressions for its mean and variance. We validate both analyses empirically on Base Sepolia across tree depths 1 to 25. PHDAG is observed to be depth-invariant at 76,276 gas (standard deviation about 6 gas), while IMT cost grows linearly with depth. The crossover below which IMT is cheaper falls far beneath the depths of every production registry surveyed. We further establish trustless registry reconstruction from public event logs in linear time with no off-chain dependency.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CLcs.CRarXiv cs.CR

Clinically Grounded Privacy Evaluation of Medical LMs

Sasha Ronaghi, Sana Tonekaboni, Lena Stempfle, Vivian Utti, Jordan Li Cahoon, Nathaniel Hendrix, Ayin Vala, Marzyeh Ghassemi, Emily Alsentzer
2026/06/08 23:02已过 3 天

Medical language models (LMs) can memorize and reproduce protected health information, but privacy evaluations often focus on recovery of training text rather than disclosure under realistic threat models. We introduce a clinically grounded framework that evaluates leakage along a graded axis of adversarial access, ranging from publicly inferable demographics to leaked note fragments. At each tier, we measure verbatim memorization of patient-specific text and semantic leakage of sensitive diagnoses. Applying the framework to an LM pretrained on 378k clinical notes, we find that routine encounter metadata (i.e. name, date of birth, provider, practice, visit date) elicits high rates of verbatim memorization across a patient's timeline and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). At the same time, exact-match memorization can overstate disclosure: 36% of memorized tokens reflect templated documentation. Our work highlights the risks of training on longitudinal clinical data, providing a practical framework for contextual privacy evaluation of medical LMs.

arXiv cs.SE / arXiv LLM4Code / Program Repair / arXiv AI Agents for Software Engineering / arXiv AI Coding 3Y / arXiv cs.SE / arXiv LLM4Code / Program Repair / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivapproachAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SETrustworthy AI for SEcs.CLcs.LGcs.SEarXiv cs.SE

Code Is More Than Text: Uncertainty Estimation for Code Generation

Yuling Shi, Caiqi Zhang, Yuexian Li, Haopeng Wang, Yeheng Chen, Nigel Collier, Xiaodong Gu
2026/06/08 22:52已过 3 天

Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.

arXiv cs.CRDependability and SecurityarXivbenchmarkapproachReliabilityavailabilityand safetycs.LGcs.AIcs.CRcs.RO

Safe-RULE: Safe Reinforcement UnLEarning

Shixiong Jiang, Taozheng Zhu, Fanxin Kong
2026/06/08 22:33已过 3 天

Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. We further extend reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during the unlearning process. Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks.

arXiv cs.CRAI for Software EngineeringarXivapproachAutomating SE tasks with LLM and foundation modelsPrompt engineering for SEcs.CRcs.AIarXiv cs.CR

FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing

Yuhan Ma, Yong Li, Stefan Schmid
2026/06/08 22:30已过 3 天

Two-server secure inference allows a client to query a hosted large language model (LLM) without revealing prompts or embeddings. Recent GPU systems based on function secret sharing (FSS) make linear layers efficient, but fixed-point nonlinearities and helper operations remain a bottleneck because each operator is typically implemented as a bespoke protocol with its own comparisons, wrap-around corrections, and preprocessing material. We present FuseFSS, a compiler that replaces per-operator protocol design with a single compilation pipeline. For each scalar fixed-point operator, a compact specification lists its interval partition, low-degree arithmetic pieces, and required predicate bits. The compiler emits two batched FSS evaluations on the public masked value: one packed comparison that returns all predicate bits, and one vector interval lookup that returns the active coefficients and constants. Compared to the current state-of-the-art FSS-based GPU secure inference, FuseFSS preserves accuracy while achieving a $1.24\times$--$1.50\times$ end-to-end speedup and reducing online communication by $9\%$--$16\%$ on BERT and GPT-style models; preprocessing is also lighter, with $14\%$--$23\%$ lower key-generation time and $20\%$--$24\%$ smaller keys.

arXiv cs.CRDependability and SecurityarXivapproachConfidentialityintegrityprivacycs.CRcs.AIarXiv cs.CR

SecureClaw: Clawing Back Control of LLM Agents

Yuhan Ma, Stefan Schmid
2026/06/08 22:29已过 3 天

Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and, in the evaluated deployment, bounded summaries as an explicit declassification interface. Writes that change external state follow a PREVIEW$\rightarrow$COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy. The runtime can still plan over summaries and symbolic references, but cannot directly dereference secrets or perform side effects. Across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw is the only defense we evaluate in a common harness that simultaneously retains usable task utility and achieves 0\% attack success rate (ASR) on ASB, 0.64\% ASR on AgentDojo, and 3.23\% overall leak on AgentLeak's attacked parity lane, which measures final-output and internal-relay leakage.