<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>hgpu.org</title>
	<atom:link href="https://hgpu.org/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>https://hgpu.org</link>
	<description>High performance computing on Graphics Processing Units</description>
	<lastBuildDate>Sun, 12 Apr 2026 21:17:07 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
<site xmlns="com-wordpress:feed-additions:1">56702024</site>	<item>
		<title>MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU</title>
		<link>https://hgpu.org/?p=30722</link>
					<comments>https://hgpu.org/?p=30722#respond</comments>
		
		<dc:creator><![CDATA[hgpu]]></dc:creator>
		<pubDate>Sun, 12 Apr 2026 21:17:07 +0000</pubDate>
				<category><![CDATA[Computer science]]></category>
		<category><![CDATA[CUDA]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[nVidia]]></category>
		<category><![CDATA[nVidia GH200]]></category>
		<category><![CDATA[Package]]></category>
		<guid isPermaLink="false">https://hgpu.org/?p=30722</guid>

					<description><![CDATA[We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://hgpu.org/?feed=rss2&#038;p=30722</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">30722</post-id>	</item>
		<item>
		<title>Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization</title>
		<link>https://hgpu.org/?p=30721</link>
					<comments>https://hgpu.org/?p=30721#respond</comments>
		
		<dc:creator><![CDATA[hgpu]]></dc:creator>
		<pubDate>Sun, 12 Apr 2026 21:17:07 +0000</pubDate>
				<category><![CDATA[Computer science]]></category>
		<category><![CDATA[CUDA]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[Heterogeneous systems]]></category>
		<category><![CDATA[nVidia]]></category>
		<category><![CDATA[Triton]]></category>
		<guid isPermaLink="false">https://hgpu.org/?p=30721</guid>

					<description><![CDATA[We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://hgpu.org/?feed=rss2&#038;p=30721</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">30721</post-id>	</item>
		<item>
		<title>CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe</title>
		<link>https://hgpu.org/?p=30720</link>
					<comments>https://hgpu.org/?p=30720#respond</comments>
		
		<dc:creator><![CDATA[hgpu]]></dc:creator>
		<pubDate>Sun, 12 Apr 2026 21:17:07 +0000</pubDate>
				<category><![CDATA[Computer science]]></category>
		<category><![CDATA[CUDA]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Matrix multiplication]]></category>
		<category><![CDATA[nVidia]]></category>
		<category><![CDATA[nVidia GeForce RTX 4090]]></category>
		<guid isPermaLink="false">https://hgpu.org/?p=30720</guid>

					<description><![CDATA[High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate&#8211;test&#8211;refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://hgpu.org/?feed=rss2&#038;p=30720</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">30720</post-id>	</item>
		<item>
		<title>Agentic Code Optimization via Compiler-LLM Cooperation</title>
		<link>https://hgpu.org/?p=30719</link>
					<comments>https://hgpu.org/?p=30719#respond</comments>
		
		<dc:creator><![CDATA[hgpu]]></dc:creator>
		<pubDate>Sun, 12 Apr 2026 21:17:07 +0000</pubDate>
				<category><![CDATA[Computer science]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[Code generation]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Package]]></category>
		<guid isPermaLink="false">https://hgpu.org/?p=30719</guid>

					<description><![CDATA[Generating performant executables from high level languages is critical to software performance across a wide range of domains. Modern compilers perform this task by passing code through a series of well-studied optimizations at progressively lower levels of abstraction, but may miss optimization opportunities that require high-level reasoning about a program&#8217;s purpose. Recent work has proposed [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Generating performant executables from high level languages is critical to software performance across a wide range of domains. Modern compilers perform this task by passing code through a series of well-studied optimizations at progressively lower levels of abstraction, but may miss optimization opportunities that require high-level reasoning about a program&#8217;s purpose. Recent work has proposed using LLMs to fill this gap. While LLMs can achieve large speedups on some programs, they frequently generate code that is incorrect. In this work, we propose a method to balance the correctness of conventional compiler optimizations with the &#8220;creativity&#8221; of LLM-based code generation: compiler-LLM cooperation. Our approach integrates existing compiler optimization passes with LLM-based code generation at multiple levels of abstraction, retaining the best features of both types of code optimization. We realize our approach with a multi-agent system that includes (1) LLM-based optimization agents for each level of abstraction, (2) individual compiler constituents as tools, (3) an LLM-based test generation agent that probes the correctness and performance of generated code, and (4) a guiding LLM that orchestrates the other components. The strategy enables LLM-based optimization of input programs at multiple levels of abstraction and introduces a method for distributing computational budget between levels. Our extensive evaluation shows that compiler-LLM cooperation outperforms both existing compiler optimizations and level-specific LLM-based baselines, producing speedups up to 1.25x.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://hgpu.org/?feed=rss2&#038;p=30719</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">30719</post-id>	</item>
		<item>
		<title>DVM: Real-Time Kernel Generation for Dynamic AI Models</title>
		<link>https://hgpu.org/?p=30718</link>
					<comments>https://hgpu.org/?p=30718#respond</comments>
		
		<dc:creator><![CDATA[hgpu]]></dc:creator>
		<pubDate>Sun, 12 Apr 2026 21:17:07 +0000</pubDate>
				<category><![CDATA[Computer science]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[Code generation]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Package]]></category>
		<guid isPermaLink="false">https://hgpu.org/?p=30718</guid>

					<description><![CDATA[Dynamism is common in AI computation, e.g., the dynamic tensor shapes and the dynamic control flows in models. Due to the long compilation time, existing runtime compilation damages the model efficiency, while the offline compilers either suffer from the long compilation time and device memory footprint to cover all the possible execution instances of a [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Dynamism is common in AI computation, e.g., the dynamic tensor shapes and the dynamic control flows in models. Due to the long compilation time, existing runtime compilation damages the model efficiency, while the offline compilers either suffer from the long compilation time and device memory footprint to cover all the possible execution instances of a dynamic model, or sacrifice optimization opportunities for usability. In this paper, we rethink the feasibility of runtime compilation for dynamic models and identify that the key for it to work is to speed up the compilation or hide the compilation overhead. To do this, we propose a real-time compiler, DVM. In DVM, we design a runtime operator compiler based on a bytecode virtual machine to perform effective and efficient compilation for each dynamic operator instance given its input. Specifically, instead of compiling programs into machine code, we encode the operator program into bytecode on the CPU and decode the bytecode into virtual instructions for direct execution on the NPU. Based on the runtime operator compiler, we further propose an operator fuser, which performs symbol-deduction-based fusion on static graphs and runtime fusion on dynamic graphs. Both pattern- and stacking-based fusion are supported to increase fusion opportunities. Evaluation on operators, subgraphs, and models shows that, compared with TorchInductor, PyTorch-eager and MindSpore-graph-O0, we are up to 11.77 better in terms of the operator/model efficiency and up to 5 orders of magnitude faster in terms of the maximum compilation time.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://hgpu.org/?feed=rss2&#038;p=30718</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">30718</post-id>	</item>
		<item>
		<title>DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation</title>
		<link>https://hgpu.org/?p=30706</link>
					<comments>https://hgpu.org/?p=30706#respond</comments>
		
		<dc:creator><![CDATA[hgpu]]></dc:creator>
		<pubDate>Wed, 25 Mar 2026 22:04:10 +0000</pubDate>
				<category><![CDATA[Computer science]]></category>
		<category><![CDATA[CUDA]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[nVidia]]></category>
		<category><![CDATA[Triton]]></category>
		<guid isPermaLink="false">https://hgpu.org/?p=30706</guid>

					<description><![CDATA[Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://hgpu.org/?feed=rss2&#038;p=30706</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">30706</post-id>	</item>
		<item>
		<title>High-level Programming of Vulkan-based GPUs Through OpenMP</title>
		<link>https://hgpu.org/?p=30705</link>
					<comments>https://hgpu.org/?p=30705#respond</comments>
		
		<dc:creator><![CDATA[hgpu]]></dc:creator>
		<pubDate>Wed, 25 Mar 2026 22:04:09 +0000</pubDate>
				<category><![CDATA[Computer science]]></category>
		<category><![CDATA[CUDA]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[AMD Radeon RX 550]]></category>
		<category><![CDATA[nVidia]]></category>
		<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[Tesla P40]]></category>
		<category><![CDATA[Vulkan]]></category>
		<guid isPermaLink="false">https://hgpu.org/?p=30705</guid>

					<description><![CDATA[Modern applications often involve complex, structured or data-parallel computations on large datasets. Traditionally, GPUs have served as the primary accelerators for such tasks, mostly through compute-focused models like CUDA and OpenCL. Vulkan is a more recent cross-platform API, widely adopted for both high-performance graphics and compute. These models require lower-level programming, as developers have to [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Modern applications often involve complex, structured or data-parallel computations on large datasets. Traditionally, GPUs have served as the primary accelerators for such tasks, mostly through compute-focused models like CUDA and OpenCL. Vulkan is a more recent cross-platform API, widely adopted for both high-performance graphics and compute. These models require lower-level programming, as developers have to be aware of architectural details; this is not easily accomplished given the dramatic rise in hardware heterogeneity. It has thus become increasingly desirable to adopt higher-level models that abstract away the low-level hardware and API details, and simplify GPU programming. In this paper we present a full-fledged OpenMP translator and runtime offloading infrastructure that targets the Vulkan Compute pipeline. While previous works usually focus on OpenCL or CUDA, this is the first time an OpenMP compiler targets Vulkan shaders. As such, apart from the support for off-the-shelf NVIDIA and AMD GPUs, we are the first to provide OpenMP support for mobile and embedded GPUs, such as VideoCore GPUs. The proposed translator, which is based on an open-source compilation framework, receives standard OpenMP code and converts it to tunable Vulkan shaders. Our approach preserves the simplicity of higher-level programming, while still achieving high performance, as demonstrated by our experimental results.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://hgpu.org/?feed=rss2&#038;p=30705</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">30705</post-id>	</item>
		<item>
		<title>Mixed-precision numerics in scientific applications: survey and perspectives</title>
		<link>https://hgpu.org/?p=30704</link>
					<comments>https://hgpu.org/?p=30704#respond</comments>
		
		<dc:creator><![CDATA[hgpu]]></dc:creator>
		<pubDate>Wed, 25 Mar 2026 22:04:09 +0000</pubDate>
				<category><![CDATA[Computer science]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[Mixed precision]]></category>
		<category><![CDATA[nVidia]]></category>
		<category><![CDATA[nVidia V100]]></category>
		<category><![CDATA[Review]]></category>
		<guid isPermaLink="false">https://hgpu.org/?p=30704</guid>

					<description><![CDATA[The explosive demand for artificial intelligence (AI) workloads has led to a significant increase in silicon area dedicated to lower-precision computations on recent high-performance computing hardware designs. However, mixed-precision capabilities, which can achieve performance improvements of up to 8x compared to double-precision in extreme compute-intensive workloads, remain largely untapped in most scientific applications. A growing [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>The explosive demand for artificial intelligence (AI) workloads has led to a significant increase in silicon area dedicated to lower-precision computations on recent high-performance computing hardware designs. However, mixed-precision capabilities, which can achieve performance improvements of up to 8x compared to double-precision in extreme compute-intensive workloads, remain largely untapped in most scientific applications. A growing number of efforts have shown that mixed-precision algorithmic innovations can deliver superior performance without sacrificing accuracy. These developments should prompt computational scientists to seriously consider whether their scientific modeling and simulation applications could benefit from the acceleration offered by new hardware and mixed-precision algorithms. In this survey, we (1) review progress across diverse scientific domains &#8211; fluid dynamics, weather and climate, quantum chemistry, and computational genomics—that have begun adopting mixed-precision strategies; (2) examine state-of-the-art algorithmic techniques such as iterative refinement, splitting and emulation schemes, and adaptive precision solvers; (3) assess their implications for accuracy, performance, and resource utilization; and (4) survey the emerging software ecosystem that enables mixed-precision methods at scale. We conclude with perspectives and recommendations on cross-cutting opportunities, domain-specific challenges, and the role of co-design between application scientists, numerical analysts, and computer scientists. Collectively, this survey underscores that mixed-precision numerics can reshape computational science by aligning algorithms with the evolving landscape of hardware capabilities.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://hgpu.org/?feed=rss2&#038;p=30704</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">30704</post-id>	</item>
		<item>
		<title>AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search</title>
		<link>https://hgpu.org/?p=30703</link>
					<comments>https://hgpu.org/?p=30703#respond</comments>
		
		<dc:creator><![CDATA[hgpu]]></dc:creator>
		<pubDate>Wed, 25 Mar 2026 22:04:09 +0000</pubDate>
				<category><![CDATA[Computer science]]></category>
		<category><![CDATA[CUDA]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[nVidia]]></category>
		<category><![CDATA[nVidia B200]]></category>
		<category><![CDATA[nVidia H100]]></category>
		<category><![CDATA[Package]]></category>
		<category><![CDATA[Triton]]></category>
		<guid isPermaLink="false">https://hgpu.org/?p=30703</guid>

					<description><![CDATA[Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl&#38;#x27;s law impact, and iteratively refines Triton [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl&amp;#x27;s law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://hgpu.org/?feed=rss2&#038;p=30703</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">30703</post-id>	</item>
		<item>
		<title>Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context</title>
		<link>https://hgpu.org/?p=30696</link>
					<comments>https://hgpu.org/?p=30696#respond</comments>
		
		<dc:creator><![CDATA[hgpu]]></dc:creator>
		<pubDate>Sun, 22 Mar 2026 20:58:43 +0000</pubDate>
				<category><![CDATA[Computer science]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[AMD Radeon Instinct MI250X]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[nVidia]]></category>
		<category><![CDATA[nVidia GeForce RTX 4090]]></category>
		<category><![CDATA[ROCm]]></category>
		<category><![CDATA[Triton]]></category>
		<guid isPermaLink="false">https://hgpu.org/?p=30696</guid>

					<description><![CDATA[Memory access errors remain one of the most pervasive bugs in GPU programming. Existing GPU sanitizers such as compute-sanitizer detect memory access errors by instrumenting every memory instruction in low-level IRs or binaries, which imposes high overhead and provides minimal memory access error diagnostic context for fixing problems. We present Triton-Sanitizer, the first device-agnostic memory [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Memory access errors remain one of the most pervasive bugs in GPU programming. Existing GPU sanitizers such as compute-sanitizer detect memory access errors by instrumenting every memory instruction in low-level IRs or binaries, which imposes high overhead and provides minimal memory access error diagnostic context for fixing problems. We present Triton-Sanitizer, the first device-agnostic memory sanitizer designed for Triton, a domain-specific language for developing portable, efficient GPU kernels for deep learning workloads. Triton-Sanitizer leverages Triton&amp;#x27;s tile-oriented semantics to construct symbolic expressions for memory addresses and masks, verifies them with an SMT solver, and selectively falls back to eager simulation for indirect accesses. This hybrid analysis enables precise detection of memory access errors without false positives while avoiding the cost of per-access instrumentation. Beyond detection, Triton-Sanitizer generates rich diagnostic reports that attribute violations to the tensors nearest to the violated addresses, track the complete call path, and expose the symbolic operations responsible for incorrect addresses. Evaluated on seven widely used open-source repositories of Triton kernels, Triton-Sanitizer uncovered 24 previously unknown memory access errors, of which 8 have already been fixed and upstreamed by us. Compared to compute-sanitizer, Triton-Sanitizer achieves speedups ranging from 1.07× to 14.66×, with an average improvement of 1.62×, demonstrating its ability to enhance performance, precision, and usability in memory access error detection.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://hgpu.org/?feed=rss2&#038;p=30696</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">30696</post-id>	</item>
	</channel>
</rss>
