Learn by doing: TorchInductor Reduction Kernels
This is the second post in the “Learn by Doing” series where I try to explain the inner workings of reduction kernels in TorchInductor. Motivation behind this post is that most of the posts in the internet covers GEMM kernels but reduction kernels are equally important in ML workloads. In this post I will try to explain how TorchInductor generates reduction kernels by walking through an example.
Reduction Overview
Reduction operations are fundamental in DL models, such as summing elements, finding maximum values, or computing averages across dimensions of tensors. In PyTorch, these operations are often performed using functions like torch.sum, torch.max, and torch.mean. TorchInductor optimizes these operations by generating efficient reduction kernels that can run on various hardware backends. In this post we will focus on GPU backend and specifically triton code generation. For LLMs reduction operations are widely used other than GEMM operations. For example in attention mechanism softmax operation involves reduction to compute the sum of exp, RMSNorm and LayerNorm also involve reduction operations to compute mean and variance across specific dimensions of the input tensor.