Gradients leak through masked convolutions on GPU #2387

phlippe · 2022-08-09T14:15:05Z

phlippe
Aug 9, 2022

Hi everyone,

I am currently working with masked convolutions and found that the mask does not fully remove the dependencies between input and output as I would have expected. I would appreciate any pointer on why this might be. 🙂

MInimal example in Google Colab:
Accelerator: GPU (error does not occur on CPU)

As an example, I take a 3x3 convolution which has its lowest row masked out:

class MyConv(nn.Module):
    c_out : int = 6
    kernel_size : int = 3
    
    @nn.compact
    def __call__(self, x):
        mask = jnp.ones((self.kernel_size, self.kernel_size, x.shape[-1], self.c_out), 
                        dtype=jnp.float32)
        mask = mask.at[self.kernel_size//2+1:].set(0)
        x = nn.Conv(features=self.c_out,
                    kernel_size=mask.shape[:2],
                    mask=mask)(x)
        return x

Now, I am interested in applying it to a 5x5 input image and look at the features of the center pixel. Specifically, I take a gradient with respect to the input image for the center pixel:

# Initialization
feat_dim = 128
module = MyConv(c_out=feat_dim, kernel_size=3)
inp = random.normal(random.PRNGKey(123), (1, 5, 5, feat_dim))
params = module.init(random.PRNGKey(0), inp)
# Gradient function of center pixel with respect to input image
grad_fn = jax.grad(lambda x: module.apply(params, x)[:,2,2,:].sum())
grads = grad_fn(inp)
print(grads.mean(axis=-1))

I would have expected that all elements are zero besides the pixels on which the convolution applies unmasked weights, i.e., inp[0,1:3,1:4]. However, it turns out that there is a non-zero gradient for the input elements that are masked:

DeviceArray([[[ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
                0.0000000e+00,  0.0000000e+00],
              [ 0.0000000e+00, -4.2877026e-02,  5.6495268e-02,
                5.2613448e-03,  0.0000000e+00],
              [ 0.0000000e+00,  1.4714450e-02,  5.2021518e-03,
               -2.5886254e-02,  0.0000000e+00],
              [ 0.0000000e+00,  7.6906872e-09, -3.4051482e-09,
               -2.7757778e-08,  0.0000000e+00],
              [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
                0.0000000e+00,  0.0000000e+00]]], dtype=float32)

Note the elements 7.6906872e-09, -3.4051482e-09, -2.7757778e-08 which are the gradients for the masked inputs. Similarly, if I apply the module twice with slightly different input in the masked elements, I get different results:

vals = module.apply(params, inp)[:,2,2,:].sum()
inp2 = inp.at[:,3:].set(2.0)
vals2 = module.apply(params, inp2)[:,2,2,:].sum()

diff = vals - vals2
print(diff)

The output of the print function is 1.9073486e-06. Interestingly, there is no difference if I set the values to 0 instead of 2, as well as when you use much smaller feature dimensions (e.g., 4 instead of 128). I suspect that the limited precision of float32 might play a role here. However, the gradient function for the weights shows that the gradient for the weights is correctly zero in those elements:

grad_fn_params = jax.grad(lambda p: module.apply(p, inp)[:,2,2,:].sum())
grad_fn_params(params)

I would appreciate any hints on why the gradient with respect to the input is non-zero for the masked elements.

Answered by jheek

Aug 31, 2022

I think this is an artifact of the winograd convolution kernel that NVIDIA implemented to accelerate small convolution kernels. Note that the values that are supposed to be zero are actually pretty small (10^-9)

View full answer

cgarciae · 2022-08-30T18:17:09Z

cgarciae
Aug 30, 2022
Maintainer

Hey @phlippe, the mask argument applies to the kernel, not the input, but you are calculating the gradient w.r.t. the input so I don't know the exact distribution it should have without a little bit more thought. I ran your code but calculated the gradients w.r.t. the parameters and can verify that the bottom row is masked out:

# Initialization
feat_dim = 128
module = MyConv(c_out=feat_dim, kernel_size=3)
inp = random.normal(random.PRNGKey(123), (1, 5, 5, feat_dim))
params = module.init(random.PRNGKey(0), inp)
# Gradient function of center pixel with respect to input image
grad_fn = jax.grad(lambda params, x: module.apply(params, x)[:,2,2,:].sum())
grads = grad_fn(params, inp)
for i in range(3):
  for j in range(3):
    print()
    print(grads['params']['Conv_0']['kernel'][..., i, j] == 0)

Output:

[[False False False]
 [False False False]
 [ True  True  True]]

[[False False False]
 [False False False]
 [ True  True  True]]

[[False False False]
 [False False False]
 [ True  True  True]]

[[False False False]
 [False False False]
 [ True  True  True]]

[[False False False]
 [False False False]
 [ True  True  True]]

[[False False False]
 [False False False]
 [ True  True  True]]

[[False False False]
 [False False False]
 [ True  True  True]]

[[False False False]
 [False False False]
 [ True  True  True]]

[[False False False]
 [False False False]
 [ True  True  True]]

0 replies

phlippe · 2022-08-31T07:47:24Z

phlippe
Aug 31, 2022
Author

Hi @cgarciae, thanks for your response! The calculation of the gradients w.r.t. the input is more of a debugging step for a single layer, and becomes relevant if you stack multiple layers. For example, consider a two-layer network with these masks. When calculating gradients w.r.t. the parameters of the input layer, we backpropagate through the input features to the output layer, since we can change these features by changing the parameters of the input layer, and the features have a direct impact on the output/loss. Hence, you can see the gradient above as an intermediate step that is done in a deep NN. The intended gradient we would like to have seen w.r.t. the input is zeros for the second-to-last row, since the weight for these elements is set to zero by the mask, and hence should not have any effect on the output. In other words, this is the expected output:

[[[ 0.          0.          0.          0.          0.        ]
  [ 0.         -0.04287701  0.05649526  0.00526134  0.        ]
  [ 0.          0.01471444  0.00520216 -0.02588625  0.        ]
  [ 0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.        ]]]

Now consider the behavior above with the gradients w.r.t. the inputs. Essentially, the gradients indicate that changing intermediate features at position [3,2] (row 3, col 2, zero-indexed) lead to a change for output features [2,2]. However, this is not intended since in the output layer, the kernel has been masked such that position [3,2] has a weight of zero for output [2,2]. This will influence the gradients w.r.t. the parameters of the input layer, and if you had stacked more layers, for example in a PixelCNN here, it effects more layers further down the network. However, in practice, the network is still trainable because the leaked gradients are a couple of magnitudes smaller than the non-masked elements.

After a bit of further investigation, it seems that this behavior does not origin from flax itself, but jax's convolution operator. So I opened an issue on the jax repo to get behind this behavior

0 replies

jheek · 2022-08-31T10:26:52Z

jheek
Aug 31, 2022
Maintainer

I think this is an artifact of the winograd convolution kernel that NVIDIA implemented to accelerate small convolution kernels. Note that the values that are supposed to be zero are actually pretty small (10^-9)

2 replies

phlippe Aug 31, 2022
Author

Thanks a lot for the pointer! I was surprised that these inaccuracies come up when values in the matrix/kernel are zero, but that might depend on the specific optimizations in winograd. And indeed, in practice, we do not really see the effect except if you look for it explicitly, for instance, for visualizing the receptive field of a PixelCNN.
Do you know if JAX has a flag to deactivate these optimizations or choose a different convolution kernel, or if there is more detailed documentation of that in JAX?

jheek Aug 31, 2022
Maintainer

JAX doesn't really handle these things, the convolution kernels are chosen by the XLA compiler. There might be a flag there somewhere but I'm not sure

warpoons · 2024-04-11T05:03:14Z

warpoons
Apr 11, 2024

Hi @jheek. May I ask if there is currently an available implementation of Winograd convolution in Flax? Or I need to manually implement the Winograd convolutional layer and replace the original convolutional layer in the model definition?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradients leak through masked convolutions on GPU #2387

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Gradients leak through masked convolutions on GPU #2387

phlippe Aug 9, 2022

Replies: 4 comments · 2 replies

cgarciae Aug 30, 2022 Maintainer

phlippe Aug 31, 2022 Author

jheek Aug 31, 2022 Maintainer

phlippe Aug 31, 2022 Author

jheek Aug 31, 2022 Maintainer

warpoons Apr 11, 2024

phlippe
Aug 9, 2022

Replies: 4 comments 2 replies

cgarciae
Aug 30, 2022
Maintainer

phlippe
Aug 31, 2022
Author

jheek
Aug 31, 2022
Maintainer

phlippe Aug 31, 2022
Author

jheek Aug 31, 2022
Maintainer

warpoons
Apr 11, 2024