Skip to content Skip to sidebar Skip to footer

How To Use Tensorflow Batchnormalization With Gradienttape?

Suppose we have a simple Keras model that uses BatchNormalization: model = tf.keras.Sequential([ tf.keras.layers.InputLayer(input_shape=(1,)),

Solution 1:

with gradient tape mode BatchNormalization layer should be called with argument training=True

example:

inp = KL.Input( (64,64,3) )
x = inp
x = KL.Conv2D(3, kernel_size=3, padding='same')(x)
x = KL.BatchNormalization()(x, training=True)
model = KM.Model(inp, x)

then moving vars are properly updated

>>> model.layers[2].weights[2]
<tf.Variable 'batch_normalization/moving_mean:0' shape=(3,)dtype=float32,numpy
=array([-0.00062087,0.00015137, -0.00013239], dtype=float32)>

Solution 2:

I just give up. I spent quiet a bit of time trying to make sense of a model that looks like:

model = tf.keras.Sequential([
                     tf.keras.layers.BatchNormalization(),
])

And I do give up because that thing looks like that: enter image description here

My intuition was that BatchNorm these days is not as straight forward as it used to be and that is why it scales original distribution but not so much new distribution (which is a shame), but ain't nobody got time for that.

Edit: the reason for that behavior is that BN only calculates moments and normalizes batches during training. During training it maintains running averages of mean and deviation and once you switch to evaluation, parameters are used as constants. i.e. evaluation should not depend on normalization because evaluation can be used even for a single input and can not rely on batch statistics. Since constants are calculated on a different distribution, you are getting a higher error during evaluation.

Solution 3:

With Gradient Tape mode, you would usually find gradients like:

with tf.GradientTape() as tape:
    y_pred = model(features)
    loss = your_loss_function(y_pred, y_true)
    gradients = tape.gradient(loss, model.trainable_variables)

train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))

However, if your model contains BatchNormalization or Dropout layer (or any layer that has different train/test phases) then tf will fail building the graph.

A good practice would be to explicitly use trainable parameter when obtaining output from a model. When optimizing use model(features, trainable=True) and when predicting use model(features, trainable=False), in order to explicitly choose train/test phase when using such layers.

For PREDICT and EVAL phase, use

training = (mode == tf.estimator.ModeKeys.TRAIN)
y_pred = model(features, trainable=training)

For TRAIN phase, use

with tf.GradientTape() as tape:
    y_pred = model(features, trainable=training)
    loss = your_loss_function(y_pred, y_true)
    gradients = tape.gradient(loss, model.trainable_variables)

train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))

Note that, iperov's answer works as well, except that you will need to set the training phase manually for those layers.

x = BatchNormalization()(x, training=True)
x = Dropout(rate=0.25)(x, training=True)

x = BatchNormalization()(x, training=False)
x = Dropout(rate=0.25)(x, training=False)

I'd recommended to have one get_model function that returns the model, while changing the phase using training parameter when calling the model.

Note:

If you use model.variables when finding gradients, you'll get this warning

Gradients do not exist for variables 
['layer_1_bn/moving_mean:0', 
'layer_1_bn/moving_variance:0', 
'layer_2_bn/moving_mean:0', 
'layer_2_bn/moving_variance:0'] 
when minimizing the loss.

This can be resolved by computing gradients only against trainable variables. Replace model.variables with model.trainable_variables

Post a Comment for "How To Use Tensorflow Batchnormalization With Gradienttape?"