Is it Matt Lauer? It seems no amount of practice can make Schiff appear to be a real Democrat ‘lawyer.’

# Tag: ADAM

## Bias correction in Adam: * beta or * 1/(1-beta)?

I’m investigating the TensorFlow implementation of the Adam optimiser.

When comparing the code in the implementation to several published pseudocode versions of Adam, it looks like the bias correction in the TensorFlow version is different to what I would expect.

I would like to know if this is a mistake or if there are other versions of the Adam optimiser with different approaches to bias correction. I haven’t seen the “wrong” version described anywhere else.

The pseudocode from the original paper is:

In particular, $ \hat{m}_t \leftarrow\frac{m_t}{1-\beta_1^t}$ and $ \hat{v}_t \leftarrow\frac{v_t}{1-\beta_2^t}$ .

The tensorflow code is:

` def _apply_sparse_shared(self, grad, var, indices, scatter_add): beta1_power, beta2_power = self._get_beta_accumulators() beta1_power = math_ops.cast(beta1_power, var.dtype.base_dtype) beta2_power = math_ops.cast(beta2_power, var.dtype.base_dtype) lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype) beta1_t = math_ops.cast(self._beta1_t, var.dtype.base_dtype) beta2_t = math_ops.cast(self._beta2_t, var.dtype.base_dtype) epsilon_t = math_ops.cast(self._epsilon_t, var.dtype.base_dtype) lr = (lr_t * math_ops.sqrt(1 - beta2_power) / (1 - beta1_power)) # m_t = beta1 * m + (1 - beta1) * g_t m = self.get_slot(var, "m") m_scaled_g_values = grad * (1 - beta1_t) m_t = state_ops.assign(m, m * beta1_t, use_locking=self._use_locking) with ops.control_dependencies([m_t]): m_t = scatter_add(m, indices, m_scaled_g_values) # v_t = beta2 * v + (1 - beta2) * (g_t * g_t) v = self.get_slot(var, "v") v_scaled_g_values = (grad * grad) * (1 - beta2_t) v_t = state_ops.assign(v, v * beta2_t, use_locking=self._use_locking) with ops.control_dependencies([v_t]): v_t = scatter_add(v, indices, v_scaled_g_values) v_sqrt = math_ops.sqrt(v_t) var_update = state_ops.assign_sub( var, lr * m_t / (v_sqrt + epsilon_t), use_locking=self._use_locking) return control_flow_ops.group(*[var_update, m_t, v_t]) `

The offending lines here are:

`m_t = state_ops.assign(m, m * beta1_t, use_locking=self._use_locking)`

and

`v_t = state_ops.assign(v, v * beta2_t, use_locking=self._use_locking)`

That is, $ \hat{m}_t \leftarrow m_t \cdot \beta_1^t$ and $ \hat{v}_t \leftarrow v_t \cdot \beta_2^t$ .

**My question:** Is this a mistake, or is there some other accepted approach to the bias correction that I have missed?