Order of the stick update schedule

For example, an SVM for CIFAR-10 contains up to 450,000 \(max(0,x)\) terms because there are 50,000 examples and each example yields 9 terms to the objective. You might think that this is a pathological case, but in fact this case can be very common. Since \(x 1e-6\)) and introduce a non-zero contribution. Consider gradient checking the ReLU function at \(x = -1e6\). Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU (\(max(0,x)\)), or the SVM loss, Maxout neurons, etc. One source of inaccuracy to be aware of during gradient checking is the problem of kinks.

If they are you may want to temporarily scale your loss function up by a constant to bring them to a “nicer” range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0. roughly 1e-10 and smaller in absolute value is worrying). This is why I like to always print the raw numerical/analytic gradient, and make sure that the numbers you are comparing are not extremely small (e.g. However, if your gradients per datapoint are very small, then additionally dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues. For example, in neural nets it can be common to normalize the loss function over the batch. It’s a good idea to read through “What Every Computer Scientist Should Know About Floating-Point Arithmetic”, as it may demystify your errors and enable you to write more careful code. Stick around active range of floating point. In my experience I’ve sometimes seen my relative errors plummet from 1e-2 to 1e-8 by switching to double precision. It is often that case that you might get high relative errors (as high as 1e-2) even with a correct gradient implementation. A common pitfall is using single precision floating point to compute gradient check. Conversely, an error of 1e-2 for a single differentiable function likely indicates incorrect gradient. So if you are gradient checking the input data for a 10-layer network, a relative error of 1e-2 might be okay because the errors build up on the way. use of tanh nonlinearities and softmax), then 1e-4 is too high.Īlso keep in mind that the deeper the network, the higher the relative errors will be.

1e-4 > relative error is usually okay for objectives with kinks.

1e-2 > relative error > 1e-4 should make you feel uncomfortable.

relative error > 1e-2 usually means the gradient is probably wrong.

However, one must explicitly keep track of the case where both are zero and pass the gradient check in that edge case. Notice that normally the relative error formula only includes one of the two terms (either one), but I prefer to max (or add) both to make it symmetric and to prevent dividing by zero in the case where one of the two is zero (which can often happen, especially with ReLUs). Which considers their ratio of the differences to the ratio of the absolute values of both gradients. The formula you may have seen for the finite difference approximation when evaluating the numerical gradient looks as follows: Here are some tips, tricks, and issues to watch out for: In practice, the process is much more involved and error prone. In theory, performing a gradient check is as simple as comparing the analytic gradient to the numerical gradient. This section is devoted to the dynamics, or in other words, the process of learning the parameters and finding good hyperparameters. In the previous sections we’ve discussed the static parts of a Neural Networks: how we can set up the network connectivity, the data, and the loss function.

Per-parameter adaptive learning rates (Adagrad, RMSProp).

First-order (SGD), momentum, Nesterov momentum.

Activation/Gradient distributions per layer.