For example, an SVM for CIFAR-10 contains up to 450,000 \(max(0,x)\) terms because there are 50,000 examples and each example yields 9 terms to the objective. You might think that this is a pathological case, but in fact this case can be very common. Since \(x 1e-6\)) and introduce a non-zero contribution. Consider gradient checking the ReLU function at \(x = -1e6\). Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU (\(max(0,x)\)), or the SVM loss, Maxout neurons, etc. One source of inaccuracy to be aware of during gradient checking is the problem of kinks.
If they are you may want to temporarily scale your loss function up by a constant to bring them to a “nicer” range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0. roughly 1e-10 and smaller in absolute value is worrying). This is why I like to always print the raw numerical/analytic gradient, and make sure that the numbers you are comparing are not extremely small (e.g. However, if your gradients per datapoint are very small, then additionally dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues. For example, in neural nets it can be common to normalize the loss function over the batch. It’s a good idea to read through “What Every Computer Scientist Should Know About Floating-Point Arithmetic”, as it may demystify your errors and enable you to write more careful code. Stick around active range of floating point. In my experience I’ve sometimes seen my relative errors plummet from 1e-2 to 1e-8 by switching to double precision. It is often that case that you might get high relative errors (as high as 1e-2) even with a correct gradient implementation. A common pitfall is using single precision floating point to compute gradient check. Conversely, an error of 1e-2 for a single differentiable function likely indicates incorrect gradient. So if you are gradient checking the input data for a 10-layer network, a relative error of 1e-2 might be okay because the errors build up on the way. use of tanh nonlinearities and softmax), then 1e-4 is too high.Īlso keep in mind that the deeper the network, the higher the relative errors will be.