Sample Size Estimation for Machine Learning Models Using Hoeffding’s Inequality

Ayub Shoaib
Sep 8, 2017
2 min read

Wassily Hoeffding (1914 -1991) was one of the founding fathers of non- parametric statistics (picture credit: http://stat-or.unc.edu)

Deep learning is the talk of town these days and with advent of frameworks like Tensorflow, Keras and SciKitlearn etc. anyone can implement it with ease. This is why the first hunch of everyone when dealing with data is to someway apply deep learning to it or at-least some form of machine learning. However, what most of us don’t realize is that; to have a theoretical guarantee over learning and and then testing in such a way that error is minimized when the model is deployed in the real-world, we need considerably large data sets. And such large data sets are very hard to get.

This theoretical guarantee is of utmost importance when dealing with medical or health related data because to generate confidence intervals (values between which your point predictors in-sample can sway with a given probability or confidence level) we need to have nice probabilistic bounds. One of such bounds and also the most widely used is Hoeffding’s inequality:

\mathbf{Pr \left[ \left | \frac{1}{n} \sum_{i=1}^{n} Z_{i} - E\left [ Z \right ] \right | \geq \epsilon \right ] \leq 2 exp\left ( -2n \epsilon ^ {2} \right ). }

Simply put, it is the probability that the difference between mean of sample of random variable

and true (population) mean of it

, is greater than some value

is small. This

can also be understood as error. How small is it? Well, it is

small. All of us are familiar with negative exponentials. It decays very quickly and is further bounded by

, which is an upper bound on

such that

is the significance level and gives us

confidence intervals that the difference between sample and population statistics will not deviate from

We can use this bound, set by Hoeffding’s inequality to calculate the sample size needed to attain this small error. All we have to do is little algebra to solve for n.

\mathbf{2 exp \left ( -2n \epsilon ^ {2} \right ) \leq \alpha}

\mathbf{e^{-2n \epsilon ^ {2} } \leq \frac{\alpha}{2}}

\mathbf{ln \left ( e^{-2n \epsilon ^ {2} } \right ) \leq ln \left ( \frac{\alpha}{2} \right )}

\mathbf{ -2n \epsilon ^ {2} \leq ln \left ( \frac{\alpha}{2} \right )}

\mathbf{ -n \leq \frac{1}{2 \epsilon ^ {2}}ln \left ( \frac{\alpha}{2} \right )}

\mathbf{ n \geq - \frac{1}{2 \epsilon ^ {2}}ln \left ( \frac{\alpha}{2} \right )}

For example, we want to know how many testing examples should we have if want to be 95% confident that our model’s output in the sample will not deviate more than 0.1 in the real world? We can simply plug in the numbers as follows. Note that from

, we have

. Here

\mathbf{ n \geq - \frac{1}{2 \times 0.1^{2}}\ln \frac{0.05}{2} = 185 }

Following is the python code we can use to calculate sample size for all values of

and

import math
def Hoeffding_SampleSize(error, alpha): 
                          n = (-1//(2 * error**2)) * math.log(alpha/2) 
                          return n

#alpha #Statistics #confidenceintervals #Probability #epsilon #MachineLearning #bound #inequality #deeplearning #samplesize #AI #hoeffding #Science

Sample Size Estimation for Machine Learning Models Using Hoeffding’s Inequality

Wassily Hoeffding (1914 -1991) was one of the founding fathers of non- parametric statistics (picture credit: http://stat-or.unc.edu)

Recent Posts

Comments