Python Code and Normal Distribution: Writing CDF from Scratch

Normal Distribution CDF

Normal Distribution is one of ‘the’ most applicable distribution in statistics. For some time I have been trying to develop my own library code to get myself in shape for the sport of data science. Lately, I have also been thinking much about complexity of algorithms. But before I could develop functions for distributions I faced a problem of evaluating areas under the distribution curves. At first it was hard to find some efficient descretized formulae and I wondered if the world still loads standard score tables as data structures in programs and use them to evaluate percentiles?! To cut it short I have found the following resources helpful (in the reference section) to write the code for calculating points under a standard normal Distribution Curve.

But first… A little wisdom in Horner’s Method of coding polynomials:

1) We could evaluate a polynomial of the form a + bx + cx^2 + dx^3 by coding as a + b*x + c*x*x + d*x*x*x. 2) But we can save computational power by coding it as ((d*x + c)*x + b)*x + a. 3) The polynomials below were coded this way; bringing down the complexity of this algorithm from O(n^2) to O(n).

phi(x) returns gives a probability that a statistic is less than Z. This equates to the area of the distribution below Z.e.g: Pr(Z = 0.69) = 0.7549. This value is usually given in Z tables.

phi_compcum(x) this Complementary cumulative function returns a probability that a statistic is greater than Z. This equates tothe area of the distribution above Z. e.g: Pr(Z = 0.69) = 1 – 0.7549 = 0.2451

phi_cumformu(x) returns the Cumulative from mean gives a probability that a statistic is between 0 (mean) and Z.

q(p) returns the inverse of the phi().

Following is the code:

import math

###################################################################################################

def erf(x):
    ''' John D. Cook's implementation.http://www.johndcook.com
    >> Formula 7.1.26 given in Abramowitz and Stegun.
    >> Formula appears as 1 – (a1t1 + a2t2 + a3t3 + a4t4 + a5t5)exp(-x2)
    >> A little wisdom in Horner's Method of coding polynomials:
        1) We could evaluate a polynomial of the form a + bx + cx^2 + dx^3 by coding as a + b*x + c*x*x + d*x*x*x.
        2) But we can save computational power by coding it as ((d*x + c)*x + b)*x + a.
        3) The formula below was coded this way bringing down the complexity of this algorithm from O(n2) to O(n).'''

    # constants
    a1 =  0.254829592
    a2 = -0.284496736
    a3 =  1.421413741
    a4 = -1.453152027
    a5 =  1.061405429
    p  =  0.3275911

    # Save the sign of x
    sign = 1
    if x < 0:
        sign = -1
    x = abs(x)

    # Formula 7.1.26 given in Abramowitz and Stegun.
    t = 1.0/(1.0 + p*x)
    y = 1.0 - (((((a5*t + a4)*t) + a3)*t + a2)*t + a1)*t*math.exp(-x*x)

    return sign*y

####################################################################################################

def phi(x):
    '''Cumulative gives a probability that a statistic
    is less than Z. This equates to the area of the
    distribution below Z.
    e.g:  Pr(Z = 0.69) = 0.7549. This value is usually
    given in Z tables.'''

    return 0.5*( 1.0 + erf(x/math.sqrt(2)) )

#####################################################################################################

def phi_compcum(x):
    ''' Complementary cumulative gives a probability
    that a statistic is greater than Z. This equates to
    the area of the distribution above Z.
    e.g: Pr(Z  =  0.69) = 1 - 0.7549 = 0.2451'''

    return abs(phi(x) - 1)

#####################################################################################################

def phi_cumformu(x):
    '''Cumulative from mean gives a probability
    that a statistic is between 0 (mean) and Z.
    e.g: Pr(0 = Z = 0.69) = 0.2549'''

    return phi_compcum(0) - phi_compcum(x)

def formula(t):
    #constants
    c0 = 2.515517
    c1 = 0.802853
    c2 = 0.010328
    d0 = 1.432788
    d1 = 0.189269
    d2 = 0.001308   

    # Formula
    p = t - ((c2*t + c1)*t + c0) / (((d2*t + d1)*t + d0)*t + 1.0)
    return  p

def q(p):
    if (p < 0.5):
        #F^-1(p) = - G^-1(p)
        return -formula( math.sqrt(-2.0*math.log(p)) )

    else:
        #F^-1(p) = G^-1(1-p)
        return formula( math.sqrt(-2.0*math.log(1-p)) )

    return q

References:

http://www.johndcook.com/normal_cdf_inverse.html
Statistical Algorithms, R. Chattamvelli. Narosa Publishing. 2012
http://www.johndcook.com/blog/2009/01/19/stand-alone-normal-gaussian-distribution-function/
Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Abramowitz and Steagun. Dover Publications. 1974
http://en.wikipedia.org/wiki/Error_function

#Statistics #bigdata #Bioinformatics #Probability #DataAnalysis #ComputationalStatistics #CDF #Python #code #Computing #function #datamining #Mathematics #NormalDistribution #Algorithm #CumulativeDensityFunction #R #datascience #ScientificProgramming #Science

Python Code and Normal Distribution: Writing CDF from Scratch

Recent Posts

Comentarios