Connor Owen cowen33@gatech.edu
In statistics and simulation, there are many cases in which we would like to model a process or system using a random variable. There are many ways to do this; for example, one could decide that the interarrival time between customers of a store can be modeled by a exponential random variable, and estimate the exponential distribution parameters using maximum likelihood estimation. However, this method requires the assumption that the interarrival variable is indeed exponential, whereas this might not necessarily be the case.
For this reason, there exists a class of estimators called non-parametric estimators, that do not rely on any underlying distribution and rely only on the underlying data. This can be especially useful if the underlying random variable is unknown, does not fit well with other modeling techniques, or if the process is complex. Kernel Density Estimation (KDE) is one such non-parametric estimator that is widely used.
For the purposes of this discussion, a KDE can be seen as an improvement on a histogram. When modeling a probability density function, we can use a density (or relative frequency) histogram, which is another non-parametric method. Here we will look at an example of some randomly generated data of a exponential random variable disturbed by one of two classes of Gaussian random variable.
import numpy as np
sample_size = 1000
#exponential random variable
v = np.random.exponential(scale = 15, size = sample_size)
#first Gaussian random variable
u = np.random.normal(loc = 5 , scale = 3, size = sample_size )
#second Gaussian random variable
w = np.random.normal(loc = 75 , scale = 5, size = sample_size )
#new random variable is v plus either u or w, with more emphasis on u
z = np.array([i + np.random.choice([j,k], p=[0.75, 0.25]) for i,j,k in zip(v, u, w)])
Now that we have an array of our new random variable, we will plot a relative frequenecy histogram usng the hist( ) function from maplotlib.
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(z, bins=35, density = True)
plt.show()
While useful, histograms have a few drawbacks. These include that they are not smooth, depend on the bucket size, and depend on where the buckets end. This is where KDE comes in handy, since its estimator is smooth, and does not depend on where the buckets end. KDE still has the bucket size problem (called bandwidth); however, there are methods to optimize the selection of this parameter.
Given a random, independent and identically distributed random variable (like the one we generated above), the kernel density estimator of the function generating the shape is:
$$ \hat{f}_h(x) = \frac{1}{nh} \sum_{i = 1}^n K \left(\frac{x - x_i}{h}\right)$$where $h$ is our bandwith parameter, and $K$ is our kernel. There are many kernels to choose from, such as a uniform kernel, or a triangle kernel; however, for this tutorial we will cover the Gaussian kernel:
$$ K_\mathrm{Gaussian}(u) = \frac{1}{\sqrt{2\pi}} e^{\left(-\frac{1}{2}u^2\right)} $$While the kernel is an important choice, perhaps the most consequential choice will be our bandwith parameter, $h$.
from scipy import stats
kde1 = stats.gaussian_kde(z, bw_method = 0.1) #Here we set the bandwidth to 0.1
kde2 = stats.gaussian_kde(z, bw_method = 'scott') #Here we use the inbuilt 'Scott's Rule'
kde2.factor #This line prints out the bandwidth from the 2nd KDE
Here are two examples of different Gaussian kernels with varying bandwidths. For kde1, we set ourselves the bandwidth to 0.1, and for kde2, we use the built in "Scott's Rule" as implemented in SciPy. We see that even though the bandwidths are not very far apart, they create noticeably different estimations of the pdf.
fig = plt.figure()
ax = fig.add_subplot(111)
x_eval = np.linspace(-50, 150, num=1000)
ax.plot(x_eval, kde1(x_eval), 'k-', label="Custom", color = 'black', linewidth=2.5)
ax.plot(x_eval, kde2(x_eval), linestyle='--', label="Scott's Rule", color = 'orange',linewidth=2.5)
_= ax.hist(z, density=True, bins=35, color='#405a8a', alpha=0.5) #assign to variable to supress output
_= ax.legend(loc='best')
Compared to our histogram, the 1st KDE seems to fit well. However, as discussed before, the histogram itself can have drawbacks, so this KDE could be overfitting on the data. There are risk functions that can be minimized that estimate the error of the fit, such as mean integrated square error. In theory, these fits would be optimal; however there is no way to know the true distribution in practice, so the optimality is also an estimation.