Interest in machine learning and data science has been growing at a rapid rate in recent years. More and more students are enrolling in online data sciences courses that are great at teaching them how to fit machine-learning algorithms to simple data sets. Most of these online courses are fantastic in explaining complex techniques related to machine learning, however, only a few of them delve into the mathematical statistics behind the fancy algorithms. The fundamentals of statistics are grossly undervalued in these courses. For example, there are many so-called data scientists that cannot distinguish between discrete and continuous data. It may seem trivial, but I’ve seen many people simply assume that their data is continuous when in fact it is discrete. The most common being the Poisson distribution. Understanding the properties of various distributions is extremely important in making sense of your data.

To help one understand the properties of a certain distribution, it is always helpful to stimulate the data points and plot them visually. With the help of Python 3, we will go through and simulate the most common simple distributions in the world of data science. We won’t be explaining each distribution in detail, this research can be done in your own time (we provide useful links and resources). Here we will only simulate various popular distributions that can be helpful in many applications. The first step is to install the required libraries.

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from scipy import stats

**Standard Normal Distribution**

The Normal Distribution contains the word “Normal” because it’s possibly the distribution that explains most types of phenomena. For example, IQ scores, height and shoe sizes are applications of the normal distribution. You can find a detailed explanation of the normal distribution here. It explains the notation used and the central limit theorem.

sns.set(color_codes=True) #Set random number for reproducibility np.random.seed(123) #simulate normal dist (Std normal: loc = 1 and scale = 1) x = np.random.normal(size=500, loc = 0, scale = 1) #Plot fig = plt.figure(figsize=(10,6)) ax = sns.distplot(x, fit = norm, axlabel = "values", kde_kws={"color": "r", "lw": 3, "label": "KDE"}, fit_kws={"color": "black", "lw": 3, "label": "StdNormal"}) plt.legend(labels=['Kernel Density','Standard Normal Dist']) plt.title("Standard Normal Distribution") plt.show(block=False)

**Binomial Distribution**

The Binomial Distribution is discrete and is used to model the number of successes in a given sample size. When we simulate this distribution, it’s useful to indicate the size parameter. The size parameter essentially defines how many times we want to run the experiments. The flipping of a coin is the most intuitive way to think about the binomial distribution.

np.random.seed(123) x = np.random.binomial(size = 100, n = 10, p = 0.5) #Plot fig = plt.figure(figsize=(10,6)) ax = sns.distplot(x, axlabel = "Values", kde_kws={"color": "r", "lw": 3, "label": "KDE"}) for p in ax.patches: ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=11, color='black', xytext=(0, 20), textcoords='offset points') plt.legend(labels=['Kernel Density','Binomial Distribution']) plt.ylim(0, 0.6) plt.title("Binomial Distribution") plt.show(block=False)

In the plot above we have flipped a coin 10 times and performed this experiment 100 times. For instance, the probability that the coin lands on heads (denoted as a success) exactly 5 times for all these experiments is 0.30. Note that this is an empirical distribution and not theoretical. Also, as n (number of flips) gets larger, the binomial distribution can be somewhat approximated by the normal distribution. Notice the kernel density (red line), it closely resembles the normal distribution. Have a look at Khan Academy for a detailed explanation of the distribution.

**Poisson Distribution**

The Poisson Distribution is used to model events that occur at random time points, in which we are interested in the number of occurrences of the event. For example, the number of goals in a match or the number of calls recorded per day. Lambda is defined as the rate of the event multiplied by the time interval of the event. Stat Trek is a good place to get started on the Poisson distribution.

np.random.seed(123) x = np.random.poisson(lam = 3, size = 10000) fig = plt.figure(figsize=(10,6)) ax = sns.distplot(x, axlabel = "values", kde = False) plt.title("Poisson Distribution") plt.ylabel("frequency") plt.show(block=False)

Note that in this simulation we plot the frequency and not the density of the distribution. To plot with the density on the y-axis, you’d only need to change ‘kde = False’ to ‘kde = True’ in the code above.

**Exponential Distribution**

Referring back to the Poisson distribution and the example with the number of goals scored per match, a natural question arises: how would one model the interval of time between the goals? We would use the popular Exponential distribution to provide the result. Here, Lambda is defined as the rate parameter. A lower rate parameter is linked to a flatter curve.

np.random.seed(123) x = np.random.exponential(size=500, scale = 1.5) fig = plt.figure(figsize=(10,6)) ax = sns.distplot(x, fit = stats.expon, axlabel = "values", kde_kws={"color": "r", "lw": 3, "label": "KDE"}, fit_kws={"color": "black", "lw": 3, "label": "exp"}) plt.legend(labels=['Kernel Density','Exponential Dist']) plt.title("Exponential Distribution") plt.show(block=False)

Have a look here for a detailed explanation of the Exponential distribution and its applications.