What are the core statistical concepts for data scientists?

  • Introduction
  • Sampling and Population
  • Normal distribution
  • Central tendency
  • Central Limit Theorem/ Law of Large Numbers
  • Variance and standard deviation
  • Skewness
  • Correlation and Covariance
  • P-value
  • Expected value of random variables
  • Conditional probability
  • Bayes Theorem

Data science is a cross-disciplinary field. One of its essential components is statistics. Without a basic knowledge of statistics, it would be hard to understand or interpret the data.

The purpose of statistics is to create inferences about a population based on the sample information. There are also many overlaps between machine learning and statistics.

To become a data scientist, it’s important to understand how statistics work.

  1. Sampling and Population

Population in statistics can be referred to as the total set of observations or conclusions that can be made. For example, college students in India are one population that comprises all of the college students in India. People aged 25 years old in America are a population that includes all of the people fitting this description.

Population analysis is not always feasible or possible, so we use a sample to process the data.

A ‘sample’ is defined as a subset of the population, which sticks to a small group from which observations or conclusions can be drawn. For instance, 1000 college students in India are a sample group of all college-aged Indians.

  1. Normal distribution

Probability distribution refers to a statistical function that represents all the possible outcomes from a random event.  Consider a column (as a feature) in a dataframe. This variable can take on any number of values, as shown in its probability distribution function.

There are several columns in the dataframe, and each column has its own set of values. The probability distribution indicates that a certain value is more likely to appear than another given value.

The normal or Gaussian distribution is one such probability distribution. It is a probability that is symmetric near the mean value, meaning value near the mean occurs more likely than the value far from the mean. If we represent normal distribution in the graph we tend to get a bell-shaped curve.

The following graph showcases the shape of a typical normal distribution curve, which was created through the use of the random sampling function in NumPy.

Image source: https://wiki.analytica.com/index.php?title=File%3ANormal(0,1).png

The peak of the graph indicates the most likely value for our variable, and as we move away from this point the probability decreases.

The following image presents a more formal representation of the normal distribution. The percentages indicate what percentage of data falls within that region: As we move away from the mean, we start to see fewer values. The probability of seeing any specific value is much lower than it was nearer to the mean.

The extremes (values way off from the norm) represent a higher probability for us to observe them, but they still won’t always occur.

Image source: Normal_distribution

  1. Central tendency

Central tendency is the summary of a large set of values. It’s the statistical concept that measures the “middle” or centre point of a data set.

The two most popular central tendencies are mean and median.

The mean is calculated using the formula: 

=xn

Where, 

  • = Mean
  • X = sum of all values
  • n = ‘n’ number of values in a dataset

Generally, the mean is used to describe “typical” behaviour. It’s very sensitive to the outliers in a dataset.

The median is the middle score when values are sorted by size (smallest to largest). quartiles provide us with additional insights into quantifying central tendency since they help us understand where scores might be positioned within the data distribution.

For example, consider the following dataset: { 5.67, 7, 8.2, 10.1 }

Using quartiles to describe central tendency might look something like this: Median = 7 (the value that separates the top two data points) 4th Quartile = 8 (The first value larger than the median) 2nd Quartile = 9 (the value below the first quartile)

  1. Central Limit Theorem/ Law of Large Numbers

The central limit theorem is a statistical concept that describes if a population with mean and standard deviation of σ then no matter how large and random the sample size is the distribution of means will show normal distribution. Simply stating, the sample means tend toward the normal distribution as sample size increases. This essentially indicates that our mean-based calculations become more accurate over time, regardless of our original data’s normality.

The central limit theorem is closely connected to the concept of a margin of error, which indicates the degree of confidence we have in our estimates; more specifically, that 95% of our future samples will fall within a given range with an increasing significance level (e.g., 50%, 80%).

  1. Variance and standard deviation

Variance is essentially the measure of how far a set of values are from the mean. It is calculated by taking the average squared deviation from the mean. 

Source: Variance

Meanwhile, standard deviation (σ) is derived from variance and describes how far your value lies from the mean. Standard deviation is simply the square root of variance. Note that it’s common to have two standard deviations in order to quantify data variability, and thus describe how to spread out numbers around their mean value. Standard deviations can also indicate outliers within our dataset.

  1. Skewness

Skewness describes the asymmetry of a dataset. In a dataset, skewness refers to a deviation from the bell curve or normal distribution. A “skewed” distribution is one in which some values are more common than others. In essence, it quantifies how asymmetrical a variable is and determines whether we can expect extreme scores in our data set (or not).

If you’ve ever read a book that discusses “tail events,” such as the Black Swan, this is what we’re referring to. For example, if you’ve ever checked a coin for fairness it’s probably more likely that you’ll find heads than tails (i.e., frequent outliers). Extreme scores are easier to detect when symmetrical graphs are used in descriptive statistics (e.g., mean, median, and standard deviation).

  1. Correlation and Covariance

Correlation refers to the strength and direction of a relationship between two variables. Whereas covariance refers to how two variables deviate between the random The following image depicts the most common correlation values: -1 = perfect negative correlation 0 = no correlation +1 = perfect positive correlation A value of zero indicates there’s no linear relationship between the two variables.

Examples of correlations:

Correlations are frequently used in data analysis, but most people who have a background in math and statistics don’t really get to understand them on the most basic level (i.e., what they mean).

Image source: https://www.geeksforgeeks.org/mathematics-covariance-and-correlation/

Consider these examples:

  1. A) A positive correlation would indicate that as wine consumption increases, hospital admittance rates increase.
  2. B) A negative correlation would indicate that as wine consumption increases, hospital admittance rates decrease.
  3. C) The lack of a correlation between wine and hospital admissions indicates no relationship between the two variables, or alternatively that there exists some other variable that’s driving them both (e.g., alcohol consumption).

Covariance is simply the average of the product-moment correlation between two variables. It’s also frequently used in data analysis, but it’s not as well known as other descriptive statistics. The following image shows the covariance matrix for three variables: { x , y , z }

  1. P-value

The p-value is simply the probability of getting an observed or more extreme value assuming that there’s no relationship between our two variables. A p-value of 0.05 indicates that we can expect to get values at least as extreme 5% of the time if there were no relationship between the variables.

Image source: https://blog.minitab.com/en/adventures-in-statistics-2/understanding-hypothesis-tests-significance-levels-alpha-and-p-values-in-statistics

In statistics, a null hypothesis is a hypothesis which we are trying to disprove. In data science, most null hypotheses deal with correlations between variables: e.g., “there is no relationship between X and Y” or “X and Y have a weak/non-existent linear correlation.”

If you reject the null hypothesis, it’s likely that you’re dealing with a stronger relationship or a more complex non-linear relationship. Another common example is examining whether the empirical data supports the idea of a “winner’s curse.”

For example, let’s assume that you’re interested in exploring option trading strategies for your business (i.e., an investment) and are trying to figure out if there is any correlation between option returns and the volatility of those options.

  1. Expected value of random variables

The expected value of a random variable is simply the average value of all possible values that could be randomly generated for that variable. If you’re familiar with basic probability theory, this concept will make sense to you:

E.g. P(X = x) * x = E(X).

If not, it may take some time to understand the concept of a random variable. For now, just understand that it represents all possible values randomly generated for a specific variable.

  1. Conditional probability

Conditional probability is what we call the likelihood of an event occurring given that another event has already occurred. In data science and statistics, we are frequently interested in determining which factors lead to specific results (e.g., how can X increase my revenue?).

Conditional probability is a close cousin of Bayes’ theorem, which we will touch on in the next point.

Let’s say you have 6 blue dots and 4 yellows in two separate boxes.

We tell you to randomly pick one. The probability of getting a blue dot is 6 out of 10 = 0,6.

What if we tell you to pick one dot from box A? The probability of picking a blue one reduces.

The probability of event A given event B is denoted as p(A|B).

This means that the probability, called conditional probability, can be calculated by multiplying the following probabilities.

  1. Bayes’ Theorem

Bayes’ theorem is one of the key concepts to understand for both data science and statistics. If you’re unfamiliar with probability theory or conditional probabilities, then Bayes’ Theorem may be confusing to you at first.

where A and B are events and 

P(A | B) is a conditional probability: the probability of event A occurring given that B is true. It is also called the posterior probability of A given B.

P(B | A) is also a conditional probability: the probability of event B occurring given that A is true. It can also be interpreted as the likelihood of A given a fixed B because P(B | A)=L(A | B).

P(A) and P(B) are the probabilities of observing A and B respectively without any given conditions; they are known as the marginal probability or prior probability.

A and B must be different events.

Reference: Theorem

The core concept of Bayes’ theorem, however, is that it allows us to adjust our prior beliefs (or “priors”), given new information. For example, if your prior beliefs are that it is 40% likely to rain this weekend and you receive a weather report indicating that there is a 60% chance of rain during the weekend, then your posterior probability would be:

P(Rain|Weather Report) = P(Weather Report|Rain) * P(Rain) / P(Weather Report)

In data science and statistics, we use Bayes’ theorem very frequently to make decisions. In fact, you can even use it to make predictions about the behavior of future people or markets.  It’s a rather powerful idea that has been applied in nearly every conceivable field.

Takeaway

We have covered some basic statistical concepts and methods. The following concepts are common in the field of data science. Once you understand the basics of statistics, you can work your way up to more advanced topics. We’d be glad to hear some of the statistical concepts you may have worked on. Do share.

Reference: