Discrete Probability & Central Limit Theorem

Wed Oct 09 2024

TLDR: Discrete probability distributions describe the likelihood of different outcomes in a finite scenario. The Poisson distribution models event occurrences, while the binomial distribution describes successes in trials. The Central Limit Theorem states that sample means approach a normal distribution with a sufficiently large sample size. Sampling methods like simple random sampling and confidence intervals help estimate population parameters.

In statistics, discrete probability distributions are used to describe the likelihood of different outcomes in a scenario where there are a finite number of possibilities. One essential property of a probability distribution is that the outcomes must be mutually exclusive, meaning no two outcomes can happen simultaneously.

Example: Tossing a Coin Three Times

Let’s take the example of tossing a coin three times. The possible outcomes are:

TTT, TTH, THT, THH, HTT, HTH, HHT, HHH

Now, we can construct a probability distribution for the number of heads that appear:

Number of Heads ( $x$ )	Probability ( $P(x)$ )
0	1/8 = 0.125
1	3/8 = 0.375
2	3/8 = 0.375
3	1/8 = 0.125

To calculate the mean (expected value) and standard deviation for this distribution, we use the following formulas:

Mean (μ):
$\begin{align*} \mu &= \sum [x \cdot P(x)] \\ &= (0 \cdot 0.125) + (1 \cdot 0.375) + (2 \cdot 0.375) + (3 \cdot 0.125) \\ &= 1.5 \end{align*}$
Variance (σ²):
$\begin{align*} \sigma^2 &= \sum [(x - \mu)^2 \cdot P(x)] \\ &= (0 - 1.5)^2 \cdot 0.125 + (1 - 1.5)^2 \cdot 0.375 + (2 - 1.5)^2 \cdot 0.375 + (3 - 1.5)^2 \cdot 0.125 \\ &= 0.75 \end{align*}$
Standard Deviation (σ):
$\begin{align*} \sigma &= \sqrt{\sigma^2} \\ &= \sqrt{0.75} \\ &\approx 0.866 \end{align*}$

Thus, the expected number of heads is 1.5, with a standard deviation of approximately 0.87.

Poisson Distribution

The Poisson distribution models the probability of a given number of events happening in a fixed interval of time or space. It is appropriate when events happen independently at a constant mean rate. A key formula for the Poisson distribution is:

P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}

Where:

$P(X = k)$ is the probability of exactly $k$ events.
$\lambda$ is the mean number of events.
$e$ is approximately $2.718$ .
$k!$ is the factorial of $k$ .

Binomial Distribution

The binomial distribution describes the probability of a certain number of successes in a fixed number of independent trials. It is characterized by two parameters: the number of trials ( $n$ ) and the probability of success in each trial ( $p$ ). The formula for the binomial distribution is:

P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}

Where:

$P(X = k)$ is the probability of $k$ successes.
$\binom{n}{k}$ is the number of ways to choose $k$ successes out of $n$ trials.
$p$ is the probability of success in a single trial.
$(1 - p)$ is the probability of failure in a single trial.
$n$ is the total number of trials.
$k$ is the number of successes.

Central Limit Theorem

The Central Limit Theorem (CLT) is a fundamental concept in statistics. It states that as the sample size increases, the distribution of sample means approaches a normal distribution, regardless of the original population distribution, provided the sample size is sufficiently large (typically $n > 30$ ).

Sampling Methods

Simple Random Sampling:
- Each member of the population is assigned a number, and random numbers are selected without replacement. For instance, if you have a population of 1,000 and need to select 50 people, you would randomly select 50 unique numbers between 1 and 1,000.
Systematic Random Sampling:
- This method involves selecting every ( k )-th individual from a list after choosing a random starting point. For instance, if the population size is 1,000 and you need 50 individuals, you could select every 20th person (1,000/50 = 20) after randomly selecting a starting point within the first 20 individuals.
Stratified Sampling:
- The population is divided into subgroups (strata) based on certain characteristics, and random samples are taken from each stratum. This method ensures that each subgroup is represented in the sample.
Cluster Sampling:
- The population is divided into clusters, and a random sample of clusters is selected. All individuals within the chosen clusters are included in the sample. This method is useful when it is difficult to obtain a list of the entire population.

Recommendation: Simple random sampling tends to minimize bias, especially if you have access to a full population list. Systematic sampling is more efficient but can introduce bias if the population is ordered in a specific way. Stratified sampling is useful when you want to ensure representation from different subgroups, while cluster sampling is beneficial when the population is geographically dispersed.

Estimating Population Parameters: Confidence Intervals

A confidence interval provides a range of values within which we are confident the population parameter lies. For example, if we want to estimate the mean income of households based on a sample, we can construct a 95% confidence interval.

Example Problem

Suppose we survey 100 households and find a sample mean income of $50,000 with a standard deviation of$ 5,000. The 95% confidence interval for the population mean is given by:

CI = \bar{x} \pm z \frac{\sigma}{\sqrt{n}}

Where:

$\bar{x}$ is the sample mean (in this case, $\$ 50,000$).
$z$ is the z-value for a 95% confidence level, which is $1.96$ .
$\sigma$ is the standard deviation of the sample ( $\$ 5,000$).
$n$ is the sample size ( $100$ ).

Substituting the values:

CI = 50000 \pm 1.96 \frac{5000}{\sqrt{100}} = 50000 \pm 1.96 \times 500 = 50000 \pm 980

Thus, the 95% confidence interval is $ $49,020$ to $ $50,980$ . This means we are 95% confident that the true population mean falls within this interval.

Conclusion

Key statistical concepts such as discrete probability distributions, the Poisson distribution, the Central Limit Theorem, sampling methods, and confidence intervals play crucial roles in understanding data and making informed decisions. The use of formulas allows for precise calculations, helping to quantify uncertainty and variability, whether we’re estimating average household incomes or calculating the likelihood of specific outcomes.