Probability for Computer Scientists

I want to learn more about Combinatorics and Probability because I believe those are two areas of discrete mathematics that I don't know enough about. I found the pdf/book "Probability for Computer Scientists" online, a work is taken from the lecture notes for the course Probability for Computer Scientists at Stanford University, and I am going to take notes on it here.

Date Created:

Last Edited:

2 529

References

Probability for Computer Scientists


Symbol	Meaning	Command
	Empty set	`\oslash`
	Intersection symbol for sets	`\cap`
	Union symbol for sets	`\cup`
	Generally used to denote multiplication	`\cross`
	Floor of variable x	`\lfloor x \rfloor`
	Ceiling of variable x	`\lceil x \rceil`
	Binomial Coefficient	`\begin{pmatrix}n\\r\end{pmatrix}`
	Infinity	`\infty`
	Subset equals	`\subseteq`
	Similar To	`\sim`
	Integral	`\int`

Note

This is just an overview. I did not go too deep into any topic. I mainly just wanted to learn terminology and what I should eventually learn more in-depth.

Probability for Computer Scientists

This work is taken from the lecture notes for the course Probability for Computer Scientists at Stanford University.

Counting

Sum Rule of Counting

If the outcome of an experiment can either be one of m outcomes or one of n outcomes, where none of the outcomes in the set of m outcomes is the same as any of the outcomes in the set of n outcomes, then there are possible outcomes of the experiment.
Rewritten using set notation, the Sum Rule states that if the outcomes of an experiment can either be drawn form set A or set B, where , then the number of outcomes of the experiment is .

Product Rule of Counting

If an experiment has two parts, where the first part can result in one of m outcomes and the second part can result in one of n outcomes regardless of the outcome of the first part, the total number of outcomes for the experiment is mn.
Rewritten in set notation, the Product Rule states that if an experiment with two parts has an outcome from set A in the first part, where , and an outcome from set B in the second part (regardless of the outcome in the first part, where , then the total number of outcomes of the experiment is .

The Inclusion-Exclusion Principle

If an outcome of an experiment can either be drawn from set A or set B, and sets A and B may potentially overlap (not guaranteed that ), then the number of outcomes for the experiment is .

General Principle of Counting

If an experiment has r parts such that part i has outcomes for all , then the total number of outcomes for the experiment is

General Pigeonhole Principle

If m objects are places in n buckers, then at least one bucket must contain at least objects.

Combinatorics

Permutation Rule

A permutation is an ordered arrangement of n distinct objects. Those n objects can be permuted in ways.

Binary Search tree (BST) is a binary tree that satisfies the following properties for every node n in the tree:

n's value is greater than all the values in the left subtree
n's value is less than all the values in its right subtree
both n's left and right subtrees are binary search trees

Permutations of Indistinct Objects

Binomial Coefficient

A combination is an unordered selection of r objects from a set of n objects. If all objects are distinct, the number of ways of making the selection is:

The term is defined as a binomial coefficient and is often read as n choose r

Multinomial Coefficient

If n objects are distinct, then the number of ways to select groups of objects, such that group i has size and is:

Divider Method

Suppose yo u want to place indistinguishable items into r containers. The divider method works by imagining that you are going to solve this problem by sorting two types of object, your n original elements and dividers. Thus, you are permuting objects, of which are the same and of which are the dame (the dividers). The total number of outcomes is:

Probability

Event Spaces and Sample Spaces

A sample space S is the set of all possible outcomes of an experiment. For example:
- Coin Flip:
- Flipping Two Coins:
- Roll of 6 Sided Die:
- Number of emails in a day:
An event space E is some subset of S that we ascribe meaning to:
- Coin Flip:
- Flipping Two Coins:
- Roll of 6 Sided Die:
- Number of Emails in a day:
What probability is:

where n is the number of trials performed and n(E) is the number of trials with an outcome in E. In English this reads: say you perform n trials of an experiment. The probability of a desired event E is defined a s the ratio of the number of trials that result in an outcome in E to the number of trials performed.

Axioms of Probability

Basic truths about probabilities:
- Axiom 1:
- Axiom 2:
- Axiom 3:

Provable Identities of Probability

- Identity 1:
- Identity 2:
- Identity 3:
General Inclusion-Exclusion Identity:

Equally Likely Outcomes

Some sample spaces have outcomes that are all equally likely. We like those sample spaces; they make it simple to compute probabilities. Examples include coin flips and rolling dies.
For a sample space win which all outcomes are equally likely:
and for any event ,

Conditional Probability

In English, a conditional probability answers the question: "What is the chance of an event happening, given that I already observed some other event ?"
Definition of Conditional Probability: The probability of given that (i.e. conditioned on) event F already happened:

The Chain Rule: the definition of conditional probability can be rewritten as:
- It states that the probability of observing events # and F is the probability of observing F, multiplied by the probability of observing E given that you have observed F.
- Generalized Version:

The Law of Total Probability

In the picture like the one in Figure 4.1, event F can be thought of as having two parts, the part that is in (that is, ), and the part that isn't . This is true because E and E^C are mutually exclusive sets of outcomes which together cover the entire sample space. This is generally true:
- This observation is called the law of total probability; however, it most commonly referred to as the chain rule.
For events E and F:

In probability, the subscript c in is used to indicate the non-occurrence of an event, or its complement. The complement of an event is the subset of outcomes in the sample space that are not part of the event.

If you can divide your sample space into any number of events that are mutually exclusive and exhaustive - every outcome in sample space falls into exactly one of those events - then:
- The word total refers to the fact that the events in must combine to form the totality of the sample space.

Bayes' Theorem

Bayes' theorem (or Bayes' rule) is one of the most ubiquitous results in probability for computer scientists. Very often we know the conditional probability in one direction - - but we would like to know the conditional probability in the other direction. Bayes' theorem provides a way to convert from one to the other.
Bayes Theorem - the most common form of bayes theorem is:

Each term in Bayes' rule formula has its own name. The term is often called the posterior, the term is often called the prior, the term is called the likelihood (or the update), and is often called the normalization constant.

Conditional Probability Rules, Conditioning on G:

Independence

Independence is a big deal in machine learning and probabilistic modeling. Knowing the joint probability of many events (the probability of the and of the events) requires exponential amounts of data. By making independence and conditional independence claims, computers can essentially decompose how to calculate the joint probability, making it faster to compute, and requiring less data to learn probabilities.
Independence: Two events, and , are independent if and only if:
Otherwise, they are called dependent events.

Conditional Independence

Two events and are called conditionally independent given a third event , if
- or equivalently
An important caveat about conditional independence is that ordinary independence does not imply conditional independence, nor the other way around.

Random Variables

A random variable (RV) is a variable that probabilistically takes on different values.
An event is a situation, a random variable is an object. The situation in which a random variable takes on a particular value (or range of values) is an event.
There are many different types of random variable (indicator, binary, choice, Bernoulli, etc.). The two main families of random variable types are discrete and continuous.

Probability Mass Function

For a discrete random variable, the most important thing to know is the probability that the random variable will take on each of its possible values., The probability mass function (PMF) of a random variable is a function that maps possible outcomes of a random variable to the corresponding probabilities. Because it is a function, we can plot PMF graphs where the x-axis contains the values that the random variable can take on and the y-axis contains the probability of the random variable taking on said value

The probability mass function, , defines the probability of X taking place on the value x.

Expectation

A useful piece of information about a random variable is the average value of the random variable over many repetitions of the experiment it represents. This average is called the expectation. The expectation of a discrete random variable is defined as:

It goes by other names: mean, expected value, weighted average, center of mass, and first moment.

Properties of Expectation

Expectations preserve linearity. This means

So if you have expectation of a sum of quantities, this is equal to the sum of the expectations of those quantities.
One can calculate the expected value of a function of a random variable when one knows the probability distribution of but does not explicitly know the distribution of :

Variance

Variance is a formal qualification of spread. There is more than one way to quantify spread; variance uses the average square distance from the mean.
The standard deviation is the square root of variance.

Bernoulli and Binomial Random Variables

A Bernoulli random variable is the simplest kind of random variable. it can take on two values, 1 and 0.If X is a Bernoulli random variable, denoted

Bernoulli random variables and indicator variables are two aspects of the same concept.

Binomial Random Variable

A binomial random variable is a random variable that represents the number of successes in n successive independent trials of a Bernoulli experiment, e.g., the number of heads in n coin flips.
If is a Binomial random variable, we denote this , where p is the probability of success in a given trial. A binomial random variable has the following properties:

Poisson and Other Discrete Distributions

Poisson Random Variable

A Poisson random variable approximates Binomial random variables where n is large, p is small, and is moderate. is called the rate.
If X is a random variable, denoted , then:

Geometric Distribution

The variable X is a geometric random variable, denoted , if X is the number of independent trials until the first success and p is the probability of success on each trial:

Continuous Distributions

Probability Density Function

So far, all random variables we have seen have been discrete. Continuous Random variables can take on values .
Real values are defined with infinite precision; as a result, the probability that a random variable takes on a specific value is not very meaningful when the random variable is continuous. The Probability Mass Function does not apply here.
In a continuous world, every random variable has a probability density function (PDF), which says how likely it is that a random variable takes on a particular value, relative to other values that it could take on. The PDF has the nice property that you can integrate over it to find the probability that the random variable takes on values within a range .
The random variable is a continuous random variable if there is a function for , called the probability density function (PDF), such that:

The following properties must hold:

Cumulative Distribution Function

A cumulative distribution function (CDF) is a function which takes in a number and returns the probability that a random variable takes on a value less than (or equal to) that number. For a continuous random variable X, the cumulative distribution function is:

Most probability questions can be solved by knowing the cumulative distribution function:

Expectation and Variance

For a continuous random variable :

Uniform Random Variable

The most basic of all the continuous random variables is the uniform random variable, which is equally likely to take any value in its range . The variable is a uniform random variable, denoted , if it has the probability density function (PDF):

Normal Distributions

The single most important random variable type is the Normal (aka Gaussian) random variable, parameterized by a mean and a variance . If is a normal variable, we write . The normal is important for many reasons: it is generated from the summation of independent random variables and as a result occurs often in nature. Many things in the world are not distributed normally but data scientists and computer scientists model them as Normal distributions anyways. Why? Because it is the most entropic (conservative) distribution that we can apply to data with a measured mean and variance.

The probability density function (PDF) for a Normal is:

By definition, a normal has and .
Standard Normal: mean =0 and variance = 1.

Joint Distributions

Often you will work on problems where there are several random variables (often interacting with one another). We are going to start formally looking at how those interactions play out. We will think of joint probabilities with two events and .

Independent Random Variables

Intuitively: knowing the value of tells us nothing about the distribution of . If two variables are not independent, they are called dependent.

Statistics of Multiple Random Variables

Covariance is a quantitative measure of the extent to which the deviation of one variable from its mean matches the deviation of the other form its mean.

Covariance is interesting because it is a quantitative measurement of the relationship between two variables. Correlation between two variables, , is the covariance of the two variables normalized by the variance of each variable. The normalization cancels the units out and normalizes the measure so that it is always in the range .
Two variables are uncorrelated if there correlation is 0.

When people use the term correlation, they are actually referring to a specific type of correlation called Pearson correlation. It measures the degree tow which there is a linear relationship between the two variables. An alternative measure is Spearman correlation which has a formula almost identical to your regular correlation score, with the exception that the underlying random variables are first transformed into their rank.

Bayesian Networks

Central Limit Theorem

The central limit theorem proves that the averages of samples from any distribution themselves must be normally distributed:

Maximum Likelihood Estimation

Almost all machine algorithms work like this:

Specify a probabilistic model that has parameters.
Learn the value of those parameters from data.

Given a model, the parameters are the numbers that yield the actual distribution.

There are two main approaches to estimate the value of parameters: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP).
The idea behind Maximum Likelihood Estimation is to select the parameters that make the observed data the most likely.
In the case of discrete distributions, likelihood is a synonym for the joint probability of your data. In the case of continuous distribution, likelihood refers to the joint probability density of your data.

Maximum A Posteriori

The paradigm of Maximum A Posteriori (MAP) is that we should choose the value for our parameters that is most likely given the data.

Naive Bayes

Naive Bayes is a type of machine learning algorithm called a classifier. It is used to predict the probability of a discrete label random variable based on the state of feature random variables X. Naive Bayes is a supervised classification Machine Learning algorithm.

The Naive Bayes Assumption is that each feature of X is conditionally independent of one anther given . That assumption is naive, but often useful.

Linear Regression and Gradient Descent

Regression is a second classification of machine learning algorithms. You have a prediction function , but you would like to predict a that takes on a continuous number.
The idea behind gradient ascent is that if you continuously take small steps in the direction of your gradient, you will eventually make it to a local maxima.

Probability for Computer Scientists

References

Note

Probability for Computer Scientists

Counting

Sum Rule of Counting

Product Rule of Counting

The Inclusion-Exclusion Principle

General Principle of Counting

General Pigeonhole Principle

Combinatorics

Permutation Rule

Permutations of Indistinct Objects

Binomial Coefficient

Multinomial Coefficient

Divider Method

Probability

Event Spaces and Sample Spaces

Axioms of Probability

Provable Identities of Probability

Equally Likely Outcomes

Conditional Probability

Conditional Probability

The Law of Total Probability

Bayes' Theorem

Independence

Conditional Independence

Random Variables

Probability Mass Function

Expectation

Properties of Expectation

Variance

Bernoulli and Binomial Random Variables

Binomial Random Variable

Poisson and Other Discrete Distributions

Poisson Random Variable

Geometric Distribution

Continuous Distributions

Probability Density Function

Cumulative Distribution Function

Expectation and Variance

Uniform Random Variable

Normal Distributions

Joint Distributions

Independent Random Variables

Statistics of Multiple Random Variables

Bayesian Networks

Central Limit Theorem

Maximum Likelihood Estimation

Maximum A Posteriori

Naive Bayes

Linear Regression and Gradient Descent

Review

Comments

User Comments