Introductory Statistics

I want to know more about / review statistics, so I am reading this textbook. (I actually want to review statistics, so I am reading this textbook).

Date Created:
Last Edited:
1 6

References


  • Introductory Statistics, (OpenStax) Barbara Illowsky, Susan Dean


This textbook was written to increase student access to high-quality learning materials, maintaining highest standards of academic right at little or no cost.


Sampling and Data


Definitions of Statistics, Probability, and Key Terms

The science of statistics deals with the collection, analysis interpretation, and presentation of data. Organizing and summarizing data is called descriptive statistics. Two ways to summarize data are by graphing and by using numbers. The formal methods for drawing conclusions from good data are called inferential statistics. Statistical inference uses probability to determine how confident we can be that our conclusions are correct. Probability is a mathematical tool used to study randomness. It deals with the chance (the likelihood) of an event occurring. The theory of probability began with the study of games of chance such as poker. In statistics, we want to study a population. A population is a collection of persons, things, or objects under study. To study the population, we select a sample. The idea of sampling is to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population. From the sample data, we can calculate a statistic. A statistic is a number that represents a property of the sample. The statistic is an estimate of a population parameter. A parameter is a numerical characteristic of the whole population that can be easily estimated by a statistic.

One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter. The accuracy really depends on how well the sample represents the population. The sample must contain the characteristics of the population in order to be a representative sample. We are interested in both the sample statistic and the population parameter in inferential statistics.

A variable is a characteristic or measurement that can be determined for each member of a population. Variables may be numerical or categorical. Numerical variables take on values with equal units such as weight in pounds and time in hours. Variables may be numerical or categorical. Numerical variables take on values with equal units such as weight in pounds and time in hours. Categorical variables place the person or thing into a category. Data are the actual values of the variable. Datum is a single value.

Data, Sampling, and Variation in Data and Sampling

Qualitative data are the result of categorizing or describing attributes of a population. Qualitative data are also often called categorical data. Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes of a population. Quantitative data can be either discrete or continuous. All data that are the result of counting are called quantitative discrete data. In a pie chart, categories of data are represented by wedges in a circle and are proportional in size to the percent of individuals in each category. In a bar graph, the length of the bar for each category is proportional to the number or percent of individuals in each category. A Pareto chart consists of bars that are sorted into order by category size.

A sample should have the same characteristic as the population it is representing. There are several different methods of random sampling. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample. Each method has pros and cons. The easiest method to describe is called a simple random sample. Any group of individuals is equally likely to be chosen as any other group of individuals if the simple random sampling technique is used. In other words, each sample of the same size has an equal chance of being selected. Other well-known random sampling methods:

  • stratified sample: divide the population into groups called strata and then take a proportionate number from each stratum.
  • cluster sample: divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample.
  • systematic sample: Randomly select a starting point and take every piece of data from a listing of the population. This is frequently chosen because it is a simple method.

A type of sampling that is non-random is convenience sampling. Convenience sampling involves using results that are readily available. The results of convenience sampling may be very good in some cases and highly biased (favor certain outcomes) in others.

True random sampling is done with replacement. That is, once a member is picked, that member goes back into the population and may be chosen more than once. However, for practical reasons, in most populations, simple random sampling is done without replacement. Surveys are typically done without replacement. Sampling without replacement instead of with replacement only becomes a mathematical issue when the population is small. When you analyze data, it is important to be aware of sampling errors and non-sampling errors. The actual process of sampling causes sampling errors. Factors not related to the sampling process cause non-sampling errors. In reality, a sample will never be exactly representative of the population so there will always be some sampling error. In statistics, a sampling bias is created when a sample is collected form a population and some members of the population are not as likely to be chosen as others.

Common problems to be aware of in statistical studies:

  • Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased. Biased samples that are not representative of the population give results that are inaccurate and not valid.
  • Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.
  • Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions.
  • Undue influence: Collecting data or asking questions in a way that influences the response.
  • Non-response or refusal of subject to participate: The collected responses may no longer be representative of the population.
  • Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.
  • Self-funded or self-interest studies: A study performed by a person or organization in order to support their claim.
  • Misleading use of data: improperly displayed graphs, incomplete data, or lack of context
  • Confounding: When the effects of multiple factors on a response cannot be separated. Confounding makes it difficult or impossible to draw valid conclusions about the effect of each factor.

Frequency, Frequency Tables, and Levels of Measurement

A simple way to round off answers is to carry your final answer one more decimal place than was present in the original data. Round off only the final answer. Do not round off any intermediate results, if possible. If it becomes necessary to round off intermediate results, carry them to at least twice as many decimal places as the final answer.

The way a set of data is measured is called its level of measurement. Data can be classified into four levels of measurement:

  • Nominal Scale Level
    • Data that is measured using a nominal scale is qualitative(categorical). Nominal scale data are not ordered. Nominal scale data can not be used in calculations.
  • Ordinal Scale Level
    • Data that is measured using an ordinal scale is similar to nominal scale data but there is a big difference. The ordinal scale data can be ordered. Ordinal scale data can not be used in calculations.
  • Interval Scale Level
    • Data that is measured in the interval scale is similar to ordinal level data because it has a definite ordering but there is a difference between data. The differences between interval scale data can be measured through the data does not have a starting point.
  • Ratio Scale Level
    • Data that is measured using the ratio scale takes care of the ratio problem and gives you the most information. Ratio scale data is like interval scale data, but has a 0 point can be calculated.

A frequency is the number of times a value of the data occurs. A relative frequency is the ratio (fraction or proportion) of the number of times of the data occurs in the set of all outcomes to the total number of outcomes. Cumulative relative frequency is the accumulation of the previous relative frequencies. To find the cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the current row.

Experimental Design and Ethics

The purpose of an experiment is to investigate the relationship between two variables. When one variable causes change in another, we call the first variable the explanatory variable. The affected variable is called the response variable. In a randomized experiment, the researcher manipulates values of the explanatory variables and measures the resulting changes in the response variable. The different vales of the explanatory variables are called treatments. An experimental unit is a single object or individual to be measured. Additional variables that can cloud a study are called lurking variables. In order to prove that the explanatory variable is causing a change in the response variable, it is necessary to isolate the explanatory variable, The researcher must design her experiment in such a way that there is only one difference between groups being compared: the planned treatments. This is accomplished by random assignment of experimental units to treatment groups. A control group helps researchers balance the effects of being in an experiment with the effects of the active treatments. Blinding in a randomized experiment preserves the power of suggestion. When a person involved in a research study is blinded, he does not know who is receiving the active treatment(s) and who is receiving the placebo treatment. A double blind experiment is one in which both the subjects and the researchers involved are blinded.

Ethics

Professional organizations, like the American Statistical Association, clearly define expectations for researchers. There are even laws in the federal code about the use of research data.


Descriptive Statistics


Descriptive Statistics is the are of statistics that studies numerical and graphical ways to describe and display data. A statistical graph is a tool that helps you learn about the shape or distribution of a sample or a population. Some of the types of graphs that are used to summarize and organize data are the dot plot, the bar graph, the histogram, the stem-and-leaf plot, the frequency polygon, the pie chart, and the box plot.

Stem and Leaf Graphs, Line Graphs, and Bar Graphs

One simple graph, the stem-and-leaf graph or stem plot comes from the field of explanatory data analysis. It is a good choice when the data sets are small. To create the plot, divide each observation of data into a stem and a leaf. The leaf consists of a final significant digit. Draw a vertical line to the right of the stems, then write the leaves in increasing order next to their corresponding stem.

Stem and Leaf Graph

The stem plot is a quick way to graph data and gives an exact picture of the data. You want to look for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes while others may indicate something unusual is happening.

Histograms, Frequency Polygons, and Time Series Graphs

One advantage of a histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data set consists of 100 values or more. A histogram consists of continuous (adjoining) boxes. It has both a horizontal and a vertical axis. The horizontal axis is labeled with what the data represents. The vertical axis is labeled either frequency or relative frequency (or percent frequency or probability). The histogram (like the stem plot) will give you the shape of the data, the center, and the spread of the data.

The relative frequency is equal to the frequency for an observed value of the data divided by the total number of data values in the sample.

To construct a histogram, first decide how many bars or intervals also called classes represent the data. Many histograms consist of five to 15 bars or classes for clarity. The number of bars needs to be chosen. Choose a starting point for the first interval to be less than the smallest data value. A convenient starting pint is a lower value carried out to one more decimal place than the value with the most decimal places.

Frequency polygons are analogous to line graphs, and just as line graphs make continuous data visually easy to interpret, so too do frequency polygons. To construct a frequency polygon, first examine the data and decide on the number of intervals, or class intervals, to use on the -axis and -axis. After choosing the appropriate ranges, begin plotting the data points.

Frequency Polygons

Measures of the Location of the Data

The common measures of location are quartiles and percentiles. Quartiles are special percentiles. The first quartile, , is the same as the 25th percentile, and the third quartile, is the same as the 75th percentile. The median is called both the second quartile and the 50th percentile. Percentiles are mostly used with very large populations. The median is the number that measures the center of the data. You can think of the median as the middle value but it does not actually have to be one of the observed values. The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile () and the first quartile (). The IQR can help with potential outliers. A value is suspected to be a potential outlier if it is less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR) above the third quartile.

To find the th percentile:

  • (With being the th percentile, the index (ranking position of a data value), and the total number of data respectively)
  • Order the data from largest to smallest
  • Calculate
    • If is an integer, then the th percentile is the data value in the th position in the ordered set of data.
    • IF is not an integer, the round up and down and take the value of the values at those positions.

To find the percentile of a value, calculate

In the interest of time, I am going to start only noting things that I think are important or new to me. Example: I won't write note the equations like those shown above since those operations can be easily accomplished with code.

A percentile indicates the relative standing of a data value when data are sorted into numerical order from smallest to largest. Percentages of data values are less than or equal to the th percentile.

Box Plots

Box Plots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the concentration of the data. They also show how far the extreme values are from most of the data. A box plot is constructed from five values: the minimum value, the first quartile, the median, the third quartile, and the maximum value. The middle 50% of the data fall inside the box.

Box Plot

Measures of the Center of Data

The two most widely used measures of the center of the data are the mean and the median. The mean is the most common measure of the center. The letter to represent the sample mean is . The Greek letter represents the population mean. One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random. Another measure of the center is the mode. The mode is the most frequent value. There can be more than one mode. The Law of Large numbers says that if you take samples of larger and larger size from any population, the mean of the sample is very likely to get closer and closer to . You can think of a sampling distribution as a relative frequency distribution with great many samples. A statistic is a number calculates from a sample.

A distribution is symmetrical if a vertical line can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other. In a perfectly symmetrical distribution, the mean and the median are the same. Generally, if the distribution of data is skewed to the left, the mean is less than the media, which is often less than the mode. If the distribution of data is skewed to the right, the mode is often less than the median, which is less than the mean.

Measures of the Spread of the Data

An important characteristic of any set of data is the variation in the data. In some data sets, the data value are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measures of variation, or spread, is the standard deviation. The standard deviation is a number that measures how far data values are from the mean. The standard deviation provides a measure of the overall variation in a data set. The standard deviation is small when the data are all concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation. The standard deviation can be used to determine whether a data value is close to or far from the mean. The difference is called the deviation. The variance is the average of the squares of the deviations. The symbol represents the population variance; the population standard deviation is the square root of the population variance. The symbol represents the sample variance; the sample deviation is the square root of the sample variance.

If the numbers come from a census of the entire population and not a sample, when we calculate the average of the squared deviations to find the variance, we divide by , the number of items in the population. If the data are from a sample rather than a population, when we calculate the average of the squared deviations, we divide by , one less than the number of items in the sample.

How much the statistic varies from one sample to another is known as the sampling variability of a statistic. You typically measure the sampling variability of a statistic by its standard error. The standard error of the mean is an example of a standard error. It is a special standard deviation and is known as the standard deviation of the sampling distribution of the mean. If you add the deviations of all values of data, the sum is always zero.


Probability Topics


Probability deals with the chance of an event occurring.

Terminology

Probability is a measure that is associated with how certain we are of outcomes of a particular experiment or activity. An experiment is a planned operation carried out under some controlled conditions. If the result is not predetermined, then the experiment is said to be a chance experiment. A result of an experiment is called an outcome. The sample space of an experiment of all possible outcomes. An event is any combination of outcomes. Upper case letter like and represent events. The probability of an event is written . The probability of any outcome is the long-term relative frequency of that outcome. Probabilities are between zero and one, inclusive (that is, zero and one and all numbers between these values). means the event can never happen. means the event always happens. Equally likely means that each outcome of an experiment occurs with equal probability. To calculate the probability of an event A when all outcomes in the sample space are equally likely, count the number of outcomes for event and divide by the total number of outcomes in the sample space. The important characteristic of probability experiments is known as the law of large numbers which states that as the numbers of repetitions of an experiment is increased, the relative frequency obtained in the experiment tends to become closer and closer to the theoretical probability.

An outcome is in the event if the outcome is in or is in or is in both and . An outcome is in the event is the outcome is in both and at the same time.

The complement of event is denoted . consists of all outcomes that are NOT in . The conditional probability of given is written . is the probability that event will occur given that the event is already occurred. A conditional reduces the sample space.

Independent and Mutually Exclusive Events

Independent and mutually exclusive do not mean the same thing. Two events are independent if the following are true:

Two events and are independent if the knowledge that one occurred does not affect the chance the other occurs. To show two events are independent, you must show only one of the above conditions. If two events are NOT independent, then we say that they are dependent.

If it is not known where and are independent, assume they are dependent until you show otherwise.

Two Basic Rules of Probability

When calculating probability, there are two rules to consider when determining if two events are independent or dependent and if they are mutually exclusive or not.

The Multiplication Rule

If and are two events defined on a sample space, then , which may be written as . If and are independent, then . Then becomes .

The Addition Rule

If and are defined on a sample space, then . If and are mutually exclusive, then . Then becomes .

Contingency Tables

A contingency table provides a way of portraying data that can facilitate calculating probabilities, The table helps determining conditional probabilities quite easily. The table displays sample values in relation to two different variables that may be dependent or contingent on one another.


Discrete Random Variables


A random variable describes the outcomes of a statistical experiment in words. The values of a random variable can vary with each repetition of an experiment. Upper case letters like and denote a random variable. Lower case letters like and denote the value of a random variable. If is a random variable, then is written in words, and is given as a number.

A discrete probability distribution function has two characteristics:

  1. Each probability is between zero and one, inclusive.
  2. The sum of the probabilities is one.

Mean or Expected Value and Standard Deviation

The expected value is often referred to as the long-term average or mean. This means that over the long term of doing an experiment over and over, you would expect this average.

Binomial Distribution

There are three characteristics of a binomial distribution:

  1. There are a fixed number of trials. Think of trials as repetitions of an experiment. The letter denotes the number of trials.
  2. There are only two possible outcomes, called success and failure, for each trial. The letter denotes the probability of a success on one trial, and denotes the probability of a failure on one trial.
  3. The trials are independent and are repeated using identical conditions. Because the trials are independent, the outcome of one trial does not help in predicting the outcome of another trial. Another way of saying this is that for each individual trial, the probability of a success and probability of a failure remain the same. For example, randomly guessing at a true-false statistics question has only two outcomes. If a success is guessing correctly, then a failure is guessing incorrectly.

The outcomes of a binomial experiment fit a binomial probability distribution. The random variable the number of successes obtains in the independent trials. The mean, , and variance , for the binomial probability distributions are and . The standard deviation, , is then . Any experiment that has characteristics two and three and where is called a Bernoulli Trial.

Other Distributions to Look Into:

  • Geometric Distribution
  • Hypergeometric Distribution
  • Poisson Distribution
  • Discrete Distribution


Continuous Random Variables


The graph of a continuous provability distribution is a curve. Probability is represented by area under the curve. The curve is called the probability density function (abbreviated pdf). We use the symbol to represent the curve. is the function that corresponds to the graph; we use the density function to draw the graph of the probability distribution.

Area under the curve is given by a different function called the cumulative distribution function (abbreviated as cdf)/ The cumulative distribution function is used to evaluate probability as area. For continuous probability distributions, probability=area. The uniform distribution is a continuous probability distribution and is concerned with events that are equally likely to occur.

The Exponential Distribution

The exponential distribution is often concerned with the amount of time some specific event occurs. Values for an exponential random variable occur in the following way. There are fewer large values and more small values. Exponential distributions are commonly used in calculations of product reliability or the length of time a product lasts. The memoryless property says that knowledge of what has occurred in the pas has no effect on future probabilities.


The Normal Distribution


The normal, continuous distribution is the most important of all the distributions. The normal distribution has two parameters (two numerical descriptive measures): the mean () and the standard deviation(). If is a quantity to be measured that has a normal distribution with mean () and standard deviation (), we designate this by writing:

Normal Distribution

Standard Normal Distribution

The standard normal distribution is a normal distribution of standardized values called z-scores. A z-score is measured in units of the standard deviation. The z-score tells you how many standard deviations the value x is above (to the right of) or below (to the left of) the mean, \mu. Empirical rules also known as the 68-95-99.7 rule:

  • About 68% of values lie within 1 standard deviation of the mean
  • About 95% of values lie within 2 standard deviations of the mean
  • About 99.7% of values lie within 3 standard deviations of the mean

= area under the curve to to the left of value , and is the area under the curve to the right of value .


The Central Limit Theorem


The central limit theorem is one of the most powerful and useful ideas in all of statistics. It is concerned with drawing finite samples size form a population with a known mean, , and a known standard deviation, . The first alternative says that if we collect samples of size with a large enough n, calculate each sample's mean, and create a histogram of those means, then the resulting histogram will tend to have an approximate normal bell shape. The second alternative says that if we again collect samples of size that are large enough, calculate the sum of each sample and create a histogram, then the resulting histogram will again tend to have a normal bell-shape.

The size of the sample, , that is required to be large enough depends on the original population from which the samples are drawn (the sample size should be at least 40 or the data should come from a normal distribution). If the original population is far from normal, then more observations are needed for the sample means or sums to be normal. Sampling is done with replacement.

The Central Limit Theorem for Sample Means

Suppose is a random variable with a distribution that may be known or unknown (it can be any distribution). Using a subscript that matches the random variable, suppose:

  • = the mean of
  • = the standard deviation of

If you draw random samples of size , then as increases, the random variable which consists of sample means, tends to be normally distributed and

The central limit theorem for sample means says that if you keep drawing larger and larger samples (such as rolling one, two, five, and finally, ten dice) and calculating their means, the sample means form their own normal distribution *the sampling distribution). The normal distribution has the same mean as the original distribution and a variance that equals the original variance divided by the sample size. Standard deviation is the square root of variance, so the standard deviation of the sampling distribution is the standard deviation of the original distribution divided by the square root of . The variable is the number of values that are averaged together, not the number of times the experiment is done.

If you draw random samples of size , the distribution of the random variable , which consists of sample means, is called the sampling distribution of the mean. The sampling distribution of the mean approaches a normal distribution as , the sample size, increases.

The Central Limit Theorem for Sums

Suppose is a random variable with a distribution that may be known or unknown and suppose:

  • = the mean of
  • = the standard deviation of

If you draw random samples of size , then as increases, the random variable consisting of sums tends to be normally distributed.

The central limit theorem for sums says that if you keep drawing larger and larger samples and taking their sums, the sums force their own normal distribution (the sampling distribution), which approaches a normal distribution as the sample size increases. The normal distribution has a mean equal to the original mean multiplied by the sample size and a standard deviation equal to the original standard deviation multiplied by the square root of the sample size.


Confidence Intervals


We use sample data to make generalizations about an unknown population. This part of statistics is called inferential statistics. The sample data help us to make an estimate of a population parameter. We realize that the point estimate is most likely not the exact value of the population parameter, but close to it. After calling point estimates, we construct interval estimates, called confidence intervals. The confidence interval is a random variable. A confidence interval is a type of estimate that is an interval of numbers. It provides a range of reasonable values in which we expect the population parameter to fall. There is no guarantee that a given confidence interval does capture the parameter, but there is a predictable probability of success. A confidence interval is created for an unknown population parameter like a population mean, . Confidence intervals for some parameters have the form:

(point estimate - margin of error, point estimate + margin of error)

The margin of error depends on the confidence level or percentage of confidence and the standard error of the mean.

A Single Population Mean Using the Normal Distribution

A confidence interval for a population mean with a known standard deviation is based on the fact that the sample means follow an approximately normal distribution. To construct a confidence interval for a single unknown population mean , where the population standard deviation is known, we need as an estimate for and we need the margin of error. Here, the margin of error () is called the error bound for a population mean (EBM). The sample mean is the point estimate of the unknown population mean . The confidence interval estimate will have the form: . The margin of error depends on the confidence level (abbreviated CL). The confidence level is often considered the probability that the calculated confidence interval estimate will contain the true population parameter. The confidence level is the percent of confidence intervals that contains the true population parameter when repeated samples are taken. Most often, it is the choice of the person constructing the confidence interval to choose a confidence level of 90% or higher because that person wants to be reasonably certain of his or her conclusions. is the probability that the interval does not contain the unknown population parameter. .

To construct a confidence interval estimate for an unknown population mean, we need data from a random sample. The steps to construct and interpret the confidence interval are:

  • Calculate the sample mean from the sample data.
  • Find the z-score that corresponds to the confidence level
  • Calculate the error bound EBM
  • Construct the confidence interval
  • Write a sentence that interprets the estimate in the context of the situation of in the problem.

A Single Population Mean Using the Student t Distribution

In practice, we rarely know the population standard deviation.

Properties of the Student's t-Distribution:

  • The graph for the Student's t-distribution is similar to the standard normal curve
  • The mean for the Student's t-distribution has more probability in its tails than the standard normal distribution because the spread of the t-distribution is greater than the spread of the standard normal. So the graph of the Student's t-distribution will be thicker in the tails and shorter in the center than the graph of the standard normal distribution.
  • The exact shape of the Student's t-distribution depends on the degrees of freedom. As the degrees of freedom increases, the graph of the Student's t-distribution becomes more like the graph of the standard normal distribution.
  • The underlying population of individual observations is assumed to be normally distributed with unknown population mean and unknown standard deviation . The size of the underlying population is generally not relevant unless it is very small. If it is bell shaped (normal) then the assumption is met and doesn't need discussion. Random sampling is assumed, but that is a completely separate assumption from normality.


Hypothesis Testing with One Sample


Confidence intervals are one way to estimate a population parameter. A hypothesis test involves collecting data from a sample and evaluating the data. Then, the statistician makes a decision as to whether or not there is sufficient evidence, based upon analyses of the data, to reject the null hypothesis.

Hypothesis testing consists of two contradictory hypotheses or statements, a decision based on the data, and a conclusion. To perform a hypothesis test, a statistician will:

  1. Set up two contradictory hypotheses.
  2. Collect sample data.
  3. Determine the correct distribution to perform the hypothesis test.
  4. Analyze sample data by performing the calculations that ultimately will allow you to reject or decline to reject the null hypothesis.
  5. Make a decision and write a meaningful conclusion.

Null and Alternative Hypotheses

The actual test begins by considering two hypotheses. They are called the null hypothesis and the alternative hypothesis. These hypotheses contain two opposing viewpoints:

  • : The null hypothesis: It is a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion. In other words, the difference equals 0.
  • : The alternative hypothesis: It is a claim about the population that is contradictory to and what we conclude when we reject .

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data. After you have determined which hypothesis the sample supports, you make a decision to either reject or the null hypothesis or not to reject the null hypothesis.

Outcomes and Type 1 and Type 2 Errors

Outcomes and Type I and Type II Errors

The four possible outcomes in the table are:

  1. The decision is **not to reject when is true.
  2. The decision is to reject when is true, known as a Type I error.
  3. The decision is not to reject when is false, known as a Type II error.
  4. The decision is to reject when is false, called the Power of the Test.

Each of the errors occurs with a particular probability. The Greek letters and represent the probabilities.

  • is the probability of a Type I error = = the probability of rejecting the null hypothesis when the null hypothesis is true
  • is the probability of a Type II error = = probability of rejecting the null hypothesis when the null hypothesis is false.

and should be as small as possible because they are probabilities of errors. They are rarely zero. The Power of the Test is . Ideally, we want a high power that is as close to one as possible.

Distribution Needed for Hypothesis Testing

Particular distributions are associated with hypothesis testing. Perform tests of a population mean using a normal distribution or a Student's t-distribution. We perform tests of a population proportion using a normal distribution.

When you perform a hypothesis test of a single population mean using a Student's t-distribution (often called a t-test), there are fundamental assumptions that need to be met in order for the test to work properly. Your data should be a simple random variable that comes from a population that is approximately normally distributed. You use the sample standard deviation to approximate the population standard deviation.

When you perform a hypothesis test of a single population mean using a normal distribution (often called a z-test), you take a simple random sample from the population. The population you are testing is normally distributed or your sample size is sufficiently large. When you perform a hypothesis test of a single population proportion , you take a simple random sample from the population. You must meet the conditions for a binomial distribution which are: there are a certain number of independent trials, the outcomes of any trial are success or failure, and each trial has the same probability of a success .

Using sample data to calculate the actual probability f getting the test result, called the p-value. The p-value is the probability that, if the null hypothesis is true, the results from another randomly selected sample will be as extreme or more extreme as the results obtained from the given sample. A large p-value calculated from the data indicates that we should not reject the null hypothesis. The smaller the p-value, the more unlikely the outcome, and the stronger the evidence is against the null hypothesis.

A systematic way to make a decision of whether to reject or not reject the null hypothesis is to compare the p-value and a preset or preconceived (also called significance level). A preset is the probability of a Type I error (rejecting the null hypothesis when the null hypothesis is true). It may or may not be given to you at the beginning of the problem.

When you make a decision to reject or not reject as follows:

  • If -value, reject . The results of the sample data are significant. There is sufficient evidence to conclude that is an incorrect belief and that the alternative hypothesis may be correct.
  • If -value, do not reject . The results of the sample data are not significant. There is not sufficient evidence to conclude that the alternative hypothesis, may be correct.
  • When you do not reject it does not mean that you should believe that is true. It simply means that the sample data have failed to provide sufficient evidence to cast serious doubt about the truthfulness of .


Hypothesis Testing with Two Samples


To compare two means or two proportions, you work with two groups. The groups are classified either as independent or matched pairs. Independent groups consist of two samples that are independent, that is, sample values from one population are not related in any way to sample values selected form the other population. Matched pairs consist of two samples that are dependent. The parameter tested using matched pairs is the population mean. The parameters tested using independent groups are either population means or population proportions.

There is more here. I just don't feel like reading about this now.


The Chi-Square Distribution


The three major applications of the chi-square distribution:

  • the goodness-of-fit test. which determines if data fit a particular distribution, such as in the lottery example
  • the test of independence, which determines if events are independent, such as in the movie example
  • the test of a single variance, which tests variability, such as in the coffee example

Comparison of the Chi-Square Tests

You have seen the test statistic used in three different circumstances. The following bulleted list is a summary that will help you decide which test is the appropriate one to use.  

  • Goodness-of-Fit: Use the goodness-of-fit test to decide whether a population with an unknown distribution "fits" a known distribution. In this case there will be a single qualitative survey question or a single outcome of an experiment from a single population. Goodness-of-Fit is typically used to see if the population is uniform (all outcomes occur with equal frequency), the population is normal, or the population is the same as another population with a known distribution. The null and alternative hypotheses are:
    • ​: The population fits the given distribution.
    • ​: The population does not fit the given distribution.  
  • Independence: Use the test for independence to decide whether two variables (factors) are independent or dependent. In this case there will be two qualitative survey questions or experiments and a contingency table will be constructed. The goal is to see if the two variables are unrelated (independent) or related (dependent). The null and alternative hypotheses are:
    • : The two variables (factors) are independent.
    • ​: The two variables (factors) are dependent.  
  • Homogeneity: Use the test for homogeneity to decide if two populations with unknown distributions have the same distribution as each other. In this case there will be a single qualitative survey question or experiment given to two different populations. The null and alternative hypotheses are:
    • ​: The two populations follow the same distribution.
    • ​: The two populations have different distributions.

Test of Single Variance: To test variability, use the chi-square test of a single variance. The test may be left-, right-, or two-tailed, and its hypotheses are always expressed in terms of the variance (or standard deviation).


Linear Regression and Correlation


The type of data described in the examples is bivariate data - bi for two variables. In reality, statisticians use multivariate data, meaning many variables.

A scatter plot shows the direction of a relationship between the variables. You can determine the strength of the relationship by looking at the scatter plot and seeing how close the points are to a line, a power function, an exponential function, or to some other type of function.

The rest of this chapter goes over repression which I have already studied many times.


F Distribution and One-Way ANOVA


For hypothesis tests comparing averages between more than two groups, statisticians have developed a method called Analysis of Variance (ANOVA).

One-Way ANOVA

The purpose of a one-way ANOVA test is to determine the existence of a statistically significant difference among several group means. The test actually uses variance to help determine if the means are equal or not. In order to perform a one-way ANOVA test, there are five basic assumptions to be fulfilled:

  1. Each population from which a sample is taken is assumed to be normal
  2. All samples are randomly selected and independent
  3. The populations are assumed to have equal standard deviations (or variances)
  4. The factor is a categorical variable.
  5. The response is a numerical variable.

The distribution used for the hypothesis test is a new one. It is called the F Distribution named after Sir Ronald Fisher, an English statistician.

Analysis of variance compares the means of a response variable for several groups. ANOVA compares the variation within each group to the variation of the mean of each group. The ratio of these two is the F statistic from an F distribution with (number of groups – 1) as the numerator degrees of freedom and (number of observations – number of groups) as the denominator degrees of freedom. These statistics are summarized in the ANOVA table.

The graph of the F distribution is always positive and skewed right, though the shape can be mounded or exponential depending on the combination of numerator and denominator degrees of freedom. The F statistic is the ratio of a measure of the variation in the group means to a similar measure of the variation within the groups. If the null hypothesis is correct, then the numerator should be small compared to the denominator. A small F statistic will result, and the area under the F curve to the right will be large, representing a large p-value. When the null hypothesis of equal group means is incorrect, then the numerator should be large compared to the denominator, giving a large F statistic and a small area (small p-value) to the right of the statistic under the F curve.

You can read more about how comments are sorted in this blog post.

User Comments