Regression
I want to go through the Wikipedia series on Machine Learning and Data mining. Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
References
Notes
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between the dependent variable (often called the outcome or response variable, or a label in machine learning parlance) and one of more error-free independent variables (often called regressors, predictors, covariates, explanatory variables or features). The most common form of regression analysis is linear regression, in which one finds the line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion.
For specific mathematical reasons, this allows the researcher to estimate the conditional expectation (or population average value) of the dependent variable when the independent variable takes on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the condition expectation across a broader collection of non-linear models.
Regression is used for prediction and forecasting and to infer causal relationships between independent and dependent variables. The earliest regression form was seen in Issac Newton's work in 1700 while studying equinoxes. The method of least squares was published by Legendre in 1805 and by Gauss in 1809.
In practice, researches first select a model they would like to estimate and then use their own chosen method to estimate the parameters of that model. Regression models involve the following components:
- The unknown parameters, often denoted a scalar or vector
- The independent variables, which are observed in data and are often denoted as a vector
- The dependent variable, which are observed in data and often denoted using the scalar
- The error terms, which are not directly observed in data and are often denoted using the scalar
Most regression models propose that is a function of and , with representing an additive error term that may stand in for un-modeled determinants of or random statistical noise:
The researcher's goal is to estimate the function that most closely fits the data. The form of the function must first be specified. Once researchers determine their preferred statistical model, different forms of regression analysis provide tools to estimate the parameters . For example, least squares finds the value of that minimizes the sum of squared errors . There must be sufficient data to estimate a regression model. To estimate a least squares model with distinct parameters, one must have distinct data points.
By itself, regression is simply a calculation using the data. In order to interpret the output of regression as a meaningful statistical quantity that measures real-world relationships, researchers often rely on a number of classical assumptions:
- The sample is representative of the population at large
- The independent variables are measured with no error
- Deviations from the model have an expected value of zero, conditional on covariates
- The variance of the residuals is constant across observations
- The residuals are uncorrelated with one another. Mathematically, the variance-covariance matrix of the errors is diagonal.
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.