Regression

I want to go through the Wikipedia series on Machine Learning and Data mining. Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
Date Created: 05/16/2024
1 521
References
Regression analysis Wikipedia Article

Notes
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between the dependent variable (often called the outcome or response variable, or a label in machine learning parlance) and one of more error-free independent variables (often called regressors, predictors, covariates, explanatory variables or features). The most common form of regression analysis is linear regression, in which one finds the line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation (or population average value) of the dependent variable when the independent variable takes on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the condition expectation across a broader collection of non-linear models. 
Regression is used for prediction and forecasting and to infer causal relationships between independent and dependent variables. The earliest regression form was seen in Issac Newton's work in 1700 while studying equinoxes. The method of least squares was published by Legendre in 1805 and by Gauss in 1809.
In practice, researches first select a model they would like to estimate and then use their own chosen method to estimate the parameters of that model. Regression models involve the following components:
The unknown parameters, often denoted a scalar or vector β
The independent variables, which are observed in data and are often denoted as a vector Xi​
The dependent variable, which are observed in data and often denoted using the scalar Yi​
The error terms, which are not directly observed in data and are often denoted using the scalar ei​
Most regression models propose that Yi​ is a function of Xi​ and β, with ei​ representing an additive error term that may stand in for un-modeled determinants of Yi​ or random statistical noise:
Yi​=f(Xi​,β)+ei​
The researcher's goal is to estimate the function that most closely fits the data. The form of the function f must first be specified. Once researchers determine their preferred statistical model, different forms of regression analysis provide tools to estimate the parameters β. For example, least squares finds the value of β that minimizes the sum of squared errors ∑i​(Yi​−f(Xi​,β))2. There must be sufficient data to estimate a regression model. To estimate a least squares model with k distinct parameters, one must have N≥k distinct data points. 
By itself, regression is simply a calculation using the data. In order to interpret the output of regression as a meaningful statistical quantity that measures real-world relationships, researchers often rely on a number of classical assumptions:
The sample is representative of the population at large
The independent variables are measured with no error
Deviations from the model have an expected value of zero, conditional on covariates
The variance of the residuals ei​ is constant across observations
The residuals ei​ are uncorrelated with one another. Mathematically, the variance-covariance matrix of the errors is diagonal.
Regression

References

Notes

Comments

User Comments