Linear between these variables can be used to make

Linear regression is used for predicting a quantitative response Y on the basis of a
single or multiple predictor variables X. It is a well-known and very popular method
to model the statistical relationship between the response and explanatory variable.
When modeling this relationship the values of the explanatory variables are known
and they are used to describe the response variables as best as possible. A proper
model that accurately captures the relationship between these variables can be used
to make predictions on data that has not been used to build the model. Simple linear
regression is a model that contains only one explanatory variable. The formula of
simple linear regression is:
Y = ?0 + ?1 × X + e,
Y is the response variable,
X represents the predictor variable,
?0 is called the overall intercept or the overall population mean of the response
?1 is the average effect on Y of a one unit increase in X, holding all other predictors
fixed. It is also called the slope term,
e is the error term that represents the deviation between the observed and predicted
?0 and ?1 are the unknown constants that the model has to estimate based on the
data James et al., 2013.
The goal is to obtain coefficient estimates of ?ˆ
0 and ?ˆ
1 such that the linear model
fits the available data well, so that yˆi ? ?ˆ
0 + ?ˆ
1 × xi
for i = 1, . . . , n, where yˆi
the prediction for every i observation. In other words, the aim is to find an
intercept ?ˆ
0 and slope ?ˆ
1 such that the resulting line is as close as possible to the
data points. The most common approach to do this involves minimizing the least
squares criterion (see section 2.3.1).
Several assumptions are made in linear models: the residuals are independent,
the residuals are normally distributed, the residuals have a mean of 0 at all values of
X, the residuals have constant variance, the model is linear in the parameters. When
applying linear models it is of importance to make sure that these assumptions are
met, otherwise the statistical inference based on these results may not be adequate.
2 Chapter 1. Theoretical background
A variance-covariance matrix is a way to better demonstrate homogeneous variance
and independence of the residuals:
V = cov =
2 0 . . . 0
0 ?
. . . 0
0 0 . . . ?
In the variance-covariance matrix the diagonal values are the variances and if they
are all the same then this is a representation of variance homogeneity. The zeros in
matrix represent that there is no correlation, dependence between residuals.
Although the standard linear model assumes independent errors and heteroscedastic
variance among other assumptions, data cannot always satisfy these assumptions
so therefore more complex algorithms are also available. In fact it is possible to add
the correct variance or correlation structure to the model in R. Next, we will introduce
the mathematical notations of different variance and correlation structures and
how to include them in linear models.