Before we start, you can find a
brief introduction to this predictive analytics article series by clicking HERE.
Part 1 of the series is correlation
& linear regression. When two variables are related to one another they are
said to be correlated. A very simple example in a sporting context is player
anthropometrics. When plotted on a scatter diagram the height and weight of
your squad should correlate: i.e. taller people tend to weigh more.
If you can quantify the strength
and direction of a correlation, you can then start to use your knowledge of one
variable to predict another. For example, if you know the height of your player
you can predict their weight. The process you use to assess the strength of a
relationship is called correlation
analysis; however the process you use to predict one variable from another
is called regression analysis. The
strength of your correlation analysis will determine the accuracy of your
regression analysis.
To demonstrate correlation
analysis and linear regression I will use something a little more interesting
than height and weight. The majority of toplevel teams will be using
technology to monitor their players in training (GPS) and match play (Prozone).
If your aim is to collate all data into one database, data taken from different
sources needs to be accurate and comparable. We know from research (Harley et
al., 2010) that caution must be taken when using these systems interchangeably,
as GPS reported higher values for total distance covered (TD) during match play
when compared to Prozone. One possible solution to allow the use of both data
sets in one database is to create a linear regression model to predict GPS TD
from Prozone match data.
In order to do this we need to
evaluate the strength of the relationship (correlation) between GPS and Prozone
TD data. It is important to firstly visualize the relationship using a scatter
diagram. Below is a screenshot of what this should look like in Excel for Mac
2011. Please note this is simulated data,
generated for the purpose of this article and does not accurately represent the
true relationship between GPS and Prozone data.

Figure 1.1 Visualisation of the correlation between GPS and Prozone total distance match data. 

When visualizing the data like
this it is clear to see a positive, nearly perfect correlation (which is to be
expected). In order to objectify this relationship we can calculate Pearson’s Correlation Coefficient. This
can be easily done using Excel as long as your data is accurately input into a
list, as shown in Figure 1.1.
To do this use the Excel formula =CORREL()
Figure 1.2 shows the correlation
coefficient for our dataset is 0.98,
this can be categorized as nearly perfect (for more information on categorizing
the strength of your correlation see Will Hopkins
– A New View of Statistics).
To calculate this we used the formula =CORREL(A2:A31,B2:B31).

Figure 1.2 Correlation between GPD and Prozone data using =CORREL() function 

So how do we get from a
correlation to a prediction? In simple terms, when conducting a regression you
forecast one variable from another. The equation is as follows:
The regression coefficient, b is obtained as follows:
 Find r_{xy}, the correlation between
the predictor variable x and the
forecast variable y.
 Multiple r_{xy} by the standard deviation
of the forecast variable y (S_{y}).
 Divide the
result by the standard deviation of the predictor variable x (S_{x}).
Besides the regression
coefficient you need to know the intercept,
denoted as a, which you can calculate as follows:
 Multiple the
regression coefficient b, by the mean
of x.
 Subtract the
result of step 1 from the mean of y.
So the equation for the intercept
a is as follows:
a = y  bx
Figure 1.3 shows how this all
comes together in an Excel worksheet.

Figure 1.3 Calculation of Linear Regression 
Having calculated the regression
coefficient and the intercept we can now forecast GPS TD based on Prozone TD.
For Example:
y = bx + a
GPS TD = Regression Coefficient *
Prozone TD + Intercept
GPS TD = E10 * 12000 + E11
This returns 12524. Given the
observed relationship between Prozone and GPS total distance data, we can
predict that 12000 metres derived from Prozone data would equal 12524 metres in
GPS data.
Conducting this analysis manually
helps to understand the underlying equations that are used in a linear
regression and helps to show how the correlation is related to the regression
formula. I would suggest trying to calculate your regression this way at least
once to gain an understanding of these calculations. However, this is a waste
of time for everyday work. Luckily there is an Excel function that can
calculate this for you =LINEST().
Figure 1.4 shows how you can use this function on this data set.

Figure 1.4 Calculation of Linear Regression using =LINEST() function. 
The =LINEST() function can be used to bypass all the intermediate
calculations shown in Figure 1.3. This function calculates the regression
coefficient and intercept directly. They appear in cells D2:E2 of Figure 1.4.
It must be noted the =LINEST() function is an Array Formula this means instead of
simply pressing Enter, you hold down
Ctrl and the Shift keys as you press Enter.
So in the case of our dataset in
Figure 1.4, the sequence of events is as follows:
 Select cells
D2:E2.
 Type the
formula =LINEST(A2:A31,B2:B31).
 Hold down Ctrl and Shift, and press Enter.
You should now see the regression
coefficient appear in cell D2 and the intercept in E2. If you look in the
formula box, you should see the arrayentered formula surrounded by curled
brackets, as shown in cell F2. Excel will automatically create these brackets
when an array formula is entered you do not need to type these.
Just as in Figure 1.3, you can
use the regression coefficient and intercept to create a forecast value based
on a predictor value. The formula to do that is shown in cell F5.
Using this formula in cell E5,
you can enter any Prozone TD data in cell D5 and a Predicted GPS TD will be
returned in cell E5.
Excel does make all of this even
easier, the =TREND() function puts
everything into one step. It calculates the regression coefficient and
intercept and applies them to one or more predictor values and displays a
forecast value.
=TREND() will be
covered in more detail later in this article series. I have not covered it in
this section based on the guidance of Conrad Carlsberg in his book (See Introduction
for a link). Conrad states, “By looking at the equations and some of the other
results that pertain to the equations, you can do a better job of evaluating
whether and how you want to use it.
That wraps up Part 1 of the
Predictive Analytics article series. Part 2 will cover how to apply the
techniques we have learnt to conduct multiple regression analysis, where you
can use more than one predictor variable to return a forecast variable.