This section describes the prediction model in mathematical detail, but is not a proof of its effectiveness. MicroBrothers welcomes any comments you may have about the methods described in this section.
Many statistical problems require one or more independent variables. These variables are used to design a prediction model that determines future outcomes based on past results. Simple problems with just one independent variable are often solved with a technique called linear regression. However, more complex problems often have more than one independent variable. If the model is linear in its coefficients and uses more than one variable, it is called a multiple linear regression model. The general form of such a model used to estimate future outcomes can be expressed as:
PFA allows you to use 28 separate statistical categories when you create prediction formulas. These 28 categories are independent variables that serve as inputs to the model. Therefore, the value of k in the standard equation is 28.
Each input data measurement is one game. For each of the 28 statistics, i=1..28, the input measurement to the model is:
Rvo is the visitor team league rank on offense
Coefficients in the Model
The coefficients in the equation, b1 through b28, are related to the prediction formula weights by a multiplier. The constant term, b0, is simply the home field advantage value in the formula. In the prediction formula, the coefficients b1 through b28 are derived using the relationship shown below. The actual percentage weights in the formula are called w1 through w28. These percentage weights are always positive and always add up to 100 (normalized). Another variable, called the point spread multiplier (PSM) is used to vary the magnitude of the predicted margin of victory as shown:
y = b0 + PSM ( w1 x1 + ... wk xk )
From this relationship we see that the coefficients b1 through b28 are related to the prediction formula’s percentage weights by:
bi = PSM × wi
and thus we can solve for the percentage weights w1 through w28 by solving the standard regression model for b1 through b28 and then back substituting the above equation.
Output from the Model
The output from our model, y, is the projected margin of victory of the home team over the visiting team.
To obtain an equation to estimate y, we collect data samples (x) from our sample of past games. The next step is solving for the coefficients b0 through b28 to produce an optimum formula.
Solving the Model Using Regression
Solving the linear regression model for all 28 statistics is time consuming even on the fastest computers and does not necessarily produce a more accurate formula. In addition, as previously noted, the coefficients in a prediction formula are all positive. Therefore, a more effective solution is to use a subset of the 28 statistics, each with a positive coefficient, and then solve the model for that subset.
After we select a sample size, there are still two more important questions to ask - How many statistics should we use and which statistics should we use? Both questions can be answered using a method called forward selection. Forward selection is a stepwise procedure, and requires several iterations to find an "optimum subset" of statistics to include in the model. The basic process is performed as follows:
1) Each statistical category (Points Scored, First Downs, and so on) is solved independently using linear regression. The category which produces the highest correlation to the actual game results is selected.
2) Each of the remaining categories are inserted in addition to the category selected in step 1. Linear regression is used to solve the model for two variables. The category with the highest correlation to the actual game results and which has a positive coefficient is added to the model.
Step 2 is repeated until the number of categories in the model equals the number of statistics that you selected.
Once the statistics are selected, SureLock solves for the coefficients. One problem that occurs when solving for the coefficients is called multi-collinearity. This occurs when two or more statistics are highly correlated to each other. For example, points scored and first downs may correlate closely in a given game. Multi-collinearity causes the standard regression solution to produce coefficients with large variances, and sometimes a solution is not even possible.
Ridge regression is another form of multiple linear regression, but is known as a biased estimation technique. It allows a certain amount of bias in the coefficients in order to reduce their relative magnitudes. Another nice feature of ridge regression is that the solution of the coefficients yields positive values (most of the time), which is exactly what SureLock needs to create prediction formulas.
SureLock uses ridge regression to solve for the coefficients b0 through b28 in the prediction model. Ridge regression is fairly complex, and we lack the space to discuss it here. If you are interested, the topic is covered in several books on statistics and forecasting.
The Finished Formula
After using Ridge Regression, some coefficients still have negative values due to precision errors in floating point arithmetic. The negative coefficients are numbers very close to zero, and can be set to zero without degrading accuracy. SureLock then normalizes the set of coefficients so that the sum of the percentage weights equals 100, and sets the point spread multiplier to an optimum value. The result is a formula that Pro Football Analyst can use to accurately forecast future games.
Return to Reference Manual Table of Contents