Mission
home

Curve Fitting & Regression Techniques

Tags

Sample Research Questions

How can regression techniques be used to explain the relationship between xx and yy?
Predicting the yy value based on multiple input factors x1x_1, x2x_2, x3x_3?
How can interpolation methods can be used to generate the simplest possible curve passing through nn data points?
How can [Regression Technique] be used to explain the relationship between [explanatory variable #1, #2, #3...] and [dependent variable]

Linear Regression

Suppose we have nn non-collinear data points, (x1,y1)(x_1, y_1), (x2,y2)(x_2, y_2),…,(xn,yn)(x_n, y_n).
A linear regression model, y^=ax+b\hat y=ax+b, may be defined to predict a value of yy for each corresponding input value of xx.
Depending on how the line is drawn, some predicted values yi^\hat {y_i} are less than the actual value yiy_i.
y1^=ax1+b\hat {y_1}=ax_1+b and y^3=ax3+b\hat y_3=ax_3+b underestimate the value of y1y_1 and y3y_3 respectively.
In other cases, the predicted value of yi^\hat {y_i} is greater than the actual value of yiy_i.
y2^=ax2+b\hat {y_2}=ax_2+b, for example, overestimates the value of y2y_2.
Given that an infinite combination of coefficients (a,b)(a, b) are possible, which linear equation y^=ax+b\hat y=ax+b would be the curve of best fit offering the most accurate representation of all individual data points?
Key Questions
Is there a metric that quantifies and compares how closely linear regression models fit over a set of data points?
Should the regression model prioritize fitting certain data points more closely than others or should it aim to minimize the overall error across all data points?
Sum of Squared Errors (SSE) is a collective measure of error that penalizes both positive and negative deviations from all data points equally.
Linear Regression seeks to find the set of coefficients that minimizes SSE
Minimize i=1n(yiyi^)2\sum _{i=1}^n(y_i-\hat{y_i})^2
Investigation:
Find (a,b)(a,b) such that:
(y1ax1b)2+(y2ax2b)2+(y3ax3b)2(y_1-ax_1-b)^2+ (y_2-ax_2-b)^2+(y_3-ax_3-b)^2 is minimized.
The optimal solution, (a,b)(a^*,b^*), provides coefficients for line of best fit that minimizes SSE.
y^=ax+b\hat y=a^*x+b^*
Extension:
What other loss functions, other than SSE, can be used to penalize deviations of the model from data points to guide the optimal selection of model parameters (a,b)(a, b)?
What if more complex, non-linear regression models like y^=ax2+bx+c\hat y=ax^2+bx+c promise a better fit?

Polynomial Interpolation

Extending the previous discussion of linear regression models in the form, y^=ax+b\hat y=ax+b,
the polynomial interpolation method finds a polynomial function that fits a set of data points exactly.
Polynomial Interpolation Theorem
Given any n+1n+1 points, (x0,y0)(x_0, y_0), (x1,y1)(x_1, y_1),…,(xn,yn)(x_n, y_n), in the xyxy-plane with distinct xx-coordinates such that x0<x1<x2<xnx_0<x_1<x_2…<x_n, there is a unique nnth degree polynomial passing through each of these points.
y=anxn+an1xn1++a2x2+a1x+a0y= a_nx^n+a_{n-1}x^{n-1}+…+a_2x^2+a_1x+a_0
Substituting known pairs of xx’s and yy’s that the polynomial function passes through, the following system of equations must be satisfied:
y0=anx0n+an1x0n1++a2x02+a1x0+a0y_0=a_nx_0^n+a_{n-1}x_0^{n-1}+…+a_2x_0^2+a_1x_0+a_0
y1=anx1n+an1x1n1++a2x12+a1x1+a0y_1=a_nx_1^n+a_{n-1}x_1^{n-1} +…+a_2x_1^2+a_1x_1+a_0
yn=anxnn+an1xnn1++a2xn2+a1xn+a0y_n=a_nx_n^n+a_{n-1}x_n^{n-1}+…+a_2x_n^2+a_1x_n+a_0
Investigation
Find a cubic polynomial function whose graph passes through these four points:
(1,2)(1,2), (2,3)(2,-3), (3,5)(3,-5), (4,0)(4,0)
Solve system of equations to find coefficients:
p(x)=a3x3+a2x2+a1x+a0p(x)=a_3x^3+a_2x^2+a_1x+a_0
p(1)=2, p(2)=3, p(3)=5, p(4)=0p(1)=2, ~p(2)= -3,~ p(3)=-5, ~p(4)=0
2=a313+a212+a11+a02=a_31^3+a_21^2+a_11+a_0
3=a323+a222+a12+a0-3=a_32^3+a_22^2+a_12+a_0
5=a333+a232+a13+a0-5=a_33^3+a_23^2+a_13+a_0
0=a343+a242+a14+a00=a_34^3+a_24^2+a_14+a_0
Extension:
Systematic approaches to reliably obtain solutions to large systems of equations with mm equations and nn unknown variables (in the HL syllabus, we typically worked with m3m≤3, n3n≤3)
Lagrange polynomial interpolation
y=(xx1)(xx2)(xxn)(x0x1)(x0x2)(x0xn)y0 +(xx0)(xx2)(xxn)(x1x0)(x1x2)(x1xn)y1 + +(xx0)(xx1)(xxn1)(xnx0)(xnx1)(xnxn1)yny = \frac{(x - x_1)(x - x_2)\cdots(x - x_n)}{(x_0 - x_1)(x_0 - x_2)\cdots(x_0 - x_n)} y_0 + \frac{(x - x_0)(x - x_2)\cdots(x - x_n)}{(x_1 - x_0)(x_1 - x_2)\cdots(x_1 - x_n)} y_1 + \cdots + \frac{(x - x_0)(x - x_1)\cdots(x - x_{n-1})}{(x_n - x_0)(x_n - x_1)\cdots(x_n - x_{n-1})} y_n
Equivalently, we can use product notion:
y=i=0n(yij=0jinxxjxixj)y = \sum_{i=0}^{n} \left( y_i \prod_{\substack{j=0 \\ j \ne i}}^{n} \frac{x - x_j}{x_i - x_j} \right)
Newton polynomial interpolation
y=b0+b1(xx0)+b2(xx0)(xx1)++bn(xx0)(xx1)(xxn1)b0=y0b1=[y0,y1]=y1y0x1x0b2=[y0,y1,y2]=y2y1x2x1y1y0x1x0x2x0bn=[y0,y1,,yn]=[y1,y2,,yn][y0,y1,,yn1]xnx0\begin{aligned}y &= b_0 + b_1(x - x_0) + b_2(x - x_0)(x - x_1) + \cdots + b_n(x - x_0)(x - x_1) \cdots (x - x_{n-1}) \\\\b_0 &= y_0 \\b_1 &= [y_0, y_1] = \frac{y_1 - y_0}{x_1 - x_0} \\b_2 &= [y_0, y_1, y_2] = \frac{ \frac{y_2 - y_1}{x_2 - x_1} - \frac{y_1 - y_0}{x_1 - x_0} }{x_2 - x_0} \\&\vdots \\b_n &= [y_0, y_1, \dots, y_n] = \frac{[y_1, y_2, \dots, y_n] - [y_0, y_1, \dots, y_{n-1}]}{x_n - x_0}\end{aligned}
Could we perform regression (curve fitting) for more than one independent variables?

Multiple Linear Regression

For a bivariate data set, regression equations modelled yy as a single variate function of xx.
XX
YY
x1x_1
y1y_1
x2x_2
y2y_2
x3x_3
y3y_3
y=β0+β1x, y=β0+β1x+β2x2y= \beta_0+\beta_1x ,~ y=\beta_ 0+\beta_1x+\beta_2x^2
For multivariate data sets, a multitude of input variables are mapped to one composite output.
X1X_1
X2
X3X_3
YY
x1,1x_{1,1}
x2,1x_{2,1}
x3,1x_{3,1}
y1y_1
x1,2x_{1,2}
x2,2x_{2,2}
x3,2x_{3,2}
y2y_2
x1,3x_{1,3}
x2,3x_{2,3}
x3,3x_{3,3}
y3y_3
y^=β0+β1x1+β2x2+β3x3, y^=β0+β1x1+β2x2+β3x3+β4x12+β5x2x3\hat y= \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3 ,~ \hat y= \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_1^2+\beta_5x_2x_3
Loss Function (SSE)
Similarly to classic linear regression, define a sum of squared error (SSE) metric loss function that penalizes deviations between the actual value of yy and the predicted value of yy.
Apply partial differentiation to solve for the optimal selection of coefficients that minimize the loss function, L(β0,β1,β2,β3)L(\beta_0,\beta_1,\beta_2,\beta_3).
For y^=β0+β1x1+β2x2+β3x3\hat y=\beta_ 0+\beta_1x_1+\beta_2x_2+\beta_3x_3,
L(β0,β1,β2,β3)=i=1n(yiyi^)2L(\beta_0,\beta_1,\beta_2,\beta_3)=\sum_{i=1}^n(y_i-\hat{y_i})^2
L(β0,β1,β2,β3)=i=1n(yiβ0β1x1β2x2β3x3)2L(\beta_0,\beta_1,\beta_2,\beta_3)=\sum_{i=1}^n(y_i-\beta_0-\beta_1x_1-\beta_2x_2-\beta_3x_3)^2
Partial Differentiation
f(x,y,z)x=f(x+h,y,z)f(x,y,z)h\frac{∂f(x,y,z)}{∂x}=\frac{f(x+h,y,z)-f(x,y,z)}h
Similar to the df(x)dx=0\frac{df(x)}{dx}=0 condition, setting the partial differentiation to zero is a technique used for optimization, but for multivariate functions.
Investigation
L(β0,β1,β2,β3)β0=L(β0,β1,β2,β3)β1=L(β0,β1,β2,β3)β2=L(β0,β1,β2,β3)β3=0\frac{∂L(\beta_0,\beta_1,\beta_2,\beta_3)}{∂\beta_0}=\frac{∂L(\beta_0,\beta_1,\beta_2,\beta_3)}{∂\beta_1}=\frac{∂L(\beta_0,\beta_1,\beta_2,\beta_3)}{∂\beta_2}=\frac{∂L(\beta_0,\beta_1,\beta_2,\beta_3)}{∂\beta_3}=0
Minimizing loss function LL with respect to β0,β1,β2,β3\beta_0,\beta_1,\beta_2,\beta_3 gives rise to the least squares normal equations.
Lβ0β^0,β^1,,β^k=2i=1n(yiβ^0j=1kβ^jxij)=0Lβjβ^0,β^1,,β^k=2i=1n(yiβ^0j=1kβ^jxij)xij=0j=1,2,,knβ^0+β^1i=1nxi1+β^2i=1nxi2++β^ki=1nxik=i=1nyiβ^0i=1nxi1+β^1i=1nxi12+β^2i=1nxi1xi2++β^ki=1nxi1xik=i=1nxi1yiβ^0i=1nxik+β^1i=1nxikxi1+β^2i=1nxikxi2++β^ki=1nxik2=i=1nxikyi\left. \frac{\partial L}{\partial \beta_0} \right|_{\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_k} = -2 \sum_{i=1}^{n} \left( y_i - \hat{\beta}_0 - \sum_{j=1}^{k} \hat{\beta}_j x_{ij} \right) = 0\\\\ \left. \frac{\partial L}{\partial \beta_j} \right|_{\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_k} = -2 \sum_{i=1}^{n} \left( y_i - \hat{\beta}_0 - \sum_{j=1}^{k} \hat{\beta}_j x_{ij} \right) x_{ij} = 0 \quad \text{} j = 1, 2, \dots, k\\\\ n \hat{\beta}_0 + \hat{\beta}_1 \sum_{i=1}^{n} x_{i1} + \hat{\beta}_2 \sum_{i=1}^{n} x_{i2} + \cdots + \hat{\beta}_k \sum_{i=1}^{n} x_{ik} = \sum_{i=1}^{n} y_i\\\\ \hat{\beta}_0 \sum_{i=1}^{n} x_{i1} + \hat{\beta}_1 \sum_{i=1}^{n} x_{i1}^2 + \hat{\beta}_2 \sum_{i=1}^{n} x_{i1} x_{i2} + \cdots + \hat{\beta}_k \sum_{i=1}^{n} x_{i1} x_{ik} = \sum_{i=1}^{n} x_{i1} y_i\\\\ \hat{\beta}_0 \sum_{i=1}^{n} x_{ik} + \hat{\beta}_1 \sum_{i=1}^{n} x_{ik} x_{i1} + \hat{\beta}_2 \sum_{i=1}^{n} x_{ik} x_{i2} + \cdots + \hat{\beta}_k \sum_{i=1}^{n} x_{ik}^2 = \sum_{i=1}^{n} x_{ik} y_i
Extension
What are some real life situations or problems that invite the appropriate use of multiple linear regression?
How can we incorporate non-linear terms into multiple linear regression models?