Intro: Unfortunately, we live in a complex but wonderful world that cannot always be sufficiently modeled by basic linear systems. Naturally, nonlinear models are in widespread use and range from distribution models such as the normal distribution to more convoluted models such as deep neural networks. Fortunately, Kenneth Levenberg and Donald Marquardt contrived a fairly robust and simple algorithm for optimizing nonlinear models via nonlinear least squares.
Linear Least Squares
Before jumping into nonlinear least squares, it is prudent to review linear least squares. “Least squares”, as the name in implies, is method for finding the minimal(least) squared error between a model and data. Consequently, it is used to determine model coefficients in cases of over-determined systems. In more simple words, it is a method of curve fitting data. With the assistance of my javascript math/plotting library, one can generate some random data.
/* Maths.js code @ https://github.com/natezimmer/maths.js */ var m = Math.random()*20; var b = Math.random()*20; var X = Matrixs.range(100).addNoise(0.9); // Creates a 0-99 value array var Y = X.multiply(m).add(b); // Line equation Plots.create([Y],{type:'scatter',div:'myPlotData'});So what is the best fit line for this data? The approach is generic for any linear model. First write out the model.
$$ f(x) = m \cdot x + b $$ Now we take the gradient with respect to the dependent varibles. Not like with respect to $x$. $x$ is known and not what we are solving for, it is constant. Where as the slope $m$ and offset $b$ are unknown. This is done for each of the m samples and put in an array $A$.
$$ A = \begin{bmatrix} \underline{ \nabla f} \\ \nabla f_1 \\ \nabla f_2 \\ \cdots \\ \nabla f_m \end{bmatrix} = \begin{bmatrix} \underline{m} & \underline{c} \\ x_1 & 1 \\ x_2 & 1 \\ \cdots & \cdots \\ x_m & 1 \end{bmatrix} $$
Array $A$ has dimensions [m,2] since it has m samples and 2 features. Since our model is linear, we can use this array directly in the over-determined least squares normal equation shown below:
$$ [m ; c] = \big( A^T \cdot A \big)^{-1} \cdot A^T \cdot f(x) $$
With this mathmagical formula one can easily fit a huge assortment of curves. Lets fit the initially shown curve.
/* Maths.js code @ https://github.com/natezimmer/maths.js */ var X2 = Matrixs.range(100); var A = X.catHorizontal(Matrixs.ones(X.value.length,1)); res = A.transpose().multiply(A).invert().multiply(A.transpose()).multiply(Y).flatten(); // Normal Equation var X2 = Matrixs.range(100); Ynew = X2.multiply(res[0]).add(res[1]); Plots.create([Y],{type:'scatter',div:'myPlotData2'}); Plots.add([Ynew],{type:'lines',div:'myPlotData2'});
A line is pretty basic. This same principle can be extended to any linear function. For a more complex example, I will fit a 4th order polynomial to a noisy 3rd order polynomial;
$$ f(x)= -1.3 \cdot x^3 + 13 \cdot x^2 -5 \cdot x + 30 $$
Recall the gradient for the 4th order polynomial of the form:
$$ f(x) = a \cdot x^4 + b \cdot x^3 + c \cdot x^2 + d \cdot x +e $$ $$ \nabla f = \big[ x^4 , x^3, x^2 , x^1, 1 \big] $$ $$ [a;b;c;d;e] = \big( A^T \cdot A \big)^{-1} \cdot A^T \cdot f(x) $$
Here is a plot of the 3rd order poly as well as the 4th order least squares best fit.
X = Matrixs.range(0,0.1,10); Y = X.pow(3).multiply(-1.3).add(X.pow(2).multiply(13)) .add(X.multiply(-5)).add(30).addNoise(0.3); // create Y var A = X.pow(4).catHorizontal(X.pow(3)).catHorizontal(X.pow(2)) .catHorizontal(X.pow(1)).catHorizontal(Matrixs.ones(X.value.length,1)); // Create A matrix, 5 features res = A.transpose().multiply(A).invert().multiply(A.transpose()).multiply(Y).flatten(); // Normal Equation Ynew =X.pow(4).multiply(res[0]).add(X.pow(3).multiply(res[1])).add(X.pow(2).multiply(res[2])).add(X.multiply(res[3])).add(res[4]); // Use least squares result to create new poly Plots.create([X,Y],{type:'scatter',div:'myPlotData3'}); Plots.add([X,Ynew],{type:'lines',div:'myPlotData3'});
So as observed, a higher order function can generally fit unknown lower order data nearly perfectly. I could take linear regression much further but this was simply a soft introduction to make way for nonlinear least squares.
Nonlinear least squares
Consider the following equation:
$$ f(x) = e^{-a \cdot x} + b \cdot sin(c \cdot x) $$ Consider the equation’s gradient with respect to its dependent variables (a,b,c):
$$ \nabla f = \bigg[ x \cdot e^{a \cdot x},\quad sin(c \cdot x ),\quad b \cdot x \cdot cos( c \cdot x) \bigg] $$ Now look at the previous gradients that were calculated. Previous gradients only involved independent variables and were linearly separated. This prevents one from using the normal equation accurately. Furthermore, taking derivatives of more complicated functions may not be analytically possible. These factors are introductory to the challenge to non-linear least squares. An important step in NNLS is obtaining the array of functional partial derivatives that we will now call J or the Jacobian
Numerical Jacobin:
The Jacobian is essentially an array of functional gradients and it takes a similar form to the linear matrix A as seen below:
$$ J = \begin{bmatrix} \underline{ \nabla f} \\ \nabla f_1 \\ \nabla f_2 \\ \cdots \\ \nabla f_m \end{bmatrix} = \begin{bmatrix} \underline{\partial f / \partial a} & \underline{ \partial f / \partial b } & \underline{ \partial f / \partial c } & \cdots \\ \partial f_1 / \partial a & \partial f_1 / \partial b & \partial f_1 / \partial c & \cdots \\ \partial f_2 / \partial a & \partial f_2 / \partial b & \partial f_2 / \partial c & \cdots \\ \cdots & \cdots & \cdots & \cdots \\ \partial f_m / \partial a & \partial f_m / \partial b & \partial f_m / \partial c & \cdots \end{bmatrix} $$
Similar to the previous linear matrix A, the Jacobian will have dimensions [samples,features]. While numerical calculation of the Jacobian is fundamentally inefficient, it is a simple starting point for solving new non-linear systems. A numerical derivative is trivial to calculate when one recalls the limit definition of a derivative.
$$ f'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) -f(x)}{\Delta X} \approx \frac{f(x + 0.000001) -f(x)}{0.000001} $$ This approximation is similar to the approximation I used in my previous blog post regarding the conversion from Laplace-transforms to Z-transforms. Note there are a multitude of methods of taking numerical derivatives but I find this method is generally ‘good enough’ for generic applications. This method can be extended to a 2D array as seen in my numerical Jacobian code.
Pseudo Inverse:
Another mathematical construct highly recommenced for non-linear least squares is the pseudo inverse. The pseudo inverse represents the optimal compromise for a rank deficient matrix. Through traditional matrix inversion methods, an inverse cannot be found if the matrices determinant is 0. Singular matrices are often encountered in NNLS and consequently having the pseudo inverse is crucial. The Pseudo inverse can be calculated with the assistance of Singular Value Decomposition(SVD). SVD decomposes a matrix X into three parts such that X = UΣV^T. Two of these parts, U and V are orthogonal and Σ contains a diagonal matrix of singular values. By plugging this into the Normal Equation and applying matrix simplification, we can arrive at a simple result:
$$ X = U \cdot \Sigma \cdot V^T $$ $$ (X^T \cdot X)^{-1} \cdot X \cdot Y \approx $$ $$ \bigg( (U \cdot \Sigma \cdot V^T)^T \cdot (U \cdot \Sigma \cdot V^T) \bigg)^{-1}\cdot (U \cdot \Sigma \cdot V^T) \cdot Y \approx $$ $$ V \cdot \Sigma^{-1} \cdot U^T \cdot Y $$
Here is an example of doing the pesudo invere in Maths.js.
var A = Matrixs.make([[1,2,3],[4,5,6],[7,8,9]]); A.pinv().print();
With this we can see an approximate inverse of previously non invertable singular matrices:
$$ A^{-1} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}^{-1} \approx \begin{bmatrix} -0.639 & -0.167 & 0.306 \\ -0.056 & -0.000 & 0.056 \\ 0.528 & 0.167 & -0.194 \end{bmatrix} $$
Levenberg Marquadrt:
With a Jacobian and a reliable matrix inverse, one can begin to implement the Levenberg Marquadrt algorithm. This algorithm is a hybrid between Gradient Descent and Gauss Newton. It is recommended one knows both but in short summary, gradient descent moves too slow and Gauss Newton can often move too fast. In short, both have convergence problems and so an algorithm was developed as a hybrid between these too methods. First new vector to define is the residual vector (r) or more practically speaking the error vector.
$$ r(x) = \hat{y(x)} -y(x) $$ Note that $ \hat{y(x)} $ is our estimated model of $ y(x) $. Naturally, the goal is to minimize the RMS value of the error (residual). Another new vector to define is an approximation of the Hessian which is the second derivative of the Jacobian. This matrix is often approximated as the following:
$$ H(x) \approx J^T \cdot J $$ The near final form Levenberg step $p$ is the following:
$$ p = (J^T \cdot J + \lambda I)^{-1} \cdot J^T \cdot r $$ p is the step that the features will take upon completion of the calculation. Note, this is consequently a iterative process as are most non-linear solvers. Secondly $ \lambda $ is the damping parameter. Sometimes LMA is called a damped Gauss Newton step. When $ \lambda $ is large, each step is near equivalent of a gradient descent step. When $ \lambda $ is small, each step is equivalent of a Gauss Newton step. Note, $ \lambda $ should update every itteration based upon if $ r_{k+1} > r_{k}$. If that is true, that means the error increased and the previous step should be reversed and $ \lambda $ should be increased by a factor of 10(commonly). The following is pseudo code on how to write the algorithm:
$$ loop \ k \ iterations : $$ $$ p_{k+1} = (J^T_k \cdot J_k + \lambda_k I)_{ V \cdot \Sigma^{-1} \cdot U^T } \cdot J^T_k \cdot r_k $$ $$ p_{k+1} = (r_{k+1} > r_{k}) \quad ? \quad p_{k} : p_{k+1} $$ $$ \lambda = (r_{k+1} > r_{k}) \quad ? \quad \lambda \cdot 10 : \lambda \cdot 0.1 $$ $$ x = x -p_{k+1} $$
My javascript source for this can be found here. It is also running in your browser right now ;). Next this method will be applied to the nonlinear function previously shown.
function nls(param){this.param = param;} // Create Object nls.prototype.fnc = function(x) //Create Object's function { var X = Matrixs.make(x); var a = this.param[0]; var b = this.param[1]; var c = this.param[2]; var Y = X.multiply(a).apply(Math.exp).add(X.multiply(c).apply(Math.cos).multiply(b)); return Y.value; } var X = Matrixs.range(-5,0.05,5); var Y = Matrixs.make((new nls([-0.5,4,1.5])).fnc(X)).addNoise(0.3); var guess = [-1,2,1]; var testModel = new nls(guess); var inputObj ={input: X.value , output: Y.value }; var resultObj = Solvers.levenbergMarquardt(inputObj,testModel); Plots.create([X,Y],{type:"scatter", size:10,div:'myPlotData5'}); Plots.add([X,Matrixs.make(testModel.fnc(X.value))],{type:"line",div:'myPlotData5'});Note that in the code above, a ‘guess’ was required. This is a fundamental weakness with the Levenberg Marquadrt algorithm is that a guess and a good guess is required else the function will get stuck in a local minimum.
Solving Neural Networks
While LMA struggles with initial guesses, models that support pre-conditioning are ideal for LMA. Neural networks for example generally constrain inputs and outputs which makes initial guesses far less of a problem. Consequently, LMA becomes a rapid batch solver for smaller scale neural networks. Here is a classic example:
$$ f = a \quad xor \quad b $$ $$ TruthTable $$ $$ \begin{bmatrix} \underline{ a } & \underline{ b } & | &\underline{ f } \\ 0 & 0 & | & 0 \\ 0 & 1 & | & 1 \\ 1 & 0 & | & 1 \\ 1 & 1 & | & 0 \end{bmatrix} $$
var xx = new Models.neuralNet(); xx.setInputNumber(2); xx.setLayerSizes([4]); xx.setOutputNumber(1); xx.init(); var inputs = [[0,0],[0,1],[1,0],[1,1]]; var outputs = [[0],[1],[1],[0]]; var inputObjs = {input:inputs,output:outputs}; var resultObj = Solvers.levenbergMarquardt(inputObjs,xx); A = Matrixs.make(xx.fnc(inputs)).round(1);Visualization of NN is still an ongoing feature needing writing in my library. Anyhow, thats all for now folks.
References:
Imagingshop LMA
JS Numeric’s SVD
Wiki SVD
Wiki LMA
Matlab LSQ