首页 > 代码库 > A Brief Review of Supervised Learning

A Brief Review of Supervised Learning

There are a number of algorithms that are typically used for system identification, adaptive control, adaptive signal processing, and machine learning. These algorithms all have particular similarities and differences. However, they all need to process some type of experimental data. How we collect the data and process it determines the most suitable algorithm to use. In adaptive control, there is a device referred to as the self-tuning regulator. In this case, the algorithm measures the states as outputs, estimates the model parameters, and outputs the control signals. In reinforcement learning, the algorithms process rewards, estimates value functions, and output actions. Although one may refer to the recursive least squares (RLS) algorithm in the self-tuning regulator as a supervised learning algorithm and reinforcement learning as an unsupervised learning algorithm, they are both very similar.

1.1 Least Squares Estimates

The least squares (LS) algorithm is a well-known and robust algorithm for fitting experimental data to a model. The first step is for the user to define a mathematical structure or model that he/she believes will fit the data. The second step is to design an experiment to collect data under suitable conditions. “Suitable conditions” usually means the operating conditions under which the system will typically operate. The next step is to run the estimation algorithm, which can take several forms, and, finally, validate the identified or “learned” model. The LS algorithm is often used to fit the data. Let us look at the case of the classical two-dimensional linear regression fit that we are all familiar with:

(0)技术分享

In this a simple linear regression model, where the input is the sampled signal and the output is . The model structure defined is a straight line. Therefore, we are assuming that the data collected will fit a straight line. This can be written in the form:技术分享技术分享

(0)技术分享

where and . How one chooses determines the model structure, and this reflects how one believes the data should behave. This is the essence of machine learning, and virtually all university students will at some point learn the basic statistics of linear regression. Behind the computations of the linear regression algorithm is the scalar cost

function, given by技术分享技术分享技术分享

(0)技术分享

The term is the estimate of the LS parameter . The goal is for the estimate to minimize the cost function . To find the “optimal” value of the parameter estimate , one takes the partial derivative of the cost function with respect to and sets this derivative to zero.

Therefore, one gets

(1)技术分享


 Setting, we get

技术分享


(1)技术分享

Solvingfor , we get the LS solution技术分享

(1)技术分享

wherethe inverse, , exists. Ifthe inverse does not exists,then the system is not identifiable. For example, if in the straightline case one only had a single point, then the inverse would notspan the two-dimensional space and it would not exist. One needs atleast two independent pointsto draw a straight line. Or, for example, if one had exactly the samepoint over and over again, then the inverse would not exist. Oneneeds at least two independent points to draw a straight line. Thematrix is referred to as the informationmatrixand is related to how well one can estimate the parameters. Theinverse of the information matrix is the covariance matrix, and it isproportional to the variance of the parameter estimates. Both thesematrices are positive definite and symmetric. These are veryimportant properties which are used extensively in analyzing

thebehavior of the algorithm. In the literature, one will often see thecovariance matrix referred to as . We can write the second equationon the right of Eq. in the form :技术分享技术分享技术分享


(1)技术分享

and onecan define the prediction errors as

(1)技术分享

Theterm within brackets in Eq. is known as the predictionerroror, as some people will refer to it, the innovations.The term represents the error in predicting the output of thesystem. In this case, the output term is the correct answer, whichis what we want to estimate. Since we know the correct answer, thisis referred to as supervisedlearning.Notice that the value of the prediction error times the data vectoris equal to zero. We then say that the prediction errors areorthogonal to the data, or that the data sits in the null space ofthe prediction errors. In simplistic terms, this means that, if onehas chosen a good model structure , then theprediction errors shouldappear as whitenoise.Always plot the prediction errors as a quick check to see how goodyour predictor is. If the errors appear to be correlated (i.e., notwhite noise), then you can improve your model and get a betterprediction.技术分享技术分享技术分享

Onedoes not typically write the linear regression in the form of Eq. ,but typically will add a white noise term, and then the linearregression takes the form

(1)技术分享

where is awhite noise term.Equation can represent an infinite number of possible modelstructures. For example, let us assume that we want to learn thedynamics of asecond-order linear systemor the parameters of asecond-order infinite impulse response (IIR)filter.Then we could choose the second-order model structure given by 技术分享

(1)技术分享

Thenthe model structure would be defined in as技术分享

(1)技术分享

Ingeneral, one can write an arbitrary th-order autoregressive exogenous(ARX) model

structureas 技术分享

(1)技术分享

and takes the form 技术分享

(1)技术分享

Onethen collects the data from a suitable experiment (easier said thandone!), and then computes the parameters using Eq. The vector cantake many different forms; in fact, it can contain nonlinearfunctions of the data,for example, logarithmic terms or square terms, and it can havedifferent delay terms. To a large degree, one can use onesprofessional judgment as to what to put into . One will often writethe data in the matrix form, in which case the matrix is defined as 技术分享技术分享

(1)技术分享

and theoutput matrix as

(1)技术分享

Thenone can write the LS estimate as

技术分享 (1)

Furthermore,one can write the prediction errors as

技术分享 (2)

We canalso write the orthogonal condition as 技术分享

The LSmethod of parameter identification or machine learning is very welldeveloped and there are many properties associated with thetechnique. In fact, much of the work in statistical inference isderived from the few equations described in this section. This is thebeginning of many scientific investigations including work in thesocial sciences.

A Brief Review of Supervised Learning