首页 > 代码库 > 机器学习笔记(Washington University)- Classification Specialization-week one & week two

机器学习笔记(Washington University)- Classification Specialization-week one & week two

1. Linear classifier

It will use training data to learn a weight or coefficient for each word.

We use the gradient ascent to find the best model with the highest likelihood.

 

2. Sigmoid function

How can we map the output value of score (-∞ to +∞) with the probabilities ( 0 to 1).

we can use the sigmoid function shown below

技术分享

and for linear classifer, the 

技术分享

and it is called the generalized linear model.

 

 3. Categorical inputs(countries or zipcode)

1-hot encoding: only one of these features has value 1 at the time everything else is 0.

Bag-words:the number of each word  is a feature, with different wors, we have different features.

 

4. Multiclass classification

one-vesus-all

Train each classifier for one category and whatever class with the highest probability wins.

 

5. Likelihood function

We need to define a likelihood function to train the data, for positive data points we need the probability to be 1

and for negative data points we need the probability to be zero, but no weights can perfectlty achive that, so the

likelihood funciton is to measure the quality of fit for model with coefficients w and the highest likelihood defines the

best model. And that is where gradient ascent comes in. And the likelihood function is defined as:

技术分享

Now the task is to choose W to  mask the likelihood function as large as possible

and we use gradient ascent to get the maximum value

技术分享

and we stop when the magnitude of the gradient is small enough

技术分享\

and the gradient is:

技术分享

 技术分享and hj(xi) is the feature value

 

6. Learning curve

The learning curve plot is the The log likelihood over alll data points vs number of

iterations. And we use this cure to choose the step size. If the step size is too large, 

it can cause divergence or wild oscillations. And if the step size is too small then the

time to converge is too slow. And we the advanced tip is that we can try step size that 

decreases with iterations.

 

7. Overfitting

Often, overfitting is associated with large magnitude of coefficients and overconfident predictions. Because the

maximum likelihood function will try to find a larget w to get a larger probalility value to the data. And for linearly

separable data, the coefficients will grow to infinity.

 

8. Regularization

Total cost = measure of the fit - measure of magnitude of coefficients

measure of the fit = log of likelihood function

measure of magnitude of coefficients =  (L2 norm penalize the large coefficients)

we can choose w to maximize the function below:

技术分享

and the gradient ascent becomes:

技术分享

so the interpretation of the gradient is to try to move the coefficients towards zero.

 

机器学习笔记(Washington University)- Classification Specialization-week one & week two