机器学习笔记（Washington University）- Classification Specialization-week one & week two

首页 > 代码库 > 机器学习笔记（Washington University）- Classification Specialization-week one & week two

机器学习笔记（Washington University）- Classification Specialization-week one & week two

2024-09-21 09:15:09 218人阅读

1. Linear classifier

It will use training data to learn a weight or coefficient for each word.

We use the gradient ascent to find the best model with the highest likelihood.

2. Sigmoid function

How can we map the output value of score (-∞ to +∞) with the probabilities ( 0 to 1).

we can use the sigmoid function shown below

技术分享

and for linear classifer, the

技术分享

and it is called the generalized linear model.

3. Categorical inputs（countries or zipcode）

1-hot encoding: only one of these features has value 1 at the time everything else is 0.

Bag-words：the number of each word is a feature, with different wors, we have different features.

4. Multiclass classification

one-vesus-all

Train each classifier for one category and whatever class with the highest probability wins.

5. Likelihood function

We need to define a likelihood function to train the data, for positive data points we need the probability to be 1

and for negative data points we need the probability to be zero, but no weights can perfectlty achive that, so the

likelihood funciton is to measure the quality of fit for model with coefficients w and the highest likelihood defines the

best model. And that is where gradient ascent comes in. And the likelihood function is defined as:

技术分享

Now the task is to choose W to mask the likelihood function as large as possible

and we use gradient ascent to get the maximum value

技术分享

and we stop when the magnitude of the gradient is small enough

技术分享 \

and the gradient is:

技术分享

技术分享 and h_j(x_i) is the feature value

6. Learning curve

The learning curve plot is the The log likelihood over alll data points vs number of

iterations. And we use this cure to choose the step size. If the step size is too large,

it can cause divergence or wild oscillations. And if the step size is too small then the

time to converge is too slow. And we the advanced tip is that we can try step size that

decreases with iterations.

7. Overfitting

Often, overfitting is associated with large magnitude of coefficients and overconfident predictions. Because the

maximum likelihood function will try to find a larget w to get a larger probalility value to the data. And for linearly

separable data, the coefficients will grow to infinity.

8. Regularization

Total cost = measure of the fit - measure of magnitude of coefficients

measure of the fit = log of likelihood function

measure of magnitude of coefficients = (L2 norm penalize the large coefficients)

we can choose w to maximize the function below:

技术分享

and the gradient ascent becomes:

技术分享

so the interpretation of the gradient is to try to move the coefficients towards zero.

机器学习笔记（Washington University）- Classification Specialization-week one & week two

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 机器学习笔记（Washington University）- Classification Specialization-week one & week two

机器学习笔记（Washington University）- Classification Specialization-week one & week two

看完仍有疑问？有类似问题直接问程序猿