首页 > 代码库 > 【DeepLearning】Exercise:Sparse Autoencoder
【DeepLearning】Exercise:Sparse Autoencoder
习题的链接:http://deeplearning.stanford.edu/wiki/index.php/Exercise:Sparse_Autoencoder
我的实现:
sampleIMAGES.m
function patches = sampleIMAGES()% sampleIMAGES% Returns 10000 patches for trainingload IMAGES; % load images from disk patchsize = 8; % we‘ll use 8x8 patches numpatches = 10000;% Initialize patches with zeros. Your code will fill in this matrix--one% column per patch, 10000 columns. patches = zeros(patchsize*patchsize, numpatches);%% ---------- YOUR CODE HERE --------------------------------------% Instructions: Fill in the variable called "patches" using data % from IMAGES. % % IMAGES is a 3D array containing 10 images% For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image,% and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize% it. (The contrast on these images look a bit off because they have% been preprocessed using using "whitening." See the lecture notes for% more details.) As a second example, IMAGES(21:30,21:30,1) is an image% patch corresponding to the pixels in the block (21,21) to (30,30) of% Image 1for i=1:numpatches %randomly pick one of the 10 images ind = round(rand(1)*9)+1; %[0~9]+1=[1,10] %randomly sample an 8×8 image patch startx = round(rand(1)*504)+1; %[0~504]+1=[1,505] starty = round(rand(1)*504)+1; tempPatch = IMAGES(startx:startx+patchsize-1,starty:starty+patchsize-1,ind); %convert the image patch into a 64-dimensional vector tempVec = reshape(tempPatch, patchsize*patchsize, 1); patches(:,i) = tempVec;end%% ---------------------------------------------------------------% For the autoencoder to work well we need to normalize the data% Specifically, since the output of the network is bounded between [0,1]% (due to the sigmoid activation function), we have to make sure % the range of pixel values is also bounded between [0,1]patches = normalizeData(patches);end%% ---------------------------------------------------------------function patches = normalizeData(patches)% Squash data to [0.1, 0.9] since we use sigmoid as the activation% function in the output layer% Remove DC (mean of images). patches = bsxfun(@minus, patches, mean(patches));% Truncate to +/-3 standard deviations and scale to -1 to 1pstd = 3 * std(patches(:));patches = max(min(patches, pstd), -pstd) / pstd;% Rescale from [-1,1] to [0.1,0.9]patches = (patches + 1) * 0.4 + 0.1;end
computeNumericalGradient.m
function numgrad = computeNumericalGradient(J, theta)% numgrad = computeNumericalGradient(J, theta)% theta: a vector of parameters% J: a function that outputs a real-number. Calling y = J(theta) will return the% function value at theta. % Initialize numgrad with zerosnumgrad = zeros(size(theta));%% ---------- YOUR CODE HERE --------------------------------------% Instructions: % Implement numerical gradient checking, and return the result in numgrad. % (See Section 2.3 of the lecture notes.)% You should write code so that numgrad(i) is (the numerical approximation to) the % partial derivative of J with respect to the i-th input argument, evaluated at theta. % I.e., numgrad(i) should be the (approximately) the partial derivative of J with % respect to theta(i).% % Hint: You will probably want to compute the elements of numgrad one at a time. % The function aims to check whether W1grad&W2grad&b1grad&b2grad in% sparseAutoencoderCost have been computed correctly.%theta(i) denotes i_th parameterlength = size(theta,1);EPSILON = 1e-4;E = eye(length);for i=1:length numgrad(i)=(J(theta + EPSILON*E(:,i)) - J(theta - EPSILON*E(:,i)))/(2*EPSILON);end%% ---------------------------------------------------------------end
sparseAutoencoderCost.m
function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ... lambda, sparsityParam, beta, data)% visibleSize: the number of input units (probably 64) % hiddenSize: the number of hidden units (probably 25) % lambda: weight decay parameter% sparsityParam: The desired average activation for the hidden units (denoted in the lecture% notes by the greek alphabet rho, which looks like a lower-case "p").% beta: weight of sparsity penalty term% data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example. % The input theta is a vector (because minFunc expects the parameters to be a vector). % We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this % follows the notation convention of the lecture notes. % W1(i,j) denotes the weight from j_th node in input layer to i_th node% in hidden layer. Thus it is a hiddenSize*visibleSize matrixW1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); % W2(i,j) denotes the weight from j_th node in hidden layer to i_th node% in output layer. Thus it is a visibleSize*hiddenSize matrixW2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);% b1(i) denotes the i_th bias in input layer to i_th node in hidden layer.% Thus it is a hiddenSize*1 vectorb1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);% b2(i) denotes the i_th bias in hidden layer to i_th node in output layer.% Thus it is a visibleSize*1 vectorb2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);%% ---------- YOUR CODE HERE --------------------------------------% Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,% and the corresponding gradients W1grad, W2grad, b1grad, b2grad.%% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions% as b1, etc. Your code should set W1grad to be the partial derivative of J_sparse(W,b) with% respect to W1. I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) % with respect to the input parameter W1(i,j). Thus, W1grad should be equal to the term % [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 % of the lecture notes (and similarly for W2grad, b1grad, b2grad).% % Stated differently, if we were using batch gradient descent to optimize the parameters,% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2. % % 1. Set \Delta W^{(1)}, \Delta b^{(1)} to 0 for all layer l% Cost and gradient variables (your code needs to compute these values). % Here, we initialize them to zeros. cost = 0;rho = zeros(hiddenSize,1);W1grad = zeros(size(W1)); W2grad = zeros(size(W2));b1grad = zeros(size(b1)); b2grad = zeros(size(b2));% 2.For i=1 to mm = size(data,2);%{For Big-data, which we cannot save activation information for all nodes% Compute rho before backpropagationfor i=1:m x = data(:,i); z2 = W1 * x + b1; a2 = sigmoid(z2); % Accumulate rho rho = rho + a2;endrho = rho ./ m;KLterm = beta*(-sparsityParam ./ rho + (1-sparsityParam) ./ (1-rho));for i=1:m x = data(:,i); % 2a. Use backpropagation to compute diff(J_sparse(W,b;x,y), W^{(1)}) % and diff(J_sparse(W,b;x,y), b^{(1)}) % 2a.1. Perform a feedforward pass, computing the activations for % hidden layer and output layer. z2 = W1 * x + b1; a2 = sigmoid(z2); z3 = W2 * a2 + b2; a3 = sigmoid(z3); % Accumulate the cost cost = cost + 1/2 * ((x-a3)‘*(x-a3)); % 2a.2. For the output layer, set delta3 % delta3 is a visibleSize*1 vector delta3 = -(x-a3) .* sigmoidDiff(z3); % 2a.3. For the hidden layer, set delta2 % delta2 is a hiddenSize*1 vector delta2 = (W2‘*delta3 + KLterm) .* sigmoidDiff(z2); % 2a.4. Compute the desired partial derivatives JW1diff = delta2 * x‘; Jb1diff = delta2; JW2diff = delta3 * a2‘; Jb2diff = delta3; % 2b. Update \Delta W^{(1)} W1grad = W1grad + JW1diff; W2grad = W2grad + JW2diff; % 2c. Update \Delta b^{(1)} b1grad = b1grad + Jb1diff; b2grad = b2grad + Jb2diff;end% Compute KL penalty termKLpen = beta * sum(sparsityParam*log(sparsityParam ./ rho) + (1-sparsityParam)*log((1-sparsityParam) ./ (1-rho)));% Compute weight decay termtempW1 = W1 .* W1;tempW2 = W2 .* W2;WD = (lambda/2)*(sum(sum(tempW1))+sum(sum(tempW2)));cost = cost ./ m + WD + KLpen;W1grad = W1grad ./ m + lambda .* W1;W2grad = W2grad ./ m + lambda .* W2;b1grad = b1grad ./ m;b2grad = b2grad ./ m;%}% For small data, save activation information during computing rhoz2_all = zeros(hiddenSize, m);a2_all = zeros(hiddenSize, m);z3_all = zeros(visibleSize, m);a3_all = zeros(visibleSize, m);for i=1:m x = data(:,i); z2_all(:,i) = W1 * x + b1; a2_all(:,i) = sigmoid(z2_all(:,i)); z3_all(:,i) = W2 * a2_all(:,i) + b2; a3_all(:,i) = sigmoid(z3_all(:,i)); % Accumulate rho rho = rho + a2_all(:,i);endrho = rho ./ m;KLterm = beta*(-sparsityParam ./ rho + (1-sparsityParam) ./ (1-rho));for i=1:m x = data(:,i); % 2a. Use backpropagation to compute diff(J_sparse(W,b;x,y), W^{(1)}) % and diff(J_sparse(W,b;x,y), b^{(1)}) % 2a.1. Perform a feedforward pass, computing the activations for % hidden layer and output layer. z2 = z2_all(:,i); a2 = a2_all(:,i); z3 = z3_all(:,i); a3 = a3_all(:,i); % Accumulate the cost cost = cost + 1/2 * ((x-a3)‘*(x-a3)); % 2a.2. For the output layer, set delta3 % delta3 is a visibleSize*1 vector delta3 = -(x-a3) .* sigmoidDiff(z3); % 2a.3. For the hidden layer, set delta2 % delta2 is a hiddenSize*1 vector delta2 = (W2‘*delta3 + KLterm) .* sigmoidDiff(z2); % 2a.4. Compute the desired partial derivatives JW1diff = delta2 * x‘; Jb1diff = delta2; JW2diff = delta3 * a2‘; Jb2diff = delta3; % 2b. Update \Delta W^{(1)} W1grad = W1grad + JW1diff; W2grad = W2grad + JW2diff; % 2c. Update \Delta b^{(1)} b1grad = b1grad + Jb1diff; b2grad = b2grad + Jb2diff;end% Compute KL penalty termKLpen = beta * sum(sparsityParam*log(sparsityParam ./ rho) + (1-sparsityParam)*log((1-sparsityParam) ./ (1-rho)));% Compute weight decay termtempW1 = W1 .* W1;tempW2 = W2 .* W2;WD = (lambda/2)*(sum(sum(tempW1))+sum(sum(tempW2)));cost = cost ./ m + WD + KLpen;W1grad = W1grad ./ m + lambda .* W1;W2grad = W2grad ./ m + lambda .* W2;b1grad = b1grad ./ m;b2grad = b2grad ./ m;%-------------------------------------------------------------------% 3.Update the parametersAfter computing the cost and gradient, we will % convert the gradients back to a vector format (suitable for minFunc). % Specifically, we will unroll your gradient matrices into a vector.grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];end%-------------------------------------------------------------------% Here‘s an implementation of the sigmoid function, which you may find useful% in your computation of the costs and the gradients. This inputs a (row or% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x));end% define the differential of sigmoidfunction sigmDiff = sigmoidDiff(x) sigmDiff = sigmoid(x) .* (1-sigmoid(x));end
最终训练结果:
【DeepLearning】Exercise:Sparse Autoencoder
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。