首页 > 代码库 > 使用神经网络来识别手写数字【译(三)- 用Python代码实现

使用神经网络来识别手写数字【译(三)- 用Python代码实现

 

实现我们分类数字的网络

 

好,让我们使用随机梯度下降和 MNIST训练数据来写一个程序来学习怎样失败手写数字。 我们也难怪Python (2.7) 来实现。只有 74 行代码!我们需要的第一个东西是 MNIST数据。如果有 github 账号,你可以将这些代码库克隆下来,

 

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

 

或者你可以到这里 下载。

Incidentally, 当我先前说到 MNIST 数据集时,我说它被分成 60,000 个训练图片,和 10,000张测试图片。这是官方的说法。实际上,我们准备用不同的分法。 We‘ll leave the test images as is, but split the 60,000-image MNIST training set into two parts: a set of 50,000 images, which we‘ll use to train our neural network, and a separate 10,000 image validation set. We won‘t use the validation data in this chapter, but later in the book we‘ll find it useful in figuring out how to set certain hyper-parameters of the neural network - things like the learning rate, and so on, which aren‘t directly selected by our learning algorithm. Although the validation data isn‘t part of the original MNIST specification, many people use MNIST in this fashion, and the use of validation data is common in neural networks. When I refer to the "MNIST training data" from now on, I‘ll be referring to our 50,000 image data set, not the original 60,000 image data set**As noted earlier, the MNIST data set is based on two data sets collected by NIST, the United States‘ National Institute of Standards and Technology. To construct MNIST the NIST data sets were stripped down and put into a more convenient format by Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. See this link for more details. The data set in my repository is in a form that makes it easy to load and manipulate the MNIST data in Python. I obtained this particular form of the data from the LISA machine learning laboratory at the University of Montreal (link)..

 

Apart from the MNIST data we also need a Python library calledNumpy, for doing fast linear algebra. If you don‘t already have Numpy installed, you can get it here.

让我们讲述一下神经网络代码的核心功能,在我给出完整清单前。核心是一个 Network 类,我们用了表现一个神经网络。下面这些代码是用来初始化一个Network对象:

 

class Network(object):    def __init__(self, sizes):        self.num_layers = len(sizes)        self.sizes = sizes        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]        self.weights = [np.random.randn(y, x)                         for x, y in zip(sizes[:-1], sizes[1:])]

 

 

这些代码,列表的 sizes 包含各个层的神经元的数量。例如,如果我们想创建一个第一层有有两神经元第二层有3个神经元,最后一层有一个神经元 的 Network对象,我们这样设置:

net = Network([2, 3, 1])

 

偏移量和权重,用随机数来初始化,使用 Numpy np.random.randn函数来生成Gaussian distributions with mean 00 and standard deviation 11. This random initialization gives our stochastic gradient descent algorithm a place to start from. In later chapters we‘ll find better ways of initializing the weights and biases, but this will do for now. Note that the Network initialization code assumes that the first layer of neurons is an input layer, and omits to set any biases for those neurons, since biases are only ever used in computing the outputs from later layers.

 

Note also that the biases and weights are stored as lists of Numpy matrices. So, for example net.weights[1] is a Numpy matrix storing the weights connecting the second and third layers of neurons. (It‘s not the first and second layers, since Python‘s list indexing starts at0.) Since net.weights[1] is rather verbose, let‘s just denote that matrix ww. It‘s a matrix such that wjkwjk is the weight for the connection between the kthkth neuron in the second layer, and the jthjth neuron in the third layer. This ordering of the jj and kk indices may seem strange - surely it‘d make more sense to swap the jj and kk indices around? The big advantage of using this ordering is that it means that the vector of activations of the third layer of neurons is:

a=σ(wa+b).(22)(22)a′=σ(wa+b).

There‘s quite a bit going on in this equation, so let‘s unpack it piece by piece. aa is the vector of activations of the second layer of neurons. To obtain a′a′ we multiply aa by the weight matrix ww, and add the vector bb of biases. We then apply the function σσelementwise to every entry in the vector wa+bwa+b. (This is calledvectorizing the function σσ.) It‘s easy to verify that Equation (22)gives the same result as our earlier rule, Equation (4), for computing the output of a sigmoid neuron.

 

 

Exercise

  • Write out Equation (22) in component form, and verify that it gives the same result as the rule (4) for computing the output of a sigmoid neuron.

 

With all this in mind, it‘s easy to write code computing the output from a Network instance. We begin by defining the sigmoid function:

def sigmoid(z):    return 1.0/(1.0+np.exp(-z))

 

Note that when the input z is a vector or Numpy array, Numpy automatically applies the function sigmoid elementwise, that is, in vectorized form.

 

We then add a feedforward method to the Network class, which, given an input a for the network, returns the corresponding output**It is assumed that the input a is an (n, 1)Numpy ndarray, not a (n,) vector. Here, n is the number of inputs to the network. If you try to use an (n,) vector as input you‘ll get strange results. Although using an (n,) vector appears the more natural choice, using an (n, 1) ndarray makes it particularly easy to modify the code to feedforward multiple inputs at once, and that is sometimes convenient.. All the method does is applies Equation (22) for each layer:

 def feedforward(self, a):        """Return the output of the network if "a" is input."""        for b, w in zip(self.biases, self.weights):            a = sigmoid(np.dot(w, a)+b)        return a

 

 

Of course, the main thing we want our Network objects to do is to learn. To that end we‘ll give them an SGD method which implements stochastic gradient descent. Here‘s the code. It‘s a little mysterious in a few places, but I‘ll break it down below, after the listing.

 

 def SGD(self, training_data, epochs, mini_batch_size, eta,            test_data=None):        """Train the neural network using mini-batch stochastic        gradient descent.  The "training_data" is a list of tuples        "(x, y)" representing the training inputs and the desired        outputs.  The other non-optional parameters are        self-explanatory.  If "test_data" is provided then the        network will be evaluated against the test data after each        epoch, and partial progress printed out.  This is useful for        tracking progress, but slows things down substantially."""        if test_data: n_test = len(test_data)        n = len(training_data)        for j in xrange(epochs):            random.shuffle(training_data)            mini_batches = [                training_data[k:k+mini_batch_size]                for k in xrange(0, n, mini_batch_size)]            for mini_batch in mini_batches:                self.update_mini_batch(mini_batch, eta)            if test_data:                print "Epoch {0}: {1} / {2}".format(                    j, self.evaluate(test_data), n_test)            else:                print "Epoch {0} complete".format(j)

 

 

The training_data is a list of tuples (x, y) representing the training inputs and corresponding desired outputs. The variables epochs andmini_batch_size are what you‘d expect - the number of epochs to train for, and the size of the mini-batches to use when sampling. eta is the learning rate, ηη. If the optional argument test_data is supplied, then the program will evaluate the network after each epoch of training, and print out partial progress. This is useful for tracking progress, but slows things down substantially.

The code works as follows. In each epoch, it starts by randomly shuffling the training data, and then partitions it into mini-batches of the appropriate size. This is an easy way of sampling randomly from the training data. Then for each mini_batch we apply a single step of gradient descent. This is done by the codeself.update_mini_batch(mini_batch, eta), which updates the network weights and biases according to a single iteration of gradient descent, using just the training data in mini_batch. Here‘s the code for the update_mini_batch method:

 def update_mini_batch(self, mini_batch, eta):        """Update the network‘s weights and biases by applying        gradient descent using backpropagation to a single mini batch.        The "mini_batch" is a list of tuples "(x, y)", and "eta"        is the learning rate."""        nabla_b = [np.zeros(b.shape) for b in self.biases]        nabla_w = [np.zeros(w.shape) for w in self.weights]        for x, y in mini_batch:            delta_nabla_b, delta_nabla_w = self.backprop(x, y)            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]        self.weights = [w-(eta/len(mini_batch))*nw                         for w, nw in zip(self.weights, nabla_w)]        self.biases = [b-(eta/len(mini_batch))*nb                        for b, nb in zip(self.biases, nabla_b)]

 

Most of the work is done by the line

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

 

This invokes something called the backpropagation algorithm, which is a fast way of computing the gradient of the cost function. So update_mini_batch works simply by computing these gradients for every training example in the mini_batch, and then updatingself.weights and self.biases appropriately.

 

I‘m not going to show the code for self.backprop right now. We‘ll study how backpropagation works in the next chapter, including the code for self.backprop. For now, just assume that it behaves as claimed, returning the appropriate gradient for the cost associated to the training example x.

Let‘s look at the full program, including the documentation strings, which I omitted above. Apart from self.backprop the program is self-explanatory - all the heavy lifting is done in self.SGD andself.update_mini_batch, which we‘ve already discussed. Theself.backprop method makes use of a few extra functions to help in computing the gradient, namely sigmoid_prime, which computes the derivative of the σσ function, and self.cost_derivative, which I won‘t describe here. You can get the gist of these (and perhaps the details) just by looking at the code and documentation strings. We‘ll look at them in detail in the next chapter. Note that while the program appears lengthy, much of the code is documentation strings intended to make the code easy to understand. In fact, the program contains just 74 lines of non-whitespace, non-comment code. All the code may be found on GitHub here.

 

 

"""network.py~~~~~~~~~~A module to implement the stochastic gradient descent learningalgorithm for a feedforward neural network.  Gradients are calculatedusing backpropagation.  Note that I have focused on making the codesimple, easily readable, and easily modifiable.  It is not optimized,and omits many desirable features."""#### Libraries# Standard libraryimport random# Third-party librariesimport numpy as npclass Network(object):    def __init__(self, sizes):        """The list ``sizes`` contains the number of neurons in the        respective layers of the network.  For example, if the list        was [2, 3, 1] then it would be a three-layer network, with the        first layer containing 2 neurons, the second layer 3 neurons,        and the third layer 1 neuron.  The biases and weights for the        network are initialized randomly, using a Gaussian        distribution with mean 0, and variance 1.  Note that the first        layer is assumed to be an input layer, and by convention we        won‘t set any biases for those neurons, since biases are only        ever used in computing the outputs from later layers."""        self.num_layers = len(sizes)        self.sizes = sizes        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]        self.weights = [np.random.randn(y, x)                        for x, y in zip(sizes[:-1], sizes[1:])]    def feedforward(self, a):        """Return the output of the network if ``a`` is input."""        for b, w in zip(self.biases, self.weights):            a = sigmoid(np.dot(w, a)+b)        return a    def SGD(self, training_data, epochs, mini_batch_size, eta,            test_data=None):        """Train the neural network using mini-batch stochastic        gradient descent.  The ``training_data`` is a list of tuples        ``(x, y)`` representing the training inputs and the desired        outputs.  The other non-optional parameters are        self-explanatory.  If ``test_data`` is provided then the        network will be evaluated against the test data after each        epoch, and partial progress printed out.  This is useful for        tracking progress, but slows things down substantially."""        if test_data: n_test = len(test_data)        n = len(training_data)        for j in xrange(epochs):            random.shuffle(training_data)            mini_batches = [                training_data[k:k+mini_batch_size]                for k in xrange(0, n, mini_batch_size)]            for mini_batch in mini_batches:                self.update_mini_batch(mini_batch, eta)            if test_data:                print "Epoch {0}: {1} / {2}".format(                    j, self.evaluate(test_data), n_test)            else:                print "Epoch {0} complete".format(j)    def update_mini_batch(self, mini_batch, eta):        """Update the network‘s weights and biases by applying        gradient descent using backpropagation to a single mini batch.        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``        is the learning rate."""        nabla_b = [np.zeros(b.shape) for b in self.biases]        nabla_w = [np.zeros(w.shape) for w in self.weights]        for x, y in mini_batch:            delta_nabla_b, delta_nabla_w = self.backprop(x, y)            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]        self.weights = [w-(eta/len(mini_batch))*nw                        for w, nw in zip(self.weights, nabla_w)]        self.biases = [b-(eta/len(mini_batch))*nb                       for b, nb in zip(self.biases, nabla_b)]    def backprop(self, x, y):        """Return a tuple ``(nabla_b, nabla_w)`` representing the        gradient for the cost function C_x.  ``nabla_b`` and        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar        to ``self.biases`` and ``self.weights``."""        nabla_b = [np.zeros(b.shape) for b in self.biases]        nabla_w = [np.zeros(w.shape) for w in self.weights]        # feedforward        activation = x        activations = [x] # list to store all the activations, layer by layer        zs = [] # list to store all the z vectors, layer by layer        for b, w in zip(self.biases, self.weights):            z = np.dot(w, activation)+b            zs.append(z)            activation = sigmoid(z)            activations.append(activation)        # backward pass        delta = self.cost_derivative(activations[-1], y) *             sigmoid_prime(zs[-1])        nabla_b[-1] = delta        nabla_w[-1] = np.dot(delta, activations[-2].transpose())        # Note that the variable l in the loop below is used a little        # differently to the notation in Chapter 2 of the book.  Here,        # l = 1 means the last layer of neurons, l = 2 is the        # second-last layer, and so on.  It‘s a renumbering of the        # scheme in the book, used here to take advantage of the fact        # that Python can use negative indices in lists.        for l in xrange(2, self.num_layers):            z = zs[-l]            sp = sigmoid_prime(z)            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp            nabla_b[-l] = delta            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())        return (nabla_b, nabla_w)    def evaluate(self, test_data):        """Return the number of test inputs for which the neural        network outputs the correct result. Note that the neural        network‘s output is assumed to be the index of whichever        neuron in the final layer has the highest activation."""        test_results = [(np.argmax(self.feedforward(x)), y)                        for (x, y) in test_data]        return sum(int(x == y) for (x, y) in test_results)    def cost_derivative(self, output_activations, y):        """Return the vector of partial derivatives \partial C_x /        \partial a for the output activations."""        return (output_activations-y)#### Miscellaneous functionsdef sigmoid(z):    """The sigmoid function."""    return 1.0/(1.0+np.exp(-z))def sigmoid_prime(z):    """Derivative of the sigmoid function."""    return sigmoid(z)*(1-sigmoid(z))

 

 

How well does the program recognize handwritten digits? Well, let‘s start by loading in the MNIST data. I‘ll do this using a little helper program, mnist_loader.py, to be described below. We execute the following commands in a Python shell,

 

>>> import mnist_loader>>> training_data, validation_data, test_data =http://www.mamicode.com/ ... mnist_loader.load_data_wrapper()

 

 

Of course, this could also be done in a separate Python program, but if you‘re following along it‘s probably easiest to do in a Python shell.

After loading the MNIST data, we‘ll set up a Network with 3030 hidden neurons. We do this after importing the Python program listed above, which is named network,

 

>>> import network>>> net = network.Network([784, 30, 10])

 

 

Finally, we‘ll use stochastic gradient descent to learn from the MNIST training_data over 30 epochs, with a mini-batch size of 10, and a learning rate of η=3.0η=3.0,

 

>>> net.SGD(training_data, 30, 10, 3.0, test_data=http://www.mamicode.com/test_data)

 

 

Note that if you‘re running the code as you read along, it will take some time to execute - for a typical machine (as of 2015) it will likely take a few minutes to run. I suggest you set things running, continue to read, and periodically check the output from the code. If you‘re in a rush you can speed things up by decreasing the number of epochs, by decreasing the number of hidden neurons, or by using only part of the training data. Note that production code would be much, much faster: these Python scripts are intended to help you understand how neural nets work, not to be high-performance code! And, of course, once we‘ve trained a network it can be run very quickly indeed, on almost any computing platform. For example, once we‘ve learned a good set of weights and biases for a network, it can easily be ported to run in Javascript in a web browser, or as a native app on a mobile device. In any case, here is a partial transcript of the output of one training run of the neural network. The transcript shows the number of test images correctly recognized by the neural network after each epoch of training. As you can see, after just a single epoch this has reached 9,129 out of 10,000, and the number continues to grow,

 

Epoch 0: 9129 / 10000Epoch 1: 9295 / 10000Epoch 2: 9348 / 10000...Epoch 27: 9528 / 10000Epoch 28: 9542 / 10000Epoch 29: 9534 / 10000

 

 

That is, the trained network gives us a classification rate of about 9595percent - 95.4295.42 percent at its peak ("Epoch 28")! That‘s quite encouraging as a first attempt. I should warn you, however, that if you run the code then your results are not necessarily going to be quite the same as mine, since we‘ll be initializing our network using (different) random weights and biases. To generate results in this chapter I‘ve taken best-of-three runs.

Let‘s rerun the above experiment, changing the number of hidden neurons to 100100. As was the case earlier, if you‘re running the code as you read along, you should be warned that it takes quite a while to execute (on my machine this experiment takes tens of seconds for each training epoch), so it‘s wise to continue reading in parallel while the code executes.

 

>>> net = network.Network([784, 100, 10])>>> net.SGD(training_data, 30, 10, 3.0, test_data=http://www.mamicode.com/test_data)

 

 

Sure enough, this improves the results to 96.5996.59 percent. At least in this case, using more hidden neurons helps us get better results**Reader feedback indicates quite some variation in results for this experiment, and some training runs give results quite a bit worse. Using the techniques introduced in chapter 3 will greatly reduce the variation in performance across different training runs for our networks..

Of course, to obtain these accuracies I had to make specific choices for the number of epochs of training, the mini-batch size, and the learning rate, ηη. As I mentioned above, these are known as hyper-parameters for our neural network, in order to distinguish them from the parameters (weights and biases) learnt by our learning algorithm. If we choose our hyper-parameters poorly, we can get bad results. Suppose, for example, that we‘d chosen the learning rate to be η=0.001η=0.001,

 

>>> net = network.Network([784, 100, 10])>>> net.SGD(training_data, 30, 10, 0.001, test_data=http://www.mamicode.com/test_data)

 

 

The results are much less encouraging,

Epoch 0: 1139 / 10000Epoch 1: 1136 / 10000Epoch 2: 1135 / 10000...Epoch 27: 2101 / 10000Epoch 28: 2123 / 10000Epoch 29: 2142 / 10000

 

However, you can see that the performance of the network is getting slowly better over time. That suggests increasing the learning rate, say to η=0.01η=0.01. If we do that, we get better results, which suggests increasing the learning rate again. (If making a change improves things, try doing more!) If we do that several times over, we‘ll end up with a learning rate of something like η=1.0η=1.0 (and perhaps fine tune to 3.03.0), which is close to our earlier experiments. So even though we initially made a poor choice of hyper-parameters, we at least got enough information to help us improve our choice of hyper-parameters.

 

In general, debugging a neural network can be challenging. This is especially true when the initial choice of hyper-parameters produces results no better than random noise. Suppose we try the successful 30 hidden neuron network architecture from earlier, but with the learning rate changed to η=100.0η=100.0:

>>> net = network.Network([784, 30, 10])>>> net.SGD(training_data, 30, 10, 100.0, test_data=http://www.mamicode.com/test_data)

 

At this point we‘ve actually gone too far, and the learning rate is too high:

Epoch 0: 1009 / 10000Epoch 1: 1009 / 10000Epoch 2: 1009 / 10000Epoch 3: 1009 / 10000...Epoch 27: 982 / 10000Epoch 28: 982 / 10000Epoch 29: 982 / 10000

 

Now imagine that we were coming to this problem for the first time. Of course, we know from our earlier experiments that the right thing to do is to decrease the learning rate. But if we were coming to this problem for the first time then there wouldn‘t be much in the output to guide us on what to do. We might worry not only about the learning rate, but about every other aspect of our neural network. We might wonder if we‘ve initialized the weights and biases in a way that makes it hard for the network to learn? Or maybe we don‘t have enough training data to get meaningful learning? Perhaps we haven‘t run for enough epochs? Or maybe it‘s impossible for a neural network with this architecture to learn to recognize handwritten digits? Maybe the learning rate is too low? Or, maybe, the learning rate is too high? When you‘re coming to a problem for the first time, you‘re not always sure.

 

The lesson to take away from this is that debugging a neural network is not trivial, and, just as for ordinary programming, there is an art to it. You need to learn that art of debugging in order to get good results from neural networks. More generally, we need to develop heuristics for choosing good hyper-parameters and a good architecture. We‘ll discuss all these at length through the book, including how I chose the hyper-parameters above.

 

Exercise

 

 

  • Try creating a network with just two layers - an input and an output layer, no hidden layer - with 784 and 10 neurons, respectively. Train the network using stochastic gradient descent. What classification accuracy can you achieve?

 

 

Earlier, I skipped over the details of how the MNIST data is loaded. It‘s pretty straightforward. For completeness, here‘s the code. The data structures used to store the MNIST data are described in the documentation strings - it‘s straightforward stuff, tuples and lists of Numpy ndarray objects (think of them as vectors if you‘re not familiar with ndarrays):

 

"""mnist_loader~~~~~~~~~~~~A library to load the MNIST image data.  For details of the datastructures that are returned, see the doc strings for ``load_data``and ``load_data_wrapper``.  In practice, ``load_data_wrapper`` is thefunction usually called by our neural network code."""#### Libraries# Standard libraryimport cPickleimport gzip# Third-party librariesimport numpy as npdef load_data():    """Return the MNIST data as a tuple containing the training data,    the validation data, and the test data.    The ``training_data`` is returned as a tuple with two entries.    The first entry contains the actual training images.  This is a    numpy ndarray with 50,000 entries.  Each entry is, in turn, a    numpy ndarray with 784 values, representing the 28 * 28 = 784    pixels in a single MNIST image.    The second entry in the ``training_data`` tuple is a numpy ndarray    containing 50,000 entries.  Those entries are just the digit    values (0...9) for the corresponding images contained in the first    entry of the tuple.    The ``validation_data`` and ``test_data`` are similar, except    each contains only 10,000 images.    This is a nice data format, but for use in neural networks it‘s    helpful to modify the format of the ``training_data`` a little.    That‘s done in the wrapper function ``load_data_wrapper()``, see    below.    """    f = gzip.open(../data/mnist.pkl.gz, rb)    training_data, validation_data, test_data = cPickle.load(f)    f.close()    return (training_data, validation_data, test_data)def load_data_wrapper():    """Return a tuple containing ``(training_data, validation_data,    test_data)``. Based on ``load_data``, but the format is more    convenient for use in our implementation of neural networks.    In particular, ``training_data`` is a list containing 50,000    2-tuples ``(x, y)``.  ``x`` is a 784-dimensional numpy.ndarray    containing the input image.  ``y`` is a 10-dimensional    numpy.ndarray representing the unit vector corresponding to the    correct digit for ``x``.    ``validation_data`` and ``test_data`` are lists containing 10,000    2-tuples ``(x, y)``.  In each case, ``x`` is a 784-dimensional    numpy.ndarry containing the input image, and ``y`` is the    corresponding classification, i.e., the digit values (integers)    corresponding to ``x``.    Obviously, this means we‘re using slightly different formats for    the training data and the validation / test data.  These formats    turn out to be the most convenient for use in our neural network    code."""    tr_d, va_d, te_d = load_data()    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]    training_results = [vectorized_result(y) for y in tr_d[1]]    training_data = zip(training_inputs, training_results)    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]    validation_data = zip(validation_inputs, va_d[1])    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]    test_data = zip(test_inputs, te_d[1])    return (training_data, validation_data, test_data)def vectorized_result(j):    """Return a 10-dimensional unit vector with a 1.0 in the jth    position and zeroes elsewhere.  This is used to convert a digit    (0...9) into a corresponding desired output from the neural    network."""    e = np.zeros((10, 1))    e[j] = 1.0    return e 

 

I said above that our program gets pretty good results. What does that mean? Good compared to what? It‘s informative to have some simple (non-neural-network) baseline tests to compare against, to understand what it means to perform well. The simplest baseline of all, of course, is to randomly guess the digit. That‘ll be right about ten percent of the time. We‘re doing much better than that!

What about a less trivial baseline? Let‘s try an extremely simple idea: we‘ll look at how dark an image is. For instance, an image of a 22 will typically be quite a bit darker than an image of a 11, just because more pixels are blackened out, as the following examples illustrate:

 

技术分享

 

This suggests using the training data to compute average darknesses for each digit, 0,1,2,,90,1,2,…,9. When presented with a new image, we compute how dark the image is, and then guess that it‘s whichever digit has the closest average darkness. This is a simple procedure, and is easy to code up, so I won‘t explicitly write out the code - if you‘re interested it‘s in the GitHub repository. But it‘s a big improvement over random guessing, getting 2,2252,225 of the 10,00010,000test images correct, i.e., 22.2522.25 percent accuracy.

It‘s not difficult to find other ideas which achieve accuracies in the 2020 to 5050 percent range. If you work a bit harder you can get up over 5050 percent. But to get much higher accuracies it helps to use established machine learning algorithms. Let‘s try using one of the best known algorithms, the support vector machine or SVM. If you‘re not familiar with SVMs, not to worry, we‘re not going to need to understand the details of how SVMs work. Instead, we‘ll use a Python library called scikit-learn, which provides a simple Python interface to a fast C-based library for SVMs known as LIBSVM.

If we run scikit-learn‘s SVM classifier using the default settings, then it gets 9,435 of 10,000 test images correct. (The code is available here.) That‘s a big improvement over our naive approach of classifying an image based on how dark it is. Indeed, it means that the SVM is performing roughly as well as our neural networks, just a little worse. In later chapters we‘ll introduce new techniques that enable us to improve our neural networks so that they perform much better than the SVM.

That‘s not the end of the story, however. The 9,435 of 10,000 result is for scikit-learn‘s default settings for SVMs. SVMs have a number of tunable parameters, and it‘s possible to search for parameters which improve this out-of-the-box performance. I won‘t explicitly do this search, but instead refer you to this blog post by Andreas Mueller if you‘d like to know more. Mueller shows that with some work optimizing the SVM‘s parameters it‘s possible to get the performance up above 98.5 percent accuracy. In other words, a well-tuned SVM only makes an error on about one digit in 70. That‘s pretty good! Can neural networks do better?

In fact, they can. At present, well-designed neural networks outperform every other technique for solving MNIST, including SVMs. The current (2013) record is classifying 9,979 of 10,000 images correctly. This was done by Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. We‘ll see most of the techniques they used later in the book. At that level the performance is close to human-equivalent, and is arguably better, since quite a few of the MNIST images are difficult even for humans to recognize with confidence, for example:

 

技术分享

 

我相信你会同意那些很难分类! With images like these in the MNIST data set it‘s remarkable that neural networks can accurately classify all but 21 of the 10,000 test images. Usually, when programming we believe that solving a complicated problem like recognizing the MNIST digits requires a sophisticated algorithm. But even the neural networks in the Wan et al paper just mentioned involve quite simple algorithms, variations on the algorithm we‘ve seen in this chapter. All the complexity is learned, automatically, from the training data. In some sense, the moral of both our results and those in more sophisticated papers, is that for some problems:

sophisticated algorithm ≤ simple learning algorithm + good training data.

 

 

向深度学习迈进

 

While our neural network gives impressive performance, that performance is somewhat mysterious. The weights and biases in the network were discovered automatically. And that means we don‘t immediately have an explanation of how the network does what it does. Can we find some way to understand the principles by which our network is classifying handwritten digits? And, given such principles, can we do better?

To put these questions more starkly, suppose that a few decades hence neural networks lead to artificial intelligence (AI). Will we understand how such intelligent networks work? Perhaps the networks will be opaque to us, with weights and biases we don‘t understand, because they‘ve been learned automatically. In the early days of AI research people hoped that the effort to build an AI would also help us understand the principles behind intelligence and, maybe, the functioning of the human brain. But perhaps the outcome will be that we end up understanding neither the brain nor how artificial intelligence works!

To address these questions, let‘s think back to the interpretation of artificial neurons that I gave at the start of the chapter, as a means of weighing evidence. Suppose we want to determine whether an image shows a human face or not:

 

Credits: 1. Ester Inbar. 2. Unknown. 3. NASA, ESA, G. Illingworth, D. Magee, and P. Oesch (University of California, Santa Cruz), R. Bouwens (Leiden University), and the HUDF09 Team. Click on the images for more details.

技术分享 技术分享 技术分享

We could attack this problem the same way we attacked handwriting recognition - by using the pixels in the image as input to a neural network, with the output from the network a single neuron indicating either "Yes, it‘s a face" or "No, it‘s not a face".

Let‘s suppose we do this, but that we‘re not using a learning algorithm. Instead, we‘re going to try to design a network by hand, choosing appropriate weights and biases. How might we go about it? Forgetting neural networks entirely for the moment, a heuristic we could use is to decompose the problem into sub-problems: does the image have an eye in the top left? Does it have an eye in the top right? Does it have a nose in the middle? Does it have a mouth in the bottom middle? Is there hair on top? And so on.

If the answers to several of these questions are "yes", or even just "probably yes", then we‘d conclude that the image is likely to be a face. Conversely, if the answers to most of the questions are "no", then the image probably isn‘t a face.

Of course, this is just a rough heuristic, and it suffers from many deficiencies. Maybe the person is bald, so they have no hair. Maybe we can only see part of the face, or the face is at an angle, so some of the facial features are obscured. Still, the heuristic suggests that if we can solve the sub-problems using neural networks, then perhaps we can build a neural network for face-detection, by combining the networks for the sub-problems. Here‘s a possible architecture, with rectangles denoting the sub-networks. Note that this isn‘t intended as a realistic approach to solving the face-detection problem; rather, it‘s to help us build intuition about how networks function. Here‘s the architecture:

 

技术分享

 

It‘s also plausible that the sub-networks can be decomposed. Suppose we‘re considering the question: "Is there an eye in the top left?" This can be decomposed into questions such as: "Is there an eyebrow?"; "Are there eyelashes?"; "Is there an iris?"; and so on. Of course, these questions should really include positional information, as well - "Is the eyebrow in the top left, and above the iris?", that kind of thing - but let‘s keep it simple. The network to answer the question "Is there an eye in the top left?" can now be decomposed:

 

技术分享

 

这些问题 too can be broken down, further and further through multiple layers. Ultimately, we‘ll be working with sub-networks that answer questions so simple they can easily be answered at the level of single pixels. Those questions might, for example, be about the presence or absence of very simple shapes at particular points in the image. Such questions can be answered by single neurons connected to the raw pixels in the image.

The end result is a network which breaks down a very complicated question - does this image show a face or not - into very simple questions answerable at the level of single pixels. It does this through a series of many layers, with early layers answering very simple and specific questions about the input image, and later layers building up a hierarchy of ever more complex and abstract concepts. Networks with this kind of many-layer structure - two or more hidden layers - are called deep neural networks.

 

 

当然,我没有说过怎样递归分解成子网络。 It certainly isn‘t practical to hand-design the weights and biases in the network. Instead, we‘d like to use learning algorithms so that the network can automatically learn the weights and biases - and thus, the hierarchy of concepts - from training data. Researchers in the 1980s and 1990s tried using stochastic gradient descent and backpropagation to train deep networks. Unfortunately, except for a few special architectures, they didn‘t have much luck. The networks would learn, but very slowly, and in practice often too slowly to be useful.

2006年以来,一系列可用户深度学习神经网络的新技术被开发出来。这些深度学习技术是基于随机梯度下降算法和反向传播算法的。但也引入了新的思想。 These techniques have enabled much deeper (and larger) networks to be trained - people now routinely train networks with 5 to 10 hidden layers. And, it turns out that these perform far better on many problems than shallow neural networks, i.e., networks with just a single hidden layer. The reason, of course, is the ability of deep nets to build up a complex hierarchy of concepts. It‘s a bit like the way conventional programming languages use modular design and ideas about abstraction to enable the creation of complex computer programs. Comparing a deep network to a shallow network is a bit like comparing a programming language with the ability to make function calls to a stripped down language with no ability to make such calls. Abstraction takes a different form in neural networks than it does in conventional programming, but it‘s just as important.

使用神经网络来识别手写数字【译(三)- 用Python代码实现