# This is a very simple explanation of artificial neural network

2021-08-31 06:08:31 TechWeb

introduction

I can't machine learn , But last month I was GitHub I found a   The minimalist 、 Entry level neural network tutorial  ,  The sample code is Go   Language  . It is simple and easy to understand, and can make clear the truth with a line of formula , No more nonsense , I had a good time watching it .

Such a good thing has to be seen by more people , But the original text is in English and cannot be shared directly , So you have to contact the author first to get the authorization of translation , Then little bear translated the project , The last article you see . The process is arduous and takes a month , If you feel good after reading , Welcome to thumb up 、 Share with more people .

The content is divided into two parts ：

The first part ：  The simplest artificial neural network

The second part ：  The most basic back propagation algorithm

Artificial neural network is the basis of artificial intelligence , Only by laying a solid foundation , To play AI magic ！

reminder  ： There are many formulas, but they just look bluffing , It's not difficult to read with patience . The following text begins ！

One 、 The simplest artificial neural network

The simplest artificial neural network explained and demonstrated by theory and code .

theory Simulated neurons

Inspired by the working mechanism of the human brain , Artificial neural networks have interconnected analog neurons , Used to store patterns and communicate with each other . The simplest form of an analog neuron is to have one or more input values and one output value , Each of them has a weight .

Take the simplest , The output value is the sum of the input value multiplied by the weight .

A simple example

The function of the network is to simulate a complex function through multiple parameters , Thus, a specific output value can be obtained when a series of input values are given , And these parameters are usually difficult for us to formulate .

Suppose we have a network with two input values ,, They correspond to two weight values and .

Now we need to adjust the weight value , So that they can produce our preset output values .

At initialization , Because we don't know the optimal value , It is often a random assignment of weights , Here we are for simplicity , Initialize them all to 1 .

In this case , What we get is

Error value

If the output value is not consistent with our expected output value , Then there is an error .

for example , If we want the target value to be , So the difference here is

Usually we use variance （ That's the cost function ） To measure the error ：

If there are multiple sets of input and output values , Then the error is the average of the variance of each group .

We use variance to measure the difference between the output value and our expected target value . The effect of the negative deviation can be removed in the form of square , Highlight those deviation values that deviate greatly （ No matter positive or negative ）.

In order to correct the error , We need to adjust the weight value , So that the result is close to our target value . In our case , Will be taken from 1.0 drop to 0.5 You can achieve your goal , because

However , Neural networks often involve many different input and output values , In this case, we need a learning algorithm to help us complete this step automatically .

Now we need to use the error to help us find the weight value that should be adjusted , This minimizes the error . But before that , Let's look at the concept of gradient .

A gradient is essentially a vector pointing to the maximum slope of a function . We use to represent the gradient , In a nutshell , It is the vector form of the partial derivative of the function variable .

For a bivariate function , It is expressed in the following form ：

Let's use some numbers to simulate a simple example . Suppose we have a function that is , Then the gradient will be

Descent can be simply understood as finding the direction of the maximum slope of our function through the gradient , Then try many times with small steps in the opposite direction , So as to find the global function （ Sometimes local ） The weight with the smallest error value .

We use a method called   Learning rate   To represent this small step in the opposite direction , In the formula we use to characterize .

If the value is too large , It's possible to miss the minimum directly , But if the value is too small , Then our network will take longer to learn , It is also possible to fall into a shallow local minimum .

For the two weight values in our example and , We need to find the gradient of these two weight values relative to the error function

Remember our above formula and ？ For and , We can bring it in and calculate its gradient separately through the chain derivation rule in calculus

brevity , Later we will use this term to mean .

Once we have gradients , Bring our proposed learning rate into , The weight value can be updated as follows ：

Then repeat the process , Until the error value is minimum and approaches zero .

Code example

The accompanying example uses the gradient descent method , The following data sets are trained into a neural network with two input values and one output value ：

Once the training is successful , The network will enter two 1 Time output ~0, In the input 1 and 0 when , Output ~1 .

How to run ？ Go PS D:githubai-simplest-network-mastersrc> go build -o bin/test.exe PS D:githubai-simplest-network-masterin> ./test.exe  err:  1.7930306267024234 err:  1.1763080417089242 …… err:  0.00011642621631266815 err:  0.00010770190838306002 err:  9.963134967988221e-05 Finished after 111 iterations  Results ---------------------- [1 1] => [0.007421243532258703] [1 0] => [0.9879921757260246]  Docker  docker build -t simplest-network . docker run --rm simplest-network  Two 、 The most basic back propagation algorithm

Back propagation （ English ：Backpropagation, Abbreviation for BP） yes “ Error back propagation ” For short , It's an optimization method （ Such as gradient descent method ） Used in combination with , Common methods for training artificial neural networks .

Back propagation technique can be used to train neural networks with at least one hidden layer . Let's start from the theory and combine the code to win   Back propagation algorithm  .

theory Introduction to perceptron

The perceptron is such a processing unit ： It accepts input , Use the activation function to convert it , And output the result .

In a neural network , The input value is the sum of the weights of the output values of the previous layer nodes , Plus the error of the previous layer ：

If we take the error as another constant in the layer, it is -1 The node of , Then we can simplify the formula to

Activation function

Why do we need to activate functions ？ without , The output of each of our nodes will be linear , Thus, the whole neural network will be the output of a linear operation based on the input value . Because the combination of linear functions is still linear , So we must introduce nonlinear functions , In order to make the neural network different from the linear regression model .

in the light of , A typical activation function has the following form ：

Sigmoid function :

Linear rectification function ：

tanh function ：

Back propagation

Back propagation algorithm can be used to train artificial neural networks , Especially for networks with more than two layers .

The principle is to use forward pass To calculate the network output and error , Then, the weight value of the input layer is inversely updated according to the error gradient .

The term

Namely I, J, K The input value of the layer node .

Namely I, J, K Output value of layer node .

yes K The expected output value of the output node .

Namely I To J Layer and the J To K The weight value of the layer .

representative T The current group of associations in a group association .

In the following example , We will use the following activation functions for different layer nodes ：

Input layer -> Identity function

Hidden layer -> Sigmoid function

Output layer ->   Identity function

The forward pass

stay forward pass in , We input in the input layer , The results are obtained at the output layer .

The input of each node of the hidden layer is the weighted sum of the input values of the input layer ：

Because the activation function of the hidden layer is sigmoid, So the output will be ：

Again , The input value of the output layer is

Because we give the identity function as the activation function , So the output of this layer will be equal to the input value .

Once the input value propagates through the network , We can calculate the error value . If there are multiple sets of associations , Remember the variance we learned in the first part ？ here , We can use the mean variance to calculate the error .

The backward pass

Now we've got the error , It can be transmitted in reverse , To correct the weight value of the network with error .

Through the first part of the study , We know that the adjustment of the weight can be based on the partial derivative of the error to the weight multiplied by the learning rate , In the following form

We calculate the error gradient through the chain rule , as follows ：

therefore , The adjustment of weight is

For multiple associations , Then the weight adjustment will be the sum of each associated weight adjustment value

Similarly , For weight adjustment between hidden layers , Continue with the example above , The weight adjustment value between the input layer and the first hidden layer is

that , The weight adjustment based on all associations is the sum of the adjustment values calculated for each association

Calculation

here , We can explore further . In this paper , We see .

For the first half , We can have

For the second half , Because we have adopted sigmoid function , We know ,sigmoid The derivative form of the function is , therefore , Yes

Sum up , The calculation formula can be obtained as follows

Algorithm is summarized

First , Assign a small random value to the network weight value .

Repeat the following steps , Until the error is 0 ：

For each Association , Forward transmission through neural network , Get the output value

Calculate the error of each output node （）

The gradient of each output weight is calculated by superposition （）

Calculate the of each node in the hidden layer （）

Overlay calculates the gradient of each hidden layer weight （）

Update ownership revaluation , Reset overlay gradient （）

Graphical back propagation

In this example , We use real data to simulate every step in the neural network . The input value is [1.0, 1.0], The expected output value is [0.5]. In order to simplify the , We set the initialization weight to 0.5 （ Although in practice , Random values are often used ）. For input 、 Hide and output layers , We use the identity function 、 sigmoid function And identity function as activation function , The learning rate is 0.01 .

Forward pass

At the beginning of the operation , We set the node input value of the input layer to .

Because we use the identity function as the activation function for the input layer , So there is .

Next , We pass the network forward to... Through the weighted sum of the previous layer J layer , as follows

then , We will J Enter the value of the layer node into sigmoid function （, Will be substituted in , obtain 0.731） Activate .

Last , We pass this result to the final output layer .

Because the activation function of our output layer is also an identity function , therefore

Backward pass

The first step of back propagation , Is the name of the calculated output node ,

By calculation J and K The weight gradient between two nodes ：

Next , Calculate the value of each hidden layer in the same way （ In this example , There is only one hidden layer ）：

in the light of I and J The gradient of layer node weight calculation is ：

The last step is to update all weight values with the calculated gradient . Note here if we have more than one Association , Then you can accumulate for each group of associated gradients , Then update the weight value .

You can see that the weight value changes very little , But if we run again with this weight forward pass, In general, you will get a smaller error than before . Let's now look at ……

The first time we got , The new weight value is used to calculate .

thus ,, and .

so , The error is reduced ！ Although the reduction is small , But it is also very representative for a real scene . Repeat the operation according to the algorithm , Generally, the error can be reduced to 0, Then the training of neural network is completed .

Code example

In this example , Will a 2X2X1 The network trained XOR The effect of the operator .

here ,f For hidden layers sigmoid Activation function .

Be careful ,XOR The operator cannot be simulated by the linear network in the first part , Because the data set distribution is nonlinear . That is, you can't pass through a straight line XOR The four input values of are correctly divided into two categories . If we were to sigmoid Replace the function with an identity function , This network will also be infeasible .

After talking so much , It's your turn to do it yourself ！ Try using different activation functions 、 Learning rate and network topology , See how it works ？

Thanks for the authorization of the original author ：