# Understanding of the sigmoid activation function as last layer in network

by thigi   Last Updated March 09, 2018 18:19 PM

I have two CNN versions which are distinguished by a sigmoid layer.

1. CNN | last two layers: `CONV` + `SIGMOID`
2. CNN | last layer: `CONV`

My output range of my ground truth values is `[0,1]`

The loss function I use is the `L2` loss.

When I train both networks the second one outperforms the first one by far.

For example: 1. At the beginning: loss = 230 1. After 3 epochs: loss = 23

1. At the beginning: loss = 18
2. After 100 iterations loss = 4

I do not understand why the version with the `SIGMOID` does never get near the solution without the sigmoid. I have been reading up on this and some people say if the `L2` loss does not go well with the `SIGMOID` which can be proven mathematically. However, in the end, I would understand if there is some sort of difference for the loss, but the difference is huge.

Tags :

#### Answers 3

• Were your images normalized??
• If your images were normalized between 0 and 1 then I am sure sigmoid will give a smaller loss then what you are getting now
• Infact with normalized images you should get loss which will be less then your model 2 which is last layer:`CNN`
• Sigmoid squashes the values between 0 and 1 and your normalized image are already between 0 and 1 and hence, this will give you a smaller loss
• With images that are not normalized the model 2 will have a lesser loss and it makes sense because convolutions may result in any number and they do not have a upper or lower bound of producing numbers
• Say your ground truth pixel value is 250 and you want sigmoid to predict 250. Sigmoid predicts a number closer to 1 (that's the maximum it can output) and hence MSE will be (255-1)^2 = 64516 while with last layer `CNN` can actually produce a number closer to 255. Let's say it produced 250 so the MSE will be (250-255)^2 = 25.
• Normalized images also makes your learning faster and also provides good results in different lighting conditions
Jai
March 09, 2018 10:11 AM

I would guess that there are two things at work here. First your initialization seems to perform worse for sigmoid than for the linear output layer. Maybe your output is normalized around 0.5 which would be close to 1 for sigmoid and pretty good for your other network.

The second problem is in my opinion the learning rate. The gradient of sigmoid (s(x)) is s(x)(1 - s(x)), which is quite small compared to 1 for a linear function. Therefore by setting a higher learning rate the loss should decrease in a similar fashion.

In the end the result should be similar iff you train until convergence.

Thomas Pinetz
March 09, 2018 11:54 AM

You have a bug in your code. With L2 loss function and both ground truth and your net outputs bound between `[0,1]` (as in the case of sigmoid) there is no way your loss is over 1.

Lugi
March 09, 2018 17:59 PM