I have two CNN versions which are distinguished by a sigmoid layer.
My output range of my ground truth values is
The loss function I use is the
When I train both networks the second one outperforms the first one by far.
For example: 1. At the beginning: loss = 230 1. After 3 epochs: loss = 23
I do not understand why the version with the
SIGMOID does never get near the solution without the sigmoid. I have been reading up on this and some people say if the
L2 loss does not go well with the
SIGMOID which can be proven mathematically. However, in the end, I would understand if there is some sort of difference for the loss, but the difference is huge.
CNNcan actually produce a number closer to 255. Let's say it produced 250 so the MSE will be (250-255)^2 = 25.
I would guess that there are two things at work here. First your initialization seems to perform worse for sigmoid than for the linear output layer. Maybe your output is normalized around 0.5 which would be close to 1 for sigmoid and pretty good for your other network.
The second problem is in my opinion the learning rate. The gradient of sigmoid (s(x)) is s(x)(1 - s(x)), which is quite small compared to 1 for a linear function. Therefore by setting a higher learning rate the loss should decrease in a similar fashion.
In the end the result should be similar iff you train until convergence.
You have a bug in your code. With L2 loss function and both ground truth and your net outputs bound between
[0,1] (as in the case of sigmoid) there is no way your loss is over 1.