Explanation for "saddle" pattern

Projects: 
If you'll recall, we had a discussion yesterday in class about the "saddle" pattern seen in testing a range of input values (x & y each from 0 to 1) for the xor problem. There was some question as to why the saddle always ran like "/" rather than "\" Basically, the center range of x & y insisted on returning high values, even when Doug added training data for (0.5,0.5) to return 0. My intuition was that it had something to do with the calculations involved in adjusting the network's weights during back-propagation. Math isn't my forte (unfortunately, for a comp sci major!), but I'm pretty sure I've confirmed that the backprop algorithm is biased for answers of 1 over 0. Let me know if there's a hole in my logic, here. I've written out a few examples and posted them online in case anyone would like to see them. Basically, for each of 4 examples I compared two cases of error that were the same distance from the goal (desiredOutput) but in opposite directions. Basic logic (at least mine!) would suggest that regardless of which direction (positive or negative) you are from the goal, you would want to adjust the same amount (negative or positive) for comparable distances away. But, in all but one case, the weightAdjustment was very different for a negative error than for a positive error of the same absolute value (distance from the desiredOutput). The only case where the weightAdjustments were the same absolute value was Example 2, where the actualOutput was 0.5 for each, and the goals were 1 and 0 - same distance (pos & neg) and same actualOutput. Why? Because for some reason the actualOutput is factored into the final weightAdjustment. This is what causes the bias. I'm sure Doug will be able to explain why the algorithm is configured this way - there must be a good reason. And like I said, maybe there's a hole in my logic. Any thoughts?

Comments

DougBlank's picture

Kathy,

Thanks for attempting to tackle this problem. Your logic is fine, but I see a problem with your analysis. In order to compute the adjustment of a weight between a unit m and a unit i, you need to know the activation at both nodes. The activation at unit m is needed to compute the error, and the activation at unit i is needed to compute the change in weight. In the weight update:

weightUpdate[m][i] = (EPSILON * delta[m] * actualOutput[i]) + (MOMENTUM * weightUpdate[m][i])

the actualOutput[i] is the activation from the previous layer. So, you won't be able to do your analysis without the activation from the previous layer. But you can test values in the range 0 - 1 to see what the weight change would be.

I'll take a look too, and see if I see anything suspicious. If you have any more ideas, please let us know.

Kathy Maffei's picture

I knew it was too obvious to be true...