Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in NegativeLogLikelihood impl #150

Open
hweom opened this issue Oct 25, 2021 · 9 comments · May be fixed by #151
Open

Potential bug in NegativeLogLikelihood impl #150

hweom opened this issue Oct 25, 2021 · 9 comments · May be fixed by #151
Assignees
Labels
documentation Documentation related troubles question

Comments

@hweom
Copy link
Contributor

hweom commented Oct 25, 2021

Just started to explore deep learning and chose Juice as the starting framework, since I want to stick with Rust. Since I'm pretty new to the domain, it might be just my mistake.

I was looking at the NegativeLogLikelihood::compute_output() and I think there is a bug. Instead of

        for &label_value in native_labels {
            let probability_value = native_probabilities[label_value as usize];
            writable_loss.push(-probability_value);
        }

it should be

        let mut offset = 0;
        for &label_value in native_labels {
            let probability_value = native_probabilities[offset + label_value as usize];
            writable_loss.push(-probability_value);
            offset += self.num_classes;
        }

Otherwise we're comparing all labels in the batch with the first output from the batch?

Interesting that I've tried changing it and there were no noticeable effect on MNIST example...

@hweom hweom added documentation Documentation related troubles question labels Oct 25, 2021
@drahnr
Copy link
Member

drahnr commented Oct 29, 2021

@hweom sorry for the long delay, I see your point and I would appreciate a PR for this issue with a testcase. Happy to review. Practically this is not destructive, all values are summed in the following few lines, so some values get more weight and others are omitted.

drahnr added a commit that referenced this issue Oct 30, 2021
@drahnr drahnr linked a pull request Oct 30, 2021 that will close this issue
4 tasks
@hweom
Copy link
Contributor Author

hweom commented Oct 30, 2021

Thanks for the answer. I'm still trying to wrap my head around the code. I think I understand why changing this part made no effect on the MNIST example -- since NLL layer is the last one, the output of it is not used anywhere.

Now I have a different question :) This definition of NLL that I found gives the formula:

y = -ln (x)

(here y is the output of the NLL layer and x is input for the correct class).

However, NegativeLogLikelihood::compute_output() does this:

y = -x

And similarly, the input gradient in NegativeLogLikelihood::compute_input_gradient() is computed like this:

for (batch_n, &label_value) in native_labels.iter().enumerate() {
    let index = (num_classes * batch_n) + label_value as usize;
    writable_gradient[index] = -1f32;
}

which is essentially ∂y/∂x = -1. This matches the way the output is computed, but doesn't quite match the definition of NLL.

FWIW, I tried changing compute_input_gradient() to compute the real NLL gradient with

writable_gradient[index] = -1.0 / native_probabilities[index].max(0.001);

(I didn't change compute_output() since it's not used, as mentioned above), but it actually caused training to stop converging.

Should the name of the layer be changed?

@drahnr
Copy link
Member

drahnr commented Oct 31, 2021

That is correct. The implementation is pretty old and was done by the original authors of the predecessor if juice.

The only explanation I have is when looking at the expansion of ln(x) which is

ln(x) ~= \sum_{n=0}^{\inf} - x + (1/2)x^2 - (1/3)x^3 + (1/4)x^4 + ..`

now if you terminate after the first term, you get the above. This saves a bunch of computations - ln is still rather slow, but at least two or 3 terms wouldn't hurt, but that would come at a quite significant additional cost of doing 3 more multiplications, a lock call, and an array lookup in the gradient compute.

I added a commit that would improve the approximation up to x^3 in #151

@hweom
Copy link
Contributor Author

hweom commented Oct 31, 2021

Is this a Taylor expansion of ln(x)? At which point? It doesn't look like one: example.

I'm probably missing something obvious...

@drahnr
Copy link
Member

drahnr commented Oct 31, 2021

https://math.stackexchange.com/questions/585154/taylor-series-for-logx says almost the above, an offset of -1 or respectively +1 is missing, do I guess that's another issue

@hweom
Copy link
Contributor Author

hweom commented Oct 31, 2021

You mean the most upvoted answer (https://math.stackexchange.com/a/585158)? Note that it has a formula for -ln(1-x), not ln(x).

@drahnr
Copy link
Member

drahnr commented Oct 31, 2021

That's what I meant in my previous comment

@hweom
Copy link
Contributor Author

hweom commented Oct 31, 2021

Well, if we take the Taylor expansion suggested by Wolfram:

and simplify it, we'll get this (https://www.wolframalpha.com/input/?i=expand+-%28%28x-1%29+-+1%2F2%28x-1%29%5E2+%2B+1%2F3%28x-1%29%5E3%29)

Or can just add the offset and keep the math cleaner.

@drahnr
Copy link
Member

drahnr commented Nov 29, 2021

I think the offset is not really relevant for learning.

drahnr added a commit that referenced this issue Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation related troubles question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants