Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can you continue training a model from a file? #188

Closed
opfromthestart opened this issue Jan 31, 2023 · 15 comments · Fixed by #193 · May be fixed by #190
Closed

How can you continue training a model from a file? #188

opfromthestart opened this issue Jan 31, 2023 · 15 comments · Fixed by #193 · May be fixed by #190
Assignees
Labels
documentation Documentation related troubles question

Comments

@opfromthestart
Copy link
Contributor

** Question: **
I am not sure if this is a bug or just a something I am doing wrong, but when I save the model as a file and then reload it as a layer, it does not have a similar loss to the saved version of the model, and it also trains much slower. Is there a setting needed to tell it how to continue? Do I need to save additional data to resume properly?
Project can be found here
Lines 231-238, 296-299 are the significant parts to this question. If I train it from scratch the loss will go down into the 80s after 20,000 iterations, but when I reload it it will start in the 130s and not decrease significantly in 20,000 iterations.

@opfromthestart opfromthestart added documentation Documentation related troubles question labels Jan 31, 2023
@drahnr
Copy link
Member

drahnr commented Jan 31, 2023

Which GPU do you have and could you provide your saves files. It looks very fishy to me.

@drahnr
Copy link
Member

drahnr commented Jan 31, 2023

warning: `logic-ai` (bin "logic-ai") generated 1 warning
    Finished dev [unoptimized + debuginfo] target(s) in 50.09s
     Running `../cargo_target/debug/logic-ai`
2 3 0 1 
3 0 1 2 
0 1 2 3 
1 2 3 0 

Did not load
0: 95.65555, 0.008372713
thread 'main' panicked at 'Could not write to file: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/main.rs:297:53
stack backtrace:
<snip>

What's the required path? Is the above output something you'd expect?

I'd be very happy to help!

@opfromthestart
Copy link
Contributor Author

opfromthestart commented Jan 31, 2023

My GPU is a NVIDIA GeForce GTX 1650 Ti Mobile.
Here are two save files (they need to be renamed to futo.net and placed in the saves folder).
I pushed a new version of my project that should include the needed folder, so that error should no longer occur.
saves.zip

@drahnr
Copy link
Member

drahnr commented Feb 1, 2023

Long story short, it's a bug. Some things became private and the current automated testing is done without such an example, #190 addresses the principal issue of lacking API to manage to do it at all. On the other handside: That API is very rough and requires understanding of knowledge around capnp which is not ideal. I'll create an abstraction soon™ but until then I'd recommend to patch your project with that PR.

@opfromthestart
Copy link
Contributor Author

opfromthestart commented Feb 1, 2023

The example seems to store and retrieve the config, but not the actual network parameters(eg the ILayer object) itself. What would I need to change to also save and load a Layer object alongside the config?

@drahnr
Copy link
Member

drahnr commented Feb 1, 2023

@drahnr
Copy link
Member

drahnr commented Feb 1, 2023

There is a save and load function should be what you need.

@opfromthestart
Copy link
Contributor Author

I already use the save and load functions. I implemented saving of the config, but the learning stall after a reset still happens. Should I be getting the SequentialConfig from somewhere other than when I first make it? I would guess that it is probably a bug in the save and load functions rather than not being able to save configs.

@drahnr
Copy link
Member

drahnr commented Feb 1, 2023

I'll dig deeper into this, on first glance save and load appear ok. I have yet to finish a unit test for a trained network with equality checks.

@drahnr
Copy link
Member

drahnr commented Feb 2, 2023

#190 does implement a unit test now, but the PartialEq implementation does not cover all items. The weights are checked for equiv though, so that cannot be the root of the issue. The weight_gradients are not retained, but they are only accumulations for a minibatch anyways, and are reset after each. So this investigation needs some more time.

@opfromthestart
Copy link
Contributor Author

opfromthestart commented Feb 11, 2023

I wrote a simple xor example from the project linked at the top. In the main function, first only run the xor_train() function, then stop it once it has learned, change the line to xor_eval(), and see that the stored weights do not produce the correct results.

@drahnr
Copy link
Member

drahnr commented Feb 12, 2023

I think the correct test would be:

train
...
train
eval
save
load
eval

and compare the output on the two eval invocations on the same input. Or is that what you meant?

I didn't get around to dig deeper yet

@opfromthestart
Copy link
Contributor Author

I revised the main function so that it does that, the issue still persists.
Could it have to do with the forward function itself? It theoretically should not need to be mutable, so maybe something is being overwritten there?

@opfromthestart
Copy link
Contributor Author

I think it may have to do with the loading of the bias weights. I added a third example which is just a single linear layer, and it learned to just be the identity function. When I load it from file, it has the same slope, but all the outputs are shifted. I'm guessing that there is something that is not being saved or loaded properly from the weights.

0: 0.3421242
Trained model before reload from disk:
[1.4901191e-6, 0.9999985]
Loaded net
Model after reload from disk:
[1.4076138, 2.407611]
There are 0 differences in weights.

@drahnr
Copy link
Member

drahnr commented Feb 22, 2023

I'll try to make some time for investigating further, personal life events just consume a lot of my spare time lately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation related troubles question
Projects
None yet
2 participants