Conversation
rogerkuou
left a comment
There was a problem hiding this comment.
Nice implementation @SarahAlidoost !
I only have two minor comments. Just see if they are useful. Feel free to merge!
| model_path, | ||
| ) | ||
| if verbose: | ||
| print(f"Model saved to {model_path}") |
There was a problem hiding this comment.
Something I just found is that the print function will not lively export status to slurm log file.
Shall we replace the print functions with logging?
There was a problem hiding this comment.
To add to this, executing the Python script with -u does help, but still the logging option seems to be a more structural solution since it gives more info
There was a problem hiding this comment.
Actually this is good that print statement is not in slurm log file. The print statement is mainly for example notebook. On HPC, the verbose variable should be False. Instead, we implemented proper logging using torch.utils.tensorboard in #34.
Co-authored-by: Ou Ku <o.ku@esciencecenter.nl>
closes #28
This PR:
torch.optim.Adamtotorch.optim.AdamW, and exposingdropoutin the model, and using validation set in training (see the explanation 2 below)explanations:
Our model has a lot of parameters (see default arguments of the model), so just sampling the whole globe doesn’t really give us enough training data. This can lead to high loss on test and validation sets. One approach is using overlapping tiles to create more samples, like they did in "2.3 Data augmentation and pre-processing" in MAESSTRO paper. That's why I added a stride option to the dataset. I also decrease the number of parameters especially
embed_dimin the model in example notebook. This is something to fix later when building a proper training workflow on larger data on HPC.Another issue was over-fitting. I used validation set during the training, similar to what they did in MAESSTRO code. but they actually used different years for training and validation (like 2012 vs 2011). Since I work with a small dataset and using stride to create more samples, there’s some overlap between train, test, and validation. This is something to fix later when building a proper training workflow on larger data on HPC.