Skip to content

About fine-tunning hierarchical bert model #31

Discussion options

You must be logged in to vote

Hi @Vitor-Almeida,

It's interesting that you coyly reproduce the BERT results with unfreezing a single layer.

With respect to your question, we used a fixed learning of 3e-5 rate across all models. No special scheduling (warmup, decay) or anything else was used. While we acknowledge that tuning the learning rate could possibly lead to better results, this process could be extremely resource-consuming and we lacked the resources to tune learning rates (or even other hyperparams) across 6 models and 7 tasks with multiple seeds...

The same applies for the hierarchical models, we use a fixed learning rate across model layers.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by Vitor-Almeida
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants