About fine-tunning hierarchical bert model #31
-
Hi! Love your repo, I'm doing research in law text classification and I'm using your repo a lot. I was wondering what technique exactly did you use to fine-tune your hierarchical bert model. I was able to reproduce your "normal" bert results by only unfreezing the last layer of the bert model, which I think is the standard way to fine-tune a bert model. but with the hierarchical bert model there's a few more added layers, there is a new embedding layer, and two new transformer layers. How are those trained? Are the learning rate the same for all the layers? Are all of those new layers unfrozen? Thank you very much! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @Vitor-Almeida, It's interesting that you coyly reproduce the BERT results with unfreezing a single layer. With respect to your question, we used a fixed learning of 3e-5 rate across all models. No special scheduling (warmup, decay) or anything else was used. While we acknowledge that tuning the learning rate could possibly lead to better results, this process could be extremely resource-consuming and we lacked the resources to tune learning rates (or even other hyperparams) across 6 models and 7 tasks with multiple seeds... The same applies for the hierarchical models, we use a fixed learning rate across model layers. |
Beta Was this translation helpful? Give feedback.
Hi @Vitor-Almeida,
It's interesting that you coyly reproduce the BERT results with unfreezing a single layer.
With respect to your question, we used a fixed learning of 3e-5 rate across all models. No special scheduling (warmup, decay) or anything else was used. While we acknowledge that tuning the learning rate could possibly lead to better results, this process could be extremely resource-consuming and we lacked the resources to tune learning rates (or even other hyperparams) across 6 models and 7 tasks with multiple seeds...
The same applies for the hierarchical models, we use a fixed learning rate across model layers.