Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to reproduce the hyenaDNA reported results on NT tasks. HURRY! HURRY HURRY #65

Open
multydoffer opened this issue Apr 16, 2024 · 0 comments

Comments

@multydoffer
Copy link

I've tried the official settings, but the results lag far behind the reported ones.
For the fact that use seq length 2x-4x of the downstream tasks, I tried 2 experiments:

  1. use the officially produced tiny 1k 256 weights,
  2. train a 2k model with no seq warm up(max length is 2k)
    But both underperformed the reported results.

Here is the config file for my nucleotide transformer tasks:

@Package global

defaults:

  • /pipeline: nucleotide_transformer
  • override /scheduler: cosine_warmup_timm

model:
name: dna_embedding
d_model: 256
n_layer: 2
d_inner: ${eval:4 * ${.d_model}}
vocab_size: 12
resid_dropout: 0.0
embed_dropout: 0.1
fused_mlp: False # figure out how to use fused MLP, maybe only with bf16 + a100
fused_dropout_add_ln: True
residual_in_fp32: True
pad_vocab_size_multiple: 8
layer:
name: hyena
emb_dim: 5
filter_order: 64
short_filter_order: 3
l_max: 1026 # required to be set the same as the pretrained model if using, don't forget the +2! ${eval:${dataset.max_length}+2}
modulate: True
w: 10
lr: ${optimizer.lr}
wd: 0.0
lr_pos_emb: 0.0

task:
name: masked_multiclass
loss: cross_entropy

metrics:

- accuracy

torchmetrics: null

trainer:
accelerator: gpu
devices: 4
num_nodes: 1
accumulate_grad_batches: ${div_up:${train.global_batch_size}, ${eval:${trainer.devices} * ${dataset.batch_size} * ${trainer.num_nodes}}}
max_epochs: 200
precision: 16 # bf16 only a100
gradient_clip_val: 1.0

name maxlen classes samples metric

enhancer 200 2 14968 MCC

enhancer_types 200 3 14968 MCC

H3 500 2 13468 MCC

H3K4me1 500 2 28509 MCC

H3K4me2 500 2 27614 MCC

H3K4me3 500 2 33119 MCC

H3K9ac 500 2 25003 MCC

H3K14ac 500 2 29743 MCC

H3K36me3 500 2 31392 MCC

H3K79me3 500 2 25953 MCC

H4 500 2 13140 MCC

H4ac 500 2 30685 MCC

promoter_all 300 2 53276 F1

promoter_non_tata 300 2 47759 F1

promoter_tata 300 2 5517 F1

splice_sites_acceptor 600 2 19961 F1

splice_sites_donor 600 2 19775 F1

dataset:
batch_size: 32
dataset_name: 'H3K4me1'
tokenizer_name: char
add_eos: false
rc_aug: false # reverse complement augmentation
return_mask: false
padding_side: left

scheduler:
t_in_epochs: False
t_initial: ${eval:${div_up:${dataset.train_len}, ${train.global_batch_size}} * ${trainer.max_epochs}}
warmup_lr_init: 1e-6
warmup_t: ${eval:${div_up:${dataset.train_len}, ${train.global_batch_size}} * ${trainer.max_epochs} * 0.01}
lr_min: ${eval:0.1 * ${optimizer.lr}}

optimizer:
lr: 6e-4
weight_decay: 0.1

train:
gpu_mem: ${eval:"round(float(import('subprocess').check_output('nvidia-smi -i 0 --query-gpu=memory.total --format=csv,noheader,nounits', shell=True).strip().decode()) / 1000)"}
seed: 2222
global_batch_size: ${eval:${trainer.devices}*${dataset.batch_size}}
remove_test_loader_in_eval: true # no test set in this benchmark
pretrained_model_strict_load: False # false allows encoder/decoder to be used if new model uses it

for loading backbone and not head, requires both of these flags below

pretrained_model_path: hyena-dna/outputs/weights.ckpt
pretrained_model_state_hook:
name: load_backbone
freeze_backbone: false # seems to work much better if false (ie finetune entire model)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant