Fine-tuning pipeline for custom dataset #2

danielduckworth · 2020-09-22T09:24:02Z

Hi, I think this is great work.

Would you consider adding a notebook for the fine tuning pipeline?

I have my own dataset of multiple choice questions with answers and distractors that I would like to try fine tuning with.

AMontgomerie · 2020-09-24T02:32:39Z

Hey, thanks!

The qg_training notebook contains the code for fine tuning a pretrained T5 model. You can try using that if you like.

If you can get your data into a csv with the questions in one column and the answers and contexts in another, you should be able to load it into QGDataset and run the notebook with your dataset.

I only trained the model to generate the correct answer, and separately used NER to find the other multiple choice answers in the text. I'm not sure how it would perform if you want it to generate the answers and distractors at the same time. You could try concatenating them all together and separating them with the answer token like:

"<answer> answer1 <answer> answer2 <answer> answer3 <answer> answer4 <context> context"

danielduckworth · 2020-09-24T07:08:01Z

Excellent, I'll have a play.

danielduckworth · 2020-09-24T07:52:05Z

I'm getting an error with the last cell

TypeError Traceback (most recent call last)
in
11 for epoch in range(1, EPOCHS + 1):
12
---> 13 train()
14 val_loss = evaluate(model, valid_loader)
15 print_line()

TypeError: train() missing 2 required positional arguments: 'epoch' and 'best_val_loss'

AMontgomerie · 2020-09-24T08:07:09Z

Huh that's strange. train() should be train(epoch, best_val_loss). I'll update it.

danielduckworth · 2020-09-24T08:15:33Z

Thanks. I had to reduce the batch size to 1 on a home GPU, but it looks like it's working. I haven't added the distractors yet. I'll just train the context, question and answer for now and see how it goes.

AMontgomerie · 2020-09-24T08:26:22Z

Yes it's quite GPU intensive. I think if you load the notebook in Google Colab and change runtime type to GPU you can probably increase the batch size to 4.

danielduckworth · 2020-09-26T01:30:47Z

Hi Adam, I've run the training process and have the 'qg_pretrained_t5_model_trained.pth' model file.

I modified questiongenerator.py to point to a local folder for the model but it needs a but of config stuff. How do I package this trained model for huggingface transformers? Is there some docs I can look at?

danielduckworth · 2020-09-26T01:40:37Z

Never mind, I found model.save_pretrained() and tokenizer.save_pretrained()

danielduckworth · 2020-09-26T01:48:11Z

Some of the generated questions are in German.. What did I do wrong?

AMontgomerie · 2020-09-26T02:12:59Z

That's strange. Are you sure there's no German text in your dataset?

danielduckworth · 2020-09-26T05:06:47Z

It's definitely all English. But maybe there are some unicode errors? The csv is utf-8 and imported with pandas with latin1 encoding.

AMontgomerie closed this as completed Nov 29, 2021

AMontgomerie reopened this Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning pipeline for custom dataset #2

Fine-tuning pipeline for custom dataset #2

danielduckworth commented Sep 22, 2020

AMontgomerie commented Sep 24, 2020

danielduckworth commented Sep 24, 2020

danielduckworth commented Sep 24, 2020 •

edited

AMontgomerie commented Sep 24, 2020

danielduckworth commented Sep 24, 2020

AMontgomerie commented Sep 24, 2020

danielduckworth commented Sep 26, 2020

danielduckworth commented Sep 26, 2020

danielduckworth commented Sep 26, 2020

AMontgomerie commented Sep 26, 2020

danielduckworth commented Sep 26, 2020 •

edited

Fine-tuning pipeline for custom dataset #2

Fine-tuning pipeline for custom dataset #2

Comments

danielduckworth commented Sep 22, 2020

AMontgomerie commented Sep 24, 2020

danielduckworth commented Sep 24, 2020

danielduckworth commented Sep 24, 2020 • edited

AMontgomerie commented Sep 24, 2020

danielduckworth commented Sep 24, 2020

AMontgomerie commented Sep 24, 2020

danielduckworth commented Sep 26, 2020

danielduckworth commented Sep 26, 2020

danielduckworth commented Sep 26, 2020

AMontgomerie commented Sep 26, 2020

danielduckworth commented Sep 26, 2020 • edited

danielduckworth commented Sep 24, 2020 •

edited

danielduckworth commented Sep 26, 2020 •

edited