ConstantLengthDataset Ignore Some Texts #1621

TianyiPeng · 2024-05-04T23:33:12Z

Not a big concern. But the current implementation will ignore the last chunk of the data which is the remainder of the self.seq_length. Maybe worth adding the last chunk back with padding tokens.

            for i in range(0, len(all_token_ids), self.seq_length):
                input_ids = all_token_ids[i : i + self.seq_length]
                if len(input_ids) == self.seq_length:
                    examples.append(input_ids)

https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L475-L478

The text was updated successfully, but these errors were encountered:

younesbelkada · 2024-05-23T09:28:06Z

Thanks @TianyiPeng for the suggestion, the reason we don't add that is because we'll have to introduce custom attention masks. This will lead to slower training for setups that use for instance Flash Attention 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConstantLengthDataset Ignore Some Texts #1621

ConstantLengthDataset Ignore Some Texts #1621

TianyiPeng commented May 4, 2024

younesbelkada commented May 23, 2024

ConstantLengthDataset Ignore Some Texts #1621

ConstantLengthDataset Ignore Some Texts #1621

Comments

TianyiPeng commented May 4, 2024

younesbelkada commented May 23, 2024