ShareGPT Dataset? #109

teknium1 · 2023-05-07T15:43:20Z

Hello, I see ShareGPT's dataset is listed on the readme, but the download for alpaca format version is not listed. Can it get listed? Very interested.

teknium1 · 2023-05-07T16:31:26Z

Sorry, I found the dataset on your huggingface.

I looked over it though, and the dataset format might be concerning.

I may be ignorant, but if trained on the ShareGPT alpaca-format dataset, the model may not coherently learn.

ex.

        "instruction": "I have an interview about product speccing with the company Weekend Health. Give me an example of a question they might ask with regards about a new feature",
        "input": "",
        "output": "\"Can you walk us through the process you would use to gather requirements and create a specification for a new feature that our users have been requesting, such as a personalized nutrition plan feature?\""
    },
    {
        "instruction": "Give me a few more",
        "input": "",
        "output": "* \"How would you approach creating a specification for a new feature that integrates with third-party fitness trackers and allows for seamless data tracking and analysis for our users?\"\n* \"Can you give an example of a time when you had to navigate competing stakeholder"
    },

These 2 sequences will likely not be related to eachother during training, making it much more irratic than the way vicuna's original dataset in their format would learn to be

dumpmemory · 2023-05-08T04:16:12Z

did u check the _context.json version ?

float-trip · 2023-05-08T08:39:55Z

Looked into this for a bit. sharegpt_context.json has the same issue to an extent. It seems that everyone is processing the ShareGPT data using Vicuna's pipeline, including this part, which chunks long conversations based on token count.

So rather than throwing out data after hitting the context window, we have a fair amount of chats in sharegpt_context.json that start in the middle of things with the first prompt being something like "[HM]: continue". Not sure if training on this is harmful or helpful.

teknium1 · 2023-05-08T11:34:15Z

It would be wise, imo, to alter the vicuna pipeline being used to simply throw away the sequences that get split off, or perhaps, if needed, throw out all convos that are too long, maybe make a 2k ctx length version and a 4k one - since 4k llama models have started to appear (although are not working well at all rn, they will soon)

teknium1 · 2023-05-08T12:08:39Z

I also think since a lot of datasets are doing this that it likely has something to do with the vicuna "random stopping" issues

dkqkxx · 2023-05-10T07:45:35Z

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

teknium1 · 2023-05-10T08:04:38Z

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

Can you link any of those?

dkqkxx · 2023-05-10T08:19:12Z

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

Can you link any of those?

https://paratranz.cn/projects/6725
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

float-trip mentioned this issue May 8, 2023

ShareGPT conversation splits and "please continue" lm-sys/FastChat#1061

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ShareGPT Dataset? #109

ShareGPT Dataset? #109

teknium1 commented May 7, 2023

teknium1 commented May 7, 2023

dumpmemory commented May 8, 2023

float-trip commented May 8, 2023

teknium1 commented May 8, 2023 •

edited

teknium1 commented May 8, 2023

dkqkxx commented May 10, 2023

teknium1 commented May 10, 2023

dkqkxx commented May 10, 2023

ShareGPT Dataset? #109

ShareGPT Dataset? #109

Comments

teknium1 commented May 7, 2023

teknium1 commented May 7, 2023

dumpmemory commented May 8, 2023

float-trip commented May 8, 2023

teknium1 commented May 8, 2023 • edited

teknium1 commented May 8, 2023

dkqkxx commented May 10, 2023

teknium1 commented May 10, 2023

dkqkxx commented May 10, 2023

teknium1 commented May 8, 2023 •

edited