Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShareGPT Dataset? #109

Open
teknium1 opened this issue May 7, 2023 · 8 comments
Open

ShareGPT Dataset? #109

teknium1 opened this issue May 7, 2023 · 8 comments

Comments

@teknium1
Copy link

teknium1 commented May 7, 2023

Hello, I see ShareGPT's dataset is listed on the readme, but the download for alpaca format version is not listed. Can it get listed? Very interested.

@teknium1
Copy link
Author

teknium1 commented May 7, 2023

Sorry, I found the dataset on your huggingface.

I looked over it though, and the dataset format might be concerning.

I may be ignorant, but if trained on the ShareGPT alpaca-format dataset, the model may not coherently learn.

ex.

        "instruction": "I have an interview about product speccing with the company Weekend Health. Give me an example of a question they might ask with regards about a new feature",
        "input": "",
        "output": "\"Can you walk us through the process you would use to gather requirements and create a specification for a new feature that our users have been requesting, such as a personalized nutrition plan feature?\""
    },
    {
        "instruction": "Give me a few more",
        "input": "",
        "output": "* \"How would you approach creating a specification for a new feature that integrates with third-party fitness trackers and allows for seamless data tracking and analysis for our users?\"\n* \"Can you give an example of a time when you had to navigate competing stakeholder"
    },

These 2 sequences will likely not be related to eachother during training, making it much more irratic than the way vicuna's original dataset in their format would learn to be

@dumpmemory
Copy link

did u check the _context.json version ?

@float-trip
Copy link

Looked into this for a bit. sharegpt_context.json has the same issue to an extent. It seems that everyone is processing the ShareGPT data using Vicuna's pipeline, including this part, which chunks long conversations based on token count.

So rather than throwing out data after hitting the context window, we have a fair amount of chats in sharegpt_context.json that start in the middle of things with the first prompt being something like "[HM]: continue". Not sure if training on this is harmful or helpful.

@teknium1
Copy link
Author

teknium1 commented May 8, 2023

It would be wise, imo, to alter the vicuna pipeline being used to simply throw away the sequences that get split off, or perhaps, if needed, throw out all convos that are too long, maybe make a 2k ctx length version and a 4k one - since 4k llama models have started to appear (although are not working well at all rn, they will soon)

@teknium1
Copy link
Author

teknium1 commented May 8, 2023

I also think since a lot of datasets are doing this that it likely has something to do with the vicuna "random stopping" issues

@dkqkxx
Copy link
Collaborator

dkqkxx commented May 10, 2023

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

@teknium1
Copy link
Author

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

Can you link any of those?

@dkqkxx
Copy link
Collaborator

dkqkxx commented May 10, 2023

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

Can you link any of those?

https://paratranz.cn/projects/6725
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants