Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cvector: better prompt handling, add "mean vector" method #8069

Merged
merged 6 commits into from
Jun 25, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jun 22, 2024

Motivation

Ref comment: #7514 (comment)

After more consideration, I think that we should not handle completions in cvector (at least for now), because it can add unnecessary complexity. Positive/negative prompts are now 100% up to user to prepare. I also changed the example to using llama-3 format.

Also fix a bug where special tokens are not being correctly tokenized.

With this change, I spotted a problem with PCA: the output vector is being inverted (i.e. cvector_happy.gguf +1.0 makes it sad, while cvector_happy.gguf -1.0 makes it happy). Don't know why for now, but the quick fix is to invert the vector back before saving it.

Mean method

Added "mean" as dimensionality reduction method. It simple calculates mean vector from all embeddings.

The output results turns out to be very acceptable even with this simple method:

# +0.8 happy
YAY! I'm super excited to share a bedtime story with YOU! Here's a fun and exciting story for you!

# -0.8 happy
The darkness of despair weighs heavy upon your empty chest. The once-tainted soul, now forever shrouded in the darkness of a thousand tears.

Demo

./llama-cli -m ./llama-3.Q4_K_M.gguf -p "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSing a song<|im_end|><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" --special --control-vector-scaled ./control_vector.gguf -0.8 --control-vector-layer-range 10 31
...
"Oh, the darkness is spreading wide,
A heavy fog that won't subside,
The weight of the world, it's crushing me,
And the melodies that were once free!

@ngxson ngxson added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 22, 2024
@HatsuneMikuUwU33
Copy link
Contributor

HatsuneMikuUwU33 commented Jun 23, 2024

I think that we should not handle completions in cvector (at least for now), because it can add unnecessary complexity.

What's the reason for removing it? If you don't want to use them just make an empty file and run with --completions 1

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 23, 2024

@HatsuneMikuUwU33 The main reason is that completions file is used for preparing input data used for training. But main goal of cvector-generator is, as the name suggest, to calculate the vector file. The nature of these 2 steps are different.

At this stage, I don't feel like mixing these 2 steps in the same program because it makes the code too complex. The actually completion mentioned in #7514 (comment) can be done better using llama-cli. This allow user to have much more control over input data.

If we absolutely want to have completion here, I'd suggest writing a dedicated program (maybe shell script or python) to do the data preparing step.

CC @christianazinn

@ngxson ngxson changed the title cvector: better prompt handling cvector: better prompt handling, add "mean vector" method Jun 23, 2024
@ngxson ngxson marked this pull request as ready for review June 23, 2024 13:35
@ngxson ngxson requested a review from slaren June 23, 2024 13:37
@jukofyork
Copy link
Contributor

jukofyork commented Jun 23, 2024

With this change, I spotted a problem with PCA: the output vector is being inverted (i.e. cvector_happy.gguf +1.0 makes it sad, while cvector_happy.gguf -1.0 makes it happy). Don't know why for now, but the quick fix is to invert the vector back before saving it.

I'm just looking through the code now and I can't see anywhere where the data matrix is projected onto the principle component:

    PCA::run_pca(pca_params, ctx_train.v_diff, ctx_train.v_final);

    // write output vectors to gguf
    export_gguf(ctx_train.v_final, params.cvector_outfile, model_hint);
static void run_pca(
        struct pca_params & params,
        const std::vector<struct ggml_tensor *> & v_input, // shape of v_input[0]: [n_samples, n_embd]
        const std::vector<struct ggml_tensor *> & v_output) {
    printf("%s: Running PCA...\n", __func__);
    for (size_t il = 0; il < v_input.size(); ++il) {

        // prepare output vector
        struct ggml_tensor * ctrl_out = v_output[il];
        ggml_format_name(ctrl_out, "direction.%ld", il+1);

        // run power_iteration
        params.i_layer = il;
        params.n_layers = v_input.size();
        power_iteration(params, v_input[il], ctrl_out);
        printf("%s: Done layer %d / %d\n", __func__, (int) il+1, (int) v_input.size());
    }
}
    // get output tensor
    GGML_ASSERT(last_eigenvector);
    ggml_backend_tensor_get(last_eigenvector, output->data, 0, ggml_nbytes(last_eigenvector));
    //print_debug_tensor(output);
    ggml_gallocr_free(allocr);

You need to project the data matrix onto the Eigenvector(s), calculate the mean, and see if the signs of the vector(s) need flipping so that adding the vector makes the mean go the way you want. There is no inherent directionality to the Eigenvectors found and it's pretty much random if it points the way you want it or not.

The Python code does this here:

        # calculate sign
        projected_hiddens = project_onto_direction(h, directions[layer])

        # order is [positive, negative, positive, negative, ...]
        positive_smaller_mean = np.mean(
            [
                projected_hiddens[i] < projected_hiddens[i + 1]
                for i in range(0, len(inputs) * 2, 2)
            ]
        )
        positive_larger_mean = np.mean(
            [
                projected_hiddens[i] > projected_hiddens[i + 1]
                for i in range(0, len(inputs) * 2, 2)
            ]
        )

        if positive_smaller_mean > positive_larger_mean:  # type: ignore
            directions[layer] *= -1

See: https://github.com/vgel/repeng/blob/main/repeng/extract.py

But I actually think (and have tested successful in my own code), that not only should the sign be flipped but the magnitude should be scaled by the mean or else all the vectors will just keep their norm of 1 as returned by PCA, and it will be very hard to balance mixing control vectors from early and later layers where the mean hidden state is 1-2 orders of magnitude different.

@jukofyork
Copy link
Contributor

Mean method

Added "mean" as dimensionality reduction method. It simple calculates mean vector from all embeddings.

The output results turns out to be very acceptable even with this simple method:

# +0.8 happy
YAY! I'm super excited to share a bedtime story with YOU! Here's a fun and exciting story for you!

# -0.8 happy
The darkness of despair weighs heavy upon your empty chest. The once-tainted soul, now forever shrouded in the darkness of a thousand tears.

This method is actually Linear Discriminant Analysis where you assume the covariance matrices are just the identity. This is a nice explanation of why this is and why it might not be optimal:

https://sthalles.github.io/fisher-linear-discriminant/

simple_projection

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 24, 2024

You need to project the data matrix onto the Eigenvector(s), calculate the mean, and see if the signs of the vector(s) need flipping so that adding the vector makes the mean go the way you want. There is no inherent directionality to the Eigenvectors found and it's pretty much random if it points the way you want it or not.

Thanks for the explanation. I looked on the python code earlier but didn't understand this part in particular. It's all clear now and I'll try to bring this part to cpp. For now I'll just remove my hot fix and leave a TODO there.

But I actually think (and have tested successful in my own code), that not only should the sign be flipped but the magnitude should be scaled by the mean or else all the vectors will just keep their norm of 1 as returned by PCA, and it will be very hard to balance mixing control vectors from early and later layers where the mean hidden state is 1-2 orders of magnitude different.

Cool! Maybe this is also related to the fact that the generated control vector only effective if I apply it to layers higher than 10 (i.e. --control-vector-layer-range 10 31)

@jukofyork
Copy link
Contributor

jukofyork commented Jun 24, 2024

Hopefully this isn't confusing as I'm actually using more than 2 classes (num_dataset_types), but this is essentially what my code would do for 2 classes:

projected_scores = [self._project_data_onto_component(d, component) for d in data]
mean_differences = self._compute_mean_difference(projected_scores[0], projected_scores[1])  # 2 classes only!
for j in range(num_dataset_types - 1):
    scaled_direction = -mean_differences[j] * component
    direction_matrices[j][layer_index].append(torch.tensor(scaled_direction))

def _project_data_onto_component(self, data, component):
    return  np.dot(data, component.reshape(-1, 1)).squeeze()

def _compute_mean_difference(projected_scores1, projected_scores2):
    return np.mean(projected_scores1) - np.mean(projected_scores2)

To use the same logic as the old code where you just keep the norms of 1:

scaled_direction =  -math.copysign(1.0, mean_differences[j]) * component

I'm also being careful to use the delta of hidden_state[i] - hidden_state[i-1] instead of just hidden_state[i] (where hidden_state[0] is the hidden state before the first block, etc), but the original Python code is very obtuse and I'm not sure if they did this or if it really makes any difference... It's always a good idea to normalized anything you can like this for PCA or else you risk having the first principle component end up doing the normalization for you and screwing up all the subsequent components due to the orthogonality constraint!

@christianazinn
Copy link
Contributor

@HatsuneMikuUwU33 The main reason is that completions file is used for preparing input data used for training. But main goal of cvector-generator is, as the name suggest, to calculate the vector file. The nature of these 2 steps are different.

At this stage, I don't feel like mixing these 2 steps in the same program because it makes the code too complex. The actually completion mentioned in #7514 (comment) can be done better using llama-cli. This allow user to have much more control over input data.

If we absolutely want to have completion here, I'd suggest writing a dedicated program (maybe shell script or python) to do the data preparing step.

CC @christianazinn

This looks fine. Ideally I'd like to have an example Python/shell script to provide examples or instructions of how to format data, but this is not urgent.

@ngxson ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jun 25, 2024
@ngxson ngxson merged commit 49c03c7 into ggerganov:master Jun 25, 2024
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples merge ready indicates that this may be ready to merge soon and is just holding out in case of objections Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants