Question about encoding text prompt in Stable Diffusion XL #7971

00why00 · 2024-05-17T15:25:08Z

00why00
May 17, 2024

Both stable diffusion 2.1 and XL use the second-last output of the text encoder to compute cross-attention in the unet.
But why in pipeline stable diffusion, it uses the last output of the text encoder and has an additional layer norm operation

diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py

Lines 392 to 407 in 70f8d4b

 if clip_skip is None: 

 prompt_embeds = self.text_encoder(text_input_ids.to(device), attention_mask=attention_mask) 

 prompt_embeds = prompt_embeds[0] 

 else: 

 prompt_embeds = self.text_encoder( 

 text_input_ids.to(device), attention_mask=attention_mask, output_hidden_states=True 

 ) 

 # Access the `hidden_states` first, that contains a tuple of 

 # all the hidden states from the encoder layers. Then index into 

 # the tuple to access the hidden states from the desired layer. 

 prompt_embeds = prompt_embeds[-1][-(clip_skip + 1)] 

 # We also need to apply the final LayerNorm here to not mess with the 

 # representations. The `last_hidden_states` that we typically use for 

 # obtaining the final prompt representations passes through the LayerNorm 

 # layer. 

 prompt_embeds = self.text_encoder.text_model.final_layer_norm(prompt_embeds)

which pipeline stable diffusion xl hasn't

diffusers/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

Lines 403 to 407 in 70f8d4b

 if clip_skip is None: 

 prompt_embeds = prompt_embeds.hidden_states[-2] 

 else: 

 # "2" because SDXL always indexes from the penultimate layer. 

 prompt_embeds = prompt_embeds.hidden_states[-(clip_skip + 2)]

asomoza · 2024-05-22T16:00:50Z

asomoza
May 22, 2024
Collaborator

Hi, what's not clear in the code comment?

This was added with the clip_skip implementation and it states that if you don't use it, the prompt embeds passed through the LayerNorm layer, so if you use it, it was required to manually pass the prompt embeds to it.

It seems your question is more related to the difference in architecture of each model and probably for that it will be better if you read the paper of each model and understand them.

cc @sayakpaul if there's something else to add.

1 reply

sayakpaul May 23, 2024
Maintainer

Nope nothing else to add. You nailed the answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about encoding text prompt in Stable Diffusion XL #7971

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Question about encoding text prompt in Stable Diffusion XL #7971

00why00 May 17, 2024

Replies: 1 comment · 1 reply

asomoza May 22, 2024 Collaborator

sayakpaul May 23, 2024 Maintainer

00why00
May 17, 2024

Replies: 1 comment 1 reply

asomoza
May 22, 2024
Collaborator

sayakpaul May 23, 2024
Maintainer