In the unsupervised learning approach of SimCSE, two identical sentences are vectorized using the Transformer encoder to train a sentence embedding model. A key aspect of this process is the role of Dropout settings within the Transformer, which act as ‘noise’. This noise ensures that, even though the two sentences are the same, they generate distinct vectors during training. This concept, as discussed in the paper, functions similarly to contrastive learning in image recognition, applying effectively to natural language learning.
While studying SimCSE, I initially encountered confusion regarding the implementation of Dropout. My initial assumption was that Dropout would be configured in the head part of the model that compares the two sentences during training, possibly in an MLP (Multi-Layer Perceptron). However, a closer reading of the paper clarified that the Dropout referred to is actually the one within the internal layers of the Transformer. This realization resolved my confusion.
For instance, in models like BERT, Dropout is configured in various layers including the attention and forward layers, as shown in the following code snippet:
from transformers import AutoModel
base_model_name = "cl-tohoku/bert-base-japanese-v3"
testmodel = AutoModel.from_pretrained(base_model_name)
print(testmodel)
# BertModel(
# (embeddings): BertEmbeddings(
# (word_embeddings): Embedding(32768, 768, padding_idx=0)
# (position_embeddings): Embedding(512, 768)
# (token_type_embeddings): Embedding(2, 768)
# (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
# (dropout): Dropout(p=0.1, inplace=False)
# )
# (encoder): BertEncoder(
# (layer): ModuleList(
# (0-11): 12 x BertLayer(
# (attention): BertAttention(
# (self): BertSelfAttention(
# (query): Linear(in_features=768, out_features=768, bias=True)
# (key): Linear(in_features=768, out_features=768, bias=True)
# (value): Linear(in_features=768, out_features=768, bias=True)
# (dropout): Dropout(p=0.1, inplace=False)
# )
# (output): BertSelfOutput(
# (dense): Linear(in_features=768, out_features=768, bias=True)
# (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
# (dropout): Dropout(p=0.1, inplace=False)
# )
# )
# (intermediate): BertIntermediate(
# (dense): Linear(in_features=768, out_features=3072, bias=True)
# (intermediate_act_fn): GELUActivation()
# )
# (output): BertOutput(
# (dense): Linear(in_features=3072, out_features=768, bias=True)
# (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
# (dropout): Dropout(p=0.1, inplace=False)
# )
# )
# )
# )
# (pooler): BertPooler(
# (dense): Linear(in_features=768, out_features=768, bias=True)
# (activation): Tanh()
# )
# )
This aspect of Dropout configuration is a common point of confusion, as evidenced by the numerous related issues raised in the GitHub repository of the SimCSE paper: SimCSE GitHub Issues. It’s a subtle yet critical detail worth noting for anyone delving into this area.
Reference: