*Memos:
- My post explains Transformer layer.
- My post explains RNN().
- My post explains LSTM().
- My post explains GRU().
- My post explains manual_seed().
- My post explains requires_grad.
Transformer() can get the 2D or 3D tensor of the one or more elements computed by Transformer from the 2D or 3D tensor of one or more elements as shown below:
*Memos:
- The 1st argument for initialization is
d_model(Optional-Default:512:Type:int): *Memos:- It must be
1 <= x. - It must be same as the number of the elements of the deepest dimension of
srcandtgt. - It must be divisible by
nhead.
- It must be
- The 2nd argument for initialization is
nhead(Optional-Default:8-Type:int). *It must be1 <= x. - The 3rd argument for initialization is
num_encoder_layers(Optional-Default:6-Type:int). *It must be1 <= x. - The 4th argument for initialization is
num_decoder_layers(Optional-Default:6-Type:int). *It must be1 <= x. - The 5th argument for initialization is
dim_feedforward(Optional-Default:2048-Type:int): *Memos:- It must be
0 <= x. -
0does nothing.
- It must be
- The 6th argument for initialization is
dropout(Optional-Default:0.1-Type:intorfloat). *It must be0 <= x <= 1. - The 7th argument for initialization is
activation(Optional-Default:'relu'-Type:stroractivation function): *Memos: -'relu'or'gelu'can be set forstr.- An activation function can be directly set. *Not just ReLU() or GELU() but also LeakyReLU(), Sigmoid(), Softmax(), etc can be set.
- The 8th argument for initialization is
custom_encoder(Optional-Default:None-Type:transformer encoder). *TransformerEncoder() can be set. - The 9th argument for initialization is
custom_decoder(Optional-Default:None-Type:transformer decoder). *TransformerDecoder() can be set. - The 10th argument for initialization is
layer_norm_eps(Optional-Default:1e-05-Type:intorfloat). - The 11th argument for initialization is
batch_first(Optional-Default:False-Type:bool). - The 12th argument for initialization is
norm_first(Optional-Default:False-Type:bool). - The 13th argument for initialization is
bias(Optional-Default:True-Type:bool). *My post explainsbiasargument. - The 14th argument for initialization is
device(Optional-Default:None-Type:str,intor device()): *Memos:- If it's
None, get_default_device() is used. *My post explainsget_default_device()and set_default_device(). -
device=can be omitted. -
My post explains
deviceargument.
- If it's
- The 15th argument for initialization is
dtype(Optional-Default:None-Type:dtype): *Memos:- If it's
None, get_default_dtype() is used. *My post explainsget_default_dtype()and set_default_dtype(). -
dtype=can be omitted. -
My post explains
dtypeargument.
- If it's
- The 1st argument is
src(Required-Type:tensoroffloat): *Memos:- It must be the 2D or 3D tensor of one or more elements.
- Its D must be same as
tgt's. - The number of the elements of the deepest dimension must be same as
d_modelandtgt's. - Its
deviceanddtypemust be same astgtandTransformer()'s. - The tensor's
requires_gradwhich isFalseby default is set toTruebyTransformer().
- The 2nd argument is
tgt(Required-Type:tensoroffloat): *Memos:- It must be the 2D or 3D tensor of one or more elements.
- Its D must be same as
src's. - The number of the elements of the deepest dimension must be same as
d_modelandsrc's. - Its
deviceanddtypemust be same assrcandTransformer()'s. - The tensor's
requires_gradwhich isFalseby default is set toTruebyTransformer().
- The 3rd argument is
src_mask(Optional-Default:None:Type:tensoroffloatorbool). *It must be the 2D or 3D tensor of one or more elements. - The 4th argument is
tgt_mask(Optional-Default:None:Type:tensoroffloatorbool). *It must be the 2D or 3D tensor of one or more elements. - The 5th argument is
memory_mask(Optional-Default:None:Type:tensoroffloatorbool). *It must be the 2D or 3D tensor of one or more elements. - The 6th argument is
src_key_padding_mask(Optional-Default:None:Type:tensoroffloatorbool). *It must be the 1D tensor of one or more elements. - The 7th argument is
tgt_key_padding_mask(Optional-Default:None:Type:tensoroffloatorbool). *It must be the 1D tensor of one or more elements. - The 8th argument is
memory_key_padding_mask(Optional-Default:None:Type:tensoroffloatorbool). *It must be the 1D tensor of one or more elements. - The 9th argument is
src_is_causal(Optional-Default:None:Type:bool). - The 10th argument is
tgt_is_causal(Optional-Default:None:Type:bool). - The 11th argument is
memory_is_causal(Optional-Default:False:Type:bool). - The
deviceanddtype(float) ofsrc_mask,tgt_mask,memory_mask,tgt_mask memory_mask,src_key_padding_mask,src_key_padding_mask,tgt_key_padding_maskandmemory_key_padding_maskmust be same asTransformer()'s,d_model's,src's andtgt's. - The
dtype(bool) ofsrc_mask,tgt_mask,memory_mask,tgt_mask memory_mask,src_key_padding_mask,src_key_padding_mask,tgt_key_padding_maskandmemory_key_padding_maskmust be the same. -
tran1.deviceandtran1.dtypedon't work.
import torch
from torch import nn
tensor1 = torch.tensor([[8., -3., 0., 1.]])
tensor2 = torch.tensor([[5., 9., -4., 8.],
[-2., 7., 3., 6.]])
tensor1.requires_grad
tensor2.requires_grad
# False
torch.manual_seed(42)
tran1 = nn.Transformer(d_model=4, nhead=2)
tensor3 = tran1(src=tensor1, tgt=tensor2)
tensor3
# tensor([[1.5608, 0.1450, -0.6434, -1.0624],
# [0.8815, 1.0994, -1.1523, -0.8286]],
# grad_fn=<NativeLayerNormBackward0>)
tensor3.requires_grad
# True
tran1
# Transformer(
# (encoder): TransformerEncoder(
# (layers): ModuleList(
# (0-5): 6 x TransformerEncoderLayer(
# (self_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (linear1): Linear(in_features=6, out_features=2048, bias=True)
# (dropout): Dropout(p=0.1, inplace=False)
# (linear2): Linear(in_features=2048, out_features=6, bias=True)
# (norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (dropout1): Dropout(p=0.1, inplace=False)
# (dropout2): Dropout(p=0.1, inplace=False)
# )
# )
# (norm): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# )
# (decoder): TransformerDecoder(
# (layers): ModuleList(
# (0-5): 6 x TransformerDecoderLayer(
# (self_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (multihead_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (linear1): Linear(in_features=4, out_features=2048, bias=True)
# (dropout): Dropout(p=0.1, inplace=False)
# (linear2): Linear(in_features=2048, out_features=4, bias=True)
# (norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm3): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (dropout1): Dropout(p=0.1, inplace=False)
# (dropout2): Dropout(p=0.1, inplace=False)
# (dropout3): Dropout(p=0.1, inplace=False)
# )
# )
# (norm): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# )
# )
tran1.encoder
# TransformerEncoder(
# (layers): ModuleList(
# (0-5): 6 x TransformerEncoderLayer(
# (self_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (linear1): Linear(in_features=4, out_features=2048, bias=True)
# (dropout): Dropout(p=0.1, inplace=False)
# (linear2): Linear(in_features=2048, out_features=6, bias=True)
# (norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (dropout1): Dropout(p=0.1, inplace=False)
# (dropout2): Dropout(p=0.1, inplace=False)
# )
# )
# (norm): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# )
tran1.decoder
# TransformerDecoder(
# (layers): ModuleList(
# (0-5): 6 x TransformerDecoderLayer(
# (self_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (multihead_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (linear1): Linear(in_features=4, out_features=2048, bias=True)
# (dropout): Dropout(p=0.1, inplace=False)
# (linear2): Linear(in_features=2048, out_features=6, bias=True)
# (norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm3): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (dropout1): Dropout(p=0.1, inplace=False)
# (dropout2): Dropout(p=0.1, inplace=False)
# (dropout3): Dropout(p=0.1, inplace=False)
# )
# )
# (norm): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# )
tran1.d_model
# 4
tran1.nhead
# 2
tran1.batch_first
# False
torch.manual_seed(42)
tran2 = nn.Transformer(d_model=4, nhead=2)
tran1(src=tensor2, tgt=tensor3)
# tensor([[-0.8631, 1.6747, -0.6517, -0.1599],
# [-0.0919, 1.6377, -0.5336, -1.0122]],
# grad_fn=<NativeLayerNormBackward0>)
torch.manual_seed(42)
tran = nn.Transformer(d_model=4, nhead=2, num_encoder_layers=6,
num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
activation='relu', custom_encoder=None, custom_decoder=None,
layer_norm_eps=1e-05, batch_first=False, norm_first=False,
bias=True, device=None, dtype=None)
tran(src=tensor1, tgt=tensor2, src_mask=None, tgt_mask=None,
memory_mask=None, src_key_padding_mask=None,
tgt_key_padding_mask=None, memory_key_padding_mask=None,
src_is_causal=None, tgt_is_causal=None, memory_is_causal=False)
# tensor([[1.5608, 0.1450, -0.6434, -1.0624],
# [0.8815, 1.0994, -1.1523, -0.8286]],
# grad_fn=<NativeLayerNormBackward0>)
tensor1 = torch.tensor([[8., -3.], [0., 1.]])
tensor2 = torch.tensor([[5., 9.], [-4., 8.],
[-2., 7.], [3., 6.]])
torch.manual_seed(42)
tran = nn.Transformer(d_model=2, nhead=2)
tran(src=tensor1, tgt=tensor2)
# tensor([[1.0000, -1.0000],
# [-1.0000, 1.0000],
# [-1.0000, 1.0000],
# [-1.0000, 1.0000]], grad_fn=<NativeLayerNormBackward0>)
tensor1 = torch.tensor([[8.], [-3.], [0.], [1.]])
tensor2 = torch.tensor([[5.], [9.], [-4.], [8.],
[-2.], [7.], [3.], [6.]])
torch.manual_seed(42)
tran = nn.Transformer(d_model=1, nhead=1)
tran(src=tensor1, tgt=tensor2)
# tensor([[0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.]],
# grad_fn=<NativeLayerNormBackward0>)
tensor1 = torch.tensor([[[8.], [-3.], [0.], [1.]]])
tensor2 = torch.tensor([[[5.], [9.], [-4.], [8.]],
[[-2.], [7.], [3.], [6.]]])
torch.manual_seed(42)
tran = nn.Transformer(d_model=1, nhead=1)
tran(src=tensor1, tgt=tensor2)
# tensor([[[0.], [0.], [0.], [0.]],
# [[0.], [0.], [0.], [0.]]], grad_fn=<NativeLayerNormBackward0>)
Transformer().generate_square_subsequent_mask() can get the 2D tensor of the zero or more 0.(Default), 0.+0.j or False and -inf(Default), -inf+0.j or True as shown below:
*Memos:
- The 1st argument is
sz(Required-Type:int). *It must be0 <= x. - The 2nd argument for initialization is
device(Optional-Default:None-Type:str,intor device()): *Memos:- If it's
None,cpuis set. -
device=can be omitted. -
My post explains
deviceargument.
- If it's
- The 3rd argument for initialization is
dtype(Optional-Default:None-Type:dtype): *Memos:- If it's
None,float32is set. -
dtype=can be omitted. -
My post explains
dtypeargument.
- If it's
import torch
from torch import nn
tran = nn.Transformer()
tran.generate_square_subsequent_mask(sz=3)
tran.generate_square_subsequent_mask(sz=3, device=None, dtype=None)
# tensor([[0., -inf, -inf],
# [0., 0., -inf],
# [0., 0., 0.]])
tran1.generate_square_subsequent_mask(sz=5)
# tensor([[0., -inf, -inf, -inf, -inf],
# [0., 0., -inf, -inf, -inf],
# [0., 0., 0., -inf, -inf],
# [0., 0., 0., 0., -inf],
# [0., 0., 0., 0., 0.]])
tran1.generate_square_subsequent_mask(sz=5, dtype=torch.complex64)
# tensor([[0.+0.j, -inf+0.j, -inf+0.j, -inf+0.j, -inf+0.j],
# [0.+0.j, 0.+0.j, -inf+0.j, -inf+0.j, -inf+0.j],
# [0.+0.j, 0.+0.j, 0.+0.j, -inf+0.j, -inf+0.j],
# [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j, -inf+0.j],
# [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]])
tran1.generate_square_subsequent_mask(sz=5, dtype=torch.bool)
# tensor([[False, True, True, True, True],
# [False, False, True, True, True],
# [False, False, False, True, True],
# [False, False, False, False, True],
# [False, False, False, False, False]])
Top comments (0)