DEV Community

Super Kai (Kazuya Ito)
Super Kai (Kazuya Ito)

Posted on • Edited on

Activation functions in PyTorch (4)

Buy Me a Coffee

*Memos:

(1) GELU(Gaussian Error Linear Unit):

  • can convert an input value(x) to an output value by the input value's probability under a Gaussian distribution with optional Tanh. *0 is exclusive except when x = 0.
  • 's formula is. *Both of them get the almost same results: Image description Or: Image description
  • is GELU() in PyTorch.
  • is used in:
    • Transformer. *Transformer() in PyTorch.
    • NLP(Natural Language Processing) based on Transformer such as ChatGPT, BERT(Bidirectional Encoder Representations from Transformers), etc. *Strictly speaking, ChatGPT and BERT are based on Large Language Model(LLM) which is based on Transformer.
  • 's pros:
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
  • 's cons:
    • It's computationally expensive because of complex operation including Erf(Error function) or Tanh.
  • 's graph in Desmos:

Image description

(2) Mish:

  • can convert an input value(x) to an output value by x * Tanh(Softplus(x)). *0 is exclusive except when x = 0.
  • 's formula is: Image description
  • is Mish() in PyTorch.
  • 's pros:
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
  • 's cons:
    • It's computationally expensive because of Tanh and Softplus operation.
  • 's graph in Desmos:

Image description

(3) SiLU(Sigmoid-Weighted Linear Units):

  • can convert an input value(x) to an output value by x * Sigmoid(x). *0 is exclusive except when x = 0.
  • 's formula is y = x / (1 + e-x).
  • is also called Swish.
  • is SiLU() in PyTorch.
  • 's pros:
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
  • 's cons:
    • It's computationally expensive because of Sigmoid.
  • 's graph in Desmos:

Image description

(4) Softplus:

  • can convert an input value(x) to the output value between 0 and ∞. *0 is exclusive.
  • 's formula is y = log(1+ex).
  • is Softplus() in PyTorch.
  • 's pros:
    • It normalizes input values.
    • The convergence is stable.
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Exploding Gradient Problem.
    • It avoids Dying ReLU Problem.
  • 's cons:
    • It's computationally expensive because of log and exponential operation.
  • 's graph in Desmos:

Image description

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay