DEV Community

Super Kai (Kazuya Ito)
Super Kai (Kazuya Ito)

Posted on • Edited on

Activation functions in PyTorch (4)

Buy Me a Coffee

*Memos:

(1) GELU(Gaussian Error Linear Unit):

  • can convert an input value(x) to an output value by the input value's probability under a Gaussian distribution with optional Tanh. *0 is exclusive except when x = 0.
  • 's formula is. *Both of them get the almost same results: Image description Or: Image description
  • is GELU() in PyTorch.
  • is used in:
    • Transformer. *Transformer() in PyTorch.
    • NLP(Natural Language Processing) based on Transformer such as ChatGPT, BERT(Bidirectional Encoder Representations from Transformers), etc. *Strictly speaking, ChatGPT and BERT are based on Large Language Model(LLM) which is based on Transformer.
  • 's pros:
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
  • 's cons:
    • It's computationally expensive because of complex operation including Erf(Error function) or Tanh.
  • 's graph in Desmos:

Image description

(2) Mish:

  • can convert an input value(x) to an output value by x * Tanh(Softplus(x)). *0 is exclusive except when x = 0.
  • 's formula is: Image description
  • is Mish() in PyTorch.
  • 's pros:
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
  • 's cons:
    • It's computationally expensive because of Tanh and Softplus operation.
  • 's graph in Desmos:

Image description

(3) SiLU(Sigmoid-Weighted Linear Units):

  • can convert an input value(x) to an output value by x * Sigmoid(x). *0 is exclusive except when x = 0.
  • 's formula is y = x / (1 + e-x).
  • is also called Swish.
  • is SiLU() in PyTorch.
  • 's pros:
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
  • 's cons:
    • It's computationally expensive because of Sigmoid.
  • 's graph in Desmos:

Image description

(4) Softplus:

  • can convert an input value(x) to the output value between 0 and ∞. *0 is exclusive.
  • 's formula is y = log(1+ex).
  • is Softplus() in PyTorch.
  • 's pros:
    • It normalizes input values.
    • The convergence is stable.
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Exploding Gradient Problem.
    • It avoids Dying ReLU Problem.
  • 's cons:
    • It's computationally expensive because of log and exponential operation.
  • 's graph in Desmos:

Image description

Billboard image

Imagine monitoring that's actually built for developers

Join Vercel, CrowdStrike, and thousands of other teams that trust Checkly to streamline monitor creation and configuration with Monitoring as Code.

Start Monitoring

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Engage with a sea of insights in this enlightening article, highly esteemed within the encouraging DEV Community. Programmers of every skill level are invited to participate and enrich our shared knowledge.

A simple "thank you" can uplift someone's spirits. Express your appreciation in the comments section!

On DEV, sharing knowledge smooths our journey and strengthens our community bonds. Found this useful? A brief thank you to the author can mean a lot.

Okay