Spaces:
Running
Running
| --- | |
| title: "Activation functions" | |
| notebook-links: false | |
| crossref: | |
| lof-title: "List of Figures" | |
| number-sections: false | |
| --- | |
| When choosing an activation function, consider the following: | |
| - **Non-saturation:** Avoid activations that saturate (e.g., sigmoid, tanh) to prevent vanishing gradients. | |
| - **Computational efficiency:** Choose activations that are computationally efficient (e.g., ReLU, Swish) for large models or real-time applications. | |
| - **Smoothness:** Smooth activations (e.g., GELU, Mish) can help with optimization and convergence. | |
| - **Domain knowledge:** Select activations based on the problem domain and desired output (e.g., softmax for multi-class classification). | |
| - **Experimentation:** Try different activations and evaluate their performance on your specific task. | |
| [Slideshow](activations_slideshow.qmd) | |
| {{< embed ActivationFunctions.ipynb#fig-overview >}} | |
| ## Sigmoid {#sec-sigmoid} | |
| **Strengths:** Maps any real-valued number to a value between 0 and 1, making it suitable for binary classification problems. | |
| **Weaknesses:** Saturates (i.e., output values approach 0 or 1) for large inputs, leading to vanishing gradients during backpropagation. | |
| **Usage:** Binary classification, logistic regression. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| \sigma(x) = \frac{1}{1 + e^{-x}} | |
| $$ | |
| ``` python | |
| def sigmoid(x): | |
| return 1 / (1 + np.exp(-x)) | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-sigmoid >}} | |
| ::: | |
| ::: | |
| ## Hyperbolic Tangent (Tanh) {#sec-tanh} | |
| **Strengths:** Similar to sigmoid, but maps to (-1, 1), which can be beneficial for some models. | |
| **Weaknesses:** Also saturates, leading to vanishing gradients. | |
| **Usage:** Similar to sigmoid, but with a larger output range. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} | |
| $$ | |
| ``` python | |
| def tanh(x): | |
| return np.tanh(x) | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-tanh >}} | |
| ::: | |
| ::: | |
| ## Rectified Linear Unit (ReLU) | |
| **Strengths:** Computationally efficient, non-saturating, and easy to compute. | |
| **Weaknesses:** Not differentiable at x=0, which can cause issues during optimization. | |
| **Usage:** Default activation function in many deep learning frameworks, suitable for most neural networks. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| \text{ReLU}(x) = \max(0, x) | |
| $$ | |
| ``` python | |
| def relu(x): | |
| return np.maximum(0, x) | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-relu >}} | |
| ::: | |
| ::: | |
| ## Leaky ReLU | |
| **Strengths:** Similar to ReLU, but allows a small fraction of the input to pass through, helping with dying neurons. | |
| **Weaknesses:** Still non-differentiable at x=0. | |
| **Usage:** Alternative to ReLU, especially when dealing with dying neurons. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| \text{Leaky ReLU}(x) = | |
| \begin{cases} | |
| x & \text{if } x > 0 \\ | |
| \alpha x & \text{if } x \leq 0 | |
| \end{cases} | |
| $$ | |
| ``` python | |
| def leaky_relu(x, alpha=0.01): | |
| # where α is a small constant (e.g., 0.01) | |
| return np.where(x > 0, x, x * alpha) | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-leaky_relu >}} | |
| ::: | |
| ::: | |
| ## Swish | |
| **Formula:** | |
| where g(x) is a learned function (e.g., sigmoid or ReLU) | |
| **Strengths:** Self-gated, adaptive, and non-saturating. | |
| **Weaknesses:** Computationally expensive, requires additional learnable parameters. | |
| **Usage:** Can be used in place of ReLU or other activations, but may not always outperform them. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| \text{Swish}(x) = x \cdot \sigma(x) | |
| $$ | |
| ``` python | |
| def swish(x): | |
| return x * sigmoid(x) | |
| ``` | |
| See also: [sigmoid](#sec-sigmoid) | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-swish >}} | |
| ::: | |
| ::: | |
| ## Mish | |
| **Strengths:** Non-saturating, smooth, and computationally efficient. | |
| **Weaknesses:** Not as well-studied as ReLU or other activations. | |
| **Usage:** Alternative to ReLU, especially in computer vision tasks. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| \text{Mish}(x) = x \cdot \tanh(\text{Softplus}(x)) | |
| $$ | |
| ``` python | |
| def mish(x): | |
| return x * np.tanh(softplus(x)) | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-mish >}} | |
| ::: | |
| ::: | |
| See also: [softplus](#softplus) [tanh](#sec-tanh) | |
| ## Softmax | |
| **Strengths:** Normalizes output to ensure probabilities sum to 1, making it suitable for multi-class classification. | |
| **Weaknesses:** Only suitable for output layers with multiple classes. | |
| **Usage:** Output layer activation for multi-class classification problems. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| \text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{k=1}^{K} e^{x_k}} | |
| $$ | |
| ``` python | |
| def softmax(x): | |
| e_x = np.exp(x - np.max(x)) | |
| return e_x / e_x.sum() | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-softmax >}} | |
| ::: | |
| ::: | |
| ## Softsign | |
| **Strengths:** Similar to sigmoid, but with a more gradual slope. | |
| **Weaknesses:** Not commonly used, may not provide significant benefits over sigmoid or tanh. | |
| **Usage:** Alternative to sigmoid or tanh in certain situations. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| \text{Softsign}(x) = \frac{x}{1 + |x|} | |
| $$ | |
| ``` python | |
| def softsign(x): | |
| return x / (1 + np.abs(x)) | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-softsign >}} | |
| ::: | |
| ::: | |
| ## SoftPlus {#softplus} | |
| **Strengths:** Smooth, continuous, and non-saturating. | |
| **Weaknesses:** Not commonly used, may not outperform other activations. | |
| **Usage:** Experimental or niche applications. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| \text{Softplus}(x) = \log(1 + e^x) | |
| $$ | |
| ``` python | |
| def softplus(x): | |
| return np.log1p(np.exp(x)) | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-softplus >}} | |
| ::: | |
| ::: | |
| ## ArcTan | |
| **Strengths:** Non-saturating, smooth, and continuous. | |
| **Weaknesses:** Not commonly used, may not outperform other activations. | |
| **Usage:** Experimental or niche applications. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| arctan(x) = arctan(x) | |
| $$ | |
| ``` python | |
| def arctan(x): | |
| return np.arctan(x) | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-arctan >}} | |
| ::: | |
| ::: | |
| ## Gaussian Error Linear Unit (GELU) | |
| **Strengths:** Non-saturating, smooth, and computationally efficient. | |
| **Weaknesses:** Not as well-studied as ReLU or other activations. | |
| **Usage:** Alternative to ReLU, especially in Bayesian neural networks. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| \text{GELU}(x) = x \cdot \Phi(x) | |
| $$ | |
| ``` python | |
| def gelu(x): | |
| return 0.5 * x | |
| * (1 + np.tanh(np.sqrt(2 / np.pi) | |
| * (x + 0.044715 * np.power(x, 3)))) | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-gelu >}} | |
| ::: | |
| ::: | |
| See also: [tanh](#sec-tanh) | |
| ## Silu (SiLU) | |
| **Strengths:** Non-saturating, smooth, and computationally efficient. | |
| **Weaknesses:** Not as well-studied as ReLU or other activations. | |
| **Usage:** Alternative to ReLU, especially in computer vision tasks. | |
| ::: columns | |
| ::: {.column width="50%"} | |
| $$ | |
| silu(x) = x * sigmoid(x) | |
| $$ | |
| ``` python | |
| def silu(x): | |
| return x / (1 + np.exp(-x)) | |
| ``` | |
| ::: | |
| ::: {.column width="50%"} | |
| {{< embed ActivationFunctions.ipynb#fig-silu >}} | |
| ::: | |
| ::: | |
| ## GELU Approximation (GELU Approx.) | |
| $$ | |
| f(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x^3))) | |
| $$ | |
| **Strengths:** Fast, non-saturating, and smooth. | |
| **Weaknesses:** Approximation, not exactly equal to GELU. | |
| **Usage:** Alternative to GELU, especially when computational efficiency is crucial. | |
| ## SELU (Scaled Exponential Linear Unit) | |
| $$ | |
| f(x) = \lambda | |
| \begin{cases} | |
| x & x > 0 \\ | |
| \alpha e^x - \alpha & x \leq 0 | |
| \end{cases} | |
| $$ | |
| **Strengths:** Self-normalizing, non-saturating, and computationally efficient. | |
| **Weaknesses:** Requires careful initialization and α tuning. | |
| **Usage:** Alternative to ReLU, especially in deep neural networks. | |