Spaces:

sebdg
/

ai-cookbook

Running

ai-cookbook / src /theory /activations.qmd

Sébastien De Greef

feat: Add slideshow on optimizers in neural networks

db1f0f8 over 1 year ago

8.19 kB

	---
	title: "Activation functions"
	notebook-links: false
	crossref:
	lof-title: "List of Figures"
	number-sections: false
	---

	When choosing an activation function, consider the following:

	- Non-saturation: Avoid activations that saturate (e.g., sigmoid, tanh) to prevent vanishing gradients.

	- Computational efficiency: Choose activations that are computationally efficient (e.g., ReLU, Swish) for large models or real-time applications.

	- Smoothness: Smooth activations (e.g., GELU, Mish) can help with optimization and convergence.

	- Domain knowledge: Select activations based on the problem domain and desired output (e.g., softmax for multi-class classification).

	- Experimentation: Try different activations and evaluate their performance on your specific task.

	[Slideshow](activations_slideshow.qmd)

	{{< embed ActivationFunctions.ipynb#fig-overview >}}

	## Sigmoid {#sec-sigmoid}

	Strengths: Maps any real-valued number to a value between 0 and 1, making it suitable for binary classification problems.

	Weaknesses: Saturates (i.e., output values approach 0 or 1) for large inputs, leading to vanishing gradients during backpropagation.

	Usage: Binary classification, logistic regression.

	::: columns
	::: {.column width="50%"}
	$$
	\sigma(x) = \frac{1}{1 + e^{-x}}
	$$

	``` python
	def sigmoid(x):
	return 1 / (1 + np.exp(-x))
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-sigmoid >}}
	:::
	:::

	## Hyperbolic Tangent (Tanh) {#sec-tanh}

	Strengths: Similar to sigmoid, but maps to (-1, 1), which can be beneficial for some models.

	Weaknesses: Also saturates, leading to vanishing gradients.

	Usage: Similar to sigmoid, but with a larger output range.

	::: columns
	::: {.column width="50%"}
	$$
	\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
	$$

	``` python
	def tanh(x):
	return np.tanh(x)
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-tanh >}}
	:::
	:::

	## Rectified Linear Unit (ReLU)

	Strengths: Computationally efficient, non-saturating, and easy to compute.

	Weaknesses: Not differentiable at x=0, which can cause issues during optimization.

	Usage: Default activation function in many deep learning frameworks, suitable for most neural networks.

	::: columns
	::: {.column width="50%"}
	$$
	\text{ReLU}(x) = \max(0, x)
	$$

	``` python
	def relu(x):
	return np.maximum(0, x)
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-relu >}}
	:::
	:::

	## Leaky ReLU

	Strengths: Similar to ReLU, but allows a small fraction of the input to pass through, helping with dying neurons.

	Weaknesses: Still non-differentiable at x=0.

	Usage: Alternative to ReLU, especially when dealing with dying neurons.

	::: columns
	::: {.column width="50%"}
	$$
	\text{Leaky ReLU}(x) =
	\begin{cases}
	x & \text{if } x > 0 \\
	\alpha x & \text{if } x \leq 0
	\end{cases}
	$$

	``` python
	def leaky_relu(x, alpha=0.01):
	# where α is a small constant (e.g., 0.01)
	return np.where(x > 0, x, x * alpha)
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-leaky_relu >}}
	:::
	:::

	## Swish

	Formula:

	where g(x) is a learned function (e.g., sigmoid or ReLU)

	Strengths: Self-gated, adaptive, and non-saturating.

	Weaknesses: Computationally expensive, requires additional learnable parameters.

	Usage: Can be used in place of ReLU or other activations, but may not always outperform them.

	::: columns
	::: {.column width="50%"}
	$$
	\text{Swish}(x) = x \cdot \sigma(x)
	$$

	``` python
	def swish(x):
	return x * sigmoid(x)
	```

	See also: [sigmoid](#sec-sigmoid)
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-swish >}}
	:::
	:::

	## Mish

	Strengths: Non-saturating, smooth, and computationally efficient.

	Weaknesses: Not as well-studied as ReLU or other activations.

	Usage: Alternative to ReLU, especially in computer vision tasks.

	::: columns
	::: {.column width="50%"}
	$$
	\text{Mish}(x) = x \cdot \tanh(\text{Softplus}(x))
	$$

	``` python
	def mish(x):
	return x * np.tanh(softplus(x))
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-mish >}}
	:::
	:::

	See also: [softplus](#softplus) [tanh](#sec-tanh)

	## Softmax

	Strengths: Normalizes output to ensure probabilities sum to 1, making it suitable for multi-class classification.

	Weaknesses: Only suitable for output layers with multiple classes.

	Usage: Output layer activation for multi-class classification problems.

	::: columns
	::: {.column width="50%"}
	$$
	\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{k=1}^{K} e^{x_k}}
	$$

	``` python
	def softmax(x):
	e_x = np.exp(x - np.max(x))
	return e_x / e_x.sum()
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-softmax >}}
	:::
	:::

	## Softsign

	Strengths: Similar to sigmoid, but with a more gradual slope.

	Weaknesses: Not commonly used, may not provide significant benefits over sigmoid or tanh.

	Usage: Alternative to sigmoid or tanh in certain situations.

	::: columns
	::: {.column width="50%"}
	$$
	\text{Softsign}(x) = \frac{x}{1 + \|x\|}
	$$

	``` python
	def softsign(x):
	return x / (1 + np.abs(x))
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-softsign >}}
	:::
	:::

	## SoftPlus {#softplus}

	Strengths: Smooth, continuous, and non-saturating.

	Weaknesses: Not commonly used, may not outperform other activations.

	Usage: Experimental or niche applications.

	::: columns
	::: {.column width="50%"}
	$$
	\text{Softplus}(x) = \log(1 + e^x)
	$$

	``` python
	def softplus(x):
	return np.log1p(np.exp(x))
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-softplus >}}
	:::
	:::

	## ArcTan

	Strengths: Non-saturating, smooth, and continuous.

	Weaknesses: Not commonly used, may not outperform other activations.

	Usage: Experimental or niche applications.

	::: columns
	::: {.column width="50%"}
	$$
	arctan(x) = arctan(x)
	$$

	``` python
	def arctan(x):
	return np.arctan(x)
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-arctan >}}
	:::
	:::

	## Gaussian Error Linear Unit (GELU)

	Strengths: Non-saturating, smooth, and computationally efficient.

	Weaknesses: Not as well-studied as ReLU or other activations.

	Usage: Alternative to ReLU, especially in Bayesian neural networks.

	::: columns
	::: {.column width="50%"}
	$$
	\text{GELU}(x) = x \cdot \Phi(x)
	$$

	``` python
	def gelu(x):
	return 0.5 * x
	* (1 + np.tanh(np.sqrt(2 / np.pi)
	* (x + 0.044715 * np.power(x, 3))))
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-gelu >}}
	:::
	:::

	See also: [tanh](#sec-tanh)

	## Silu (SiLU)

	Strengths: Non-saturating, smooth, and computationally efficient.

	Weaknesses: Not as well-studied as ReLU or other activations.

	Usage: Alternative to ReLU, especially in computer vision tasks.

	::: columns
	::: {.column width="50%"}
	$$
	silu(x) = x * sigmoid(x)
	$$


	``` python
	def silu(x):
	return x / (1 + np.exp(-x))
	```
	:::

	::: {.column width="50%"}
	{{< embed ActivationFunctions.ipynb#fig-silu >}}
	:::
	:::

	## GELU Approximation (GELU Approx.)

	$$
	f(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x^3)))
	$$

	Strengths: Fast, non-saturating, and smooth.

	Weaknesses: Approximation, not exactly equal to GELU.

	Usage: Alternative to GELU, especially when computational efficiency is crucial.

	## SELU (Scaled Exponential Linear Unit)

	$$
	f(x) = \lambda
	\begin{cases}
	x & x > 0 \\
	\alpha e^x - \alpha & x \leq 0
	\end{cases}
	$$

	Strengths: Self-normalizing, non-saturating, and computationally efficient.

	Weaknesses: Requires careful initialization and α tuning.

	Usage: Alternative to ReLU, especially in deep neural networks.