---
title: "Activation functions in Neural Networks"
author: "Sébastien De Greef"
format:
  revealjs: 
    theme: solarized
    navigation-mode: grid 
    controls-layout: bottom-right
    controls-tutorial: true
---

# Activation functions

When choosing an activation function, consider the following:

-   **Non-saturation:** Avoid activations that saturate (e.g., sigmoid, tanh) to prevent vanishing gradients.

-   **Computational efficiency:** Choose activations that are computationally efficient (e.g., ReLU, Swish) for large models or real-time applications.

-   **Smoothness:** Smooth activations (e.g., GELU, Mish) can help with optimization and convergence.

-   **Domain knowledge:** Select activations based on the problem domain and desired output (e.g., softmax for multi-class classification).

-   **Experimentation:** Try different activations and evaluate their performance on your specific task.

## Sigmoid {#sec-sigmoid}

**Strengths:** Maps any real-valued number to a value between 0 and 1, making it suitable for binary classification problems.

**Weaknesses:** Saturates (i.e., output values approach 0 or 1) for large inputs, leading to vanishing gradients during backpropagation.

**Usage:** Binary classification, logistic regression.

::: columns
::: {.column width="50%"}
$$ 
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

``` python
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
```
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-sigmoid >}}
:::
:::

## Hyperbolic Tangent (Tanh) {#sec-tanh}

**Strengths:** Similar to sigmoid, but maps to (-1, 1), which can be beneficial for some models.

**Weaknesses:** Also saturates, leading to vanishing gradients.

**Usage:** Similar to sigmoid, but with a larger output range.

::: columns
::: {.column width="50%"}
$$
\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
$$

``` python
def tanh(x):
    return np.tanh(x)
```
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-tanh >}}
:::
:::

## Rectified Linear Unit (ReLU)

**Strengths:** Computationally efficient, non-saturating, and easy to compute.

**Weaknesses:** Not differentiable at x=0, which can cause issues during optimization.

**Usage:** Default activation function in many deep learning frameworks, suitable for most neural networks.

::: columns
::: {.column width="50%"}
$$
\text{ReLU}(x) = \max(0, x)
$$

``` python
def relu(x):
    return np.maximum(0, x)
```
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-relu >}}
:::
:::

## Leaky ReLU

**Strengths:** Similar to ReLU, but allows a small fraction of the input to pass through, helping with dying neurons.

**Weaknesses:** Still non-differentiable at x=0.

**Usage:** Alternative to ReLU, especially when dealing with dying neurons.

::: columns
::: {.column width="50%"}
$$
\text{Leaky ReLU}(x) = 
\begin{cases} 
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0 
\end{cases}
$$

``` python
def leaky_relu(x, alpha=0.01):
    # where α is a small constant (e.g., 0.01)
    return np.where(x > 0, x, x * alpha)
```
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-leaky_relu >}}
:::
:::

## Swish

**Strengths:** Self-gated, adaptive, and non-saturating.

**Weaknesses:** Computationally expensive, requires additional learnable parameters.

**Usage:** Can be used in place of ReLU or other activations, but may not always outperform them.

::: columns
::: {.column width="50%"}
$$
\text{Swish}(x) = x \cdot \sigma(x)
$$

``` python
def swish(x):
    return x * sigmoid(x)
```

See also: [sigmoid](#sec-sigmoid)
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-swish >}}
:::
:::

## Mish

**Strengths:** Non-saturating, smooth, and computationally efficient.

**Weaknesses:** Not as well-studied as ReLU or other activations.

**Usage:** Alternative to ReLU, especially in computer vision tasks.

::: columns
::: {.column width="50%"}
$$
\text{Mish}(x) = x \cdot \tanh(\text{Softplus}(x))
$$

``` python
def mish(x):
    return x * np.tanh(softplus(x))
```
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-mish >}}
:::
:::

See also: [softplus](#softplus) [tanh](#sec-tanh)

## Softmax

**Strengths:** Normalizes output to ensure probabilities sum to 1, making it suitable for multi-class classification.

**Weaknesses:** Only suitable for output layers with multiple classes.

**Usage:** Output layer activation for multi-class classification problems.

::: columns
::: {.column width="50%"}
$$
\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{k=1}^{K} e^{x_k}}
$$

``` python
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()
```
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-softmax >}}
:::
:::

## Softsign

**Strengths:** Similar to sigmoid, but with a more gradual slope.

**Weaknesses:** Not commonly used, may not provide significant benefits over sigmoid or tanh.

**Usage:** Alternative to sigmoid or tanh in certain situations.

::: columns
::: {.column width="50%"}
$$
\text{Softsign}(x) = \frac{x}{1 + |x|}
$$

``` python
def softsign(x):
    return x / (1 + np.abs(x))
```
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-softsign >}}
:::
:::

## SoftPlus {#softplus}

**Strengths:** Smooth, continuous, and non-saturating.

**Weaknesses:** Not commonly used, may not outperform other activations.

**Usage:** Experimental or niche applications.

::: columns
::: {.column width="50%"}
$$
\text{Softplus}(x) = \log(1 + e^x)
$$

``` python
def softplus(x):
    return np.log1p(np.exp(x))
```
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-softplus >}}
:::
:::

## ArcTan

**Strengths:** Non-saturating, smooth, and continuous.

**Weaknesses:** Not commonly used, may not outperform other activations.

**Usage:** Experimental or niche applications.

::: columns
::: {.column width="50%"}
$$
arctan(x) = arctan(x)
$$

``` python
def arctan(x):
    return np.arctan(x)
```
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-arctan >}}
:::
:::

## Gaussian Error Linear Unit (GELU)

**Strengths:** Non-saturating, smooth, and computationally efficient.

**Weaknesses:** Not as well-studied as ReLU or other activations.

**Usage:** Alternative to ReLU, especially in Bayesian neural networks.

::: columns
::: {.column width="50%"}
$$
\text{GELU}(x) = x \cdot \Phi(x)
$$

``` python
def gelu(x):
    return 0.5 * x 
        * (1 + np.tanh(np.sqrt(2 / np.pi) 
        * (x + 0.044715 * np.power(x, 3))))
```
:::

::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-gelu >}}
:::
:::

See also: [tanh](#sec-tanh)

## Silu (SiLU)

$$
silu(x) = x * sigmoid(x)
$$

**Strengths:** Non-saturating, smooth, and computationally efficient.

**Weaknesses:** Not as well-studied as ReLU or other activations.

**Usage:** Alternative to ReLU, especially in computer vision tasks.

## GELU Approximation (GELU Approx.)

$$
f(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x^3)))
$$

**Strengths:** Fast, non-saturating, and smooth.

**Weaknesses:** Approximation, not exactly equal to GELU.

**Usage:** Alternative to GELU, especially when computational efficiency is crucial.

## SELU (Scaled Exponential Linear Unit)

$$
f(x) = \lambda 
\begin{cases} 
x & x > 0 \\
\alpha e^x - \alpha & x \leq 0 
\end{cases}
$$

**Strengths:** Self-normalizing, non-saturating, and computationally efficient.

**Weaknesses:** Requires careful initialization and α tuning.

**Usage:** Alternative to ReLU, especially in deep neural networks.

\listoffigures