Update README.md
Browse files
README.md
CHANGED
@@ -103,9 +103,9 @@ Example 3:
|
|
103 |
The architecture is a modification of a standard decoder-only transformer.
|
104 |
|
105 |
The llama-2-70b models have been modified from a standard transformer in the following ways:
|
106 |
-
* It uses [grouped-query attention](https://arxiv.org/pdf/2305.13245.pdf) (GQA), a generalization of multi-query attention which uses an intermediate number of key-value heads.
|
107 |
* It uses the [SwiGLU activation function](https://arxiv.org/abs/2002.05202)
|
108 |
* It uses [rotary positional embeddings](https://arxiv.org/abs/2104.09864) (RoPE)
|
|
|
109 |
|
110 |
| Hyperparameter | Value |
|
111 |
|----------------|-------|
|
|
|
103 |
The architecture is a modification of a standard decoder-only transformer.
|
104 |
|
105 |
The llama-2-70b models have been modified from a standard transformer in the following ways:
|
|
|
106 |
* It uses the [SwiGLU activation function](https://arxiv.org/abs/2002.05202)
|
107 |
* It uses [rotary positional embeddings](https://arxiv.org/abs/2104.09864) (RoPE)
|
108 |
+
* It uses [grouped-query attention](https://arxiv.org/pdf/2305.13245.pdf) (GQA), a generalization of multi-query attention which uses an intermediate number of key-value heads.
|
109 |
|
110 |
| Hyperparameter | Value |
|
111 |
|----------------|-------|
|