dfurman
/

Llama-2-70B-Instruct-v0.1

Text Generation

Model card Files Files and versions Community

dfurman commited on Jul 24, 2023

Commit

c29c747

•

1 Parent(s): de8ab98

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -103,9 +103,9 @@ Example 3:
 The architecture is a modification of a standard decoder-only transformer.
 The llama-2-70b models have been modified from a standard transformer in the following ways:
-* It uses [grouped-query attention](https://arxiv.org/pdf/2305.13245.pdf) (GQA), a generalization of multi-query attention which uses an intermediate number of key-value heads.
 * It uses the [SwiGLU activation function](https://arxiv.org/abs/2002.05202)
 * It uses [rotary positional embeddings](https://arxiv.org/abs/2104.09864) (RoPE)
 | Hyperparameter | Value |
 |----------------|-------|

 The architecture is a modification of a standard decoder-only transformer.
 The llama-2-70b models have been modified from a standard transformer in the following ways:
 * It uses the [SwiGLU activation function](https://arxiv.org/abs/2002.05202)
 * It uses [rotary positional embeddings](https://arxiv.org/abs/2104.09864) (RoPE)
+* It uses [grouped-query attention](https://arxiv.org/pdf/2305.13245.pdf) (GQA), a generalization of multi-query attention which uses an intermediate number of key-value heads.
 | Hyperparameter | Value |
 |----------------|-------|