turboderp commited on
Commit
2015387
1 Parent(s): ae95094

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -7
README.md CHANGED
@@ -7,23 +7,30 @@ license: apache-2.0
7
  This is [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with a Llama-3 vocabulary.
8
 
9
  The intended use is as a draft model for Llama-3-70B-Instruct. Llama3-8B-Instruct works for this purpose, but it's on
10
- the heavier side for drafting. Secondary purpose is just to explore the feasibility of vocabulary swaps.
 
 
 
 
 
11
 
12
  ## Procedure
13
 
14
- The vocabulary was swapped by creating a new embedding layer (origianl model uses tied embeddings so the output layer is
15
  the same) and initializing it as follows:
16
 
17
- - every L3 token that has a corresponding Qwen2 token is initialized with the corresponding embedding
18
  - every L3 token that decodes and re-encodes to multiple Qwen2 token is initialized with the mean of those embeddings
19
- - there are no L3 tokens that cannot be translated to one or more Qwen2 tokens (both vocabularies are complete)
20
 
21
  Swapping the vocabulary with the above method yields a mostly coherent but still very confused model. It especially
22
- struggles with numbers. (It likes to talk about people born in the year 1900670 having 695 beautiful children etc.)
 
23
 
24
  This is remedied by subsequent finetuning, first on
25
  [this 2.41 million row sample from Common Crawl](https://huggingface.co/datasets/agentlans/common-crawl-sample), and
26
- subsequently on about 25000 completions produced by Llama3-8B-Instruct in the L3 instruct format for 3 epochs.
 
27
 
28
  I did try tuning just the tied embeddings, but this didn't achieve good results.
29
 
@@ -61,4 +68,4 @@ Qwama-0.5B-instruct:
61
 
62
  ## EXL2 Quants
63
 
64
- EXL2 quants uploaded [here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct-exl2).
 
7
  This is [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with a Llama-3 vocabulary.
8
 
9
  The intended use is as a draft model for Llama-3-70B-Instruct. Llama3-8B-Instruct works for this purpose, but it's on
10
+ the heavier side for drafting.
11
+
12
+ The secondary purpose is to explore the feasibility of vocabulary swaps, either for adapting small models like
13
+ Qwen2-0.5b to produce drafts for other models, or for interoperability between dissimilar language models in general.
14
+ The conclusion in this regard is that the method works, but, since finetuning is required, it will be expensive for
15
+ larger models. It would be interesting to explore low-rank or quantized finetuning as an alternative.
16
 
17
  ## Procedure
18
 
19
+ The vocabulary was swapped by creating a new embedding layer (original model uses tied embeddings so the output layer is
20
  the same) and initializing it as follows:
21
 
22
+ - every L3 token that is an exact match for a Qwen2 token is initialized with the corresponding embedding
23
  - every L3 token that decodes and re-encodes to multiple Qwen2 token is initialized with the mean of those embeddings
24
+ - there are no L3 tokens that cannot be translated to one or more Qwen2 tokens (both vocabularies are complete).
25
 
26
  Swapping the vocabulary with the above method yields a mostly coherent but still very confused model. It especially
27
+ struggles with numbers, and of course the embeddings for the Llama-3 control tokens do not have the significance they
28
+ would in an instruct-tuned model.
29
 
30
  This is remedied by subsequent finetuning, first on
31
  [this 2.41 million row sample from Common Crawl](https://huggingface.co/datasets/agentlans/common-crawl-sample), and
32
+ subsequently 3 epochs on about 25000 instruct-formatted completions produced by Llama3-8B-Instruct, included
33
+ [here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct/blob/main/llama3-instruct-prompts.json) for reference.
34
 
35
  I did try tuning just the tied embeddings, but this didn't achieve good results.
36
 
 
68
 
69
  ## EXL2 Quants
70
 
71
+ EXL2 quants uploaded [here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct-exl2).