chargoddard commited on
Commit
e34fafc
1 Parent(s): 8de93b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -10
README.md CHANGED
@@ -18,17 +18,23 @@ SuperNova-Medius is designed to excel in a variety of business use cases, includ
18
 
19
  The development of SuperNova-Medius involved a sophisticated multi-teacher, cross-architecture distillation process, with the following key steps:
20
 
21
- 1. **Logit Distillation from Llama-3.1-405B-Instruct**:
22
- - We distilled the logits of Llama-3.1-405B-Instruct to Qwen2.5-14B using KL-divergence as the loss function. This allowed us to capture the nuanced distribution of Llama's outputs while adapting them to Qwen's architecture.
23
-
24
- 2. **Logit and Hidden State Distillation from Qwen2.5-72B-Instruct**:
25
- - Further distillation was performed using a combination of logit and hidden state distillation from Qwen2.5-72B-Instruct to ensure that SuperNova-Medius inherited the strong instruction-following capabilities and domain-specific knowledge of Qwen2.5.
26
 
27
- 3. **Cross-Architecture Vocabulary Alignment**:
28
- - Using `mergekit-tokensurgeon`, we aligned the vocabularies and hidden states of both teacher models, allowing for seamless integration of knowledge across the different architectures. This enabled SuperNova-Medius to effectively combine the strengths of both models.
 
29
 
30
- 4. **Final Fusion and Fine-Tuning**:
31
- - After aligning the vocabularies, a final fusion and fine-tuning step was conducted, using a specialized dataset from [EvolKit](https://github.com/arcee-ai/EvolKit) to ensure that SuperNova-Medius maintained coherence, fluency, and context understanding across a broad range of tasks.
 
 
 
 
 
 
 
32
 
33
  ## Performance Evaluation
34
 
@@ -61,7 +67,7 @@ SuperNova-Medius is available for use under the Apache-2.0 license. For those wh
61
  - **Distillation Sources**: Qwen2.5-72B-Instruct, Llama-3.1-405B-Instruct
62
  - **Parameter Count**: 14 billion
63
  - **Training Dataset**: Custom instruction dataset generated with [EvolKit](https://github.com/arcee-ai/EvolKit)
64
- - **Distillation Technique**: Multi-architecture logit and hidden state distillation with cross-architecture vocabulary alignment.
65
 
66
  ## Summary
67
 
 
18
 
19
  The development of SuperNova-Medius involved a sophisticated multi-teacher, cross-architecture distillation process, with the following key steps:
20
 
21
+ 1. **Logit Distillation from Llama 3.1 405B**:
22
+ - We distilled the logits of Llama 3.1 405B using an offline approach.
23
+ - The top K logits for each token were stored to capture most of the probability mass while managing storage requirements.
 
 
24
 
25
+ 2. **Cross-Architecture Adaptation**:
26
+ - Using `mergekit-tokensurgeon`, we created a version of Qwen2.5-14B that uses the vocabulary of Llama 3.1 405B.
27
+ - This allowed for the use of Llama 3.1 405B logits in training the Qwen-based model.
28
 
29
+ 3. **Distillation to Qwen Architecture**:
30
+ - The adapted Qwen2.5-14B model was trained using the stored 405B logits as the target.
31
+
32
+ 4. **Parallel Qwen Distillation**:
33
+ - In a separate process, Qwen2-72B was distilled into a 14B model.
34
+
35
+ 5. **Final Fusion and Fine-Tuning**:
36
+ - The Llama-distilled Qwen model's vocabulary was reverted to Qwen vocabulary.
37
+ - After re-aligning the vocabularies, a final fusion and fine-tuning step was conducted, using a specialized dataset from [EvolKit](https://github.com/arcee-ai/EvolKit) to ensure that SuperNova-Medius maintained coherence, fluency, and context understanding across a broad range of tasks.
38
 
39
  ## Performance Evaluation
40
 
 
67
  - **Distillation Sources**: Qwen2.5-72B-Instruct, Llama-3.1-405B-Instruct
68
  - **Parameter Count**: 14 billion
69
  - **Training Dataset**: Custom instruction dataset generated with [EvolKit](https://github.com/arcee-ai/EvolKit)
70
+ - **Distillation Technique**: Multi-architecture offline logit distillation with cross-architecture vocabulary alignment.
71
 
72
  ## Summary
73