Dynamic 8x7B Mixtral Model

Nous-Hermes-2-Mixtral-8x7B-17m-DPO-raw : 17 MoE FF Layers, 15 Dense FF Layers

Model Details

Model Description

MoE layer pruning test modified from Nous-Hermes-2-Mixtral-8x7B-DPO. So it uses the same chatml format for conversations.

15 layers of MoE is merged into a normal feed forward layer ( 17/32 layers are MoE), so the total params are reduced from 47B to 14B.

Pruned layers index are as follows:

[3, 4, 7, 10, 11, 23, 24, 25, 26, 27, 28, 29]

Developed by: MistralAI, NousResearch, theblackcat
Model type: Modified Mixtral Architecture for dynamic MoE
License: apache-2.0

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

Experiment stage, still finding the best sweet spot for running just under 24G memory under 4 bit-quantization config.

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = CustomMixtralForCausalLM.from_pretrained(model_path,
                                            torch_dtype=torch.bfloat16,
                                            low_cpu_mem_usage=True,
                                            load_in_4bit=True,
                                            trust_remote_code=True
                                        )
pytorch_total_params = sum(p.numel() for p in model.parameters())
print(pytorch_total_params/1e9)
max_length = 100
input_text = """<|im_start|>user\nHow are you? Write a story for me please<|im_end|><|im_start|>assistant\n"""
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to('cuda')
print(len(input_ids[0]))
output = model.generate(input_ids, max_length=max_length, temperature=0.7, repetition_penalty=1.1, do_sample=True)
print(tokenizer.decode(output[0]))