--- language: - en license: apache-2.0 library_name: transformers tags: - moe - moah - mod datasets: - Locutusque/UltraTextbooks --- # Model Card for Model ID ## Model Details ### Model Description MoM: Mixture of Mixture This Model is a first test to combine [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) architecture with mixture of attention head and mixture of depth. Attention layers only are in bf16 precision and the rest is in 1.58bits precision 17M over a total of 1025M parameters are in bf16 precision ~ 1.7% of the parameters are in bf16 The goal is to developpe and test if this kind of architectures have not too much quality loss for a fast inference. - **Model type:** Mixture of attention head mixture of depth and mixture of expert with 1.58bits linear layer excpeted for **attention** - **License:** Apache licence 2.0 ### Model Sources [optional] - **Repository:** https://github.com/ostix360/optimized-LLM ## How to Get Started with the Model If you want to test this model please look at this repo at this [commit](https://github.com/ostix360/optimized-LLM/tree/796cfe43cf16461b92102cf0f41e8960cd91340b) ## Training Details - **wandb**: [training detail](https://wandb.ai/ostix360/Mixture%20of%20mixture%20(mod,%20moah%20moe)/runs/0ayclh2i) ### Training Data We use the first ~0.5B tokens of Locutusque/UltraTextbooks to train this model ### Training Procedure We use adam-8 bits with default betas and epsilon values #### Preprocessing [optional] The data fit the model max length i.e. 512 tokens #### Training Hyperparameters Please look at the wandb meta data or the train.py in the repo to see the hyperparameters ## Technical Specifications [optional] ### Compute Infrastructure #### Hardware - one 4070 ti GPU #### Software - pytorch, transformers etc