Deepsword-34B-Base / README.md
TriadParty's picture
Update README.md
17627a9
|
raw
history blame
No virus
2.34 kB
metadata
license: apache-2.0
datasets:
  - TriadParty/deepsword
language:
  - zh
  - en

Deepsword-34B-Base

Introducing wrath in the Seven Deadly Sins series of models. ![](https://media.discordapp.net/attachments/1088992345824972840/1187269297811247195/dickboy._Chinese_Fangtian_Painting_Halberd_Manufactured_by_Mech_532eefe6-7d75-473c-b5ef-13e1f46bb09e.png?ex=659645b2&is=6583d0b2&hm=51125137c9b25e1f7447c35ea07e891393b374c8072e023b04c0f231a1533cd8 =200x200)

  • Continuous pre-training of qlora on Yi-34b
  • High-quality martial arts novels
  • Thoughtful cleaning process

This model is designed to serve as the base model in the agent model of the script-killing game process. For this purpose, I've collected approximately 10G of martial arts novels, sourced from various novel websites and PT sites. However, this dataset includes a significant amount of duplicate and low-quality content. To address these issues, I've undertaken the following steps:

1. Define Data Quality Dimensions

For martial arts novels, high-quality works are typically represented by authors like Jin Yong, Gu Long, and Liang Yusheng. In these novels, the complexity of the plot is a critical factor and is the focal point for script quality.

2. Quantify Data Quality Dimensions

Given the emphasis on plot complexity, we approached this in several stages:

Chapter Summarization:

English: Utilize Hugging Face's LED-Large-Book-Summary model. Chinese: Use the Randeng-Pegasus-523M-Summary-Chinese model. Vectorization and Complexity Analysis:

Convert plot summaries into vectors using a BERT-based model. Measure transitions between chapters through cosine similarity or Euclidean distance. Develop a complexity algorithm focused on standard deviation and peak analysis. Metric Quantification:

Apply subjective weighting to the complexity metrics derived from chapter transitions.

3. Outcome

By employing these methods, we can effectively filter out novels of higher quality. This refined dataset has been shared for further use. The next step is to continue pretraining, for which specific parameters can be referred to in my previous model descriptions.