--- license: apache-2.0 language: - en metrics: - accuracy pipeline_tag: text-generation --- # Using NeyabAI: # Fine-Tuning NeyabAI on Custom Dataset: This repository demonstrates how to fine-tune the NeyabAI(GPT-2) language model on a custom dataset using PyTorch and Hugging Face's Transformers library. The code provides an end-to-end example, from loading the dataset to training the model and evaluating its performance. ## Requirements - Python 3.6+ - PyTorch - Transformers (Hugging Face) - NumPy You can install the required packages using pip: ```bash pip install torch transformers numpy ``` ## Fine-Tuning Script The following script outlines the steps for fine-tuning GPT-2 on a custom dataset: ```python from transformers import GPT2LMHeadModel, GPT2TokenizerFast, AdamW import torch from torch.utils.data import DataLoader, TensorDataset import numpy as np # Load pre-trained model and tokenizer model_name = "XsoraS/NeyabAI" model = GPT2LMHeadModel.from_pretrained(model_name) tokenizer = GPT2TokenizerFast.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token # Example dataset dataset = ["Your custom dataset goes here."] # Replace with your actual dataset # Tokenization function def tokenize_function(examples): return tokenizer(examples, padding='max_length', truncation=True, max_length=512) # Tokenize the dataset tokenized_inputs = [tokenize_function(text) for text in dataset] input_ids = [input['input_ids'] for input in tokenized_inputs] attention_masks = [input['attention_mask'] for input in tokenized_inputs] # Convert to torch tensors input_ids = torch.tensor(input_ids) attention_masks = torch.tensor(attention_masks) labels = input_ids.clone() # Create DataLoader batch_size = 8 dataset = TensorDataset(input_ids, attention_masks, labels) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) # Configure device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) model = model.half() # Set up optimizer optimizer = AdamW(model.parameters(), lr=3e-5) # Define accuracy calculation def calculate_accuracy(preds, labels): pred_flat = np.argmax(preds, axis=-1).flatten() labels_flat = labels.flatten() return np.sum(pred_flat == labels_flat) / len(labels_flat) # Training loop (simplified) for epoch in range(3): # Adjust the number of epochs as needed for batch in dataloader: batch = tuple(t.to(device) for t in batch) input_ids, attention_masks, labels = batch outputs = model(input_ids, attention_mask=attention_masks, labels=labels) loss = outputs.loss logits = outputs.logits loss.backward() optimizer.step() optimizer.zero_grad() preds = logits.detach().cpu().numpy() label_ids = labels.to('cpu').numpy() acc = calculate_accuracy(preds, label_ids) print(f"Loss: {loss.item()}, Accuracy: {acc}") print("Training complete!") ``` ## Notes - **Dataset:** Replace the `dataset` variable with your actual dataset. - **Max Length:** Adjust the `max_length` parameter in the `tokenize_function` as needed based on the length of your input texts. - **Batch Size and Learning Rate:** You may need to tune the `batch_size` and learning rate (`lr`) according to your dataset and hardware capabilities. - **Epochs:** Adjust the number of epochs based on your convergence criteria. ## Acknowledgments - This project uses the [Transformers](https://huggingface.co/transformers/) library by Hugging Face. - Inspired by various fine-tuning examples and tutorials from the Hugging Face community.