--- datasets: - mozilla-foundation/common_voice_11_0 language: - el metrics: - wer license: apache-2.0 --- # Whisper small adapters model for Greek transcription We added adapters to whisper-small model then we finetuned it on Greek ASR. During training, the model is frozen and only the adapters are being trained. When trying to transcribe Greek, we need to activate the adapters, otherwise we can ignore the adapters and use the original whisper model. ## How to use Start by installing transformers with Whisper model with added adapters ```bash git clone https://gitlab.com/horizon-europe-voxreality/multilingual-translation/speech-translation-demo.git cd speech-translation-demo # You might need to switch to dev branch pip install -e transformers ``` The parameter `use_adapters` is used to decide whether we will use the adapters or not. It needs to be set to True only in the case of Greek. ```python from transformers import WhisperProcessor, WhisperForConditionalGenerationWithAdapters from datasets import Audio, load_dataset # load model and processor processor = WhisperProcessor.from_pretrained("voxreality/whisper-small-el-adapters") model = WhisperForConditionalGenerationWithAdapters.from_pretrained("voxreality/whisper-small-el-adapters") forced_decoder_ids = processor.get_decoder_prompt_ids(language="greek", task="transcribe") # load streaming dataset and read first audio sample ds = load_dataset("mozilla-foundation/common_voice_11_0", "el", split="test", streaming=True) ds = ds.cast_column("audio", Audio(sampling_rate=16_000)) input_speech = next(iter(ds))["audio"] input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features # Set use_adapters to False for languages other than Greek. # generate token ids predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids, use_adapters=True) # decode token ids to text transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) ``` You can also use an HF pipeline: ```python from transformers import pipeline from datasets import Audio, load_dataset ds = load_dataset("mozilla-foundation/common_voice_11_0", "el", split="test", streaming=True) ds = ds.cast_column("audio", Audio(sampling_rate=16_000)) input_speech = next(iter(ds))["audio"] model = WhisperForConditionalGenerationWithAdapters.from_pretrained("voxreality/whisper-small-el-adapters") pipe = pipeline("automatic-speech-recognition", model=model, tokenizer="voxreality/whisper-small-el-adapters", "voxreality/whisper-small-el-adapters", device='cpu', batch_size=32) transcription = pipe(input_speech['array'], generate_kwargs = {"language":f"<|el|>","task": "transcribe", "use_adapters": True}) ```