metadata

title: Denoise And Diarization
emoji: 🐠
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 3.28.0
app_file: app.py
pinned: false

How run:

huggingface
run local inference:
1. GUI: python app.py
2. Inference local: python main_pipeline.py --audio-path dialog.mp3 --out-folder-path out
run docker:

docker login registry.hf.space
docker run -it -p 7860:7860 --platform=linux/amd64 \
    registry.hf.space/speechmaster-denoise-and-diarization:latest python app.py

About pipeline:

denoise audio
vad(voice activity detector)
speaker embeddings from each vad fragments
clustering this embeddings

Inference for hardware

	inference time for file dialog.mp3
cpu 2v CPU huggingece	453.8 s/it
gpu tesla v100	8.23 s/it

Approaches

I know a lot of methods for this task:

separation: using separation models(need longtime train and finetune)
diarization
- speaker_embedding+Clustering knowing numbers of speakers
- overlap speech detection
- speaker_embedding+Clustering knowing numbers of speakers
- asr_each_word+speaker_embedding+Clustering numbers of speakers
end-to-end nn diarization (sota worst than just diarization)

For this task i used speaker_embedding+Clustering unknowing numbers of speakers

How i can improve:

Fix preprocessing
- estimate SNR(signal noise rate) and if input clean dont use denoising
Add train:
- custom speaker recognition model
- custom overlap speech detector
- custom speech separation model:
  - MossFormer
  - speechbrain
Using FaceVad if there are video
improve speed and ram size:
- quantization models
- optimate models for hardware onnx=>openvino/tensorrt/caffe2 or coreml
- pruning models
- distillation(train small model with big model)

How to improve besides what's on top:

delete overlap speech using asr
delete overlap speech using overlap detection