File size: 3,600 Bytes
091e64e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd9222b
 
091e64e
fe37b82
091e64e
 
 
 
 
fe37b82
091e64e
 
 
 
 
fe37b82
091e64e
 
 
 
 
4a98108
091e64e
 
 
e1fcce1
091e64e
f630abd
091e64e
f630abd
091e64e
 
 
23c00f5
091e64e
 
 
 
 
 
23c00f5
 
091e64e
 
 
f630abd
091e64e
23c00f5
 
091e64e
 
 
 
 
 
360eb29
be7d3d1
 
 
 
 
 
 
 
360eb29
be7d3d1
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: cc-by-sa-4.0
datasets:
- HaifaCLGroup/KnessetCorpus
language:
- he
tags:
- hebrew
- nlp
- masked-language-model
- transformers
- BERT
- parliamentary-proceedings
- language-model
- Knesset
- DictaBERT
- fine-tuning

---
# Knesset-DictaBERT
**Knesset-DictaBERT** is a Hebrew language model fine-tuned on the [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus), 
which comprises Israeli parliamentary proceedings. 

This model is based on the [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) architecture 
and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context.


## Model Details

- **Model type**: BERT-based (Bidirectional Encoder Representations from Transformers)
- **Language**: Hebrew
- **Training Data**: [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) (Israeli parliamentary proceedings)
- **Base Model**: [Dicta-BERT](https://huggingface.co/dicta-il/dictabert)

## Training Procedure

The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words.

## Usage
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("GiliGold/Knesset-DictaBERT")
model = AutoModelForMaskedLM.from_pretrained("GiliGold/Knesset-DictaBERT")
model.eval()
sentence = "ื™ืฉ ืœื ื• [MASK] ืขืœ ื–ื” ื‘ืฉื‘ื•ืข ื”ื‘ื"

# Tokenize the input sentence and get predictions
inputs = tokenizer.encode(sentence, return_tensors='pt')
output = model(inputs)

mask_token_index = 3
top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1]

# Convert token IDs to tokens and print them
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens)))

# Example output: ื™ืฉื™ื‘ื” / ื“ื™ื•ืŸ
```

## Evaluation
The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences.
The perplexity was calculated on this full test set.
Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 300,000 sentences (approximately 3.5 million tokens).

#### Perplexity
The perplexity of the original DictaBERT on the full test set is 22.87.

The perplexity of Knesset-DictaBERT on the full test set is 6.60.

#### Accuracy

- **1-accuracy results**

Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases.

The original DictaBERT model achieved a top-1 accuracy of 48.02%.


- **2-accuracy results**

Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases.

The original DictaBERT model achieved a top-2 accuracy of 58.60%.


- **5-accuracy results**
- 
Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases.

The original DictaBERT model achieved a top-5 accuracy of 68.98%.

## Acknowledgments
This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged.

## Citation
If you use this model in your work, please cite:
```bibtex
@misc{goldin2024knessetdictaberthebrewlanguagemodel,
      title={Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings}, 
      author={Gili Goldin and Shuly Wintner},
      year={2024},
      eprint={2407.20581},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.20581}, 
}
```