klasocki commited on
Commit
6373819
1 Parent(s): 54847c6

Fix a crazy bug and add README

Browse files

Some tokens
after the punctuation has been removed become joint into a single token,
for instance @.@ becomes @@,
which we then cannot find in the original string.
So switch from using find,
to using the end indices
and keeping track of punctuation we removed from the original string.

README.md CHANGED
@@ -10,26 +10,116 @@ pinned: true
10
  app_port: 8000
11
  ---
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  `sudo service docker start`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- `docker log [id]` for logs from the container.
16
 
17
- `docker build -t comma-fixer --target test .` for tests
 
 
 
 
 
 
 
18
 
19
- `git push hub` to deploy to huggingface hub, after adding a remote
20
 
21
- Multi-stage build brings down the size from 9GB+ to around 7GB.
22
- Less not possible most likely, due to the size of torch and models.
 
23
 
24
- Reported token classification F1 scores on commas for different languages, on a political speeches' dataset:
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  | English | German | French | Italian |
27
  |---------|--------|--------|---------|
28
  | 0.819 | 0.945 | 0.831 | 0.798 |
29
 
30
- Evaluation of the baseline model on the wikitext-103-raw-v1 validation dataset:
 
31
 
32
  | precision | recall | F1 | support |
33
  |-----------|--------|------|---------|
34
  | 0.79 | 0.71 | 0.75 | 10079 |
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  app_port: 8000
11
  ---
12
 
13
+ # Comma fixer
14
+ This repository contains a web service for fixing comma placement within a given text, for instance:
15
+
16
+ `"A sentence however, not quite good correct and sound."` -> `"A sentence, however, not quite good, correct and sound."`
17
+
18
+ It provides a webpage for testing the functionality, a REST API,
19
+ and Jupyter notebooks for evaluating and training comma fixing models.
20
+
21
+ A web demo is hosted in the [huggingface spaces](https://huggingface.co/spaces/klasocki/comma-fixer).
22
+
23
+ ## Development setup
24
+
25
+ Deploying the service for local development can be done by running `docker-compose up` in the root directory.
26
+ Note that you might have to
27
  `sudo service docker start`
28
+ first.
29
+
30
+ The application should then be available at http://localhost:8000.
31
+ For the API, see the `openapi.yaml` file.
32
+ Docker-compose mounts a volume and listens to changes in the source code, so the application will be reloaded and
33
+ reflect them.
34
+
35
+ We use multi-stage builds to reduce the image size, ensure flexibility in requirements and that tests are run before
36
+ each deployment.
37
+ However, while it does reduce the size by nearly 3GB, the resulting image still contains deep learning libraries and
38
+ pre-downloaded models, and will take around 7GB of disk space.
39
+
40
+ Alternatively, you can setup a python environment by hand. It is recommended to use a virtualenv. Inside one, run
41
+ ```bash
42
+ pip install -e .[test]
43
+ ```
44
+ the `[test]` option makes sure to install test dependencies.
45
+
46
+ If you intend to perform training and evaluation of deep learning models, install also using the `[training]` option.
47
+
48
+ ### Running tests
49
+ To run the tests, execute
50
+ ```bash
51
+ docker build -t comma-fixer --target test .
52
+ ```
53
+ Or `python -m pytest tests/ ` if you already have a local python environment.
54
 
 
55
 
56
+ ### Deploying to huggingface spaces
57
+ In order to deploy the application, one needs to be added as a collaborator to the space and have set up a
58
+ corresponding git remote.
59
+ The application is then continuously deployed on each push.
60
+ ```bash
61
+ git remote add hub https://huggingface.co/spaces/klasocki/comma-fixer
62
+ git push hub
63
+ ```
64
 
65
+ ## Evaluation
66
 
67
+ In order to evaluate, run `jupyter notebook notebooks/` or copy the notebooks to a web hosting service with GPUs,
68
+ such as Google Colab or Kaggle
69
+ and clone this repository there.
70
 
71
+ We use the [oliverguhr/fullstop-punctuation-multilang-large](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large)
72
+ model as the baseline.
73
+ It is a RoBERTa large model fine-tuned for the task of punctuation restoration on a dataset of political speeches
74
+ in English, German, French and Italian.
75
+ That is, it takes a sentence without any punctuation as input, and predicts the missing punctuation in token
76
+ classification fashion, thanks to which the original token structure stays unchanged.
77
+ We use a subset of its capabilities focusing solely on commas, and leaving other punctuation intact.
78
+
79
+
80
+
81
+ The authors report the following token classification F1 scores on commas for different languages on the original
82
+ dataset:
83
 
84
  | English | German | French | Italian |
85
  |---------|--------|--------|---------|
86
  | 0.819 | 0.945 | 0.831 | 0.798 |
87
 
88
+ The results of our evaluation of the baseline model out of domain on the English wikitext-103-raw-v1 validation
89
+ dataset are as follows:
90
 
91
  | precision | recall | F1 | support |
92
  |-----------|--------|------|---------|
93
  | 0.79 | 0.71 | 0.75 | 10079 |
94
 
95
+ We treat each comma as one token instance, as opposed to the original paper, which NER-tags the whole multiple-token
96
+ preceding words as comma class tokens.
97
+ In our approach, for each comma from the prediction text obtained from the model:
98
+ * If it should be there according to ground truth, it counts as a true positive.
99
+ * If it should not be there, it counts as a false positive.
100
+ * If a comma from ground truth is not predicted, it counts as a false negative.
101
+
102
+ ## Training
103
+ While fine-tuning an encoder BERT-like pre-trained model for NER seems like the best approach to the problem,
104
+ since it preserves the sentence structure and only focuses on commas,
105
+ with limited GPU resources, we doubt we could beat the baseline model with a similar approach.
106
+ We could fine-tune the baseline on our data, focusing on commas, and see if it brings any improvement.
107
+
108
+ However, we thought that trying out pre-trained text-to-text or decoder-only LLMs for this task using PEFT could be
109
+ interesting, and wanted to check if we have enough resources for low-rank adaptation or prefix-tuning.
110
+
111
+ We adapt the code from [this tutorial](https://www.youtube.com/watch?v=iYr1xZn26R8) in order to fine-tune a
112
+ [bloom LLM](https://huggingface.co/bigscience/bloom-560m) to our task using
113
+ [LoRa](https://arxiv.org/pdf/2106.09685.pdf).
114
+ However, even with the smallest model from the family, we struggled with CUDA memory errors using the free Google
115
+ colab GPU quotas, and could only train with a batch size of two.
116
+ After a short training, it seems the loss keeps fluctuating and the model is only able to learn to repeat the
117
+ original phrase back.
118
+
119
+ If time permits, we plan to experiment with seq2seq pre-trained models, increasing gradient accumulation steps, and the
120
+ percentage of
121
+ data with commas.
122
+ The latter could help since wikitext contains highly diverse data, with many rows being empty strings,
123
+ headers, or short paragraphs.
124
+
125
+
commafixer/src/baseline.py CHANGED
@@ -1,52 +1,131 @@
1
  from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline, NerPipeline
 
2
 
3
 
4
  class BaselineCommaFixer:
 
 
 
 
 
 
5
  def __init__(self, device=-1):
6
  self._ner = _create_baseline_pipeline(device=device)
7
 
8
  def fix_commas(self, s: str) -> str:
 
 
 
 
 
 
 
 
9
  return _fix_commas_based_on_pipeline_output(
10
- self._ner(_remove_punctuation(s)),
11
- s
 
12
  )
13
 
14
 
15
  def _create_baseline_pipeline(model_name="oliverguhr/fullstop-punctuation-multilang-large", device=-1) -> NerPipeline:
 
 
 
 
 
 
 
 
16
  tokenizer = AutoTokenizer.from_pretrained(model_name)
17
  model = AutoModelForTokenClassification.from_pretrained(model_name)
18
  return pipeline('ner', model=model, tokenizer=tokenizer, device=device)
19
 
20
 
21
- def _remove_punctuation(s: str) -> str:
22
- to_remove = ".,?-:"
23
- for char in to_remove:
24
- s = s.replace(char, '')
25
- return s
26
-
27
-
28
- def _fix_commas_based_on_pipeline_output(pipeline_json: list[dict], original_s: str) -> str:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  result = original_s.replace(',', '') # We will fix the commas, but keep everything else intact
30
- current_offset = 0
 
 
31
 
32
  for i in range(1, len(pipeline_json)):
33
- current_offset = _find_current_token(current_offset, i, pipeline_json, result)
 
 
 
 
 
 
 
34
  if _should_insert_comma(i, pipeline_json):
35
  result = result[:current_offset] + ',' + result[current_offset:]
36
- current_offset += 1
37
  return result
38
 
39
 
40
- def _should_insert_comma(i, pipeline_json, new_word_indicator='▁') -> bool:
41
- # Only insert commas for the final token of a word
42
- return pipeline_json[i - 1]['entity'] == ',' and pipeline_json[i]['word'].startswith(new_word_indicator)
 
 
 
43
 
 
 
 
 
44
 
45
- def _find_current_token(current_offset, i, pipeline_json, result, new_word_indicator='▁') -> int:
46
- current_word = pipeline_json[i - 1]['word'].replace(new_word_indicator, '')
47
- # Find the current word in the result string, starting looking at current offset
48
- current_offset = result.find(current_word, current_offset) + len(current_word)
49
- return current_offset
50
 
51
 
52
  if __name__ == "__main__":
 
1
  from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline, NerPipeline
2
+ import re
3
 
4
 
5
  class BaselineCommaFixer:
6
+ """
7
+ A wrapper class for the oliverguhr/fullstop-punctuation-multilang-large baseline punctuation restoration model.
8
+ It adapts the model to perform comma fixing instead of full punctuation restoration, that is, removes the
9
+ punctuation, runs the model, and then uses its outputs so that only commas are changed.
10
+ """
11
+
12
  def __init__(self, device=-1):
13
  self._ner = _create_baseline_pipeline(device=device)
14
 
15
  def fix_commas(self, s: str) -> str:
16
+ """
17
+ The main method for fixing commas using the baseline model.
18
+ In the future we should think about batching the calls to it, for now it processes requests string by string.
19
+ :param s: A string with commas to fix, without length restrictions.
20
+ Example: comma_fixer.fix_commas("One two thre, and four!")
21
+ :return: A string with commas fixed, example: "One, two, thre and four!"
22
+ """
23
+ s_no_punctuation, punctuation_indices = _remove_punctuation(s)
24
  return _fix_commas_based_on_pipeline_output(
25
+ self._ner(s_no_punctuation),
26
+ s,
27
+ punctuation_indices
28
  )
29
 
30
 
31
  def _create_baseline_pipeline(model_name="oliverguhr/fullstop-punctuation-multilang-large", device=-1) -> NerPipeline:
32
+ """
33
+ Creates the huggingface pipeline object.
34
+ Can also be used for pre-downloading the model and the tokenizer.
35
+ :param model_name: Name of the baseline model on the huggingface hub.
36
+ :param device: Device to use when running the pipeline, defaults to -1 for CPU, a higher number indicates the id
37
+ of GPU to use.
38
+ :return: A token classification pipeline.
39
+ """
40
  tokenizer = AutoTokenizer.from_pretrained(model_name)
41
  model = AutoModelForTokenClassification.from_pretrained(model_name)
42
  return pipeline('ner', model=model, tokenizer=tokenizer, device=device)
43
 
44
 
45
+ def _remove_punctuation(s: str) -> tuple[str, list[int]]:
46
+ """
47
+ Removes the punctuation (".,?-:") from the input text, since the baseline model has been trained on data without
48
+ punctuation. It also keeps track of the indices where we remove it, so that we can restore the original later.
49
+ Commas are the exception, since we remove them, but restore with the model.
50
+ Hence we do not keep track of removed comma indices.
51
+ :param s: For instance, "A short-string: with punctuation, removed.
52
+ :return: A tuple of a string, for instance:
53
+ "A shortstring with punctuation removed"; and a list of indices where punctuation has been removed, in ascending
54
+ order
55
+ """
56
+ to_remove_regex = r"[\.\?\-:]"
57
+ # We're not counting commas, since we will remove them later anyway. Only counting removals that will be restored
58
+ # in the final resulting string.
59
+ punctuation_indices = [m.start() for m in re.finditer(to_remove_regex, s)]
60
+ s = re.sub(to_remove_regex, '', s)
61
+ s = s.replace(',', '')
62
+ return s, punctuation_indices
63
+
64
+
65
+ def _fix_commas_based_on_pipeline_output(pipeline_json: list[dict], original_s: str, punctuation_indices: list[int]) -> \
66
+ str:
67
+ """
68
+ This function takes the comma fixing token classification pipeline output, and converts it to string based on the
69
+ original
70
+ string and punctuation indices, so that the string contains all the original characters, except commas, intact.
71
+ :param pipeline_json: Token classification pipeline output.
72
+ Contains five fields.
73
+ 'entity' is the punctuation that should follow this token.
74
+ 'word' is the token text together with preceding space if any.
75
+ 'end' is the end index in the original string (with punctuation removed in our case!!)
76
+ Example: [{'entity': ':',
77
+ 'score': 0.90034866,
78
+ 'index': 1,
79
+ 'word': '▁Exam',
80
+ 'start': 0,
81
+ 'end': 4},
82
+ {'entity': ':',
83
+ 'score': 0.9157294,
84
+ 'index': 2,
85
+ 'word': 'ple',
86
+ 'start': 4,
87
+ 'end': 7}]
88
+ :param original_s: The original string, before removing punctuation.
89
+ :param punctuation_indices: The indices of the removed punctuation except commas, so that we can correctly keep
90
+ track of the current offset in the original string.
91
+ :return: A string with commas fixed, and other the original punctuation from the input string restored.
92
+ """
93
  result = original_s.replace(',', '') # We will fix the commas, but keep everything else intact
94
+
95
+ commas_inserted_or_punctuation_removed = 0
96
+ removed_punctuation_index = 0
97
 
98
  for i in range(1, len(pipeline_json)):
99
+ current_offset = pipeline_json[i - 1]['end'] + commas_inserted_or_punctuation_removed
100
+
101
+ commas_inserted_or_punctuation_removed, current_offset, removed_punctuation_index = (
102
+ _update_offset_by_the_removed_punctuation(
103
+ commas_inserted_or_punctuation_removed, current_offset, punctuation_indices, removed_punctuation_index
104
+ )
105
+ )
106
+
107
  if _should_insert_comma(i, pipeline_json):
108
  result = result[:current_offset] + ',' + result[current_offset:]
109
+ commas_inserted_or_punctuation_removed += 1
110
  return result
111
 
112
 
113
+ def _update_offset_by_the_removed_punctuation(
114
+ commas_inserted_and_punctuation_removed, current_offset, punctuation_indices, removed_punctuation_index
115
+ ):
116
+ # increase the counters for every punctuation removed from the original string before the curent offset
117
+ while (removed_punctuation_index < len(punctuation_indices) and
118
+ punctuation_indices[removed_punctuation_index] < current_offset):
119
 
120
+ commas_inserted_and_punctuation_removed += 1
121
+ removed_punctuation_index += 1
122
+ current_offset += 1
123
+ return commas_inserted_and_punctuation_removed, current_offset, removed_punctuation_index
124
 
125
+
126
+ def _should_insert_comma(i, pipeline_json, new_word_indicator='') -> bool:
127
+ # Only insert commas for the final token of a word, that is, if next word starts with a space.
128
+ return pipeline_json[i - 1]['entity'] == ',' and pipeline_json[i]['word'].startswith(new_word_indicator)
 
129
 
130
 
131
  if __name__ == "__main__":
notebooks/evaluation.ipynb CHANGED
@@ -283,15 +283,6 @@
283
  }
284
  ]
285
  },
286
- {
287
- "cell_type": "markdown",
288
- "source": [
289
- "We have 2 commas predicted correctly, and 3 missed, so we are expecting 100% precision and 40% recall."
290
- ],
291
- "metadata": {
292
- "id": "NzVo05UcoPlb"
293
- }
294
- },
295
  {
296
  "cell_type": "code",
297
  "source": [
@@ -346,6 +337,15 @@
346
  }
347
  ]
348
  },
 
 
 
 
 
 
 
 
 
349
  {
350
  "cell_type": "code",
351
  "source": [
 
283
  }
284
  ]
285
  },
 
 
 
 
 
 
 
 
 
286
  {
287
  "cell_type": "code",
288
  "source": [
 
337
  }
338
  ]
339
  },
340
+ {
341
+ "cell_type": "markdown",
342
+ "source": [
343
+ "We have 2 commas predicted correctly, and 3 missed, so we are expecting 100% precision and 40% recall."
344
+ ],
345
+ "metadata": {
346
+ "collapsed": false
347
+ }
348
+ },
349
  {
350
  "cell_type": "code",
351
  "source": [
openapi.yaml CHANGED
@@ -32,11 +32,10 @@ paths:
32
  s:
33
  type: string
34
  example: 'This is a sentence with wrong commas, at least some.'
35
- description: A text with commas fixed, or unchanged if not necessary.
36
- TODO WARNING - the text will have spaces normalized and trimmed at the start and end.
37
- TODO some other punctuation may be changed as well
38
 
39
  400:
40
- description: Input text query parameter missing.
41
 
42
 
 
32
  s:
33
  type: string
34
  example: 'This is a sentence with wrong commas, at least some.'
35
+ description: A text with commas fixed, or unchanged if not necessary. Everything other that
36
+ commas will stay as it was originally.
 
37
 
38
  400:
39
+ description: A required field missing from the POST request body JSON.
40
 
41
 
setup.py CHANGED
@@ -21,7 +21,7 @@ setup(
21
  extras_require={
22
  'training': [
23
  'datasets==2.14.4',
24
- 'seqeval'
25
  'notebook'
26
  ],
27
  'test': [
 
21
  extras_require={
22
  'training': [
23
  'datasets==2.14.4',
24
+ 'seqeval',
25
  'notebook'
26
  ],
27
  'test': [
tests/test_baseline.py CHANGED
@@ -30,7 +30,15 @@ def test_fix_commas_leaves_correct_strings_unchanged(baseline_fixer, test_input)
30
  ['Even newlines\ntabs\tand others get preserved.',
31
  'Even newlines,\ntabs\tand others get preserved.'],
32
  ['I had no Creativity left, therefore, I come here, and write useless examples, for this test.',
33
- 'I had no Creativity left therefore, I come here and write useless examples for this test.']]
 
 
 
 
 
 
 
 
34
  )
35
  def test_fix_commas_fixes_incorrect_commas(baseline_fixer, test_input, expected):
36
  result = baseline_fixer.fix_commas(s=test_input)
@@ -39,10 +47,11 @@ def test_fix_commas_fixes_incorrect_commas(baseline_fixer, test_input, expected)
39
 
40
  @pytest.mark.parametrize(
41
  "test_input, expected",
42
- [['', ''],
43
- ['Hello world...', 'Hello world'],
 
44
  ['This: test - string should not, have any commas inside it...?',
45
- 'This test string should not have any commas inside it']]
46
  )
47
  def test__remove_punctuation(test_input, expected):
48
  assert _remove_punctuation(test_input) == expected
 
30
  ['Even newlines\ntabs\tand others get preserved.',
31
  'Even newlines,\ntabs\tand others get preserved.'],
32
  ['I had no Creativity left, therefore, I come here, and write useless examples, for this test.',
33
+ 'I had no Creativity left therefore, I come here and write useless examples for this test.'],
34
+ [' This is a sentence. With, a lot of, useless punctuation!!??. O.o However we have to insert commas O-O, '
35
+ 'nonetheless or we will fail this test.',
36
+ ' This is a sentence. With a lot of useless punctuation!!??. O.o However, we have to insert commas O-O '
37
+ 'nonetheless, or we will fail this test.'],
38
+ [" The ship 's secondary armament consisted of fourteen 45 @-@ calibre 6 @-@ inch ( 152 mm ) quick @-@ firing ( QF ) guns mounted in casemates . Lighter guns consisted of eight 47 @-@ millimetre ( 1 @.@ 9 in ) three @-@ pounder Hotchkiss guns and four 47 @-@ millimetre 2 @.@ 5 @-@ pounder Hotchkiss guns . The ship was also equipped with four submerged 18 @-@ inch torpedo tubes two on each broadside .",
39
+ " The ship 's secondary armament consisted of fourteen 45 @-@ calibre 6 @-@ inch ( 152 mm ) quick @-@ firing ( QF ) guns mounted in casemates . Lighter guns consisted of eight 47 @-@ millimetre ( 1 @.@ 9 in ), three @-@ pounder Hotchkiss guns and four 47 @-@ millimetre 2 @.@ 5 @-@ pounder Hotchkiss guns . The ship was also equipped with four submerged 18 @-@ inch torpedo tubes, two on each broadside ."]
40
+
41
+ ]
42
  )
43
  def test_fix_commas_fixes_incorrect_commas(baseline_fixer, test_input, expected):
44
  result = baseline_fixer.fix_commas(s=test_input)
 
47
 
48
  @pytest.mark.parametrize(
49
  "test_input, expected",
50
+ [['', ('', [])],
51
+ [' world...', (' world', [6, 7, 8])],
52
+ [',,,', ('', [])],
53
  ['This: test - string should not, have any commas inside it...?',
54
+ ('This test string should not have any commas inside it', [4, 11, 57, 58, 59, 60])]]
55
  )
56
  def test__remove_punctuation(test_input, expected):
57
  assert _remove_punctuation(test_input) == expected