File size: 5,796 Bytes
b18d8ef
b822a55
b18d8ef
b822a55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b18d8ef
487a218
b822a55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
487a218
b822a55
 
 
 
 
 
 
 
 
 
 
 
 
 
4f41cb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
afdd8a4
4f41cb0
b822a55
 
 
 
 
 
 
 
4f41cb0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
language: en
license: mit
pipeline_tag: token-classification
tags:
  - token-classification
  - NER
widget:
  - text: >
        Food Name: GARLIC AND FINE HERBS
        Brand: CELEBRITY
        Food Category: Cheese
        Ingredients: Goat's milk, garlic, herbs, sea salt, potassium sorbate, microbial enzyme, bacterial culture
        
        Nutrition Facts per Serving (28g):
        
        Calories: 250 kcal
        Protein: 14.3g
        Fat: 21.4g
        Carbs: 0g
        Sugars: 3.57g
        Sodium: 571mg
        Label Nutrition Facts:
        
        Fat: 5.99g
        Saturated Fat: 4g
        Trans Fat: 0.199g
        Cholesterol: 19.9mg
        Sodium: 160mg
        Protein: 4g
        Calcium: 19.9mg
        Calories: 70 kcal
        Update Log:
        
        Nutrient Added: Value 5
        Nutrient Updated: Value 4

---
This is a BERT sequence labeling model, is designed for Named Entity Recognition (NER) in the context of nutrition labeling. It aims to identify and classify various nutritional elements from text dataproviding a structured interpretation of the content typically found on nutrition labels.


## Training Data Description

The training data for the `sgarbi/bert-fda-nutrition-ner` model was thoughtfully curated from the U.S. Food and Drug Administration (FDA) through their publicly available datasets. This data primarily originates from the FoodData Central website and features comprehensive nutritional information and labeling for a wide array of food products.

### Data Source
- **Source**: U.S. Food and Drug Administration (FDA), FoodData Central.
- **Dataset Link**: [FDA FoodData Central](https://fdc.nal.usda.gov/download-datasets.html)
- **Content**: The dataset includes detailed nutritional data, such as ingredient lists, nutritional values, serving sizes, and other essential label information.

### Preprocessing and Augmentation Steps
- **Extraction**: Key textual data, encompassing nutritional facts and ingredient lists, were extracted from the FDA dataset.
- **Normalization**: All text underwent normalization for consistency, including converting to lowercase and removing redundant formatting.
- **Entity Tagging**: Significant nutritional elements were manually tagged, creating a labeled dataset for training. This includes macronutrients, vitamins, minerals, and various specific dietary components.
- **Tokenization and Formatting**: The data was tokenized and formatted to meet the BERT model's input requirements.
- **Robustness Techniques**:
    - **Introducing Noise**: To enhance the model's ability to handle real-world, imperfect data, deliberate noise was introduced into the training set. This included:
        - **Sentence Swaps**: Random swapping of sentences within the text to promote the model's understanding of varied sentence structures.
        - **Introducing Misspellings**: Deliberately inserting common spelling errors to train the model to recognize and correctly process misspelled words frequently encountered in real-world scenarios such as inaccurate document scans.

### Relevance to the Model
- The use of a diverse and comprehensive dataset ensures that the model is well-equipped for nutritional NER tasks.
- The introduction of noise and sentence variations in the training data aids in building a more robust model, capable of accurately processing and analyzing real-world nutritional data that might contain imperfections.




## Ethical Considerations

- The model was trained only on publicly available data from food product labels. No private or sensitive data was used.
- The model should not be used to make recommendations about nutrition or health - it only extracts nutritional entities from text. Any nutrition advice should come from qualified experts.
- The model may have biases related to the language and phrasing on certain types of food product labels. It should be re-evaluated periodically on new test sets.

## Label Map

The following is the label map used in the model, defining the various entity types that the model can recognize:

```python
label_map = {
    'O': 0,
    'B-MACRONUTRIENTS': 1,
    'I-MACRONUTRIENTS': 2,
    'B-PROXIMATES': 3,
    'I-PROXIMATES': 4,
    'B-PROTEINS': 5,
    'I-PROTEINS': 6,
    'B-FIBER': 7,
    'I-FIBER': 8,
    'B-CARBOHYDRATES': 9,
    'I-CARBOHYDRATES': 10,
    'B-SUGARS': 11,
    'I-SUGARS': 12,
    'B-ALCOHOLS': 13,
    'I-ALCOHOLS': 14,
    'B-PHYSICALPROPERTIES': 15,
    'I-PHYSICALPROPERTIES': 16,
    'B-ORGANICCOMPOUNDS': 17,
    'I-ORGANICCOMPOUNDS': 18,
    'B-WATER': 19,
    'I-WATER': 20,
    'B-STIMULANTS': 21,
    'I-STIMULANTS': 22,
    'B-PRESERVATIVES': 23,
    'I-PRESERVATIVES': 24,
    'B-MINERALS': 25,
    'I-MINERALS': 26,
    'B-VITAMINS': 27,
    'I-VITAMINS': 28,
    'B-CAROTENOIDS': 29,
    'I-CAROTENOIDS': 30,
    'B-OTHER': 31,
    'I-OTHER': 32,
    'B-AMINOACIDS': 33,
    'I-AMINOACIDS': 34,
    'B-LIPIDS': 35,
    'I-LIPIDS': 36,
    'B-ANTIOXIDANTS': 37,
    'I-ANTIOXIDANTS': 38,
    'B-PHYTOCHEMICALS': 39,
    'I-PHYTOCHEMICALS': 40,
    'B-DIETARYFIBER': 41,
    'I-DIETARYFIBER': 42,
    'B-INGREDIENTS': 43,
    'I-INGREDIENTS': 44,
    'B-QUANTITY': 45,
    'I-QUANTITY': 46,
    'B-COUNTRY_ORIGIN': 47,
    'I-COUNTRY_ORIGIN': 48,
    'B-SERVING_SIZE': 49,
    'I-SERVING_SIZE': 50,
    'B-PACKAGE_WEIGHT': 51,
    'I-PACKAGE_WEIGHT': 52,
}
```

## Caveats and Recommendations

- The model may struggle with typos, uncommon ingredients, or unusual phrasing not seen during training.  
- Performance should be monitored periodically, especially when applying the model to new types of text data.
- For best results, retrain the model on text data that matches the target use case.

Here are some example model outputs on the provided text:

{example_outputs}