--- language: en license: mit pipeline_tag: token-classification tags: - token-classification - NER widget: - text: > Food Name: GARLIC AND FINE HERBS Brand: CELEBRITY Food Category: Cheese Ingredients: Goat's milk, garlic, herbs, sea salt, potassium sorbate, microbial enzyme, bacterial culture Nutrition Facts per Serving (28g): Calories: 250 kcal Protein: 14.3g Fat: 21.4g Carbs: 0g Sugars: 3.57g Sodium: 571mg Label Nutrition Facts: Fat: 5.99g Saturated Fat: 4g Trans Fat: 0.199g Cholesterol: 19.9mg Sodium: 160mg Protein: 4g Calcium: 19.9mg Calories: 70 kcal Update Log: Nutrient Added: Value 5 Nutrient Updated: Value 4 --- This is a BERT sequence labeling model, is designed for Named Entity Recognition (NER) in the context of nutrition labeling. It aims to identify and classify various nutritional elements from text dataproviding a structured interpretation of the content typically found on nutrition labels. ## Training Data Description The training data for the `sgarbi/bert-fda-nutrition-ner` model was thoughtfully curated from the U.S. Food and Drug Administration (FDA) through their publicly available datasets. This data primarily originates from the FoodData Central website and features comprehensive nutritional information and labeling for a wide array of food products. ### Data Source - **Source**: U.S. Food and Drug Administration (FDA), FoodData Central. - **Dataset Link**: [FDA FoodData Central](https://fdc.nal.usda.gov/download-datasets.html) - **Content**: The dataset includes detailed nutritional data, such as ingredient lists, nutritional values, serving sizes, and other essential label information. ### Preprocessing and Augmentation Steps - **Extraction**: Key textual data, encompassing nutritional facts and ingredient lists, were extracted from the FDA dataset. - **Normalization**: All text underwent normalization for consistency, including converting to lowercase and removing redundant formatting. - **Entity Tagging**: Significant nutritional elements were manually tagged, creating a labeled dataset for training. This includes macronutrients, vitamins, minerals, and various specific dietary components. - **Tokenization and Formatting**: The data was tokenized and formatted to meet the BERT model's input requirements. - **Robustness Techniques**: - **Introducing Noise**: To enhance the model's ability to handle real-world, imperfect data, deliberate noise was introduced into the training set. This included: - **Sentence Swaps**: Random swapping of sentences within the text to promote the model's understanding of varied sentence structures. - **Introducing Misspellings**: Deliberately inserting common spelling errors to train the model to recognize and correctly process misspelled words frequently encountered in real-world scenarios such as inaccurate document scans. ### Relevance to the Model - The use of a diverse and comprehensive dataset ensures that the model is well-equipped for nutritional NER tasks. - The introduction of noise and sentence variations in the training data aids in building a more robust model, capable of accurately processing and analyzing real-world nutritional data that might contain imperfections. ## Ethical Considerations - The model was trained only on publicly available data from food product labels. No private or sensitive data was used. - The model should not be used to make recommendations about nutrition or health - it only extracts nutritional entities from text. Any nutrition advice should come from qualified experts. - The model may have biases related to the language and phrasing on certain types of food product labels. It should be re-evaluated periodically on new test sets. ## Label Map The following is the label map used in the model, defining the various entity types that the model can recognize: ```python label_map = { 'O': 0, 'B-MACRONUTRIENTS': 1, 'I-MACRONUTRIENTS': 2, 'B-PROXIMATES': 3, 'I-PROXIMATES': 4, 'B-PROTEINS': 5, 'I-PROTEINS': 6, 'B-FIBER': 7, 'I-FIBER': 8, 'B-CARBOHYDRATES': 9, 'I-CARBOHYDRATES': 10, 'B-SUGARS': 11, 'I-SUGARS': 12, 'B-ALCOHOLS': 13, 'I-ALCOHOLS': 14, 'B-PHYSICALPROPERTIES': 15, 'I-PHYSICALPROPERTIES': 16, 'B-ORGANICCOMPOUNDS': 17, 'I-ORGANICCOMPOUNDS': 18, 'B-WATER': 19, 'I-WATER': 20, 'B-STIMULANTS': 21, 'I-STIMULANTS': 22, 'B-PRESERVATIVES': 23, 'I-PRESERVATIVES': 24, 'B-MINERALS': 25, 'I-MINERALS': 26, 'B-VITAMINS': 27, 'I-VITAMINS': 28, 'B-CAROTENOIDS': 29, 'I-CAROTENOIDS': 30, 'B-OTHER': 31, 'I-OTHER': 32, 'B-AMINOACIDS': 33, 'I-AMINOACIDS': 34, 'B-LIPIDS': 35, 'I-LIPIDS': 36, 'B-ANTIOXIDANTS': 37, 'I-ANTIOXIDANTS': 38, 'B-PHYTOCHEMICALS': 39, 'I-PHYTOCHEMICALS': 40, 'B-DIETARYFIBER': 41, 'I-DIETARYFIBER': 42, 'B-INGREDIENTS': 43, 'I-INGREDIENTS': 44, 'B-QUANTITY': 45, 'I-QUANTITY': 46, 'B-COUNTRY_ORIGIN': 47, 'I-COUNTRY_ORIGIN': 48, 'B-SERVING_SIZE': 49, 'I-SERVING_SIZE': 50, 'B-PACKAGE_WEIGHT': 51, 'I-PACKAGE_WEIGHT': 52, } ``` ## Caveats and Recommendations - The model may struggle with typos, uncommon ingredients, or unusual phrasing not seen during training. - Performance should be monitored periodically, especially when applying the model to new types of text data. - For best results, retrain the model on text data that matches the target use case. Here are some example model outputs on the provided text: {example_outputs}