--- library_name: span-marker tags: - span-marker - token-classification - ner - named-entity-recognition - generated_from_span_marker_trainer datasets: - SpeedOfMagic/ontonotes_english metrics: - precision - recall - f1 widget: - text: Late Friday night, the Senate voted 87 - 7 to approve an estimated $13.5 billion measure that had been stripped of hundreds of provisions that would have widened, rather than narrowed, the federal budget deficit. - text: Among classes for which details were available, yields ranged from 8.78%, or 75 basis points over two - year Treasury securities, to 10.05%, or 200 basis points over 10 - year Treasurys. - text: According to statistics, in the past five years, Tianjin Bonded Area has attracted a total of over 3000 enterprises from 73 countries and regions all over the world and 25 domestic provinces, cities and municipalities to invest, reaching a total agreed investment value of more than 3 billion US dollars and a total agreed foreign investment reaching more than 2 billion US dollars. - text: But Dirk Van Dongen, president of the National Association of Wholesaler - Distributors, said that last month's rise "isn't as bad an omen" as the 0.9% figure suggests. - text: Robert White, Canadian Auto Workers union president, used the impending Scarborough shutdown to criticize the U.S. - Canada free trade agreement and its champion, Prime Minister Brian Mulroney. pipeline_tag: token-classification model-index: - name: SpanMarker results: - task: type: token-classification name: Named Entity Recognition dataset: name: Unknown type: SpeedOfMagic/ontonotes_english split: test metrics: - type: f1 value: 0.9077127659574469 name: F1 - type: precision value: 0.9045852107076597 name: Precision - type: recall value: 0.9108620229516947 name: Recall --- # SpanMarker This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [SpeedOfMagic/ontonotes_english](https://huggingface.co/datasets/SpeedOfMagic/ontonotes_english) dataset that can be used for Named Entity Recognition. ## Model Details ### Model Description - **Model Type:** SpanMarker - **Maximum Sequence Length:** 256 tokens - **Maximum Entity Length:** 8 words - **Training Dataset:** [SpeedOfMagic/ontonotes_english](https://huggingface.co/datasets/SpeedOfMagic/ontonotes_english) ### Model Sources - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) ### Model Labels | Label | Examples | |:------------|:-------------------------------------------------------------------------------------------------------| | CARDINAL | "tens of thousands", "One point three million", "two" | | DATE | "Sunday", "a year", "two thousand one" | | EVENT | "World War Two", "Katrina", "Hurricane Katrina" | | FAC | "Route 80", "the White House", "Dylan 's Candy Bars" | | GPE | "America", "Atlanta", "Miami" | | LANGUAGE | "English", "Russian", "Arabic" | | LAW | "Roe", "the Patriot Act", "FISA" | | LOC | "Asia", "the Gulf Coast", "the West Bank" | | MONEY | "twenty - seven million dollars", "one hundred billion dollars", "less than fourteen thousand dollars" | | NORP | "American", "Muslim", "Americans" | | ORDINAL | "third", "First", "first" | | ORG | "Wal - Mart", "Wal - Mart 's", "a Wal - Mart" | | PERCENT | "seventeen percent", "sixty - seven percent", "a hundred percent" | | PERSON | "Kira Phillips", "Rick Sanchez", "Bob Shapiro" | | PRODUCT | "Columbia", "Discovery Shuttle", "Discovery" | | QUANTITY | "forty - five miles", "six thousand feet", "a hundred and seventy pounds" | | TIME | "tonight", "evening", "Tonight" | | WORK_OF_ART | "A Tale of Two Cities", "Newsnight", "Headline News" | ## Evaluation ### Metrics | Label | Precision | Recall | F1 | |:------------|:----------|:-------|:-------| | **all** | 0.9046 | 0.9109 | 0.9077 | | CARDINAL | 0.8579 | 0.8524 | 0.8552 | | DATE | 0.8634 | 0.8893 | 0.8762 | | EVENT | 0.6719 | 0.6935 | 0.6825 | | FAC | 0.7211 | 0.7852 | 0.7518 | | GPE | 0.9725 | 0.9647 | 0.9686 | | LANGUAGE | 0.9286 | 0.5909 | 0.7222 | | LAW | 0.7941 | 0.7297 | 0.7606 | | LOC | 0.7632 | 0.8101 | 0.7859 | | MONEY | 0.8914 | 0.8885 | 0.8900 | | NORP | 0.9311 | 0.9643 | 0.9474 | | ORDINAL | 0.8227 | 0.9282 | 0.8723 | | ORG | 0.9217 | 0.9073 | 0.9145 | | PERCENT | 0.9145 | 0.9198 | 0.9171 | | PERSON | 0.9638 | 0.9643 | 0.9640 | | PRODUCT | 0.6778 | 0.8026 | 0.7349 | | QUANTITY | 0.7850 | 0.8 | 0.7925 | | TIME | 0.6794 | 0.6730 | 0.6762 | | WORK_OF_ART | 0.6562 | 0.6442 | 0.6502 | ## Uses ### Direct Use for Inference ```python from span_marker import SpanMarkerModel # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("supreethrao/instructNER_ontonotes5_xl") # Run inference entities = model.predict("Robert White, Canadian Auto Workers union president, used the impending Scarborough shutdown to criticize the U.S. - Canada free trade agreement and its champion, Prime Minister Brian Mulroney.") ``` ### Downstream Use You can finetune this model on your own dataset.
Click to expand ```python from span_marker import SpanMarkerModel, Trainer # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("supreethrao/instructNER_ontonotes5_xl") # Specify a Dataset with "tokens" and "ner_tag" columns dataset = load_dataset("conll2003") # For example CoNLL2003 # Initialize a Trainer using the pretrained model & dataset trainer = Trainer( model=model, train_dataset=dataset["train"], eval_dataset=dataset["validation"], ) trainer.train() trainer.save_model("supreethrao/instructNER_ontonotes5_xl-finetuned") ```
## Training Details ### Training Set Metrics | Training set | Min | Median | Max | |:----------------------|:----|:--------|:----| | Sentence length | 1 | 18.1647 | 210 | | Entities per sentence | 0 | 1.3655 | 32 | ### Training Hyperparameters - learning_rate: 5e-05 - train_batch_size: 16 - eval_batch_size: 16 - seed: 42 - distributed_type: multi-GPU - num_devices: 2 - total_train_batch_size: 32 - total_eval_batch_size: 32 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 3 - mixed_precision_training: Native AMP ### Framework Versions - Python: 3.10.13 - SpanMarker: 1.5.0 - Transformers: 4.35.2 - PyTorch: 2.1.1 - Datasets: 2.15.0 - Tokenizers: 0.15.0 ## Citation ### BibTeX ``` @software{Aarsen_SpanMarker, author = {Aarsen, Tom}, license = {Apache-2.0}, title = {{SpanMarker for Named Entity Recognition}}, url = {https://github.com/tomaarsen/SpanMarkerNER} } ```