Named Entity Recognition is a crucial task in Natural Language Processing, helping us identify and classify key information in text. While spaCy provides excellent pre-trained models, sometimes we need to recognize entities specific to our domain. That's where custom NER models come in handy!
Before we begin, make sure you have spaCy installed:
pip install spacy
Also, download a pre-trained model to use as a starting point:
python -m spacy download en_core_web_sm
The first step in creating a custom NER model is preparing your training data. spaCy expects the data in a specific format. Here's an example:
TRAIN_DATA = [ ("Apple is looking at buying U.K. startup for $1 billion", {"entities": [(0, 5, "ORG"), (27, 31, "GPE"), (44, 54, "MONEY")]}), ("San Francisco considers banning sidewalk delivery robots", {"entities": [(0, 13, "GPE")]}) ]
Each item in the list is a tuple containing the text and a dictionary with entity annotations. The entity annotations are in the format (start_index, end_index, label).
Next, we'll create a blank spaCy model to train:
import spacy from spacy.pipeline import EntityRecognizer nlp = spacy.blank("en") ner = nlp.create_pipe("ner") nlp.add_pipe(ner, last=True)
Before training, we need to add our custom labels to the NER:
for _, annotations in TRAIN_DATA: for ent in annotations.get("entities"): ner.add_label(ent[2])
Now comes the exciting part – training our model! Here's a simple training loop:
import random from spacy.util import minibatch, compounding optimizer = nlp.begin_training() for iteration in range(100): random.shuffle(TRAIN_DATA) losses = {} batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001)) for batch in batches: texts, annotations = zip(*batch) nlp.update(texts, annotations, drop=0.5, losses=losses) print("Losses", losses)
This loop shuffles the data, creates mini-batches, and updates the model for each batch. The drop
parameter adds dropout for regularization.
After training, it's time to see our model in action:
test_text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously." doc = nlp(test_text) print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
Don't forget to save your hard work:
nlp.to_disk("./custom_ner_model")
You can load it later with:
loaded_nlp = spacy.load("./custom_ner_model")
More Data: The more quality training data you have, the better your model will perform.
Balanced Dataset: Ensure your dataset covers all entity types you want to recognize.
Iterative Improvement: Test your model, identify errors, and refine your training data accordingly.
Pre-trained Embeddings: Consider using pre-trained word embeddings to improve performance.
Hyperparameter Tuning: Experiment with different learning rates, batch sizes, and dropout values.
Creating custom NER models with spaCy opens up a world of possibilities for extracting domain-specific information from text. With these tools in your Python NLP toolkit, you're well on your way to tackling complex text analysis tasks. Happy entity recognizing!
22/11/2024 | Python
25/09/2024 | Python
06/10/2024 | Python
26/10/2024 | Python
15/10/2024 | Python
15/11/2024 | Python
15/11/2024 | Python
06/12/2024 | Python
22/11/2024 | Python
22/11/2024 | Python
06/12/2024 | Python
08/11/2024 | Python