Named Entity Recognition (NER) is a crucial task in Natural Language Processing that involves identifying and classifying named entities in text into predefined categories. These categories typically include person names, organizations, locations, dates, and more. In this blog post, we'll explore how to perform NER using spaCy, a popular and efficient NLP library in Python.
Before we dive into NER, let's set up our environment:
import spacy # Download and load the English language model nlp = spacy.load("en_core_web_sm")
This code snippet downloads and loads the small English language model. SpaCy offers different model sizes, with larger models generally providing better accuracy at the cost of increased computational resources.
Let's start with a simple example:
text = "Apple Inc. is planning to open a new store in New York City next month." doc = nlp(text) for ent in doc.ents: print(f"{ent.text} - {ent.label_}")
Output:
Apple Inc. - ORG
New York City - GPE
next month - DATE
In this example, spaCy correctly identifies "Apple Inc." as an organization (ORG), "New York City" as a geopolitical entity (GPE), and "next month" as a date.
SpaCy uses a wide range of entity labels. Here are some common ones:
To see the full list of labels and their descriptions:
import spacy from spacy import displacy nlp = spacy.load("en_core_web_sm") ner = nlp.get_pipe("ner") for label in ner.labels: print(f"{label}: {spacy.explain(label)}")
SpaCy provides a handy visualization tool called displaCy:
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company believed in the technology." doc = nlp(text) displacy.serve(doc, style="ent")
This will launch a local server and open a web page in your browser, displaying the text with highlighted entities.
While spaCy's pre-trained models work well for general text, you might need to customize NER for specific domains. Here's a basic example of how to add custom entities:
import spacy from spacy.tokens import Span nlp = spacy.load("en_core_web_sm") def add_tech_entities(doc): new_ents = [] for token in doc: if token.text in ["Python", "JavaScript", "C++"]: new_ents.append(Span(doc, token.i, token.i + 1, label="PROGRAMMING_LANGUAGE")) doc.ents = list(doc.ents) + new_ents return doc nlp.add_pipe("tech_entities", before="ner") text = "Developers use Python, JavaScript, and C++ for various projects." doc = nlp(text) for ent in doc.ents: print(f"{ent.text} - {ent.label_}")
This example adds a custom pipe to recognize programming languages as entities.
Named Entity Recognition has numerous real-world applications:
To enhance NER performance:
en_core_web_md
or en_core_web_lg
for improved accuracy.Named Entity Recognition with spaCy opens up a world of possibilities for extracting structured information from unstructured text. By mastering this technique, you'll be well-equipped to tackle a wide range of NLP tasks and build powerful text analysis applications.
21/09/2024 | Python
22/11/2024 | Python
05/10/2024 | Python
08/11/2024 | Python
22/11/2024 | Python
25/09/2024 | Python
25/09/2024 | Python
15/11/2024 | Python
17/11/2024 | Python
15/11/2024 | Python
14/11/2024 | Python
14/11/2024 | Python