Building a custom NER model using SpaCy

Click on the top "Deephub Imba" to pay attention to the official account.

What is NER?

Named entity recognition (NER) is a natural language processing technology, which is used to extract appropriate entities from a given text content and classify the extracted entities into predefined categories. In short, NER is a technology used to extract entities such as person names, place names, company names, etc. from a given text. In information retrieval, NER has its own importance.

How does NER work?

After reading the text, human beings can recognize some common entities, such as person name, date and so on. But to make the computer do the same thing, we must help the computer learn in order to complete the task for us. Here we need to use natural language processing (NLP) and machine learning (ML). The role of NLP is to let the computer read the text, communicate with human beings, understand them and interpret them by understanding the patterns and rules of language. The role of machine learning is to help machines learn and improve in time.

We define NER's work as a two-step process, 1. Identifying named entities, 2. Classifying named entities.

Let's take an example.

doc = nlp('Mr.Sundar Pichai, the CEO of Google Inc. was born in 1972 in India')
print([(X.text, X.label_) for X in doc.ents])

The output is like this

[('Sundar Pichai', 'PERSON'), ('Google Inc.', 'ORG'), ('1972', 'DATE'), ('India', 'GPE')]

NER algorithm can highlight and extract specific entities in a given text.

displacy.render(doc, style='ent', jupyter=True)

The Spacy library allows us to train NER by updating existing models according to a specific context, or we can train new NER models. In this article, we will explore how to build a custom NER model to extract educational details from resume data.

Build custom NER model

Import the necessary libraries

Just like the ceremony before starting a new project, we must import the necessary libraries.

from __future__ import unicode_literals, print_function

import random
from pathlib import Path
import spacy
from tqdm import tqdm
from spacy.training.example import Example
import pickle

Training data

First, we need to create entity categories, such as degree, school name, location, percentage and date, and provide relevant training data to NER model.

The Spacy library receives training data in the form of tuples containing text data and dictionaries. The dictionary should contain the start and end indexes of the named entity in the text and category of the named entity.

TRAIN_DATA = [("Higher School Certificate, Parramatta Marist High School, Westmead (1998)",{"entities":[(0,25,"degree"),(27,56,"school_name"),(58,66,"location"),(68,72,"date")]}),

("Bachelor of Business, University of Western Sydney (2005) ",{"entities":[(0,20,"degree"),(22,43,"school_name"),(44,50,"location"),(52,56,"date")]}),

("2007–2010 BCA (Bachelor of Computer Application) from Khalsa college for women, Amritsar (Affiliated to Guru Nanak Dev University (G.N.D.U) India ",{"entities":[(0,9,"date"),(12,50,"degree"),(54,78,"school_name"),(80,88,"location")]}),

("2010–2013 MCA (Masters in Computer Applications) from Amritsar College of Engineering, Amritsar (Affiliated to Punjab Technical University (P.T.U) India. ",{"entities":[(0,9,"date"),(10,48,"degree"),(54,85,"school_name"),(87,95,"location")]})]

Create model

The first step in building a custom model is to create a blank "en" model. The blank model is established for NER process.

model = None
output_dir=Path("ner/")
n_iter=100

#load the model

if model is not None:
    nlp = spacy.load(model)  
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank('en')  
    print("Created blank 'en' model")

Build pipeline

The next step is to use create_ The pipe function uses only the NER setting procedure.

if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)
else:
    ner = nlp.get_pipe('ner')

Training model

Before we start training the model, we must use NER. Add_ The label () method adds the category of named entities (labels) to 'ner', and then we must disable other components except 'ner', because these components should not be affected during training. We use NLP. Disable_ The pipes () method disables these components during training.

In order to train the "ner" model, the model must cycle on the training data to obtain enough iterations. To do this, we use n_iter, which is set to 100. To ensure that the model is not generalized in the order of the examples, we will randomly disrupt the training data using the random.shuffle() function before each iteration.

We use the tqdm() function to create a progress bar. Save information about the training process in the example. It stores two objects, one for the prediction of the pipeline and the other for the reference data. Example.from_ The dict (DOC, annotations) method is used to construct an example object from the predicted document (doc) and the reference annotation provided as a dictionary. nlp_ The update() function can be used to train the recognizer.

for _, annotations in TRAIN_DATA:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])
example = []
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in tqdm(TRAIN_DATA):
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            nlp.update(
                [example], 
                drop=0.5,  
                sgd=optimizer,
                losses=losses)
        print(losses)

Save model

After training, the model in the variable will be saved in output_dir and export the model as a pkl file.

if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
    print("Saved model to", output_dir)
pickle.dump(nlp, open( "education nlp.pkl", "wb" ))

Test training model

doc=nlp("•2015-2017, BE Chemical Engineering, Coimbatore Institute of Technology , India")
for ent in doc.ents:
    print(ent.label_+ '  ------>   ' + ent.text)

The output is like this

date  ------>   2015-2017
degree  ------>   BE Chemical Engineering
school_name  ------>   Coimbatore Institute of Technology
location  ------>   India

The above uses SpaCy to quickly train our custom model. Its advantages are:

  1. SpaCy NER model only needs a few lines of annotation data to learn quickly. The more training data, the better the performance of the model.
  2. There are many open source annotation tools available to create training data for the SpaCy NER model.

But there are also some disadvantages

  1. Ambiguity and abbreviations - one of the main challenges in identifying named entities is language. It is difficult to recognize words with multiple meanings.
  2. Words that are not commonly used now. For example, there may be some problems with people's names and place names

summary

For extracting entities from resumes, we prefer customized ner to pre trained NER. This is because the pre trained ner model will only have common categories, such as PERSON,ORG,GPE, etc. However, when we build a customized ner model, we can have our own set of categories, which are suitable for the context we are dealing with, such as the following applications:

  1. Extract structure from unstructured text data - extract entities such as education and other professional information from resumes.
  2. Recommendation system - NER can help recommend algorithms by extracting entities from a document and storing them in a relational database. The data science team can create tools to recommend other documents with similar entities.
  3. Customer support - NER can be used to classify customer registered complaints and assign them to the relevant departments in the organization that deal with the complaints.
  4. Efficient search algorithm - NER can run on all documents, extract entities and store them separately. The next time a user searches for a word, the search term will match the smaller list of entities in each document, which will improve the speed of search execution.

Author: Abhishek Ravichandran

If you like it, please pay attention to it:

You'd better look at it!

Posted on Tue, 23 Nov 2021 13:57:50 -0500 by alvinphp