Named Entity Recognition (NER) is a key process in pure language processing (NLP) the place the objective is to establish and categorize key info from textual content paperwork. When working with document-based NER duties — particularly these involving Optical Character Recognition (OCR) — the problem will increase because of the complexities launched by OCR errors, lengthy texts, and numerous doc codecs.
On this weblog, we are going to stroll by means of the doc preprocessing pipeline for NER utilizing a BERT-based mannequin. We’ll cowl key steps like dealing with OCR textual content, fuzzy matching, BIO tagging, and tokenizing paperwork for BERT-based token classification.
For this tutorial, we are going to use receipt information as our instance, that includes a fictional retailer known as “GiggleMart”.
For NER duties, the standard of your information going into the mannequin is simply as essential because the mannequin itself. In case your enter textual content is messy, unstructured, or poorly labeled, even essentially the most highly effective fashions will battle to carry out properly. That is very true when coping with OCR-generated textual content, which regularly comprises errors and inconsistencies.
To make sure the perfect outcomes, preprocessing performs a significant position in:
- Cleansing and normalizing textual content.
- Dealing with OCR errors utilizing fuzzy matching.
- Precisely labeling tokens with BIO tagging.
- Breaking down lengthy paperwork into manageable chunks for BERT.
Now, let’s dive into every step intimately.
Let’s assume we’re working with a receipt and the OCR output seems to be like this:
OCR Textual content: "Receipt from. GiggleMartt on 09/23/2021. Complete: $120.49. Thanks for buying."
Floor Fact (GT) Labels:
- "GiggleMart" → Service provider Title
- "09/23/2021" → Buy Date
- "$120.49" → Complete Quantity
Discover that the OCR mistakenly captured “GiggleMart” as “GiggleMartt”. We’ll use this textual content to indicate the preprocessing pipeline.
Step one is to wash up the OCR textual content to make sure it’s within the right format. Since this textual content doesn’t comprise non-ASCII characters, we are able to concentrate on eradicating any additional areas or undesirable characters.
- Enter:
"Receipt from. GiggleMartt on 09/23/2021. Complete: $120.49. Thanks for buying."
- Cleaned Textual content:
"Receipt from GiggleMartt on 09/23/2021. Complete: $120.49. Thanks for buying."
There are not any main modifications wanted right here, however this step ensures consistency throughout totally different doc varieties.
Subsequent, we have to tokenize the cleaned textual content utilizing BERT’s WordPiece tokenizer. This step splits the textual content into smaller tokens that BERT can course of.
- Tokenized Textual content:
["Receipt", "from", "Giggle", "##Martt", "on", "09", "/", "23", "/", "2021", ".", "Total", ":", "$", "120", ".", "49", ".", "Thank", "you", "for", "shopping", "."]
Discover how the tokenizer break up “GiggleMartt” into two tokens, ["Giggle", "##Martt"]
. This must be accounted for when labeling the tokens.
On this step, we use fuzzy matching to map the bottom fact labels to the OCR textual content. That is needed due to OCR errors like “GiggleMartt” as an alternative of “GiggleMart.”
Fuzzy Matching:
1. Service provider Title: “GiggleMart” → “GiggleMartt”:
Utilizing fuzzy matching, we establish that “GiggleMartt” is a detailed match for “GiggleMart” regardless of the spelling error. Subsequently, we map these two collectively. We don’t change the inaccurate OCR spelling with the bottom fact spelling; as an alternative, we map the OCR output to the bottom fact entity kind (similar to B-Service provider
or I-Service provider
). The objective of fuzzy matching is to not right the OCR errors however to make sure that the mannequin acknowledges and labels the textual content precisely primarily based on the bottom fact, even when the OCR textual content is barely incorrect
2. Buy Date: “09/23/2021” → “09/23/2021”:
Because the OCR textual content accurately extracted the date, we map it on to the bottom fact.
3. Complete Quantity: “$120.49” → “$120.49”:
Equally, the entire quantity was accurately extracted, so we immediately map it to the bottom fact.
As soon as the GT labels are mapped to the OCR textual content, we assign BIO tags to every token.
Token BIO Tagging:
We label every token primarily based on the bottom fact labels. For instance:
- Tokens:
["Receipt", "from", "Giggle", "##Martt", "on", "09", "/", "23", "/", "2021", ".", "Total", ":", "$", "120", ".", "49", ".", "Thank", "you", "for", "shopping", "."]
- BIO Labels:
"Giggle"
→B-Service provider
"##Martt"
→I-Service provider
"09"
→B-Date
"/"
→I-Date
"23"
→I-Date
"/"
→I-Date
"2021"
→I-Date
"$"
→B-Quantity
"120"
→I-Quantity
"."
→I-Quantity
"49"
→I-Quantity
- All different tokens →
O
This ensures that BERT can perceive which tokens are a part of particular entities just like the service provider title, date, or quantity.
If the receipt textual content have been longer than 512 tokens, we’d break it down into smaller chunks utilizing a sliding window method. For our present instance, the receipt is brief, so there’s no must chunk it. Nonetheless, for longer receipts or paperwork, this step would be sure that no info is misplaced.
Lastly, we put together the tokenized and labeled textual content for BERT enter. This contains:
- Changing Tokens to Enter IDs: Every token is mapped to its corresponding ID within the BERT vocabulary.
- BIO Labels to Label IDs: Every BIO tag is transformed right into a numeric ID for coaching.
- Padding and Consideration Masks: Sequences are padded to the utmost size, and a focus masks are added to distinguish between actual tokens and padding. Padding is required as a result of BERT and different transformer fashions have a most enter sequence size (normally 512 tokens for BERT). This restrict implies that at any given time, the mannequin can solely course of a sequence of as much as 512 tokens. Padding helps deal with enter sequences which might be shorter than this most size by guaranteeing all inputs are of the identical dimension.
Ultimate Enter Format:
- Enter IDs:
[101, 2345, 5678, ...]
(Instance BERT enter token IDs) - Labels:
[0, 0, B-Merchant, I-Merchant, 0, B-Date, ...]
- Consideration Masks:
[1, 1, 1, 1, 1, 1, 1, ...]
With this, the receipt information is able to be handed right into a BERT-based NER mannequin for coaching or inference.
By following this pipeline, we be sure that the receipt information is cleaned, tokenized, and labeled precisely. That is notably essential when coping with noisy OCR outputs which will comprise errors, as we noticed with “GiggleMartt” on this instance. Let’s recap the steps:
- Clear and normalize OCR textual content to make sure it’s in a usable format.
- Tokenize the textual content utilizing BERT’s WordPiece tokenizer to deal with subwords.
- Use fuzzy matching to map floor fact labels to noisy OCR textual content.
- Apply BIO tagging to label tokens for coaching a BERT mannequin.
- Deal with lengthy paperwork utilizing sliding home windows if the doc exceeds BERT’s token restrict.
- Put together the ultimate enter for BERT, together with token IDs, label IDs, and a focus masks.