6 Quotes That Will Help You Get More Clarity in Life

Feeling stuck? Need a little clarity on how to move forward? These quotes from the likes of Zerlina Maxwell and J.K. Rowling might just be the ticket. With the abundance of options prevalent from…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Recognition of Named Entities on Invoices for IxorDocs

Given the small dataset available, it is a challenging task to create a robust model. While it is for most NLP projects a logical choice to go for a connected/recurrent model, it would be impractical to use such model here. An invoice is not constructed of coherent text lines, but mostly contains tables and small text blocks. The location and structure of these blocks will vary over all different invoice templates used. We can already guess that word location and size are important features. This is why we decided to use Decision Tree Classifiers with added context features instead of a recurrent model. Such model will intuitively generate rules like “the total amount can mostly be found at the bottom of the paper” if provided with useful features.

Transforming an invoice to a set of N-grams.

For every N-gram we calculate a feature-vector. Extracted features include word length, font size and position. We also engineer features capturing capitalisation, the amount of number-characters and punctuation. To provide a limited notion of context to the model, we also add previous and following words as features by using the hashing trick. This results in a feature vector of size 164 for every created N-gram, which we will use to classify the N-grams into the entities of interest.

When looking at the results per entity, using XGBoost provides the biggest f1-score increase for VAT amount and total. Which are two fields for which the Random Forest model provides only mediocre results.

With only a small dataset available at the moment, the IxorThink team was able to create and train a named entity recognition (NER) model to correctly analyze new invoices. This can be invoices which follow a known template, or invoices from new customers based on an unseen template.
At this moment we are able to correctly detect the most important fields in invoices, next we will further roll out this proof-of-concept. We aim to expand the model to detect all invoice lines.
The ultimate goal is to extract useful data from different types of documents, for example credit notes, payslips, etc.

At IxorThink we are constantly trying to improve our methods to create state-of-the-art solutions. As a software-company we can provide stable and fully developed solutions. Feel free to contact us for more information.

Add a comment

Related posts:

Why Writing Online will change your views on Writing

Do you hate writing? Do your teachers always make you write long, dreadful papers? When you have to do these things, I wouldn’t blame you for hating writing, but what if you were writing about…

DNA based antiaging and neuroprotective treatments and diagnostic systems

DNA based antiaging and neuroprotective treatments and diagnostic systems. “Introduction to Vitalcoin” is published by Vitalcoin.

Morning Pages

Because I clearly haven’t been writing enough these days, for the past week and a half, I have been starting the day writing out 3 handwritten pages of stream of consciousness writing. The practice…