AI in the real world - Form Understanding

3 min readNov 12, 2021

Machine learning is about making things easier by learning how any task can be efficiently performed, rather than changing rules for each and every parameter, the AI will learn to do things by itself. What’s more perfect than we see them in real action, that’s why I have planned a series of mini-articles that will try to bring in the areas AI could and have improved the performance tremendously. The first of the many to come is from the computer vision domain, CV has improved the way a lot sectors work from as simple as the processing of images to as complex as an automated driving vehicle and today we will be looking at how it can optimize form understanding by levering its another sister NLP to its aid.

Motivation

Not long ago one of the pioneers in the AI world HuggingFace released the transformers version LayoutLM model open-sourced by Microsoft. It uses a pre-training model for text, layout and image in a multi-modal framework, on a high level the model uses pre-training methods for text-alignment and matching tasks. On top of the previous version of this model where it used the same technique in the training, fine-tuning stages the latest v2 version achieves the same in the pre-training stage i.e visual embeddings for the multi-modal framework. Also, this model uses spatial-aware self-attention mechanisms which help in optimization of relation extraction between the texts. As we have seen on a high level how this model works we can jump onto how it’s implemented. You can learn more about the model here.

Implementation

If you want to get your hands dirty and configure as you want you can visit this link where HuggingFace have implemented the LayoutLM2 model in practice or if you want a simple wrapper for the implementation mentioned before, just jump on to this simple library that I have created pre-dominantly aimed at non-developers to get started. Just follow the instructions as given in the readme and you are good to go.

Install the required libraries

Implement the below code

View the results

Let’s try different scenarios

Invoice:

ATM receipt:

Medical prescription :

Finally, you can download the results in the key-value pairs, for example, below I have given how the result for the ATM receipt would look like

Double Bingo!!! as we see here, got the visualizations and also the results in a consumable way. That’s all for this post, but don’t worry I’ll make sure I will come up with more engaging and interesting things as we move ahead, till then take care fellas!!

Wait, kindly do follow me on medium, also you can head to my git repo for similar concepts and don’t ever hesitate to open a conversation on my social on whatever you want to talk on. Thanks for the support you have shown me as that is what drives me to do this unconditionally.