Skew correction in Documents using Deep learning.

Vishnu Nandakumar
5 min readFeb 7, 2022

We all would have stumped on to a problem where the documents that we have will be misaligned, skewed and also could be warped. A lot of image scanners will ask us to rotate the images by ourselves or ask us to choose the four points for modification of perspective. Actually, I have to give credit to my Twitter buddy Kavin Sharath who came up with this problem of skewness in the documents and propelled me to look for a solution using machine learning. We also took different approaches like Hough lines transformation, binarizing to find only text regions and their contours, detection of contours and so on but most of them could not generalize in most cases. We are aware that a lot of these image processing techniques works well and one of the best in the retrieval of skewness is using the projectile profile method.

Projectile Profile

PP is a technique in which we will rotate a given image in different angles for a range and compute the maximum difference between the peaks of the histogram of pixels of that image at each angle. The difference between the angles for which we get the maximum difference is what will be our angle of rotation for skew correction.

Pros:

  • No model training/building is required
  • Works on most of the real-world scenarios with documents of mild skewness
  • Can be easily integrated into any application

Cons:

  • If we increase the range of angles then processing would take a long time to complete
  • No generalized range of angle as each image might require a different set of ranges.

Machine learning:

I have tried to build a CNN based regression model here to generate angles at which the skewness needs to rectify.

Data Preparation:

For a problem like this, we might have data available online yet we can try and build our own data for this task. First, download a set of document images using bing-image-downloader, then iterate your images using the below snippet.

What essentially happens in the above snippet is:

  • We are iterating through a set of images at different angles chosen randomly.
  • Angle range is [-45, 45]
  • White colour is filled for images after rotation to ensure consistency as similar to real-world scenarios.
  • A grayscale image is saved with proper naming with iteration and angle value.

Model building:

I am using a simple CNN model for training the regressor here, before constructing the model we need to process the data in a way that the CNN model would learn in an optimized way. Below are the steps that I have implemented for preprocessing images.

I haven’t used other augmentation steps like crops, rotation etc because that would sabotage the intention of this solution, so I have only scaled the angles and images as per the necessity. Feel free to try other augmentation methods which will not vandalize the orientation and spatial features.

Next is defining the model, I have built a relatively shallow model because of the intuition that our output is continuous value and we are not in for learning complicated features like if the texts are of slanting characters, do they have lines or even if they are handwritten or not etc. Below is the model summary.

Make sure that your last layer doesn’t have relu as the activation function as we also have negative values as output and the advantage of using Global MaxPooling is we can train the model faster and weights will smaller in size. This model is 9 MB in size we can also do much better than what I got here. That’s it go and define other training parameters like batch_size, epochs, tensorboard logging, early stopping etc and train the model. I have attached the results of the training of my model. The params that I have chosen are given in the notebook which I have attached at the end.

  • Also, you can find some of the results as followed.

Above we can see the results are quite satisfying and from here you can try out with varied data like handwritten documents for training, you can also try out a multi-task model by having the first task as classification whether you have to rotate clockwise or anticlockwise and second would be by how much you have to rotate the same. Another con of this model is it might not perform well if there is no skew in the document given, well again you can try a multi-task model with the third task of skew/no_skew or simply a single task learner with 0 as the output value for inputs with no skew. You can also have a look at the notebook provided at the end

  • Experiment with non-English language (I believe it’s Arabic, Sorry if I am wrong) with handwritten text.
  • On a non-skew image

Just for the record, I am not bluffing on the above scenario. The model gave 0 as the output as we wanted.

As much as the model seems to perform well on different scenarios, it could well be an exception, to be on the safer side it is better to build a model with varied and adequate data. But I will try to validate the model with more data and provide the metrics on my repo.

Okay, folks, I hope this is useful and helpful for you all. I will continue to contribute as much as I can. Until then bye you all and take care. Feel free to have a look at the following for more information

Thanks to you all who have given your valuable time on reading through this. Take care fellas. Bye until next time

--

--

Vishnu Nandakumar

Machine Learning Engineer, Cloud Computing (AWS), Arsenal Fan. Have a look at my page: https://bit.ly/m/vishnunandakumar