Couple of weeks back, I decided to work on OCR tool. Some years back I had tried my hands on
OpenCV, and I knew that it would be a good place to start, since this library comes with lot of algorithms for real-time image processing.
When searched on
google for basic OCR to start with I found
http://blog.damiles.com/2008/11/basic-ocr-in-opencv/. The same link also has very nice explanation of how it works. This basic OCR was written to recognize hand written numbers. This program was also written with assumption that each image to be recognized will have only one character.
I took a fork out of this project and modified it. I added samples for upper case character(Update: and for lower case characters) and did necessary changes so that the program would identify the upper-case characters(Update: and lower case letters), with same assumption that image to be recognized will have only one character.
Then I wrote a simple algorithm to recognize multiple characters in an image containing paragraph of text. There are some assumptions built into this algorithm, but not very big ones if considered for recognition of regular text on documents.
Assumptions(or limitations) built into the algorithm are
- Between each row of text there would be gap of at-least one row which would be at-least one pixel thick. Shown as red(well almost :) ) line in the image below.
- Between each character on same row there would be gap of at-least one column which would be at-least one pixel thick as well. Shown as green line in the image.
For any given character there wouldn't be gap between different parts of the character, although this limitation can be overcome by small change in algorithm, since only uppercase letters are used in current example, this assumption will hold good and speeds up both learning and recognition phases, this also means that if same algorithm is used for recognizing lower case characters then it will fail for i and j. since both have gap between different parts of the character. Update: This limitation has been removed by improving the algorithm.
I will not explain everything in detail, since it can be found at damile's blog, I have explained below the algorithm I have used to find characters in an image of paragraph of characters. Other changes that I have made to actual source should be fairly simple to understand.
The algorithm I have used is very simple,
Algorithm
for each row in the image from row 0 to height of image
if sum of current row > sum of(width of image*255)
save current row number as maxY
if minY has not been found already then
save current row number as minY
else if minY has been found then /*some data has been found along row already, after finding some data as soon as some blank row is reached, it is considered as end of end of character */
for each column from column 0 to width of image
if sum of current column from minY to max Y > sum of((maxY-minY)*255)
save current column number as maxX
if minX has not been found already then
save current column number as minX
else if minX has been found then
resize minY and maxY if necessary /*necessary if one of the characters in current row is bigger in-terms of height than others like A vs Q */
match image from (minX, minY) to (maxX,maxY) to a character.
Code
The algorithm may look very big, but its very simple. :) Here is the code for this algorithm
/// <summary>
/// Given image with paragraph of characters,
/// finds bounding box, resizes it to new_width and new_height, and if printResult is 1, prints result for each character.
/// </summary>
/// <params name="imsSrc">
/// Source image which has to be processed.
/// </params>
/// <params name="new_width">
/// Width of the image to be used for processing.
/// </params>
/// <params name="new_height">
/// Height of the image to be used for processing.
/// </params>
/// <params name="printResult">
/// Indicates whether result has be printed, if its 1, result are printed after running k-neares algorithm.
/// </params>
/// <params name="resultSize">
/// Number of resulting characters identified, size of the array to which result will be pointing to.
/// </params>
/// <returns> Pointer to array of result. </returns>
float* OCR::preprocessPara(IplImage* imgSrc, int new_width, int new_height, int printResult, int* resultSize)
{
int minY, maxY;
int i;
int minYFound=0;
float result;
vector<float> resultVector;
float* resultPointer;
CvMat data;
CvScalar maxVal=cvRealScalar(imgSrc->width * 255);
CvScalar val=cvRealScalar(0);
//For each column sum, if sum < width*255 then we find the min
//then continue till end to search the max, if sum< width*255 then is new max.
for (i=0; i< imgSrc->height; i++)
{
cvGetRow(imgSrc, &data, i);
val= cvSum(&data);
if(val.val[0] < maxVal.val[0])
{ // some data is found!
maxY = i;
if(!minYFound)
{
minY = i;
minYFound = 1;
}
}
else if(minYFound == 1)
{
//some data was found previously, but current row 'i' doesn't have any data.
//So process from row 'minY' till row maxY
int j;
int minX, maxX;
int minXFound=0;
//CvMat data;
CvScalar maxValx=cvRealScalar((maxY - minY) * 255);
CvScalar valx=cvRealScalar(0);
//For each col sum, if sum < width*255 then we find the min
//then continue to end to search the max, if sum< width*255 then is new max
for (j=0; j< imgSrc->width - 1; j++)
{
valx=cvRealScalar(0);
//instead of taking sum of entire column get sum of sub part of it.
cvGetSubRect(imgSrc,&data, cvRect(j,minY,1,maxY-minY));
//cvGetCol(imgSrc, &data, i);
valx= cvSum(&data);
if(valx.val[0] < maxValx.val[0])
{ //Some data found
maxX= j;
if(!minXFound){
minX= j;
minXFound= 1;
}
}
else if(minXFound == 1)
{
int maxYp;
CvScalar maxValyS = cvRealScalar((maxX-minX)*255);
CvScalar valyS = cvRealScalar(0);
// from minx to maxx and miny to maxy
for(int k=maxY-1; k >= minY; k--)
{
cvGetSubRect(imgSrc, &data, cvRect(minX, k, maxX-minX,1));
valyS = cvSum(&data);
if(valyS.val[0] < maxValyS.val[0])
{
maxYp = k+1;
break;
}
}
//Some data was found previosly but current column 'j' doesn't have any data.
// so from minY to maxY and minX to maxX is the bounding box of character!
result = process(imgSrc, new_width, new_height, printResult, cvRect(minX, minY, maxX-minX, maxYp-minY));
resultVector.push_back(result); // after finding each result push the result to the vector.
minXFound = 0;
}
}
minYFound = 0;
}
}
//If exit from loop was because max height was reached, but minFound has been set, then process from minFound till height.
//This will not happen in the ideal examples I take :)
*resultSize = resultVector.size();
resultPointer = new float[*resultSize];
int k;
for(k = 0; k < *resultSize; k++)
{
*(resultPointer+k) = resultVector[k];
}
return resultPointer;
}
k-nearest neighbors algorithm has been used to match test image vs training images. Its simple and fast.
Source code can be found at
https://github.com/vikram-ma/OCR.
Github is really nice, If you have not tried it already you should!, so is
git
Sample of images for each character for training has been put inside separate directory and has a data.txt file having corresponding character which image represent.
Build and run
Instructions for build, go inside source-code directory, type-in following commands on console,
$mkdir build
$cd build
$cmake ../OCR
$make
running
./OCR <path to directory containing sample images> <path to sample test image>
specifically,
$./OCR ../OCR ../sampleUppercase.pbm
Feel free to leave your feedback and questions, I will try to answer them as quickly as possible.