The purpose of the evaluation is to assess how close participants’ results are from the original document content. We will compute character and word accuracy, mean and confidence intervals for evaluation of the participants’ results.
Technical specifications
The dataset images will be in JPG format and you must return your result in TXT format with UTF-8 encoding. There must be one result file per image and it must have the same name. Eg. for an image “00001.jpg” the result file must be “00001.txt”.
Evaluation protocol
The results will be ranked according the mean character and word accuracy as well as their confidence interval. We will not use stopwords in order to compute the word accuracy. As a preprocessing step, we will normalize the characters contained in the submitted results.
The tools used for the evaluation will be the ones from UNLV/ISRI with UTF-8 support. The normalization tool is available on:
https://github.com/SmartDOC-MOC/moc_normalization
The evaluation tools are the ones from UNLV/ISRI with UTF-8 support. The acci and wordacci executables are availble
on: https://github.com/SmartDOC-MOC/ocr-evaluation-tools