This dataset is publicly available free of charge only for research purposes. Re-distribution of the dataset is not allowed. Obtaining a copy of the dataset has to be done by registering through this form or contacting the authors directly. Any use of this dataset has to cite reference:
Jean-Christophe Burie, Joseph Chazalon, Mickaël Coustaty, Sébastien Eskenazi, Muhammad Muzzamil Luqman, Maroua Mehri, Nibal Nayef, Jean-Marc OGIER, Sophea Prum and Marçal Rusinol: “ICDAR2015 Competition on Smartphone Document Capture and OCR (SmartDoc)”, In 13th International Conference on Document Analysis and Recognition (ICDAR), 2015.
The dataset is available for download on Zenodo at the the following address: https://zenodo.org/record/2572929
The dataset is composed of many documents (single column, printed, English). The text in the documents has multiple scales, multiple fonts, multiple font-faces and multiple colors.
The documents described above are used as a basis for capturing the dataset, where a set of different captures is taken for each of those documents. The different captures characterize different distortions that hinders the process of OCR. For each document, a few captures are taken with randomly selected values from the ranges of the different distortions (see below).
this process results a huge number of captured images, the capture
process is automated using a robotic arm that carries a mobile phone and
precisely controls its movement. The robotic arm and the mobile phone
are programmed to capture images for the specified distortion values
with minimal human supervision.
In the following we detail the specifications of distortions which are used.
a. Variable capture parameters
The following table shows the variations of the capture conditions and the resulting number of images.
- Mobile phone used for the capture (2 phones)
- Lighting condition (3 conditions)
- Perspective distortions (longitudinal and lateral incidence angles and distance to the document)
- Out of focus blur
b. Fixed document and capture parameters:
- Background: one colored, clear contrast with documents in order to facilitate the page segmentation process
- Fixed size documents (A4 papers)
- Fixed document orientation (no skew)
- No flash
Dataset size and division
- Ground truth: manual ground truth OCR transcription (raw text)
- Sample data and Test Data: the set of captured images (total 12100) is divided into 3630 sample training images and 8470 test images, such that training images are representative of different capture settings and different documents.
- Input: JPG images
- Output: UTF-8 text file
- for input image “7777.JPG” the output should be “7777.TXT”