SmartDoc 2015 – Challenge 2 Dataset

Licence

This dataset is publicly available free of charge only for research purposes. Re-distribution of the dataset is not allowed. Obtaining a copy of the dataset has to be done by registering through this form or contacting the authors directly. Any use of this dataset has to cite reference:

Jean-Christophe Burie, Joseph Chazalon, Mickaël Coustaty, Sébastien  Eskenazi, Muhammad Muzzamil Luqman, Maroua Mehri, Nibal Nayef, Jean-Marc  OGIER, Sophea Prum and Marçal Rusinol: “ICDAR2015 Competition on  Smartphone Document Capture and OCR (SmartDoc)”, In 13th International  Conference on Document Analysis and Recognition (ICDAR), 2015.

Download

The dataset is available for download on Zenodo at the the following address: https://zenodo.org/record/2572929

Description

The dataset is composed of many documents (single column, printed, English). The text in the documents has multiple scales, multiple fonts, multiple font-faces and multiple colors.

The documents described above are used as a basis for capturing the dataset, where a set of different captures is taken for each of those documents. The different captures characterize different distortions that hinders the process of OCR. For each document, a few captures are taken with randomly selected values from the ranges of the different distortions (see below).

As this process results a huge number of captured images, the capture process is automated using a robotic arm that carries a mobile phone and precisely controls its movement. The robotic arm and the mobile phone are programmed to capture images for the specified distortion values with minimal human supervision.

Dataset Specifications

In the following we detail the specifications of distortions which are used.

a. Variable capture parameters

    The following table shows the variations of the capture conditions and the resulting number of images.

  • Mobile phone used for the capture (2 phones)
  • Lighting condition (3 conditions)
  • Perspective distortions (longitudinal and lateral incidence angles and distance to the document)
  • Out of focus blur

b. Fixed document and capture parameters:

  • Background: one colored, clear contrast with documents in order to facilitate the page segmentation process   
  • Fixed size documents (A4 papers)
  • Fixed document orientation (no skew)
  • No flash

Dataset size and division

  • Ground truth: manual ground truth OCR transcription (raw text)
  • Sample data and Test Data: the set of captured images (total 12100) is divided into 3630 sample training images and 8470 test images, such that training images are representative of different capture settings and different documents.

File formats

  • Input: JPG images
  • Output: UTF-8 text file
    • for input image “7777.JPG” the output should be “7777.TXT”