SmartDoc QA

Download

The dataset has a size of 13GB and is now hosted on Zenodo at the following URL:

Citation

Nibal Nayef, Muhammad Muzzamil Luqman, Sophea Prum, Sebastien Eskenazi, Joseph Chazalon, Jean-Marc Ogier: “SmartDoc-QA: A Dataset for Quality Assessment of Smartphone Captured Document Images - Single and Multiple Distortions”, Proceedings of the sixth international workshop on Camera Based Document Analysis and Recognition (CBDAR), 2015.

Description

Modern smartphones have a revolutionary impact on the way people digitize the paper documents. The wide ownership of smartphones and their ease of use for digitizing paper documents has resulted into massive amount of imagery data of digitized paper documents. The goal of digitizing the paper documents is not only to archive them for sharing but also, most of the times, to process them by automated document image processing systems. The latter extracts the content of the document images for recognizing it, indexing it, verifying it, comparing it with a database etc. However, it is a known fact that the cameras of the smartphones are optimized for capturing natural scene images. Taking a simple photo of a paper document does not ensure that its content would be exploitable by automated document image processing systems. This could happen because of the light conditions, the resolution of the image, the camera noise, the perspective distortion, the physical distortions (folds etc.) of the paper, the out-of-focus blur and/or the motion blur during capture. To ensure that the content of a captured document image is exploitable by automated systems, it is important to automatically assess the quality of a captured document image in real-time. Otherwise most of the times it is not possible to re-capture the document image later on, because the original document is not available anymore. Assessing the quality of a captured document image is also required in situations where the captured document images are to-be transmitted for further processing.

The quality assessment step is an important part of both the acquisition and the digitization processes. Assessing document quality could aid users during the capture process or help improve image enhancement methods after a document has been captured. Current state-of-the-art works lack databases in the field of document image quality assessment.

In order to provide a baseline benchmark for quality assessment methods for mobile captured documents, we present a database for quality assessment that contains both single- and multiply-distorted document images.

Exemple 3 — Magnified view of blurry document

The proposed dataset could be used for benchmarking quality assessment methods by the objective measure of OCR accuracy, and could be also used to benchmark quality enhancement methods. There are three types of documents in the dataset: modern documents, old administrative letters and receipts.

The document images of the dataset are captured under varying capture conditions (light, different types of blur and perspective angles). This causes geometric and photometric distortions that hinder the OCR process.

The ground truth of the dataset set images consists of the text transcriptions of the documents, the OCR results of the captured documents and the values of the different capture parameters used for each image.