Images with directories as labels for Tensorflow data

A common format for storing images and labels is a tree directory structure with the data directory containing a set of directories named by their label and each containing samples for said label. Often transfer learning that is used for image classification may provide data in this structure.

Update May 2018: If you would like an approach that doesn’t prepare into TFRecords, utilising tf.data and reading directly from disk, I have done this in when making the input function for my Dogs vs Cats transfer learning classifier.

Data layout

As an example, the directory may be as so:

  • data
    • train
      • dog
        • 1.jpg, 2.jpg, …, n.jpg
      • cat
        • 1.jpg, 2.jpg, …, n.jpg
    • validation
      • dog
        • 1.jpg, 2.jpg, …, n.jpg
      • cat
        • 1.jpg, 2.jpg, …, n.jpg
    • test
      • unknown
        • 1.jpg, 2.jpg, …, n.jpg

If we want to use the Tensorflow Dataset API, there is one option of using the tf.contrib.data.Dataset.list_files and use a glob pattern. This will give us a dataset of strings for our file paths and we could then make use of tf.read_file and tf.image.decode_jpeg to map in the actual image. The downsides of this is reading in the actual label. The string is a tensor and so I found it cumbersome to do path manipulation and get the folder name and map that to an integer label. Following on from my last post on Convert and using the MNIST dataset as TFRecords, we will do the same with this dataset so we can use a very similar input function.

Label preparation

We make the assumption that directories are labels: to make this generic and easier to transfer, we can list the directories and create a name to integer.

data_dir = path.expanduser('~/data/DogsVsCats')

train_data_dir = path.join(data_dir, 'train')
test_data_dir = path.join(data_dir, 'test')
validation_data_dir = path.join(data_dir, 'validation')

class_names = os.listdir(train_data_dir) # Get names of classes
class_name2id = { label: index for index, label in enumerate(class_names) } # Map class names to integer labels

# Persist this mapping so it can be loaded when training for decoding
with open(os.path.join(data_directory, 'class_name2id.p'), 'wb') as p:
    pickle.dump(class_name2id, p, protocol=pickle.HIGHEST_PROTOCOL)

Images to TFRecords

Prior to encoding the images and labels as TFRecords, there are a few other choices we can make to simplify things. One such thing is the image dimensions. Images may not be of the same size or it may be desirable to downscale rather than using the full resolution image. It is also easier to use constants for these when training rather than the reading width, height and depth from Tensors or resizing in the training pipeline. In our cats vs dogs example, we will ensure our images are 224 x 224 x 3. I’ll dump the function here and then go over each step afterwards:

import os
import sys
from PIL import Image
import tensorflow as tf

def convert_to_tfrecord(dataset_name, data_directory, class_map, segments=1, directories_as_labels=True, files='**/*.jpg'):

    # Create a dataset of file path and class tuples for each file
    filenames = glob.glob(os.path.join(data_directory, files))
    classes = (os.path.basename(os.path.dirname(name)) for name in filenames) if directories_as_labels else [None] * len(filenames)
    dataset = list(zip(filenames, classes))

    # If sharding the dataset, find how many records per file
    num_examples = len(filenames)
    samples_per_segment = num_examples // segments

    print(f"Have {samples_per_segment} per record file")

    for segment_index in range(segments):
        start_index = segment_index * samples_per_segment
        end_index = (segment_index + 1) * samples_per_segment

        sub_dataset = dataset[start_index:end_index]
        record_filename = os.path.join(data_directory, f"{dataset_name}-{segment_index}.tfrecords")

        with tf.python_io.TFRecordWriter(record_filename) as writer:
            print(f"Writing {record_filename}")

            for index, sample in enumerate(sub_dataset):
                sys.stdout.write(f"\rProcessing sample {start_index+index+1} of {num_examples}")
                sys.stdout.flush()

                file_path, label = sample
                image = Image.open(file_path)
                image = image.resize((224, 224))
                image_raw = np.array(image).tostring()

                features = {
                    'label': _int64_feature(class_map[label]),
                    'text_label': _bytes_feature(label),
                    'image': _bytes_feature(image_raw)
                }
                example = tf.train.Example(features=tf.train.Features(feature=features))
                writer.write(example.SerializeToString())          
  1. We create a list of Tuples which is the path/label for each image in the dataset
  2. Iterating over each segment we read in the image and resize it
  3. We save out the record with the label and image data

Full code available at https://github.com/damienpontifex/BlogCodeSamples/blob/master/DataToTfRecords/directories-to-tfrecords.py