PhotoML: Photo culling with Machine Learning
A UTMIST Project by: Gursimar Singh, Fengyuan Liu, Yue Fung Lee, Jingmin Wang, Dev Shah, Chris Oh, Muhammad Ahsan Kaleem.
Photo Culling is a tedious, time-consuming process through which photographers select the best few images from a photo shoot for editing and finally delivering to clients. In this project, we develop the software stack, user-facing product, and backed ML solution for PhotoML, a startup aiming to solve the aforementioned problem. We use computer vision techniques to automate the process of photo culling while also considering the individual artistic preferences of each photographer.
We use a variety of pretrained sub-models to do this, extracting features from each one and ensembling these features to make a decision as to which category an image belongs to.
Background Information
Photographers typically take thousands of images in any given shoot, but less than ten percent of those images are selected to be in the final product that is delivered to the client. This process of filtering out the best few images from thousands of images is known as culling and takes a significant amount of time. For this project, we aim to sort images into three buckets; The first consists of images the photographer will definitely keep in the final product, the second is images that would definitely be removed and the third is a maybe-bucket, from which it is up to the photographer which bucket to place the images in.
For this project we limit our scope to wedding photographers, therefore the dataset of images consists primarily of images with multiple people in the frame. As such, some of the features that a photographer may look out for while culling would be;
1. Ensuring that all subjects visible in the frame have their eyes open and are focusing on the camera.
2. Ensuring that there is no motion blurring due to the movements of subjects in the frame when the picture was taken.
3. Not selecting more than 2–3 images of the same scene, since 10 or 20 images are sometimes taken of the same scene at once.
4. Ensuring that the subjects in the frame show a particular emotion. This may be an emotion like happiness or sadness, depending on the scene and photographer.
5. Selecting images with adequate lighting and good color composition.
6. Ensuring that the selected images make good use of the space available in the frame. Typically, photographers would use the rule of thirds [1], in which they would place their subjects in the left or right third of an image and allow the background scenery to occupy the rest of the image. However, for wedding photos, photographers may want their subjects to occupy the center of the image and take up most of the frame since the focus of the photoshoot is the people involved.
We use a different submodel to account for each of the aforementioned criteria when performing image culling.
Why manual feature selection and separate submodels?
There are three main reasons we decided to go for an ensembling approach with fixed, predefined features to extract rather than training a conventional CNN to automatically extract arbitrary features for classification.
- The difference between culled and selected images is minute, since all it takes is for one subject to have their eyes closed, or one blur or a single, small change in camera angle. For this reason, simply using a CNN for classification would not be effective, since minute details like this would not be extracted as features.
- Most photographers would generally agree on which images are good, which means that individual artistic style only accounts for a small percentage of selected images. Moreover, the features that photographers look out for are finite and well-defined, therefore it is possible to have a feature extraction submodel for each.
- It is easy to add more submodels over time due to this modular approach.
Our approach
The diagram above shows our approach to image culling, wherein each submodel is used to extract features from the images. The ensembling model takes as inputs the combined outputs of all submodels, essentially serving as the network that learns how much importance a given photographer assigns to each of the extracted features. Based on the output of the ensembling model, we sort images into the three buckets as shown in the diagram above.
It is important to note that due to our choice of using multiple submodels, computational efficiency is more important to preserve than small increases in accuracy. Slight improvements in accuracy through state-of-the-art methods are not very useful, since output ensembling means that the output of a single model does not generally have a large impact on the overall prediction score. However, the use of multiple submodels means that our approach uses a large amount of computing power, therefore picking submodels that are relatively simple but use less compute is the approach we take.
Duplicate detection
Duplicate detection is the process of selecting and grouping near-duplicate images. In any given shoot, the photographer takes multiple images of the same scene to ensure that there is at least one image without movement blurring or closed eyes. The duplicate detection model identifies these duplicate groups given a dataset consisting of all the photos in a shoot.
This is essentially an unsupervised learning method in the likes of image clustering, except we cannot use traditional approaches like K-Means clustering because we do not know how many clusters exist prior to duplicate detection, since there could be any number of duplicate groups in the dataset.
This model is an essential preprocessing step required before classifying images in a photoshoot. Without duplicate detection, multiple images from the same duplicate would be assigned a similar prediction score since the images are similar, making it so that all the images are either culled or selected, which is not ideal since photographers would only select the best few images from each duplicate group. Therefore, this model is an essential preprocessing step ensuring that the output is diverse and various images from different groups are selected.
The duplicate detection approach is relatively simple and takes the following steps;
- A pretrained MobileNetV3 [2] model is used to extract features from images. Each extracted feature can be flattened to a vector in a high-dimensional latent space.
- Clustering the extracted features using an approach that allows for an arbitrary number of clusters.
This is a relatively simple approach compared to state-of-the-art methods like SIFT or ORB but uses significantly less computational power.
Closed Eye Detection
The closed eye detection submodel allows us to identify how many subjects in the frame have their eyes closed. Typically, images with closed eyes are not desirable, but in a wedding shoot, subjects may have their eyes closed for multiple reasons (they may be crying for example), and the images may still be desirable. These nuances are expected to be learned by the ensembling model.
For this approach, we began by detecting facial landmarks using dlib [3], a Python library that can detect 68 landmarks on human faces. Each eye is marked with 6 landmarks when using dlib. The ratio of the distances between these landmarks can be used to identify whether the eye is closed or not.
Intuitively, we can see that when the distances between p2 and p6 and between p3 and p5 are large, the eye would be open. We simply add these distances and normalize by dividing by the 2 times the distance between p1 and p4, which would allow this ratio to be invariant of the distance of the eye from the camera. This EAR score is then output by this model and combined with the other activations.
Emotion Detection
The emotion detection submodel identifies faces and their associated emotions given an image, outputting the average emotion of the image. A photographer may want to look out for this particular feature, since the emotions of subjects in a wedding are important when taking photos.
To do this, we used the facenet-pytorch library [5], which uses the Multi-Task Cascaded Convolutional Neural Network (MTCNN) [6] model to detect bounding boxes around faces. We then created a custom convolutional neural network and trained it on a dataset of faces annotated with emotions. With these two models, we were able to first identify and crop faces in an image and then pass the cropped faces through the emotion recognition neural network, which classified their emotions as one of anger, disgusted, fearful, happy, sad, surprised, or neutral. The extracted probabilities are then stored and combined with activations from the other models.
Neural Style Extraction
The neural style extraction submodel is used to account for features like color composition and lighting, which dictate the general theme of an image, and a photographer may want to look out for these features when performing image culling.
This is a model that is different from the rest in the sense that there are no learned parameters in this model. Instead, we adopt the same approach to obtaining the stylistic representation of an image as in the original neural style transfer paper by Gatys et al [7].
We use the Gramian matrix of an image as the representation of the style of the image. This matrix essentially removes all information about the structure of the image and only accounts for how the color channels of an image relate to each other. The steps for computing a Gramian matrix are as follows;
- Take the original image of shape (Color Channels © x Height (H) x Width (W)) and flatten the height and width dimensions to obtain a tensor of shape (C x (H*W)).
- Perform matrix multiplication on the matrix obtained in the previous step with a transposed version of itself. The output of matrix multiplication will be a C x C matrix that is the Gramian matrix.
So why is the Gramian matrix considered to be a good representation of the style of an image? In the first step, when we collapse the height and width dimensions of the matrix, we essentially discard any information about the structure of the objects in the image, since we just flattened a 2-dimensional image into a single-column vector. When multiplying this matrix with a transpose of itself, in the output matrix we obtain, every element is the dot product of a color channel with another color channel in the original image. We have already seen that the dot product serves as a measure of similarity of two vectors, therefore, in this case, every element in the output matrix represents how similar a color channel is with another color channel. On a higher, more abstract level, we simply refer to the relation of colors with each other as the style of an image, which is why the Gramian matrix is a good representation of that.
This can be easily verified with an example;
Although the example only details Gramian matrices on images, the same process can be performed to obtain Gramian matrices from the features of images obtained after passing them through a pretrained CNN. This becomes a more advanced representation of style, not only accounting for color compositions but also texture and other advanced stylistic features.
The Gramian matrices obtained from this step are also combined with activations from the other submodels. An example with images to demonstrate that the Gramian matrices are a representation of style.
Semantic Segmentation
The semantic segmentation model is meant to account for the adequate use of the space in the frame, segmenting the image into the subjects and the background. This allows for the detection of whether most of the space in the frame is being occupied by the subjects, which is usually desired for a wedding photograph.
For this submodel, we used the mmsegmentation library [8], using the MobileNetV3 backbone trained on the ADE20k dataset [9]. We found that this model struck an ideal trade-off between efficiency and performance. This particular dataset was chosen since it has image classes similar to the objects that would commonly be seen in wedding photos, including person, window, building, wall, etc.
The segmentation map output from this model is also combined with the activations of the previously mentioned submodels.
Training
During training, we first passed our images through the pretrained submodels, the features output from each one.
We trained the ensembling model in two steps;
- We combined the outputs of the pretrained submodels to ensure uniform structure before passing these to the ensembling model.
- We passed the combined outputs from the submodels into the ensembling model, optimizing it to classify images based on the features from the submodels.
By not doing direct classification with a single model, we ensure that the ensembling model learns useful features that are relevant to the task of image culling and does not overfit on the small amount of training data that each photographer can provide to the model. This also speeds up model convergence, since features are predetermined and no feature extraction layers need to be optimized.
Conclusion
In summary, we chose to take a modular approach to image culling, using multiple submodels to simulate searching for features that a photographer would look for. This added flexibility to our approach since it was easy to add and remove submodels as needed and it also helped with classification, since we were able to specify precisely which features should be considered.
We then stacked an ensembling model on top of the submodels. This stacked structure allowed us to split training into multiple steps, using only pretrained features to perform classification.
This approach to training allowed for quicker convergence since features were predetermined.
As a future improvement for this task, transformer architecture based approaches are likely worth exploring due to their ability to pay more attention to relevant parts of the input images. Adding more submodels that account for more features could also help with performance on this task.
Photo culling is a difficult task that even the most experienced photographers find difficult to do. It was an ambitious attempt to use deep learning techniques for a task that is so intrinsically nuanced and while we did achieve fairly good results, development remains underway and the future likely holds many improvements for this project.
Citations
[1] https://www.adobe.com/ca/creativecloud/photography/discover/rule-of-thirds.html
[2] https://arxiv.org/abs/1905.02244
[3] http://dlib.net/
[5] https://github.com/timesler/facenet-pytorch
[6] https://arxiv.org/abs/1604.02878
[7] https://arxiv.org/abs/1508.06576