Deep Learning on Graphs For Computer Vision — CNN, RNN, and GNN

Learning on Graphs

Graphs are a fundamental data structure in computer science, but it is also a natural way to represent real-world information.

A graph of 3 nodes and 2 edges
The adjacency matrix for the graph above. Rows indicate starting node; Columns indicate ending node.
A citation network for Philosophy where each node is a paper and each edge means “cite” (Image credit from Kieran Healy)
A graph representation of animal skeletons, each node is a body part (image credit from Wang et al 2017)

Fully Connected Neural Network

This is arguably the “hello world” neural network. Each neuron in a layer connects to every neuron in the next layer.

A fully connected neural network with 3 hidden layers. Each neuron derives information from all neurons in the previous layer
The neural network (left) is an equivalent representation of the graph (right). We only connect the neurons that represent connected nodes in the graph


In a convolutional neural network, common for computer vision tasks, we have a moving “observation window” (convolution filter) from which the model tries to learn weights of the filter. The weights are shared for each convolution, which ensures translational equivariance, allowing the model to recognize objects wherever they show up in the image.

A convolution of size 32x32x3 is scanned across a bigger image
For example, each convolution consists of 9 square pixels (nodes), or 8 special neighbor nodes around a center node (4 direct descendants + 4 diagonal descendants). Screenshots from the presentation ( Qi et al
An arbitrarily defined neighborhood (not the whole graph). Same source as above.


Recurrent neural networks are particularly useful in training sequential data. A recurrent propagation is a function of the current input and some derived information from the previous output. The same weights are used to fit each pair of (current, previous) data. This means the network can fit data from different lengths, keep a representation of the history, and produce a prediction of different lengths. This makes it well suited for RNN because graphs can be arbitrarily big and flexible.

h’s stand for output, A stands for the shared neural network, and X’s are inputs. On the right is really just the same network, unrolled through a sequence. image from ( Olah’s blog
A toy example: He dreams that all dreams come true. This is just a toy example. Proper NLP should distinguish between a verb and a noun.
The information from X0 and X1 has decaying influence h3, after two propagations. You may imagine the problem gets worse when trying to learn far-reaching dependencies in a sequence (image also from Olah’s blog)


Although Graph Neural Networks can be understood by similarities with traditional neural networks, a helpful conceptual model unique to GNN is to think of it as message passing between nodes. The message passing consists of 3 steps:

  1. Collecting information from neighbors, applying some transformation T on their way
  2. Aggregating this information, applying some aggregation function G
  3. Updating self-state & weights based on the aggregated representation, with a function U

Use Cases

GNNs can be used quite flexible. In the following examples, Polygon-RNN++ uses GNN only to fine-tune the output, Pixel2Mesh uses a direct mapping between GNN and a graph-oriented problem, and RGBD Segmentation uses graphs to embed structural information to enrich training data.


This project, by David Acuna, Huan Ling (yes, the speaker), Amlan Kar, and Sanja Fidler, automates the time-consuming image annotation process for computer vision [Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++]. This model recognizes a roughly accurate polygon segmentation from an image and takes human adjustments on individual vertices to produce a more accurate segmentation. The tool is open to public here.

A schema for the original Polygon-RNN. GNN is not used here.
The orange dots are from the RNN outputs. The blue dots are initially placed evenly and co-linearly between orange dots. The GNN tries to predict how much and to what direction the blue dots should displace


This project by Wang et al [Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images] performs better than previous state-of-the-art techniques in generating 3D Mesh representation from 2D images. This is an elegant full-on application of GNN (a mesh model is just a closed 3D surface consisted of many stitched-together small triangles with nodes and edges — A graph!).

3 models try to predict a 3D structure from the 2D image on the left. The first two are based on volumetric description and point cloud description respectively. The third is based on graph description (mesh). There are so many vertices that it looks smooth to the eye. (image from
The GNN starts from a naive guess — an ellipsoid — and in each iteration, it moves a few vertices around and adds a few more vertices among them (image from the same source as above)
One iteration (image from the paper)
  • Prefer smooth surfaces
  • Discourage vertices from crowding on edges
  • Prevent mesh surfaces from cutting into each other

RGBD semantic segmentation

Segmenting parts of an image has been a classic computer vision problem. Traditional methods rely only on 2D RGB images. In this project by Qi et al [3D Graph Neural Networks for RGBD Semantic Segmentation], depth information (the distance of each pixel from the camera), collected from devices such as Microsoft Kinetic and dual-camera phones, is included in the form of a graph when performing segmentation. A good example is how to properly segment a mirror, but not getting distracted by the reflection within.

Unary CNN © dissected the mirror. RGBD segmentation (e) mostly recovers the mirror as one piece. (image directly from the paper)
From a 2D perspective, both the blue dots and the green dots are close to the red dot. However, with depth information, we can “disconnect” them in the graph. This helps to distinguish background carpet from the foreground bed before any training happens

Future Work

GNNs allow both the human and the machine to more accurately model and fit the machine learning task. We can see elements of CNN and RNN being transferred into a graph context. There are still great challenges in this area.

About Us

The Paper Reading Group is a series of workshops hosted by the University of Toronto Machine Intelligence Student Team (UTMIST). In these workshops, we invite AI researchers to introduce their latest publications to undergraduate students. We hope this helps to ease their learning curve and in clear the mist (hype) around machine learning. While individual researchers decide how technical to go about their papers, our goal is to facilitate connections and foster a community. For more information, you can visit our Facebook page and website.

Works Cited

Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, Raquel Urtasun. 3D Graph Neural Networks for RGBD Semantic Segmentation. 2017.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
University of Toronto Machine Intelligence Team

University of Toronto Machine Intelligence Team

UTMIST’s Technical Writing Team publishes articles on topics within machine learning to our official publication: