A Method of Comparing Across Hidden Layers using Discrete Fourier Transform and CCA
I have recently discussed a paper by researchers from GoogleBrain and Uber AI Labs in which they use Canonical Correlation beween neuron vectors to illustrate interesting attributes of learning in deep neural networks . The details of this discussion can be found in the original post. One of their contributions was the methodology of using Discrete Fourier Transform (DFT) to resize layers that have different dimensions in order to compute the projection weighted CCA (Canonical Correlation Analysis). This method was called SVCCA and is used to measure the similarity between two layers or neurons. This article will go through the DFT resizing approach in slightly more detail than in my previous article.
What is Discrete Fourier Transform (DFT)?
Fourier transform is a very useful technique in several fields and originated in the field of signal processing. Fourier transforms rely on the Fourier series which states that any periodic function can be converted into an infinite sum of cosines and sines. The Fourier transform is an extension of this to non-periodic functions. The output is what frequencies are present and in what proportions. A Discrete Fourier Transform (DFT) transforms a finite sequence into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a complex-valued function of frequency.
There are many useful guides that illustrate and explain what a fourier transorm is in more detail and how it is computed. One of my favorite visualizations is below from 3Blue1Brown, a channel that focuses on generating educational visualizations in areas of calculus, linear algebra and more. This video is not necessary to continue reading, but I recommend it if you are new to the concept.
In summary, a Discrete Fourier Transform transforms signal from its original space to a frequency domain. The resulting signal has two “properties” — magnitude and phase. Based on my investigation of the researchers’ code repository , they were interested in the magnitude and therefore require taking the absolute value, which provides the signal’s “power” in a specific frequency bin.
Tools For DFT
The researchers in  use numpy as their primary package, which will run on CPU instead of GPU. The documentation for their implementation can be found on the numpy website.
>>> np.tfft(layer_activations, 2, axis=(1,2))
array([[ 54.51875 +0.j, 98.387985+0.j, -15.375423+0.j,
46.053146+0.j,118.956764+0.j, 327.79095 +0.j,
146.41219 +0.j, 16.305088+0.j]],
PyTorch is my framework of choice and they have a set of Fourier transform functions as well, which can be found in their documentation. Using Torch allows for GPU implementation which may improve speed of the algorithm. The output of Torch’s version is slightly different than numpy because it currently does not handle complex number datatypes. If you are using torch.rfft, which takes real numbers to complex output, and if the input is (datasize, height, width, channel), then the output of Torch’s algorithm when one-sided is set to False is: (datasize, height, width, channel, 2). It outputs with 2-dimensions because the first is for real numbers and the second is for imaginary. If you are using Torch’s complex to comples, the input must end with (*, 2). Just note that one way to combine the real numbers and the respective imaginary numbers into this format is by torch.stack(reals, imaginaries). The torch rfft function does not require this because it is taking in a real number to a complex. The output however will
>>> torch.rfft(layer_activations, 2, onesided=False)
tensor([[[[[ -6.4066, 2.1299],
[ -0.1060, -2.4599],
[ 0.4378, 0.4698],
[ -1.6671, 5.4320],
[ -0.3449, -2.4629],
[ 0.6946, -0.3445]],[[ -6.7608, 1.4671],
[ -0.1413, -2.6797],
[ 0.6856, 0.4365],
[ -1.6657, 5.1920],
[ -0.3498, -2.2857],
[ 0.7519, -0.2373]],[[ -6.9790, 0.7479],
[ -0.1330, -2.9023],
[ 0.9351, 0.4522],
[ -1.6175, 4.9570],
[ -0.3891, -2.1128],
[ 0.7872, -0.1210]]]]], grad_fn=<AsStridedBackward>)
How is it being used to compare layers?
The researchers 1] represented layers as a set of neuron vectors that make up the respective layer. Neuron vectors were its set of responses over a finite set of inputs. When comparing layers between models, the two layers may not always have the same dimensions. For example, a convolutional layer (1000, 16, 16, 64) and another of (1000, 8, 8, 64). The researchers compare these layers by flattening the height, width and datasize by (datapoints x height x width, channels) for each layer. This transformes the dimensions to (datapoints, neurons) which is later transposed to (neurons, datapoints). When comparing convolutional layers, the authors emphasize comparing the similarity of two convolutional layers by comparing across channels, therefore all that is required is that the layers have the same number of channels. What if they do not have the same number of channels?
If you read the original paper  or the respective summary, you will see that one of their experiments compared different layers to the neural network’s logit output of a certain class to illustrate how similiar or dissimiliar different classes are:
This would require comparison of layers that do not share the same number of channels, and this is exactly where DFT comes into play. Assuming translation invariance of the original dataset, DFT can be used for reshaping the vector to a new size determined by the minimum height and width between the two layers, becoming the shape:
datasize, height1, width1, channel1 = activations_layer1.shape
datasize, height2, width2, channel2 = activations_layer2.shape
new_shape = (datasize, min(height1, height2), min(width1, width2), min(channel1, channel2))
The algorithm takes in an array of images, applies the 2d fourier transform and resizes them according to the new size, keeping the frequencies that overlap between the two sizes. For example, if you are comparing an activation vector to the logit output of a class, the transformed shape would become (dataset size, 1, 1, 1). This provides a lower bound on the correlation of the vector with the larger set of channels. Now that the dimensions are the same, Singular Value Decomposition (SVD) proceeded by CCA can be performed on the resulting vectors.
Why is this interesting or useful?
Why did I spend so much time on this? Although this approach was used specifically with CCA, by meaningfully resizing layers to the same dimensions, without an excess of computation, many different similiarity measures or computations can be performed to compare them. It also opens up future research on other methods that can meaningully resize the layer activations for comparison. This approach would not have been possible without changing how one thinks about neural activations. Defining neural vectors as the set of responses over a finite set of inputs changed the way comparisons could be made.
- M., Gilmer, J., Yosinski, J., & Sohl-Dickstein, J. (2017). SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability, (Nips), 1–10. https://doi.org/1706.05806
- D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16:2639–2664, 2004.
- Google. https://github.com/google/svcca.