Last week I presented my first paper "An Empirical Verification of Wide Networks Theory" at BMVC 2022 in London. The experience has been fantastic, both because of the beautiful venue, a cricket field called the Kia Oval, and because of the stimulating discussions I had with other participants.

In this post I will briefly explain what my paper is about and appropriately link to all of its material (a video presentation and a poster).

What initially attracted my attention was the fact that there are actually a lot of theoretical papers on very wide networks (where the number of neurons in each layer is the cube of the number of examples you train your network on) which are able to give mathematical guarantees on convergence and generalization (via their relation with Kernel Methods (Archived) given by the Neural Tangent Kernel (Archived)), but those are not useful in practice, because typically deployed models are much less wide than what these theories assume.

On the other hand, typical models are still overparameterized, i.e. they have more neurons than the number of examples they are trained on, and they can be studied using the Polyak-Lojasiewicz Condition (Archived), a generalization of convexity for which local minima are still global, even though no more unique.

What we do is to study the network optimization problem in the Polyak-Lojasiewicz framework, and expose the relation between the convergence speed in optimization and the conditioning of certain matrices of the network differentials, which gives us a way to measure crucial quantities during the optimization process and to verify certain parts of these theories.

In particular we focus on predicting the loss decrease in the next epoch by measuring the quantities at the current one, and in measuring conditioning more accurately during the whole optimization process. In doing so we find surprising results which actually contradict the ways in which theorems about wide networks reason, thus giving us hope that something more can be done to understand overparameterized realistic networks.

I hope that you are now more curious and want to have a look at the paper, the poster, the code (Archived) or at least go visit the paper website (Archived) with a video explanation of the paper.