Universal Equivariant Multilayer Perceptrons
Abstract
Group invariant and equivariant Multilayer Perceptrons (MLP), also known as Equivariant Networks, have achieved remarkable success in learning on a variety of data structures, such as sequences, images, sets, and graphs. Using tools from group theory, this paper proves the universality of a broad class of equivariant MLPs with a single hidden layer. In particular, it is shown that having a hidden layer on which the group acts regularly is sufficient for universal equivariance. Next, Burnside’s table of marks is used to decompose product spaces. It is shown that product of two sets always contains an orbit larger than the input orbits. Therefore high order hidden layers inevitably contain a regular orbit, leading to universality of the corresponding MLP. It is shown that with an order larger than the logarithm of the size of stabilizer group, a highorder equivariant MLP is equivariant universal.
MyFramelinecolor=black, outerlinewidth=.3pt, roundcorner=5pt, innertopmargin=1pt, innerbottommargin=1pt, innerrightmargin=1pt, innerleftmargin=1pt, backgroundcolor=black!0!white \mdfdefinestyleMyFrame2linecolor=white, outerlinewidth=1pt, roundcorner=2pt, innertopmargin=nerbottommargin=nerrightmargin=10pt, innerleftmargin=10pt, backgroundcolor=black!3!white \surroundwithmdframed[ topline=false, rightline=false, bottomline=false, leftmargin=5pt, skipabove=skipbelow=]example
I Introduction
Invariance and equivariance properties constrain the output of a function under various transformations of its input. This constraint serves as a strong learning bias that has proven useful in sample efficient learning for a wide range of structured data. In this work, we are interested in universality results for Multilayer Perceptrons (MLPs) that are constrained to be equivariant or invariant. This type of result guarantees that the model can approximate any continuous equivariant (invariant) function with an arbitrary precision, in the same way an unconstrained MLP can approximate an arbitrary continuous function hornik1989multilayer; cybenko1989approximation; funahashi1989approximate.
Study of invariance in neural networks goes back to the book of Perceptron minsky2017perceptrons, where the necessity of parametersharing for invariance was used to prove the limitation of a single layer Perceptron. The followup work showed how parameter symmetries can be used to achieve invariance to finite and infinite groups shawe1989building; wood1996representation; shawe1993symmetries; wood1996invariant. These fundamental early works went unnoticed during the resurgence of neural network research and renewed attention to symmetry hinton2011transforming; mallat2012group; bruna2013invariant; gens2014deep; jaderberg2015spatial; dieleman2016exploiting; cohen2016group.
When equivariance constraints are imposed on feedforward layers in an MLP, the linear maps in each layer is constrained to use tied parameters wood1996representation; ravanbakhsh_symmetry. This model that we call an equivariant MLP appears in deep learning with sets (zaheer_deepsets; qi2017pointnet), exchangeable tensors hartford2018deep, graphs maron2018invariant, and relational data (graham2019deep). Universality results for some of these models exists (zaheer_deepsets; segol2019universal; keriven2019universal). Broader results for high order invariant MLPs appears in (maron2019universality).
A parallel line of work in equivariant deep learning studies linear action of a group beyond permutations. The resulting equivariant linear layers can be written using convolution operations cohen2016steerable; kondor2018generalization. When limited to permutation groups, group convolution is simply another expression of parametersharing ravanbakhsh_symmetry; kondor2018generalization; see also Section II.3. However, in working with linear representations, one may move beyond finite groups cohen2019general; see also wood1996representation. Some applications include equivariance to isometries of the Euclidean space (weiler2019general; worrall2017harmonic), and sphere (cohen2018spherical). Extension of this view to manifolds is proposed in cohen2019gauge. Finally, a third line of work in equivariant deep learning that involves a specialized architecture and learning procedure is that of Capsule networks (sabour2017dynamic; hinton2018matrix); see lenssen2018group for a group theoretic generalization.
i.1 Summary of Results
This paper proves universality of equivariant MLPs for finite groups in two settings: First, Section III shows that any equivariant MLP with a single regular hidden layer is universal equivariant (invariant). Next, Section IV shows that a general universality result (that subsumes existing universality results for high order networks) can be derived and attributed to the existence of regular orbits in product spaces. The main tool in our analysis involving decomposition of product spaces is Burnside’s table of marks. Using the table of marks, Section V proves that the product of two sets always creates at least one orbit larger than the orbits of the input sets. Therefore, repeated product in highorder hidden layers inevitably leads to creation of a regular orbit, which we show is sufficient for universality. A lowerbound on the order of a high order hidden layer that is sufficient for universality is , where is the stabilizer group. Using the largest possible stabilizer on a set of size , this leads to a bound smaller than for universal equivariance to arbitrary permutation group. This bound is an improvement over the previous bound that was shown to guarantee universal invariance maron2019universality.
Ii Preliminaries
Let be a finite group with its action defined on the finite set . Formally, this action is a homomorphism into the group of all permutations of . The image of this map is a permutation group . We use the notation
to denote this action for and .
ii.1 Invariant and Equivariant Linear Maps
Let the real matrix denote a linear map . We say this map is equivariant iff
(1) 
where siamilar to , the permutation matrix is defined based on the action .
In this definition, we assume that the group action on the input is faithful – that is is injective, or . If the action on the output index set is not faithful, then the kernel of this action is a nontrivial normal subgroup of , . In this case is a quotient group, and it is more accurate to say that
is invariant to and
equivariant to . Using this convention equivariance and invariance correspond to extreme cases of and . Moreover, composition of such invariantequivariant functions preserves this property, motivating design of deep networks by stacking equivariant layers.
ii.2 Orbits and Homogeneous Spaces
partitions into orbits , where is transitive on each orbit, meaning that for each pair , there is at least one such that . If has a single orbit, it is transitive, and is called a homogeneous space for . If moreover the choice of with is unique, then is called regular.
Given a subgroup and , the right coset of in , defined as is
a subset of . For a fixed , the set of these rightcosets, , form a partition of .
naturally acts on the right coset space, where sends one coset to another.
The significant of this action is that “any” transitive action is isomorphic to action on some right coset space.
To see why, note that in this action any stabilizes the coset , because .
ii.3 ParameterSharing and Group Convolution View
Consider the equivariance condition of Eq. 1. Since the equality holds for all , and using the fact that the inverse of a permutation matrix is its transpose, the equivariance constraint reduces to
(2) 
The equation above ties the parameters within the orbits of action on rows and columns of :
(3) 
where is an element of the matrix as a linear map. This type of group action on Cartesian product space is sometimes called the diagonal action. In this case, the action is on the Cartesian product of rows and columns of .
We saw that any homogenous space is isomorphic to a coset space. Using and , the parametersharing constraint of Eq. 2 becomes
(4)  
(5) 
Since we can always multiply both sides to have the coset as the first argument, we can replace the matrix with the vector , such that . This rewriting also enables us to express the matrix vector multiplication of the linear map in the form of crosscorrelation of input and a kernel
(6)  
(7)  
(8) 
This relates the parametersharing view of equivariant maps Eq. 4 to the convolution view Eq. 8. Therefore, the universality results in the following extends to group convolution layers (cohen2016group; kondor2018generalization), for finite groups.
Equivariant Affine Maps We may extend our definition, and consider affine maps , by allowing an “invariant” bias parameter satisfying
(9) 
This implies a parameter sharing constraint . For homogeneous , this constraint enforces a scalar bias. Beyond homogeneous spaces, the number of free parameters in grows with the number of orbits.
ii.4 Invariant and Equivariant MLPs
One may stack multiple layers of equivariant affine maps with multiple channels, followed by a nonlinearity, so as to build an equivariant MLP. One layer of this equivariant MLP a.k.a. equivariant network is given by:
where and index the input and output channels respectively, is the output of layer , with denoting the original input. Here, we assume that faithfully acts on all , with and . The parameter matrices , and the bias vector are constrained by the parametersharing conditions Eq. 2 and Eq. 9 respectively.
In an invariant MLP the faithfulness condition for action on the hidden and output layers are lifted. In practice, it is common to construct invariant networks by first constructing an equivariant network followed by pooling over .
Iii Universality Results for Regular Action
This section presents two simple new results on universality of both invariant and equivariant networks with a single hidden layer (). Formally, we can claim that a equivariant MLP is a universal equivariant approximator, if for any equivariant continuous function , any compact set , and , there exists a choice of parameters, and number of channels such that .
[style=MyFrame2]
Theorem III.1.
A invariant network
(10) 
with a single hidden layer, on which acts regularly is a universal invariant approximator. Here, and , .
Proof.
The first step follows the symmetrisization argument yarotsky2018universal, which in its general form is widely used in invariant theory sturmfels2008algorithms. Since MLP is a universal approximator, for any compact set , we can find such that for any , for . Let denote the symmetrisized , which is again a compact subset of for finite . Let approximate on the symmetrisized compact set . It is then easy to show that for invariant , the symmetrisized MLP also approximates
(11)  
(12) 
Our next step, is to show that is equal to of Eq. 10, for some parameters constrained so that , where and are the permutation representation of action on the input and the hidden layer respectively.
(13)  
(14)  
(15) 
where in the last step we put the summation terms into rows of the matrix , and performed the summation using multiplication by . is the rescaled . Now we show that the parameter matrix above satisfy the parametersharing constraint :
where the first equality follows from the fact that row indexed by is moved to the row :
Therefore, the current row was previously . The second equality follows from is acting from the right, and no further inversion is needed
This shows that a invariant network with a single hidden layer on which acts regularly is equivalent to a symmetricized MLP, and therefore for some number of channels, it is a universal approximator of invariant functions. Note that the number of channels corresponds to the number of hidden units in the symmetrisized MLP. ∎
Next, we extend this to equivariant MLPs. {mdframed}[style=MyFrame2]
Theorem III.2.
A equivariant MLP
(16) 
with a single regular hidden layer is a universal equivariant approximator.
Proof.
In this setting, symmetricization, using the socalled Reynolds operator sturmfels2008algorithms, for the universal MLP is given by
(17) 
where and are the weight vectors in the first and second layer associated with hidden unit . Our objective is to show that this symmetrisized MLP is equivalent to the equivariant network of Eq. 16, in which , and use parametersharing to satisfy
(18) 
Here, , and are the permutation representations of action on the input, the output, and the hidden layer respectively.
First, rewrite the symmetrisized MLP as
and the factor is absorbed in one of the weights. It remains to show that the two matrices above satisfy the equivariance condition
The proof for is identical to the invariant network case.
For , we use a similar approach.
In the first step, since is acting on the right, it moves the column indexed by to . This means that the column currently at is . The second step uses the following:
This, proves the equality of the symmetrisize MLP Eq. 17 to the equivariant MLP of Eq. 16. However, a similar argument to the proof of invariant case, shows the universality of . Putting these together, completes the proof of Theorem III.2. ∎
In the case where is an Abelian group, any faithful transitive action is regular, meaning that the hidden layer in a equivariant neural network is necessarily regular. Combined with Theorem III.2, this leads to a universality result for Abelian groups. {mdframed}[style=MyFrame2]
Corollary 1.
For Abelian group , a equivariant (invariant) neural network with a single hidden layer is a universal approximator of continuous equivariant (invariant) functions on compact subsets of .
Iv Decomposition of Product Sets
A prerequisite to analysis of product sets is their classification, which also leads to classification of all maps based on their input/output sets.
iv.1 Classification of Sets and Maps
Recall that any transitive set is isomorphic to a rightcoset space .
However, the right cosets and
are themselves isomorphic.
iv.2 Classification of sets
A set is transitive on each of its orbits, and we can identify each orbit with its stabilizer subgroup. Therefore a list of these subgroups along with their multiplicities completely defines a set up to an isomorphism (rotman2012introduction):
(19) 
where denotes the multiplicity of a rightcoset space, and has orbits.
To ensure a faithful action on , a necessary and sufficient condition is for the pointstabilizers to have a trivial intersection. The pointstabilizers within each orbit are conjugate to each other and their intersection which is the largest normal subgroup of contained in , is called the core of action on :
(20) 
iv.3 Classification of Maps
Next, we extend the classification of sets to equivariant maps, a.k.a. maps , by jointly classifying the input and the output index sets and . We may consider a similar expression to Eq. 19 for the output index set . The linear map is then equivariant to and invariant to iff
(21) 
where the second condition translates to invariance of action on . Note that the first condition is simply ensuring the faithfulness of action on . This result means that the multiplicities and completely identify a (linear) map that equivariant to and invariant to , up to an isomorphism.
iv.4 Cartesian Product of sets
Previously we classified all sets as the disjoint union of homogeneous spaces , where acts transitively on each orbit. However, also naturally acts on the Cartesian product of homogeneous sets:
where the action is defined
High Order Spaces A special case is when we consider the repeated selfproduct of the same homogeneous space :
We call this an order product space. Product spaces are used in building highorder layers in equivariant networks in several recent works kondor2018covariant; maron2018invariant; albooyeh2019incidence. maron2019universality show that for
(22) 
such MLPs with multiple hidden layers of order become universal invariant approximators. We show that better bounds for that guaranetees universal invariance and equivariance follows from the universality results of Theorems III.2 and III.1 and the decomposition of product spaces. This means that such high order produce spaces are universal simply because they contain a regular set.
iv.5 Burnside Ring and Decomposition of sets
Since any set can be written as a disjoint union of homogeneous spaces Eq. 19, we expect a decomposition of the product space in the form
(23) 
Indeed, this decomposition exists, and the multiplicities , are called the structure coefficient of the Burnside Ring. The (commutative semi)ring structure is due to the fact that the set of nonisomorphic sets is equipped with: 1) a commutative product operation that is the Cartesian product of spaces, and; 2) a summation operation that is the disjoint union of spaces dieck2006transformation. A key to analysis of product spaces is finding the structure coefficients in Eq. 23.
Example 1 (Product of Sets).
The symmetric group acts faithfully on , where the stabilizer is – that is the stabilizer of is the set of all permutations of the remaining items . This means .
The diagonal action on the product space , decomposes into orbits, where the Bell number is the number of different partitions of a set of labelled objects maron2018invariant. One may further refine these orbits by their type in the form of Eq. 23:
(24) 
where the “structure coefficient” is the Stirling number of the second kind, and it counts the number of ways could be partitioned into nonempty sets. For example, when , one may think of the index set as indexing some matrix. This matrix decomposes into one () diagonal and one set of offdiagonals . This decomposition is presented in albooyeh2019incidence, where it is shown that these orbits correspond to “hyperdiagonals” for higher order tensors. For general groups, inferring the structural coefficients is more challenging, as we see shortly.
From Eq. 24 in the example above it follows that an order product of sets contains a regular orbit. The following is a corollary that combines this with the universality results of Theorems III.2 and III.1. {mdframed}[style=MyFrame2]
Corollary 2.
[Universality for Product of Sets] a equivariant network with a hidden layer of order , is a universal approximator of equivariant (invariant) functions, where the input and output layer may be of any order.
A universality result for the invariant case only, using a quadratic order appears in maron2019universality, where the MLP is called a hypergraph network. keriven2019universal prove universality for the equivariant case, without giving a bound on the order of the hidden layer, and assuming an output of degree . In comparison, Corollary 2 uses a linear bound and applies to a much more general setting of arbitrary orders for the input and output product sets. In fact, the universality result is true for arbitrary inputoutput sets.
Linear Map as a Product Space For finite groups, the linear map
is indexed by , and therefore it is a product space.
In fact the parametersharing of Eq. 3 ties all the parameters that are in
the same orbit.
Therefore, the decomposition Eq. 23 also identifies
parametersharing pattern of .
Example 2 (Equivariant Maps between Set Products).
Equation Eq. 24 gives a closed form for the decomposition of into orbits. Assuming a similar decomposition for , the equivariant map is decomposed in to linear maps corresponding to the orbits of . albooyeh2019incidence show that each orbit “type” is a form of poolingbroadcasting from/to hyperdiagonals of the corresponding tensors.
Burnside’s Table of Marks
Burnside’s table of marks simplifies working with the multiplication operation of the Burnside ring, and enables the analysis of action on product spaces burnside1911theory; pfeiffer1997subgroups. The mark of on a finite set , is defined as the number of points in fixed by all :
(25) 
The interesting quality of the number of fixed points is that the total number of fixed points adds up when we add two spaces . Also, when considering product spaces , any combination of points fixed in both spaces will be fixed by . This means
(26)  
(27) 
Now define the vector of marks as
where is the the number of conjugacy classes of subgroups of , and we have assume a fixed order on . Due to Eqs. 26 and 27, given sets , we can perform elementwise addition and multiplication on the vector of integers , to obtain the mark of union and product sets respectively. Moreover, the special quality of marks, makes this vector an injective homeomorphism: we can work backward from the resulting vector of marks and decompose the union/product space into homogeneous spaces. To facilitate calculation of this vector, for any set , one may use the table of marks.
⋮  

1  1  1  1 
The table of marks for a group , is the square matrix of marks of all subgroups on all rightcoset spaces
(28) 
The matrix , has valuable information about the subgroup structure of . For example,
’s action on will have a fixed point, iff .
Therefore, the sparsity pattern in the table of marks, reflects the subgroup lattice structure of , up to conjugacy.
A useful property of is that we can use it to find the marks on any set in using the expression Moreover, the structural constants of Eq. 23 can be recovered from the table of Marks
(29) 
V Universality of Maps on Product Spaces
Using the tools discussed in the previous section, in this section we prove some properties of product spaces that are consequential in design of equivariant maps. Previously we saw that product spaces decompose into orbits, identified by in Eq. 23. The following theorem states that such product spaces always have orbits that are at least as large as the largest of the input orbits, and at least one of these product orbits is strictly larger than both inputs. For simplicity, this theorem is stated in terms of the stabilizers, rather than the orbits, where by the orbitstabilizer theorem, larger stabilizers correspond to smaller orbits. Also, while the following theorem is stated for the product of homogeneous sets, it trivially extends to product of sets with multiple orbits.
[style=MyFrame2]
Theorem V.1.
Let and be transitive sets, with . Their product set decomposes into orbits , such that:

(i) for all the resulting orbits.

(ii) if and , then for at least one of the resulting orbit.
Proof.
The proof is by analysis of the table of Marks . The vector of mark for the product space is the elementwise product of vector of marks of the input:
The same vector, can be written as a linear combination of rows of , with nonnegative integer coefficients:
For convenience we assume a topological ordering of the conjugacy class of subgroups consistent with their partial order – that is . This means that is lowertriangular, with nonzero diagonals; see Table 1. Three important properties of this table are pfeiffer1997subgroups:

the sparsity pattern in reflects the subgroup relation: iff .

the first column is the index of in : .

the diagonal element is the index of the normalizer: , where the normalizer of in is defined as the largest intermediate subgroup of in which is normal:
(i) From (1) it follows that the nonzeros of the product correspond to and . Since the only rows of with such nonzero elements are for , all the resulting orbits have such stabilizers. This finishes the proof of the first claim.
(ii) If and , then which is a subgroup of both groups is strictly smaller than both, which means one of the resulting orbits must be larger than both input orbits.
Next, w.l.o.g., assume . Consider proof by contradiction: suppose the product does not have a strictly larger orbit, then from (i) it follows that for some . Consider the first and element of the elementwise product above:
Substituting from the first equation into the second equation and simplifying we get This means the action of on fixes all points, and therefore as defined in Eq. 20. This contradicts the assumption of (ii). ∎
60  