Universal Equivariant Multilayer Perceptrons

# Universal Equivariant Multilayer Perceptrons

## Abstract

Group invariant and equivariant Multilayer Perceptrons (MLP), also known as Equivariant Networks, have achieved remarkable success in learning on a variety of data structures, such as sequences, images, sets, and graphs. Using tools from group theory, this paper proves the universality of a broad class of equivariant MLPs with a single hidden layer. In particular, it is shown that having a hidden layer on which the group acts regularly is sufficient for universal equivariance. Next, Burnside’s table of marks is used to decompose product spaces. It is shown that product of two -sets always contains an orbit larger than the input orbits. Therefore high order hidden layers inevitably contain a regular orbit, leading to universality of the corresponding MLP. It is shown that with an order larger than the logarithm of the size of stabilizer group, a high-order equivariant MLP is equivariant universal.

\mdfdefinestyle

MyFramelinecolor=black, outerlinewidth=.3pt, roundcorner=5pt, innertopmargin=1pt, innerbottommargin=1pt, innerrightmargin=1pt, innerleftmargin=1pt, backgroundcolor=black!0!white \mdfdefinestyleMyFrame2linecolor=white, outerlinewidth=1pt, roundcorner=2pt, innertopmargin=nerbottommargin=nerrightmargin=10pt, innerleftmargin=10pt, backgroundcolor=black!3!white \surroundwithmdframed[ topline=false, rightline=false, bottomline=false, leftmargin=5pt, skipabove=skipbelow=]example

## I Introduction

Invariance and equivariance properties constrain the output of a function under various transformations of its input. This constraint serves as a strong learning bias that has proven useful in sample efficient learning for a wide range of structured data. In this work, we are interested in universality results for Multilayer Perceptrons (MLPs) that are constrained to be equivariant or invariant. This type of result guarantees that the model can approximate any continuous equivariant (invariant) function with an arbitrary precision, in the same way an unconstrained MLP can approximate an arbitrary continuous function hornik1989multilayer; cybenko1989approximation; funahashi1989approximate.

Study of invariance in neural networks goes back to the book of Perceptron minsky2017perceptrons, where the necessity of parameter-sharing for invariance was used to prove the limitation of a single layer Perceptron. The follow-up work showed how parameter symmetries can be used to achieve invariance to finite and infinite groups shawe1989building; wood1996representation; shawe1993symmetries; wood1996invariant. These fundamental early works went unnoticed during the resurgence of neural network research and renewed attention to symmetry hinton2011transforming; mallat2012group; bruna2013invariant; gens2014deep; jaderberg2015spatial; dieleman2016exploiting; cohen2016group.

When equivariance constraints are imposed on feed-forward layers in an MLP, the linear maps in each layer is constrained to use tied parameters wood1996representation; ravanbakhsh_symmetry. This model that we call an equivariant MLP appears in deep learning with sets (zaheer_deepsets; qi2017pointnet), exchangeable tensors hartford2018deep, graphs maron2018invariant, and relational data (graham2019deep). Universality results for some of these models exists (zaheer_deepsets; segol2019universal; keriven2019universal). Broader results for high order invariant MLPs appears in (maron2019universality).

A parallel line of work in equivariant deep learning studies linear action of a group beyond permutations. The resulting equivariant linear layers can be written using convolution operations cohen2016steerable; kondor2018generalization. When limited to permutation groups, group convolution is simply another expression of parameter-sharing ravanbakhsh_symmetry; kondor2018generalization; see also Section II.3. However, in working with linear representations, one may move beyond finite groups cohen2019general; see also wood1996representation. Some applications include equivariance to isometries of the Euclidean space (weiler2019general; worrall2017harmonic), and sphere (cohen2018spherical). Extension of this view to manifolds is proposed in cohen2019gauge. Finally, a third line of work in equivariant deep learning that involves a specialized architecture and learning procedure is that of Capsule networks (sabour2017dynamic; hinton2018matrix); see  lenssen2018group for a group theoretic generalization.

### i.1 Summary of Results

This paper proves universality of equivariant MLPs for finite groups in two settings: First, Section III shows that any equivariant MLP with a single regular hidden layer is universal equivariant (invariant). Next, Section IV shows that a general universality result (that subsumes existing universality results for high order networks) can be derived and attributed to the existence of regular orbits in product spaces. The main tool in our analysis involving decomposition of product spaces is Burnside’s table of marks. Using the table of marks, Section V proves that the product of two -sets always creates at least one orbit larger than the orbits of the input -sets. Therefore, repeated product in high-order hidden layers inevitably leads to creation of a regular orbit, which we show is sufficient for universality. A lower-bound on the order of a high order hidden layer that is sufficient for universality is , where is the stabilizer group. Using the largest possible stabilizer on a set of size , this leads to a bound smaller than for universal equivariance to arbitrary permutation group. This bound is an improvement over the previous bound that was shown to guarantee universal invariance maron2019universality.

## Ii Preliminaries

Let be a finite group with its action defined on the finite set . Formally, this action is a homomorphism into the group of all permutations of . The image of this map is a permutation group . We use the notation to denote this action for and . 1 Let be another -set, where the corresponding permutation action is defined by . -action on naturally extends to by We also write this action as , where is the permutation matrix form of .

### ii.1 Invariant and Equivariant Linear Maps

Let the real matrix denote a linear map . We say this map is -equivariant iff

 MBgMWmmx=MWMAgmmx∀mmx∈\mathdsRN,g∈G. (1)

where siamilar to , the permutation matrix is defined based on the action . In this definition, we assume that the group action on the input is faithful – that is is injective, or . If the action on the output index set is not faithful, then the kernel of this action is a non-trivial normal subgroup of , . In this case is a quotient group, and it is more accurate to say that is invariant to and equivariant to . Using this convention -equivariance and -invariance correspond to extreme cases of and . Moreover, composition of such invariant-equivariant functions preserves this property, motivating design of deep networks by stacking equivariant layers. 2

### ii.2 Orbits and Homogeneous Spaces

partitions into orbits , where is transitive on each orbit, meaning that for each pair , there is at least one such that . If has a single orbit, it is transitive, and is called a homogeneous space for . If moreover the choice of with is unique, then is called regular.

Given a subgroup and , the right coset of in , defined as is a subset of . For a fixed , the set of these right-cosets, , form a partition of . naturally acts on the right coset space, where sends one coset to another. The significant of this action is that “any” transitive -action is isomorphic to -action on some right coset space. To see why, note that in this action any stabilizes the coset , because . 3 Therefore in any action the stabilizer identifies the coset space.

### ii.3 Parameter-Sharing and Group Convolution View

Consider the equivariance condition of Eq. 1. Since the equality holds for all , and using the fact that the inverse of a permutation matrix is its transpose, the equivariance constraint reduces to

 (2)

The equation above ties the parameters within the orbits of -action on rows and columns of :

 MW(m,n)=MW(g⋅m,g⋅n)∀g∈G,n,m∈N×M (3)

where is an element of the matrix as a linear map. This type of group action on Cartesian product space is sometimes called the diagonal action. In this case, the action is on the Cartesian product of rows and columns of .

We saw that any homogenous -space is isomorphic to a coset space. Using and , the parameter-sharing constraint of Eq. 2 becomes

 MW(Kg,Hg′) =MW(g−1⋅Kg,g−1⋅Hg′) (4) (5)

Since we can always multiply both sides to have the coset as the first argument, we can replace the matrix with the vector , such that . This rewriting also enables us to express the matrix vector multiplication of the linear map in the form of cross-correlation of input and a kernel

 [MWmmx](n) =[MWmmx](Kg) (6) =∑Hg′∈H∖GMW(Kg,Hg′)mmx(Hg′) (7) =∑Hg′∈H∖Gw(Hg′g−1)mmx(Hg′) (8)

This relates the parameter-sharing view of equivariant maps Eq. 4 to the convolution view Eq. 8. Therefore, the universality results in the following extends to group convolution layers (cohen2016group; kondor2018generalization), for finite groups.

Equivariant Affine Maps We may extend our definition, and consider affine -maps , by allowing an “invariant” bias parameter satisfying

 MBgmmb=mmb. (9)

This implies a parameter sharing constraint . For homogeneous , this constraint enforces a scalar bias. Beyond homogeneous spaces, the number of free parameters in grows with the number of orbits.

### ii.4 Invariant and Equivariant MLPs

One may stack multiple layers of equivariant affine maps with multiple channels, followed by a non-linearity, so as to build an equivariant MLP. One layer of this equivariant MLP a.k.a. equivariant network is given by:

 mmx(ℓ)c=σ⎛⎝C(ℓ−1)∑c′=1MW(ℓ)c,c′mmx(ℓ−1)c′+mmb(ℓ)c⎞⎠,

where and index the input and output channels respectively, is the output of layer , with denoting the original input. Here, we assume that faithfully acts on all , with and . The parameter matrices , and the bias vector are constrained by the parameter-sharing conditions Eq. 2 and Eq. 9 respectively.

In an invariant MLP the faithfulness condition for -action on the hidden and output layers are lifted. In practice, it is common to construct invariant networks by first constructing an equivariant network followed by pooling over .

## Iii Universality Results for Regular Action

This section presents two simple new results on universality of both invariant and equivariant networks with a single hidden layer (). Formally, we can claim that a -equivariant MLP is a universal -equivariant approximator, if for any -equivariant continuous function , any compact set , and , there exists a choice of parameters, and number of channels such that .

{mdframed}

[style=MyFrame2]

###### Theorem III.1.

A -invariant network

 ^ψ(mmx)=C∑c=1w′c1⊤σ(MWcmmx+bc). (10)

with a single hidden layer, on which acts regularly is a universal -invariant approximator. Here, and , .

###### Proof.

The first step follows the symmetrisization argument yarotsky2018universal, which in its general form is widely used in invariant theory sturmfels2008algorithms. Since MLP is a universal approximator, for any compact set , we can find such that for any , for . Let denote the symmetrisized , which is again a compact subset of for finite . Let approximate on the symmetrisized compact set . It is then easy to show that for -invariant , the symmetrisized MLP also approximates

 |ψ(mmx)−ψsym(mmx)|=|ψ(mmx)−1|G|∑g∈GψMLP+(mmx)| (11) ≤1|G|∑g∈G|ψ(MAgmmx)−ψMLP+(MAgmmx)|≤ϵ. (12)

Our next step, is to show that is equal to of Eq. 10, for some parameters constrained so that , where and are the permutation representation of action on the input and the hidden layer respectively.

 ψsym(mmx) (13) (14) =C∑c=1~wc1⊤σ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝⎡⎢ ⎢ ⎢⎣−mmw⊤cMAg1−$⋮$−mmw⊤cMAg|H|−⎤⎥ ⎥ ⎥⎦MWcmmx⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠. (15)

where in the last step we put the summation terms into rows of the matrix , and performed the summation using multiplication by . is the rescaled . Now we show that the parameter matrix above satisfy the parameter-sharing constraint :

where the first equality follows from the fact that row indexed by is moved to the row :

 MHgMAgr=MAg⋅gr=MAgrg−1.

Therefore, the current row was previously . The second equality follows from is acting from the right, and no further inversion is needed

This shows that a -invariant network with a single hidden layer on which acts regularly is equivalent to a symmetricized MLP, and therefore for some number of channels, it is a universal approximator of -invariant functions. Note that the number of channels corresponds to the number of hidden units in the symmetrisized MLP. ∎

Next, we extend this to equivariant MLPs. {mdframed}[style=MyFrame2]

###### Theorem III.2.

A -equivariant MLP

 ^ψ(mmx)=C∑c=1MW′cσ(MWcmmx+bc). (16)

with a single regular hidden layer is a universal -equivariant approximator.

###### Proof.

In this setting, symmetricization, using the so-called Reynolds operator sturmfels2008algorithms, for the universal MLP is given by

 ψsym(mmx)=1|G|∑g∈GMBg−1C∑c=1mmw′cσ(mmw⊤cMAgmmx+bc) (17)

where and are the weight vectors in the first and second layer associated with hidden unit . Our objective is to show that this symmetrisized MLP is equivalent to the equivariant network of Eq. 16, in which , and use parameter-sharing to satisfy

 MHgMWc=MWcMAgandMBgMW′c=MW′cMHg∀g∈G. (18)

Here, , and are the permutation representations of action on the input, the output, and the hidden layer respectively.

First, rewrite the symmetrisized MLP as

 ψsym(mmx) =C∑c=1∑g∈GMBg−1mmw′cσ(mmw⊤cMAgmmx+bc) =C∑c=1MW′cσ(MWcmmx) whereMW′c =⎡⎢ ⎢⎣||MBg−11mmw′c…MBg−1|G|mmw′c||⎤⎥ ⎥⎦ MWc =⎡⎢ ⎢ ⎢⎣−mmwcMAg1−$⋮$−mmwcMAg|G|−⎤⎥ ⎥ ⎥⎦,

and the factor is absorbed in one of the weights. It remains to show that the two matrices above satisfy the equivariance condition

 MHgMWc=MWcMAg,andMBgMW′c=MW′cMHg.

The proof for is identical to the invariant network case.

For , we use a similar approach.

 MBgMW′cMH−1g=⎡⎢ ⎢⎣||MBgMBg−11gmmw′c…MBgMBg−1|G|gmmw′c||⎤⎥ ⎥⎦ =⎡⎢ ⎢⎣||MBg−11mmw′c…MBg−1|G|mmw′c||⎤⎥ ⎥⎦=MW′c.

In the first step, since is acting on the right, it moves the column indexed by to . This means that the column currently at is . The second step uses the following:

 MBgMBg−1lg=MBg⋅(g−1lg)=MBg−1lgg−1=MBg−1l.

This, proves the equality of the symmetrisize MLP Eq. 17 to the equivariant MLP of Eq. 16. However, a similar argument to the proof of invariant case, shows the universality of . Putting these together, completes the proof of Theorem III.2. ∎

In the case where is an Abelian group, any faithful transitive action is regular, meaning that the hidden layer in a -equivariant neural network is necessarily regular. Combined with Theorem III.2, this leads to a universality result for Abelian groups. {mdframed}[style=MyFrame2]

###### Corollary 1.

For Abelian group , a -equivariant (invariant) neural network with a single hidden layer is a universal approximator of continuous -equivariant (invariant) functions on compact subsets of .

## Iv Decomposition of Product G-Sets

A prerequisite to analysis of product -sets is their classification, which also leads to classification of all -maps based on their input/output -sets.

### iv.1 Classification of G-Sets and G-Maps

Recall that any transitive -set is isomorphic to a right-coset space . However, the right cosets and are themselves isomorphic. 5 This also means what we care about is conjgacy classes of subgroups which classifies right-coset spaces up to conjugacy We used the bracket to identify the conjugacy class. In this notation, for , we say , iff , for some .

### iv.2 Classification of G-sets

A -set is transitive on each of its orbits, and we can identify each orbit with its stabilizer subgroup. Therefore a list of these subgroups along with their multiplicities completely defines a -set up to an isomorphism (rotman2012introduction):

 N≅⋃[Hi]≤Gpi[Hi∖G], (19)

where denotes the multiplicity of a right-coset space, and has orbits.

To ensure a faithful -action on , a necessary and sufficient condition is for the point-stabilizers to have a trivial intersection. The point-stabilizers within each orbit are conjugate to each other and their intersection which is the largest normal subgroup of contained in , is called the core of -action on :

 CoreG(Hi)≐⋂g∈Gg−1Hig. (20)

### iv.3 Classification of G-Maps

Next, we extend the classification of -sets to -equivariant maps, a.k.a. -maps , by jointly classifying the input and the output index sets and . We may consider a similar expression to Eq. 19 for the output index set . The linear -map is then equivariant to and invariant to iff

 ⋂pi>0CoreG(Hi)={e}and⋂qi>0CoreG(Ki)=K (21)

where the second condition translates to invariance of -action on . Note that the first condition is simply ensuring the faithfulness of -action on . This result means that the multiplicities and completely identify a (linear) -map that equivariant to and invariant to , up to an isomorphism.

### iv.4 Cartesian Product of G-sets

Previously we classified all -sets as the disjoint union of homogeneous spaces , where acts transitively on each orbit. However, also naturally acts on the Cartesian product of homogeneous -sets:

 N1×…×ND=(G1∖G)×…×(GD∖G)

where the action is defined

High Order Spaces A special case is when we consider the repeated self-product of the same homogeneous space :

 HD≅[H∖G]D=[H∖G]×…×[H∖G]Dtimes

We call this an order product space. Product spaces are used in building high-order layers in -equivariant networks in several recent works kondor2018covariant; maron2018invariant; albooyeh2019incidence. maron2019universality show that for

 D≥12|H|(|H|−1), (22)

such MLPs with multiple hidden layers of order become universal -invariant approximators. We show that better bounds for that guaranetees universal invariance and equivariance follows from the universality results of Theorems III.2 and III.1 and the decomposition of product spaces. This means that such high order produce spaces are universal simply because they contain a regular -set.

### iv.5 Burnside Ring and Decomposition of G-sets

Since any -set can be written as a disjoint union of homogeneous spaces Eq. 19, we expect a decomposition of the product -space in the form

 [Gi∖G]×[Gj∖G]=⋃[Gℓ]≤Gδℓi,j[Gℓ∖G] (23)

Indeed, this decomposition exists, and the multiplicities , are called the structure coefficient of the Burnside Ring. The (commutative semi)ring structure is due to the fact that the set of non-isomorphic -sets is equipped with: 1) a commutative product operation that is the Cartesian product of -spaces, and; 2) a summation operation that is the disjoint union of -spaces dieck2006transformation. A key to analysis of product -spaces is finding the structure coefficients in Eq. 23.

###### Example 1 (Product of Sets).

The symmetric group acts faithfully on , where the stabilizer is – that is the stabilizer of is the set of all permutations of the remaining items . This means .

The diagonal action on the product space , decomposes into orbits, where the Bell number is the number of different partitions of a set of labelled objects maron2018invariant. One may further refine these orbits by their type in the form of Eq. 23:

 [SN−n∖SN]D=D⋃d=1S(D,d)[SN−{n1,…,nd}∖SN] (24)

where the “structure coefficient” is the Stirling number of the second kind, and it counts the number of ways could be partitioned into non-empty sets. For example, when , one may think of the index set as indexing some matrix. This matrix decomposes into one () diagonal and one set of off-diagonals . This decomposition is presented in albooyeh2019incidence, where it is shown that these orbits correspond to “hyper-diagonals” for higher order tensors. For general groups, inferring the structural coefficients is more challenging, as we see shortly.

From Eq. 24 in the example above it follows that an order product of sets contains a regular orbit. The following is a corollary that combines this with the universality results of Theorems III.2 and III.1. {mdframed}[style=MyFrame2]

###### Corollary 2.

[Universality for Product of Sets] a equivariant network with a hidden layer of order , is a universal approximator of -equivariant (invariant) functions, where the input and output layer may be of any order.

A universality result for the invariant case only, using a quadratic order appears in maron2019universality, where the MLP is called a hyper-graph network. keriven2019universal prove universality for the equivariant case, without giving a bound on the order of the hidden layer, and assuming an output of degree . In comparison, Corollary 2 uses a linear bound and applies to a much more general setting of arbitrary orders for the input and output product sets. In fact, the universality result is true for arbitrary input-output -sets.

Linear -Map as a Product Space For finite groups, the linear -map is indexed by , and therefore it is a product space. In fact the parameter-sharing of Eq. 3 ties all the parameters that are in the same orbit. Therefore, the decomposition Eq. 23 also identifies parameter-sharing pattern of .6

###### Example 2 (Equivariant Maps between Set Products).

Equation Eq. 24 gives a closed form for the decomposition of into orbits. Assuming a similar decomposition for , the equivariant map is decomposed in to linear maps corresponding to the orbits of . albooyeh2019incidence show that each orbit “type” is a form of pooling-broadcasting from/to hyper-diagonals of the corresponding tensors.

#### Burnside’s Table of Marks

Burnside’s table of marks simplifies working with the multiplication operation of the Burnside ring, and enables the analysis of -action on product spaces burnside1911theory; pfeiffer1997subgroups. The mark of on a finite -set , is defined as the number of points in fixed by all :

 mN(H)≐|{n∈N∣h⋅n=n∀h∈H}|. (25)

The interesting quality of the number of fixed points is that the total number of fixed points adds up when we add two spaces . Also, when considering product spaces , any combination of points fixed in both spaces will be fixed by . This means

 mN1∪N2(Gi) =mN1(Gi)+mN2(Gi) (26) mN1×N2(Gi) =mN1(Gi)mN2(Gi). (27)

Now define the vector of marks as

 mmmN≐[mN(G1),…,mN(GI)]

where is the the number of conjugacy classes of subgroups of , and we have assume a fixed order on . Due to Eqs. 26 and 27, given -sets , we can perform elementwise addition and multiplication on the vector of integers , to obtain the mark of union and product -sets respectively. Moreover, the special quality of marks, makes this vector an injective homeomorphism: we can work backward from the resulting vector of marks and decompose the union/product space into homogeneous spaces. To facilitate calculation of this vector, for any -set , one may use the table of marks.

The table of marks for a group , is the square matrix of marks of all subgroups on all right-coset spaces7 – that is the element of this matrix is:

 MMG(i,j)≐mGi∖G(Gj)orMMG≐⎡⎢ ⎢ ⎢⎣mmm{e}∖G⋮mmmG∖G⎤⎥ ⎥ ⎥⎦. (28)

The matrix , has valuable information about the subgroup structure of . For example, ’s action on will have a fixed point, iff . Therefore, the sparsity pattern in the table of marks, reflects the subgroup lattice structure of , up to conjugacy.8

A useful property of is that we can use it to find the marks on any -set in using the expression Moreover, the structural constants of Eq. 23 can be recovered from the table of Marks

 δℓij=∑lMMG(i,l)MMG(j,l)(MM−1G)(l,ℓ). (29)

## V Universality of G-Maps on Product Spaces

Using the tools discussed in the previous section, in this section we prove some properties of product spaces that are consequential in design of equivariant maps. Previously we saw that product spaces decompose into orbits, identified by in Eq. 23. The following theorem states that such product spaces always have orbits that are at least as large as the largest of the input orbits, and at least one of these product orbits is strictly larger than both inputs. For simplicity, this theorem is stated in terms of the stabilizers, rather than the orbits, where by the orbit-stabilizer theorem, larger stabilizers correspond to smaller orbits. Also, while the following theorem is stated for the product of homogeneous -sets, it trivially extends to product of -sets with multiple orbits.

{mdframed}

[style=MyFrame2]

###### Theorem V.1.

Let and be transitive -sets, with . Their product -set decomposes into orbits , such that:

• (i) for all the resulting orbits.

• (ii) if and , then for at least one of the resulting orbit.

###### Proof.

The proof is by analysis of the table of Marks . The vector of mark for the product space is the element-wise product of vector of marks of the input:

The same vector, can be written as a linear combination of rows of , with non-negative integer coefficients:

For convenience we assume a topological ordering of the conjugacy class of subgroups consistent with their partial order – that is . This means that is lower-triangular, with nonzero diagonals; see Table 1. Three important properties of this table are pfeiffer1997subgroups:

1. the sparsity pattern in reflects the subgroup relation: iff .

2. the first column is the index of in : .

3. the diagonal element is the index of the normalizer: , where the normalizer of in is defined as the largest intermediate subgroup of in which is normal:

 NG(H)={g∈G∣gHg−1=H}.

(i) From (1) it follows that the non-zeros of the product correspond to and . Since the only rows of with such non-zero elements are for , all the resulting orbits have such stabilizers. This finishes the proof of the first claim.

(ii) If and , then which is a subgroup of both groups is strictly smaller than both, which means one of the resulting orbits must be larger than both input orbits.

Next, w.l.o.g., assume . Consider proof by contradiction: suppose the product does not have a strictly larger orbit, then from (i) it follows that for some . Consider the first and element of the elementwise product above:

 |G:Gj|×|G:Gi| =δiii|G:Gi| mmm[Gj∖G](i)×|G:NG(Gi)| =δiii|G:NG(Gi)|

Substituting from the first equation into the second equation and simplifying we get This means the action of on fixes all points, and therefore as defined in Eq. 20. This contradicts the assumption of (ii). ∎