Testing Lipschitz Property over Product Distribution and its Applications to Statistical Data Privacy

Testing Lipschitz Property over Product Distribution and its Applications to Statistical Data Privacy

Abstract

Analysis of statistical data privacy has emerged as an important area of research. In this work we design algorithms to test privacy guarantees of a given Algorithm executing on a data set which contains potentially sensitive information about individuals. We design an efficient algorithm which can verify whether satisfies generalized differential privacy guarantee. Generalized differential privacy [BBG11] is a relaxation of the notion of differential privacy initially proposed by [DMNS06]. By now differential privacy is the most widely accepted notion of statistical data privacy.

To design Algorithm , we show a new connection between the differential privacy guarantee and Lipschitzness property of a given function. More specifically, we show that an efficient algorithm for testing of Lipschitz property can be transformed into which can test for generalized differential privacy. Lipschitz property testing and its variants, first studied by [JR11], has been explored by many works [JR11, AJMR12b, AJMR12a, CS12] because of its intrinsic connection to data privacy as highlighted by [JR11]. To develop a Lipschitz property tester with an explicit application in privacy has been an intriguing problem since the work of [JR11]. In our work, we present such a direct application of lipschitz tester to testing privacy . We provide concrete instantiations of Lipschitz testers (over both the hypercube and the hypergrid domains) which are used in to test for privacy of Algorithm when the underlying data set is drawn from the hypercube and the hypergrid domains respectively.

Apart from showing a direct connection between testing of privacy and Lipschitzness testing, we generalize the work of [JR11] to the setting of distribution property testing. We design an efficient Lipschitz testing algorithm when the distribution over the domain points is not uniform. More precisely, we design an efficient Lipschitz tester for the case where the domain points are drawn from hypercube according to some fixed product distribution. This result is of independent interest to the property testing community. It is important to note that to the best of our knowledge our results on Lipschitz testing over product distributions is the only positive result in property testing literature for non-uniform distributions after [AC06].

1 Introduction

Consider a data sharing platform like BlueKai, TellApart or Criteo. These platforms extensively collect and share user data with third-parties (e.g., advertisers) to enhance specific user experience (e.g., better behavioral targeting). Now, the third party applications use these data to train their machine learning algorithms for better prediction abilities. Since, the data which gets shared is extremely rich in user information, it immediately poses privacy concerns over the user information [Kor10, CKN11]. One way to address the privacy concerns due to the third-party learning algorithms is to train the third party algorithms “in-house”, i.e., within the data sharing platform itself thus, making sure that the trained machine learning model preserves privacy of the underlying training data. In this paper, we study a theoretical abstraction of the above mentioned problem.

Let be a data set where each record corresponds to a particular user and contains potentially sensitive information about the user (for example, the click history of the user for a set of advertisements displayed). Let be an algorithm that we would like to execute on the data set (possibly to obtain some global trends about the users in ) without compromising individual’s privacy. This challenging problem has recently received a lot of attention in the form of theoretical investigation in determining the privacy-utility trade-offs for various old and new algorithms. However, even if an algorithm is provably “safe”, in practice the algorithm will be implemented in a programming language that may originate from untrusted third party. This brings its own set of challenges and has primarily been addressed in the following way: transform the algorithm into a variant which provably satisfies some theoretically sound notion of data privacy (e.g., differential privacy [DMNS06]) either by syntactic manipulation (e.g. [McS09, RP10]) or doing so in some algorithmic/systems framework (eg.  [NRS07, JR11, MTS12, RSK10]). While each approach has its own appeal, they all have a few shortcomings. For example, they suffer from weak utility guarantees [NRS07, MTS12, RSK10] or take prohibitively large running time [JR11] or require use of specialized syntax [McS09, RP10] making it somewhat nontrivial for a non-privacy expert to produce an effective transformation.

In this work, we take a new approach to the above problem which we call privacy testing. Specifically, we initiate the study of testing whether an input algorithm satisfies statistical privacy guarantees. We do this by formulating the problem in the well-studied framework of property testing [RS96a, GGR98a].

Privacy testing Before we execute an Algorithm which claims to satisfy a pre-approved notion of privacy, we test for the validity of such a claim. To the best of our knowledge, ours is the first work to study this approach. More precisely, in this work we initiate the study of testing an algorithm for differential privacy guarantees. Differential privacy in the recent past has become a well established notion of privacy [Dwo06, Dwo08, Dwo09]. Roughly speaking, differential privacy guarantees that the output of an algorithm will not depend “too much” on any particular record of the underlying data set . We design testing algorithms to test whether satisfies generalized differential privacy [BBG11] or not. Generalized differential privacy is a relaxation of differential privacy and follows the same principles as differential privacy. Under specific setting of parameters, generalized differential privacy collapses to the definition of differential privacy. For a precise definition, see Section 2.1. It seems to us (and we make it more formal later on) that it may not be possible to design a computationally efficient testing algorithm for testing the notion of exact differential privacy, since in some sense it is a worst case notion privacy (see [BBG11, BD12] for a discussion on this).

Testing Lipschitz property under product distribution and its connection to privacy testing The goal of testing properties of functions is to distinguish between functions which satisfy a given property from functions which are “far” from satisfying the property. The notion of “far” is usually the fraction of points in the domain of the function on which the function needs to be redefined to make it satisfy the property.

To test for generalized differential privacy, we show a new connection between differential privacy and the problem of testing Lipschitz property which was first studied by [JR11]. A recent line of work [JR11, AJMR12b, AJMR12a] has sought to explore applications of sublinear algorithms (specifically, property testers and reconstructors) to data privacy. We continue this line of work and show the first application of property testers (which are vastly more efficient than property reconstructors) to the setting of data privacy. Indeed, prior to this work it was not clear if property testers for Lipschitz property can be used at all in data privacy setting.

Let be the universe from which data sets are drawn where each data set has the same number of records. A function is -Lipschitz if for all pair of points the following condition holds:  where is the Hamming distance between and (that is, is the number of entries in which and differ). To define Lipschitz tester, we define the notion of distance between functions and defined on the same (finite) domain under distribution as follows: . A Lipschitz tester gets an oracle access to function , a distance parameter . It accepts Lipschitz functions and rejects with high probability functions which are -far from Lipschitz property. Namely, functions for which , where the minimum is taken over all Lipschitz functions . In this work, we extend the result of [JR11] to the setting of product distribution.

While is usually taken to be the uniform distribution in the property testing literature, in our setting it will be important to allow to be more general distribution. Taking to be something other than uniform distribution is challenging to investigate even for the special case of product distributions. Indeed, prior to this work the only positive result known for the product distribution setting is the work by [AC06] for monotonicity testing. For the setting where is an arbitrary unknown distribution there are exponential lower bounds on computational efficiency of the tester are known [HK07]. Above result is stated for functions with discrete range of the form .

In this paper, we show that one can use a Lipschitz property testing algorithm () as a proxy for testing generalized differential privacy. The tester should be able to sample efficiently the data set according to a given probability distribution defined over domain of these data sets (see Definition 2.2). It has been shown that this additional requirement is sufficient to give strong privacy guarantees for the algorithm being tested.( For further details see Section 3.) Additionally, for practical applications, this tester should run efficiently, especially over the large data set domain.

With the above motivation in mind, we have designed such a Lipschitz tester with sub-linear time complexity (with respect to the domain size) for the hypercube domain with product distribution defined on data sets in . (For further details, we refer the reader to Section 4.) With this construction, we can test the privacy guarantees of an algorithm in time that is poly-logarithmic in domain size.

1.1 Related Work

In the last few years, various notions of data privacy have been proposed. Some of the most prominent are -anonymity [Swe02], -diversity [MGKV06], differential privacy [DMNS06], noiseless privacy [BBG11], natural differential privacy [BD12] and generalized differential privacy [BBG11]. While ad-hoc notions like -anonymity and -diversity being broken [GKS08], privacy community has pretty much converged to theoretical sound notions of privacy like differential privacy. In this paper, we work with the definition of generalized differential privacy (GDP), which is a generalization of differential privacy, noiseless privacy and natural differential privacy. The primary difference between GDP and the other related definitions is that it incorporates both the randomness in the underlying data set and the randomness of the Algorithm , where as other notions consider either the randomness of the data or the randomness of the algorithm.

In this paper, we design algorithms () to test whether a given algorithm satisfies GDP. In all our algorithms, we assume that is given as a “white-box”, i.e., complete access to the source code of is provided. In this paper, all the instantiations of are probabilistic and use Lipschitz property testing algorithms as underlying tool set. On a related note, in the field of formal verifications there have been recent works [RP10] using which one can guarantee that a given algorithm satisfy differential privacy. The caveat of these kind of static analysis based algorithms is that it needs the source code for to be written in a type-safe language which is hard for a non-expert to adapt to.

One of the primary reason for considering the sublinear (with respect to the domain size) time Lipschitz testers is the large size of domain often encountered in the study of statistical privacy of databases. The property testers ([RS96b, GGR98b]) have been extensively studied for various approximation and decision problems. They are of particular interest because they usually have sublinear (in input size) running time which is of particular interests in the problem with large inputs. Some of the ideas and definitions in this paper have been taken from the work on distribution testing ([HK07, GS09, AC06]). Lipschitz property testers were introduced in [JR11] (which gave the explicit tester for the hypercube domain) and have since then been studied in [AJMR12b, AJMR12a] for the hypergrid domain. Recently [CS12] have proposed an optimal Lipshcitz tester for the hypercube domain with the underlying distribution being uniform.

1.2 Our Contributions

  • Formulate testing of data privacy property as Lipschitz property testing: In this paper we initiate the study of testing privacy properties of a given candidate algorithm . The specific privacy property that we test is generalized differential privacy (GDP) (see Definition 2.2). In order to design a tester for GDP property, we cast the problem of testing GDP property as a problem of testing Lipschitzness. (See Theorem 3.1.) The problem of testing Lipschitzness was initially proposed by [JR11].

  • Design a generic transformation to convert an Algorithm to its GDP variant: We design a generic transformation to convert a candidate algorithm to its generalized differentially private variant. (See Theorem 3.5.)

  • New results for Lipschitz property testing: In order to allow our privacy tester to be effective for a large class of data generating distributions, we extend the existing results of Lipschitz property testing to work with product distributions. We give the first efficient tester for the Lipschitz property for the hypercube domain which works for arbitrary product distribution. (See Theorem 4.1.) Previous works (even for other function properties) have mostly focused on the case of uniform distribution. To the best of our knowledge this is the only non-trivial positive result in property testing over arbitrary product distribution apart from the result of [AC06] on monotonicity testing.

  • Concrete instantiation of privacy testers based on old and new Lipschitz testers We instantiate privacy tester using Lipschitz tester described in the previous item to get a concrete instantiation of privacy tester. This also leads to a concrete instantiation of Item 2 mentioned above. We also instantiate privacy testers based on known Lipschitz testers in the literature. This is summarized in Section 5.

1.3 Organization of the paper

In Section 2, we introduce the notions of privacy used in this paper, namely, differential privacy and generalized differential privacy. We also introduce the concepts of general property testing and the specific instantiation of Lipschitz property testing. In Section 3, we show the formal connection between testing of generalized differential privacy (GDP) and Lipschitz property testing. In Section 4, we state our new results of Lipschitz property testing over product distributions in the hypercube domain. In Section 5, we show that Lipschitz testers over the hypergrid domain can be used to test for GDP when the data sets are drawn uniformly from the hypergrid domain. Lastly, in Section 6 we conclude with discussions and open problems.

2 Preliminaries

2.1 Differential Privacy and Generalized Differential Privacy

In the last few years, differential privacy [DMNS06] has become a well-accepted notion of statistical data privacy in the data privacy community. At a high-level the definition of differential privacy implies that the output of a differentially private algorithm will be “almost” the same from an adversary’s perspective irrespective of an individual’s presence or absence in the underlying data set. The reason that it is a meaningful notion is because the presence or absence of an individual in the data set does not affect the output of the algorithm “too much”. This high-level intuition can be formalized as below:

Definition 2.1 (-Differential Privacy [Dmns06, Dkmn06]).

A randomized algorithm is -differentially private if for any two data sets and drawn from a domain with ( being the symmetric difference), and for all measurable sets the following holds:

.

In the above definition if , we simply call it -differential privacy. In this paper we intend to test if an algorithm is -differentially private. In order to test the above, we mould the problem into a problem of testing Lipschitzness over the probability measure induced by Algorithm over a finite set (see Section 3 for more discussion on this). Since, we want to test Lipschitzness efficiently with respect to the size of the set , we will use a relaxed notion of differential privacy called generalized differential privacy (GDP) [BBG11]. The main idea behind GDP is that it allows us to incorporate the randomness over the data generating distribution. This in turn allows us to incorporate the failure probability of the Lipschitzness testing algorithm (over the randomness of the data generating distribution). The definition of GDP below is a slight modification to the definition proposed in [BBG11] and in most natural settings is stronger than [BBG11].

Definition 2.2 (-Generalized Differential Privacy).

Let be the distribution over the space of all data sets drawn from domain . Let be a set such that . A randomized algorithm is -generalized differentially private (GDP) if for any pair data sets with ( being the symmetric difference) and for all measurable sets the following holds: , where the probability is over the randomness of the Algorithm .

It is worth mentioning here that the above definition generalizes the noiseless privacy definition [BBG11] and natural differential privacy definition [BD12] in the literature. While in both noiseless and natural differential privacy definitions the randomness is solely over the data generating distribution , in GDP the randomness is both over the data generating distribution and the randomness of the algorithm.

At a high-level what GDP says is that there exists a set of “bad” data sets where -differential privacy condition does not hold. But the probability of drawing a data set (over the data generating distribution ) from is at most (which is usually negligible in the problem parameters). In fact if we set , then we recover -differential privacy definition (see Definition 2.1) exactly. Similarly, it can be shown that under different choices of GDP implies both noiseless privacy and natural differential privacy.

2.2 Lipschitz Property Testing

In this work we show that efficiently testing whether an algorithm is -generalized differentially private reduces to the problem of testing (with high success probability over the probability measure induced by Algorithm ) if the output is Lipschitz. (For further details see section, see section 3.)

Definition 2.3.

Given a function from a metric space to , where and denote the distance function on the domain and the range respectively. The function is c-Lipschitz if .

Property testing ([GGR98b],[RS96b]) is a well studied area pertaining to randomized approximation algorithms for decision problems usually having sublinear time and query complexity. At one end of the spectrum, most of the work previously done in this area assume a uniform distribution over domain elements. The other end is to consider the setting where the distribution over the domain points is not known ([HK07]).

Here, we assume that the probability measure over domain elements is known and is not necessarily uniform. Although seemingly important, to the best of our knowledge, this is the first time that such a setting is explored in the lipschitz property testing. To state our results, we will need the following notation.

Let (e.g. Lipschitzness in this case) be the property that needs to be tested over the range of function . We define the distance of the function from as follows.

Definition 2.4.

Let and be defined as above. The between functions is defined by . The -distance of a function from property is defined as . We say that is -far from a property if .

We will need the notion of the image diameter of a function for explaining our results, which, roughly speaking, is the difference between maximum and minimum values taken by on domain .

Definition 2.5 (Image diameter).

The image diameter of a function , denoted by , is the difference between the maximum and the minimum values attained by , i.e., .

3 Test for Generalized Differential Privacy

In this work we initiate the study of testing whether a given algorithm satisfies statistical data privacy guarantees. As a specific instantiation of the problem, we study the notion of generalized differential privacy (GDP) (see Definition 2.2). Roughly speaking, GDP guarantee ensures that the output of Algorithm when executed on data set does not depend “too much” on any one entry of . The term “too much” is formalized by three parameters , and , where the first two parameters ( and ) depends on the randomness of the Algorithm and the parameter depends on the randomness of the distribution generating the data. We refer to the guarantee as -Generalized Differential Privacy (or simply -GDP).

Given an algorithm , we design a tester with the following property: if the tester outputs , then Algorithm is -generalized differentially private where the parameters and can be made arbitrarily small (at the cost of increased running time). If the tester outputs , then the Algorithm is not -differentially private. We state this formally below.

Theorem 3.1 (-Privacy testing).

Let be a -approximate Lipschitz tester (see Definition 3.2 below), let be a distribution on the domain of datasets and let be an algorithm which on input outputs a value in the finite set . Suppose there is an oracle which for every value and for every allows constant time access to the probability measure (where the measure is over the randomness of the algorithm ). Then there exists a “testing” algorithm which on input privacy parameters , failure probability parameter and access to and satisfies the following guarantee.

  • (soundness) If Algorithm outputs , then the candidate algorithm is not -differentially private.

  • (completeness) If Algorithm outputs , then with probability at least the candidate algorithm is -generalized differentially private.

The algorithm uses as a subroutine and runs in time .

To prove Theorem 3.1, we show a new connection between testing -GDP and the problem of testing Lipschitz property. The study of testing Lipschitz property was initiated by [JR11]. We present an algorithm for testing -GDP based on a generalization of Lipschitz tester presented in [JR11]. We formally define the (generalized) Lipschitz tester below where the definition differs from the standard property testing definition (example, as used in [JR11]) in two aspects: (i) we require Lipschitz testers to only distinguish between Lipschitz functions from functions which are far from -Lipschitz functions for some fixed and (ii) we measure distance between functions (in particular, how “far” the function is from satisfying the property) with respect to a pre-defined probability measure on the domain.

Definition 3.2 (-approximate Lipschitz tester).

A -approximate Lipschitz tester is a randomized algorithm that gets as input: (i) oracle access to function ; (ii) oracle access to independent samples from distribution on and (iii) parameters . It outputs a / value and provides the following guarantee.

  • If outputs , then with probability 1, the function is not Lipschitz.

  • If outputs , then with probability at least , there exists a set such that (i) the input function is -Lipschitz on the domain and (ii) .

We remark that setting and to be the uniform distribution on recovers the standard definition of property tester (in our case, Lipschitz tester as defined in [JR11]).

In Section 3.2, we show that one can extend the connection between GDP and Lipschitz testing to design an algorithm which converts the candidate algorithm in to a -generalized differentially private algorithm.

3.1 (Generalized) Differential Privacy as Lipschitz Property over a Probability Measure

Consider the domain of the data sets to be a finite set and assume that (the randomized) Algorithm , whose privacy property is to be tested, maps a data set to another finite set , i.e. any output of is always an element in . Now let us look at the privacy guarantee of GDP (see Definition 2.2). Ignoring the parameters and , the privacy guarantee suggests that for any pair of neighboring data sets (drawn from the distribution ) and any , the following is true:

(1)

The measure is the probability induced by the randomness of the Algorithm . Taking logarithm of (1), we get

(2)

We will use the following formulation of (2): , where is the Hamming metric. Now, if we view the expression as a function defined by setting , then we get the following condition: . This condition is exactly the Lipschitzness guarantee for under the Hamming metric. Using this observation we state the following meta-algorithm (Algorithm 1) to test whether given Algorithm is -generalized differentially private. In Algorithm 1 (Algorithm ), we use a black box Lipschitz property tester . Later in the paper we instantiate with a specific testing algorithms.

0:  Algorithm , data generating distribution , data domain , output range , privacy parameters and failure parameter
1:  
2:  Let be a -approximate Lipschitz tester defined in Definition 3.2.
3:  for all values  do
4:     Define function by setting .
5:     Run on with proximity parameter and failure probability parameter .
6:     If outputs , then
7:  end for
8:  If , then output , otherwise output
Algorithm 1 : Generalized Differential Privacy (GDP) tester

At a high-level Algorithm does the following. For each possible output , it defines a function table (with the domain ). It then invokes the Lipschitz testing algorithm to test for Lipschitzness property. If for every output , outputs , then outputs affirmative, and outputs negative otherwise.

Proof of Theorem 3.1

The claim about the running time of Algorithm stated in Theorem 3.1 follows directly from the definition of Algorithm (Algorithm 1). We state and prove the soundness and completeness guarantees of Theorem 3.1 separately as Claim 3.3 and Claim 3.4 respectively below.

Claim 3.3 (Soundness guarantee).

If Algorithm (Algorithm 1) outputs , then the candidate algorithm is not -differentially private.

Proof.

If Algorithm outputs a , then there exists an such that outputs NO on . By defintion of (see Definition 3.2), we get that is not Lipschitz. In other words, we have, . Therefore, either or , as required. ∎

Claim 3.4 (Completeness guarantee).

If Algorithm (Algorithm 1) outputs , then with probability at least (over the randomness of ), the candidate algorithm is -generalized differentially private.

Proof.

If Algorithm outputs , then by the union bound it follows that with probability at least , the following condition holds for every : There exists a set such that (i) satisfies -Lipschitz condition for every and (ii) .

Let . We show that with probability at least (over the randomness of ), the following holds: algorithm satisfies -differential privacy condition on the set and .

Condition (i) above implies that for every , is -Lipschitz on . Therefore, we get the following for every neighboring pairs of data sets .

Also, using Condition (ii) and the union bound over all , we get the following.

Since Conditions (i) and (ii) both hold with probability at least (over the randomness of ), we get the desired claim.

3.2 Application of GDP tester to ensure privacy for the output of a given candidate algorithm

In this section we will demonstrate how one can use Algorithm (Algorithm 1) designed in the previous section to guarantee -generalized differential privacy to the output produced by a candidate Algorithm . The details are given in Algorithm 2. The theoretical guarantees for Algorithm 2 are given below.

Theorem 3.5 (-generalized differentially private mechanism).

Let be a -approximate Lipschitz tester (see Definition 3.2) used in the testing algorithm (Algorithm 1). Under the same assumptions of Theorem 3.1, following are true for Algorithm (Algorithm 2).

  • (privacy) Algorithm (Algorithm 2) is -generalized differentially private (GDP).

  • (utility) If the candidate Algorithm is -differentially private, then Algorithm (Algorithm 2) always produces the output .

0:  Data set , candidate algorithm , testing algorithm , data generating distribution , data domain , output set , privacy parameters
1:  Run with parameters , privacy parameters , and failure parameter
2:  If outputs , then output , output otherwise
Algorithm 2 : Generalized differentially private mechanism

Proof of Theorem 3.5

The proof of Theorem 3.5 follows from the two claims below.

Claim 3.6 (Privacy).

Algorithm (Algorithm 2) is -generalized differentially private (GDP).

Proof.

First note that from Claim 3.4, it follows that if Algorithm (Algorithm 1) outputs , then w.p. , the candidate algorithm is -GDP. Now to complete the proof, we provide the following argument.

  • Case 1 [Algorithm 2 outputs ]: We define event to be the following: For every there exists a set such that (i) satisfies -Lipschitz condition for every and (ii) . As implied by the GDP guarantee, event holds with probability . Hence, we have the following for all

  • Case 2[Algorithm 2 outputs ]: In this case, the output is trivially -generalized differentially private since the output (i.e., ) is independent of the data set .

With this the proof is complete. ∎

Claim 3.7 (Utility).

If the candidate Algorithm is -differentially private, then Algorithm (Algorithm 2) always produces the output .

The proof of the above claim follows from the fact that if the candidate algorithm is -differentially private, then will always output .

4 Lipschitz Property Testing over Hypercube domain

In this section, we present a -approximate Lipschitz tester (see Definition 3.2) for functions defined on where the notion of distance is with respect to any product distribution. Specifically, the points in the data set are distributed according to the product distribution where denotes the Bernoulli distribution with probability . For any vertex , with probability and with probability . Each vertex in has an associated probability mass where is the hamming weight of , also denoted by and denote the indices of with bit-value .

In this section, we prove the following theorem which gives a -approximate Lipschitz tester for -valued functions. A function is valued if it produces outputs in integral multiples of .

Theorem 4.1.

Let be the domain from which the data set are drawn according to a product probability distribution . The Lipschitz property of functions on these data sets can be tested non-adaptively and with one sided error probability in time for . Here is the image diameter defined in Definition 2.5.

Following is an easy corollary of the above giving a -approximate Lipschitz tester for -valued functions.

Corollary 4.2 (of Theorem 4.1).

Let be the domain from which the data set are drawn according to a product probability distribution . There is an algorithm that on input parameters and oracle access to a function has the following behavior: It accepts if is Lipschitz and rejects with probability at least if is -far (with respect to the distribution ) from -Lipschitz and runs in time. Here is the image diameter defined in Definition 2.5.

The proof of above theorem and corollary appears in Section 4.1. To state the proof we need the following technical result.

We define a distribution on edges of the hypercube where the probability mass of an edge is given by . Note that . Thus the probability distribution (we call it henceforth) on the edges defined above is consistent. Our tester is based on detecting violated edges (that is, edges which violate Lipschitz property) sampled from distribution . Our main technical lemma (Lemma 4.3) gives a lower bound on the probability of sampling a violated edge according to distribution for a function that is -far from Lipschitz. (Recall that -far is measured with respect to the distribution .)

Lemma 4.3.

Let function be -far from Lipschitz. Then

Here is the image diameter defined in Definition 2.5.

We prove the above lemma in section 4.2.1.

4.1 Lipschitz tester

In this section we prove Theorem 4.1 and Corollary 4.2. We first present the algorithm stated in Theorem 4.1.

0:  Data domain , product distribution on data set , failure probability parameter , -distance parameter , discretization parameter
1:  Set .
2:  Sample vertices independently from according to the distribution
3:  Let
4:  If , reject
5:  Sample edges independently with each edge picked with probability from the hypercube
6:  If any of the sampled edges are violated, then reject, else accept
Algorithm 3 Lipschitz Tester
Proof of Theorem 4.1.

First observe that if input function is Lipschitz then the Algorithm 3 always accepts. This is because a Lipschitz function has image diameter (see Definition 2.5) at most (and hence cannot be rejected in Step 4. Moreover, it does not have any violated edges (and hence cannot be rejected in Step 6). Next consider the case when is -far from Lipschitz. Towards this we first extend Claim 3.1 of [JR11] about sample diameter to our setting where the distance (in particular, the notion of -far) is measured with respect to product distribution.

Claim 4.4.

The steps 1. and 2. of the tester outputs such that and with probability at least (failure probability at most ), is -close to having diameter that is at most .

Proof.

Sort the points in according the function value in non-decreasing order. Let be the first -points such that their probability mass sums up to and be the set of last points such that their probability mass sums up to . The rest of the proof is very similar to the proof of Claim 3.1 in [JR11], so we omit the details here. ∎

Having established Claim 4.4, rest of the proof is identical to [JR11] and we omit the details. ∎

Proof of Corollary 4.2.

It is identical to the proof of Corollary 1.2 in [JR11] and we omit the details. ∎

4.2 Repair Operator and Proof of Lemma 4.3

We show a transformation of an arbitrary function into Lipschitz function by changing on certain points, whose probability mass is related to the probability mass (with respect to ) of the violated edges of . This is achieved by repairing one dimension of at a time as explained henceforth. To achieve this, we define an asymmetric version of the basic operator of [JR11]. The operator redefines function values so that it reduces the gap asymmetrically according to the Hamming weights (and probability masses in-turn) of the endpoints of the violated edge. This is the main difference from previous approaches ([JR11], [AJMR12b]) which do not work if applied directly, because of the varying probability masses of the vertices with respect to the Hamming weight. We first define the building block of the repair operator which is called the asymmetric basic operator.

Definition 4.5 (Asymmetric basic operator).

Given , for each violated edge along dimension , where , define as follows.

  1. If , then and

  2. If , then and

Now we define the repair operator.

Definition 4.6 (Repair operator).

Given , is obtained from by several applications of the asymmetric basic operator (see Definition 4.5) along dimension followed by a single application of the rounding operator. Specifically, let be the function obtained from by applying repeatedly until there are no violated edges along the -th dimension. Then, is defined to be where the rounding operator rounds the function values to the closest -valued function.

In effect, we have the following picture for the repair operation.

Now we define a measure called violation score which will be used to show the progress of repair operation. As shown later, the violation score is approximately preserved along any dimension when we apply the repair operator to repair the edges along dimension . Note that the violation score closely resembles the violation score in [JR11] except that it depends on the function value as well as the probability masses of the end-points of the edge.

Definition 4.7.

The violation score of an edge with respect to function , denoted by , is . The violation score along dimension , denoted by , is the sum of violation scores of all edges along dimension

The violation score of an edge is positive iff it is violated and violation score of a valued function is contained in the interval . Let denote be the set of edges along dimension violated by . Then

(3)

Lemma 4.9 shows that does not increase the violation score in dimensions other than more than the additive value of . The lemma makes use of the following claim.

Claim 4.8 (Rounding is safe).

Given satisfying , let (respectively, ) be the value obtained by rounding (respectively, ) to the closest integer. Then .

Proof.

Assume without loss of generality . For , let be the largest value in not greater than . Observe that . Using the fact that , we see that if then always holds. Therefore, assume . This can happen only if . The latter implies (using the fact that ). That is . In other words, again implying , as required. ∎

Lemma 4.9.

For all , where , and every function , the following holds.

  • (progress) Applying the repair operator does not introduce new violated edges in dimension if the dimension is violation free, i.e. .

  • (accounting) Applying the repair operator does not increase the violation score in dimension by more than , i.e. .

Proof.

Let be the function obtained from by applying repeatedly until there are no violated edges along the -th dimension. We prove the following stronger claim to prove the lemma.

Claim 4.10.

We prove the above claim momentarily but first prove the lemma using the above claim. The function is obtained by rounding the values of to the closest values. Since rounding can never create new edge violations by Claim 4.8, we immediately get the first part of the lemma. The second part follows from the observation that the rounding step modifies each function value by at most . Correspondingly, the violation score of an edge along the -th dimension changes by at most where the factor 2 comes because both endpoints of an edge may be rounded. Summing over all edges in the -th dimension, we get, where the last equality holds because edges along the -th dimension form a perfect matching and therefore the probabilities sum to 1.

Proof of Claim 4.10.

Following the proof outline of a similar proof in [JR11], we show that application of the asymmetric basic operator in dimension does not increase the violation score in dimension . Standard arguments [GGL00, DGL99, JR11, AJMR12b] show that it is enough to analyze the effect of applying on one fixed disjoint square formed by adjacent edges that cross dimensions and . (This is because edges along dimensions and form disjoint squares in the hypercube. So having established Claim 4.10 for one fixed square of the hypercube, the full claim follows by summing up the inequalities over all such squares.)

Consider the two dimensional function where are positioned such that where denotes the hamming weight of . Assume that the basic operator is applied along the dimension . We show that the violation score along dimension does not increase. Assume that the violation score along edge increases. First, assume that the . (The other case is very similar and we will prove it later.) Then increases and/or decreases . Assume that increases . (The other case is symmetrical.) This implies that is violated and . Let (resp. ) denote the value of (resp. ) after applications of on an edge , for an integer . If is violated after applications of the basic operator, then and else and . We will study the effect of applying on multiple (say ) times. Recall that the repair operator is applied only if the edge is violated. This means that

The second inequality follows from the observation that since the edge is being corrected in the application, it must have been corrected in all previous applications as well. The last inequality follows from the fact that is a -valued function and is an integer. We subtract from both sides in the above inequality and do some rearrangement to achieve the following.

The above inequality is crucial for the remaining proof of the lemma 4.3. Now consider the cases when either the bottom edge is also violated or is not violated.

If the bottom edge is not violated then we have and and are not modified by the basic operator. Since increases, . Combining the above inequalities, we get . Thus the violation score increases along by and decreases along by