Biometric Systems Private by Design: Reasoning about privacy properties of biometric system architecturesThis work has been partially funded by the French ANR-12-INSE-0013 project BIOPRIV and the European FP7-ICT-2013-1.5 project PRIPARE. Earlier and partial versions of this work appeared in FM 2015 [10] and ISC 2015 [11] conferences. This work provides a global and consistent view of these preliminary publications.

Biometric Systems Private by Design: Reasoning about privacy properties of biometric system architectures1

Abstract

This work aims to show the applicability, and how, of privacy by design approach to biometric systems and the benefit of using formal methods to this end. Starting from a general framework that has been introduced at STM in 2014, that enables to define privacy architectures and to formally reason about their properties, we explain how it can be adapted to biometrics. The choice of particular techniques and the role of the components (central server, secure module, biometric terminal, smart card, etc.) in the architecture have a strong impact on the privacy guarantees provided by a biometric system. In the literature, some architectures have already been analysed in some way. However, the existing proposals were made on a case by case basis, which makes it difficult to compare them and to provide a rationale for the choice of specific options. In this paper, we describe, on different architectures with various levels of protection, how a general framework for the definition of privacy architectures can be used to specify the design options of a biometric systems and to reason about them in a formal way.

1 Introduction

Applications of biometric recognition, as the most natural tool to identify or to authenticate a person, have grew over the years. They now vary from criminal investigations and identity documents to many public or private usages, like physical access control or authentication from a smartphone toward an internet service provider. Such biometric systems involve two main phases: enrolment and verification (either authentication or identification) [23]. Enrolment is the registration phase, in which the biometric traits of a person are collected and recorded within the system. In the authentication mode, a fresh biometric trait is collected and compared with the registered one by the system to check that it corresponds to the claimed identity. In the identification mode, a fresh biometric data is collected and the corresponding identity is searched in a database of enrolled biometric references. During each phase, to enable efficient and accurate comparison, the collected biometric data are converted into discriminative features, leading to what is called a biometric template.

With the increased use of biometric systems, and more recently with the development of personal data protection regulations, the issues related to the protection of the privacy of the used biometric traits have received particular attention. As leakage of biometric traits may lead to privacy risks, including tracking and identity theft, privacy by design approach is often needed.

As a security technical challenge, it has attracted a lot of research works since at least 15 years and a wide-array of well-documented primitives, such as encryption, homomorphic encryption, secure multi-party computation, hardware security, template protection etc., are known in the litterature. With those building tools, various architectures have been proposed to take into account privacy requirements in the implementation of privacy preserving biometric systems. Some solutions involve dedicated cryptographic primitives such as secure sketches [14] and fuzzy vaults [24, 44], others rely on adaptations of existing cryptographic tools [30] or the use of secure hardware solutions [36]. The choice of particular techniques and the role of the components (central server, secure module, terminal, smart card, etc.) in the architecture have a strong impact on the privacy guarantees provided by a solution. However, existing proposals were made on a case by case basis, which makes it difficult to compare them, to provide a rationale for the choice of specific options and to capitalize on past experience.

Here, we aim to show how to use and adapt a general framework that has been introduced in [2] for the formal definition and validation of privacy architectures. The goal is specify the various design options in a consistent and comparable way, and then to reason about them in a formal way in order to justify their design in terms of trust assumptions and achieved privacy properties.

The privacy by design approach is often praised by lawyers as well as computer scientists as an essential step towards a better privacy protection. It is even becoming more and more often legally compelled, as for instance in European Union with the General Data Protection Regulation [16] entering into force. Nevertheless, it is one thing to impose by law the adoption of privacy by design, quite another to define precisely what it is intended to mean technically-wise and to ensure that it is put into practice by developers. The overall philosophy is that privacy should not be treated as an afterthought but rather as a first-class requirement in the design phase of systems: in other words, designers should have privacy in mind from the start when they define the features and architecture of a system. However, the practical application raises a number of challenges: first of all the privacy requirements must be defined precisely; then it must be possible to reason about potential tensions between privacy and other requirements and to explore different combinations of privacy enhancing technologies to build systems meeting all these requirements.

This work, which has been conducted in particular within the French ANR research project BioPriv [6], an interdisciplinary project involving lawyers and computer scientists, can be seen as an illustration of the feasibility of the privacy by design approach in an industrial environment. A step in this direction has been described in [2] which introduces a system for defining privacy architectures and reasoning about their properties. In Section 2, we provide an outline of this framework. Then we show how this framework can be used to apply a privacy by design approach to the implementation of biometric systems. In Sections 3 to 4.3, we describe several architectures for biometric systems, considering both existing systems and more advanced solutions, and show that they can be defined in this framework. This makes it possible to highlight their commonalities and differences especially with regard to their underlying trust assumptions.

In the second part of this paper, we address a security issue which cannot be expressed in the framework presented in Section 2. The origin of the problem is that side-channel information may leak from the execution of the system. This issue is acute for biometric systems because the result of a matching between two biometric data inherently provides some information, even if the underlying cryptographic components are correctly implemented [12, 39, 37]. To adress this issue, in Section 5, we propose an extension of the formal framework, in which information leaks spanning over several sessions of the system can be expressed. In Section 6, we apply the extended model to analyse biometric information leakage in several variants of biometric system architectures.

Finally, Section 7 sketches related works and Section 8 concludes the paper with suggestions of avenues for further work.

2 General approach

The work presented in [2] can be seen as a first step towards a formal and systematic approach to privacy by design. In practice, this framework makes it possible to express privacy and integrity requirements (typically the fact that an entity must obtain guarantees about the correctness of a value), to analyse their potential tensions and to make reasoned architectural choices based on explicit trust assumptions. The motivations for the approach come from the following observations:

  • First, one of the key decisions that has to be taken in the design of a privacy compliant system is the location of the data and the computations: for example, a system in which all data is collected and all results computed on a central server brings strong integrity guarantees to the operator at the price of a loss of privacy for data subjects. Decentralized solutions may provide better privacy protections but weaker guarantees for the operator. The use of privacy enhancing technologies such as homomorphic encryption or secure multi-party computation can in some cases reconcile both objectives.

  • The choice among the architectural options should be guided by the assumptions that can be placed by the actors on the other actors and on the components of the architecture. This trust itself can be justified in different ways (security protocol, secure or certified hardware, accredited third party, etc.).

As far as the formal model is concerned, the framework proposed in [2] relies on a dedicated epistemic logic. Indeed, because privacy is closely connected with the notion of knowledge, epistemic logics [17] form an ideal basis to reason about privacy properties but standard epistemic logics based on possible worlds semantics suffer from a weakness (called “logical omniscience” [22]) which makes them unsuitable in the context of privacy by design.

We assume that the functionality of the system is expressed as the computation of a set of equations over a language of terms  defined as follows, where represents constants (), variables () and functions ():

An architecture is defined by a set of components , for , and a set of relations. The relations define the capacities of the components and the trust assumptions. We use the following language to define the relations:

The notation denotes a set of terms of category . denotes the fact that component possesses (or is the origin of) the value of , which may correspond to situations in which is stored on or is a sensor collecting the value of . In this paper we use the set of predicates . means that the set of components can compute the term and assign its value to and represents the fact that component trusts component . means that can receive the values of variables in together with the statements in from .

We consider two types of statements here, namely attestations: is the declaration by the component that the properties in hold; and proofs: is the delivery by of a set of proofs of properties. is the verification by component of the corresponding statements (proof or authenticity). In any case, the architecture level does not provide details on how a verification is done. The verification of an attestation concerns the authenticity of the statement only, not its truth that may even not be able to carry out itself. In practice, it could be the verification of a digital signature.

Graphical data flow representations can be derived from architectures expressed in this language. For the sake of readability, we use both notations in the next sections.

The subset of the privacy logic used in this paper is the following dedicated epistemic logic:

and denote the facts that component respectively can or cannot get the value of . denotes the epistemic knowledge following the “deductive algorithmic knowledge” philosophy [17, 38] that makes it possible to avoid the logical omniscience problem. In this approach, the knowledge of a component is defined as the set of properties that this component can actually derive using its own information and his deductive system .

Another relation, , is used to take into account dependencies between variables. means that if can obtain the values of each variable in the set of variables , then it may be able to derive the value of . The absence of such a relation is an assumption that cannot derive the value of from the values of the variables in . It should be noted that this dependency relation is associated with a given component: different components may have different capacities. For example, if component is the only component able to decrypt a variable to get the clear text , then holds but does not hold for any .

The semantics of an architecture is defined as the set of states of the components of resulting from compatible execution traces [2]. A compatible execution trace contains only events that are instantiations of relations (e.g. , etc.) of (as further discussed in Section 5.1). The semantics of a property is defined as the set of architectures meeting . For example, if for all states , the sub-state of component is such that , which expresses the fact that the component cannot assign a value to the variable .

To make it possible to reason about privacy properties, an axiomatics of this logic is presented and is proven sound and complete. denotes that can be derived from thanks to the deductive rules (i.e. there exists a derivation tree such that all steps belong to the axiomatics, and such that the leaf is ). A subset of the axioms useful for this paper is presented in Figure 1.

\AxiomC \LeftLabelH1 \UnaryInfC \DisplayProof  \AxiomC \AxiomC \LeftLabelH3 \BinaryInfC \DisplayProof\AxiomC \AxiomC \LeftLabelH2 \BinaryInfC \DisplayProof  \AxiomC \AxiomC \LeftLabelH5 \BinaryInfC \DisplayProof\AxiomC \LeftLabelHN \UnaryInfC \DisplayProof  \AxiomC \AxiomC \LeftLabelK \BinaryInfC \DisplayProof\AxiomC    \LeftLabelK1 \UnaryInfC \DisplayProof  \AxiomC \AxiomC \LeftLabelK3 \BinaryInfC \DisplayProof\AxiomC \AxiomC \AxiomC \AxiomC \LeftLabelK4 \QuaternaryInfC \DisplayProof\AxiomC \AxiomC \AxiomC \LeftLabelK5 \TrinaryInfC \DisplayProof

Figure 1: A subset of rules from the axiomatics of [2]

3 Biometric systems architectures

Before starting the presentation of the different biometric architectures in the next sections, we introduce in this section the basic terminology used in this paper and the common features of the architectures. For the sake of readability, we use upper case sans serif letters S, T, etc. rather than indexed variables to denote components. By abuse of notation, we will use component names instead of indices and write, for example, . Type letters dec, br, etc. denote variables. The set of components of an architecture is denoted by .

The variables used in biometric system architectures are the following:

  • A biometric reference template br built during the enrolment phase, where a template corresponds to a set or vector of biometrics features that are extracted from raw biometric data in order to be able to compare biometric data accurately.

  • A raw biometric data rd provided by the user during the verification phase.

  • A fresh template bs derived from rd during the verification phase.

  • A threshold thr which is used during the verification phase as a closeness criterion for the biometric templates.

  • The output dec of the verification which is the result of the matching between the fresh template bs and the enrolled templates br, considering the threshold thr.

Two components appear in all biometric architectures: a component representing the user, and the terminal T which is equipped with a sensor used to acquire biometric traits. In addition, biometric architectures may involve an explicit issuer I, enrolling users and certifying their templates, a server S managing a database containing enrolled templates, a module (which can be a hardware security module, denoted HSM) to perform the matching and eventually to take the decision, and a smart card C to store the enrolled templates (and in some cases to perform the matching). Figure 2 introduces some graphical representations used in the figures of this paper.

User

Encrypted

database

Terminal

Card

Location

of the

comparison

Figure 2: Graphical representations

In this paper, we focus on the verification phase and assume that enrolment has already been done. Therefore the biometric reference templates are stored on a component which can be either the issuer () or a smart card (). A verification process is initiated by the terminal T receiving as input a raw biometric data rd from the user U. T extracts the fresh biometric template bs from rd using the function . All architectures therefore include and and the relation is such that . In all architectures , the user receives the final decision (which can typically be positive or negative) from the terminal: . The matching itself, which can be performed by different components depending on the architecture, is expressed by the function which takes as arguments two biometric templates and the threshold .

4 Application of the framework to several architectures for biometric systems with various protection levels

4.1 Protecting the reference templates with encryption

Let us consider first the most common architecture deployed for protecting biometric data. When a user is enrolled his reference template is stored encrypted, either in a terminal with an embedded database, or in a central database. During the identification process, the user supplies a fresh template, the reference templates are decrypted by a component (which can be typically the terminal or a dedicated hardware security module) and the comparison is done inside this component. The first part of Figure 3 shows an architecture in which reference templates are stored in a central database and the decryption of the references and the matching are done inside the terminal. The second part of the figure shows an architecture in which the decryption of the references and the matching are done on a dedicated hardware security module. Both architectures are considered in turn in the following paragraphs.

Encrypted database

Encrypted database with a hardware security module (HSM)

Figure 3: Classical architectures with an encrypted database

Use of an encrypted database.

The first architecture is composed of a user U, a terminal T, a server S managing an encrypted database ebr and an issuer I enrolling users and generating the encrypted database ebr. The set includes the encryption and decryption functions and . When applied to an array, is assumed to encrypt each entry of the array. At this stage, for the sake of conciseness, we consider only biometric data in the context of an identification phase. The same types of architectures can be used to deal with authentication, which does not raise any specific issue. The functionality of the architecture is {, , , }, and the architecture is defined as:

The properties of the encryption scheme are captured by the dependence and deductive relations. The dependence relations are: , and {(, {}), (, {, , }), (, {}), (, {})} . Moreover the deductive algorithm relation contains: .

From the point of view of biometric data protection, the property that this architecture is meant to ensure is the fact that the server should not have access to the reference template, that is to say: , which can be proven using Rule HN (the same property holds for ):

\AxiomC

\noLine\UnaryInfC \LeftLabelHN \UnaryInfC \DisplayProof

It is also easy to prove, using H2 and H5, that the terminal has access to : .

As far as integrity is concerned, the terminal should be convinced that the matching is correct. The proof relies on the trust placed by the terminal in the issuer (about the correctness of ) and the computations that the terminal can perform by itself (through and the application of ):

\AxiomC

\AxiomC \LeftLabelK5 \BinaryInfC \DisplayProof


\AxiomC

\AxiomC \LeftLabelK \BinaryInfC \DisplayProof


\AxiomC

\LeftLabelK1 \UnaryInfC \DisplayProof

Assuming that all deductive relations include the properties (commutativity and transitivity) of the equality, K can be used to derive: . A further application of K1 with another transitivity rule for the equality allows us to obtain the desired integrity property:

\AxiomC

\AxiomC \LeftLabelK1 \UnaryInfC \LeftLabelK \BinaryInfC \DisplayProof

Encrypted database with a hardware security module.

The architecture presented in the previous subsection relies on the terminal to decrypt the reference template and to perform the matching operation. As a result, the clear reference template is known by the terminal and the only component that has to be trusted by the terminal is the issuer. If it does not seem sensible to entrust the terminal with this central role, another option is to delegate the decryption of the reference template and computation of the matching to a hardware security module so that the terminal itself never stores any clear reference template. This strategy leads to architecture pictured in the second part of Figure 3.

In addition to the user U, the issuer I, the terminal T, and the server S, the set of components contains a hardware security module M. The terminal does not perform the matching, but has to trust M. This trust can be justified in practice by the level of security provided by the HSM M (which can also be endorsed by an official security certification scheme). The architecture is described as follows in our framework:

where the set of attestations received by the terminal from the module is .

The trust relation between the terminal and the module makes it possible to apply rule K5 twice:

\AxiomC

\UnaryInfC \DisplayProof


\AxiomC

\AxiomC \LeftLabelK5 \BinaryInfC \DisplayProof

The same proof as in the previous subsection can be applied to establish the integrity of the matching. The trust relation between the terminal and the issuer and the rules K5, K make it possible to derive: . Then two successive applications of K regarding the transitivity of the equality lead to: .

As in architecture , the biometric references are never disclosed to the server. However, in contrast with , they are not disclosed either to the terminal, as shown by rule HN:

\AxiomC

      \noLine\UnaryInfC \LeftLabelHN \UnaryInfC \DisplayProof

4.2 Enhancing protection with homomorphic encryption

In both architectures of Section 4.1, biometric templates are protected, but the component performing the matching (either the terminal or the secure module) gets access to the reference templates. In this section, we show how homomorphic encryption can be used to ensure that no component gets access to the biometric reference templates during the verification.

Homomorphic encryption schemes [19] makes it possible to compute certain functions over encrypted data. For example, if is a homomorphic encryption scheme for multiplication then there is an operation such that:

Figure 4 presents an architecture derived from in which the server performs the whole matching computation over encrypted data. The user supplies a template that is sent encrypted to the server (denoted ). The server also owns an encrypted reference template . The comparison, i.e. the computation of the distance between the templates, is done by the server, leading to the encrypted distance , but the server does not get access to the biometric data or to the result. This is made possible through the use a homomorphic encryption scheme. On the other hand, the module gets the result, but does not get access to the templates. Let us note that is just one of the possible ways to use homomorphic encryption in this context: the homomorphic computation of the distance could actually be made by another component (for example the terminal itself) since it does not lead to any leak of biometric data.

Figure 4: Comparison over encrypted data with homomorphic encryption

The homomorphic property of the encryption scheme needed for this application depends on the matching algorithm. An option is to resort to a fully homomorphic encryption scheme (FHE) [19] as in the solution described in [43] which uses a variant of a FHE scheme for face-recognition. However, schemes with simpler homomorphic functionalities can also be sufficient (examples can be found in [8, 7]). Since we describe our solutions at the architecture level, we do not need to enter into details regarding the chosen homomorphic scheme. We just need to assume the existence of a homomorphic matching function with the following properties captured by the algorithmic knowledge relations:

(1)

The dependence relations include the following: , ; ; , , . Architecture is defined as follows:

where the set of attestations received by the terminal from the server is: .

In order to prove that the terminal can establish the integrity of the result , we can proceed in two steps, proving first the correctness of and then deriving the correctness of using the properties of homomorphic encryption. The first step relies on the capacities of component and the trust assumptions on components and using rules K1 and K5 respectively.

\AxiomC

\LeftLabelK1 \UnaryInfC \DisplayProof


\AxiomC

\AxiomC \LeftLabelK5 \BinaryInfC \DisplayProof


\AxiomC

,   \LeftLabelK5 \UnaryInfC \DisplayProof

The second step can be done through the application of the deductive algorithmic knowledge regarding the homomorphic encryption property (with the left hand-side of equation (1)) :

\AxiomC

\AxiomC \LeftLabelK \BinaryInfC \DisplayProof

The desired property is obtained through the application of rules K5 and K exploiting the trust relation between and and the transitivity of equality.

\AxiomC

\AxiomC \LeftLabelK5 \BinaryInfC \DisplayProof

\AxiomC

\LeftLabelK \UnaryInfC \DisplayProof

As far as privacy is concerned, the main property that is meant to ensure is that no component (except the issuer) has access to the biometric references. Rule HN makes it possible to prove that U, T, and S never get access to br, as in Section 4.1. The same rule can be applied here to prove exploiting the fact that neither nor belong to .

4.3 The Match-On-Card technology

Another solution can be considered when the purpose of the system is authentication rather than identification. In this case, it is not necessary to store a database of biometric reference templates and a (usually unique) reference template can be stored on a smart card. A smart card based privacy preserving architecture has been proposed recently which relies on the idea of using the card not only to store the reference template but also to perform the matching itself. Since the comparison is done inside the card the reference template never leaves the card. In this Match-On-Card (MOC) technology [36, 35, 20] (also called comparison-on-card), the smart card receives the fresh biometric template, carries out the comparison with its reference template, and sends the decision back (as illustrated in Figure 5).

Figure 5: Biometric verification using the Match-On-Card technology

In this architecture, the terminal is assumed to trust the smart card. This trust assumption is justified by the fact that the card is a tamper-resistant hardware element. This architecture is simpler than the previous ones but not always possible in practice (for a combination of technical and economic reasons) and may represent a shift in terms of trust if the smart card is under the control of the user.

More formally, the MOC architecture is composed of a user U, a terminal T, and a card C. The card C attests that the templates and are close (with respect to the threshold ):

Using rule HN, it is easy to show that no component apart from gets access to . The proof of the integrity property relies on the capacities of component and the trust assumption on component using rules K1 and K5 respectively.

5 Extension of the framework to information leakage

5.1 Extension of the architecture language

Motivated by the need to analyse the inherent leakage of the result of a matching between two biometric data in biometric systems (cf. [12, 39, 37]), we now propose an extension of the formal framework sketched in Section 2, in which the information leaking through several executions can be expressed.

We highlights the difference with the framework introduced in Section 2 without repeating their common part. The term language we use is now the following.

Functions may take as parameters both variables and constants. Variables can be simple variables or arrays of variables. If is an array, denotes its size.

In this extended framework, in addition to defining a set of primitives, an architecture can also provide a bound on the number of times a primitive can be used.

The superscript notation denotes that a primitive can be carried out at most times by the component(s) – where (: ). We assume that is never equal to 0. denotes the multiplicity of the primitive , if any. The primitive is used to reinitialize the whole system.

As in the initial model, consistency assumptions are made about the architectures to avoid meaningless definitions. For instance, we require that components carry out computations only on the values that they have access to (either through , , or ). We also require that all multiplicities specified by the primitives are identical in a consistent architecture. As a result, a consistent architecture is parametrized by an integer (we note when we want to make this integer explicit).

A key concept for the definition of the semantics is the notion of trace. A trace is a sequence of events and an event2 is an instantiation of an architectural primitive3. The notion of successive sessions is caught by the addition of a event4 . A trace of events is said compatible with a consistent architecture if all events in (except the computations) can be obtained by instantiation of some architectural primitive from , and if the number of events between two events corresponding to a given primitive is less than the bound specified by the architecture. We denote by the set of traces which are compatible with an architecture .

An event can instantiate variables with specific values . Constants always map to the same value. Let be the set of values the variables and constants can take. The set is defined as where is a specific symbol used to denote that a variable or a constant has not been assigned yet.

The semantics of an architecture follows the approach introduced in [2]. Each component is associated with a state. Each event in a trace of events affects the state of each component involved by the event. The semantics of an architecture is defined as the set of states reachable by compatible traces.

The state of a component is either the state or a pair consisting of: (i) a variable state assigning values to variables, and (ii) a property state defining what is known by a component.

The data structure over a set denotes the finite ordered lists of elements of , denotes the size of the list , and is the empty list. For a non-empty list where , denotes the element for , denotes , and denotes the list . Let denote the global state (i.e. the list of states of all components) defined over and and denote, respectively, the variable and the knowledge state of the component .

The variable state assigns values to variables and to constants (each constant is either undefined or taking a single value). (resp. ) denotes the -th entry of the variable state of (resp. ). The initial state of an architecture is denoted by