Van Wijngaarden grammars,
metamorphism and K-ary malwares.
Grammars are used to describe sentences structure, thanks to some sets of rules, which depends on the grammar type. A classification of grammars has been made by Noam Chomsky, which led to four well-known types. Yet, there are other types of grammars, which do not exactly fit in Chomsky’s classification, such as the two-level grammars. As their name suggests it, the main idea behind these grammars is that they are composed of two grammars.
Van Wijngaarden grammars, particularly, are such grammars. They are interesting by their power (expressiveness), which can be the same, under some hypotheses, as the most powerful grammars of Chomsky’s classification, i.e. Type 0 grammars. Another point of interest is their relative conciseness and readability.
Van Wijngaarden grammars can describe static and dynamic semantic of a language. So, by using them as a generative engine, it is possible to generate a possibly infinite set of words, while assuring us that they all have the same semantic. Moreover, they can describe K-ary codes, by describing the semantic of each components of a code.
Grammars are mostly used to describe languages, like programming languages, in order to parse them. In this paper, we are not interested in the parsing problem of a language. On the contrary, the objective is to use a grammar from which the word parsing problem is known to be hard (as in NP), or even better, undecidable. Indeed, if one wants to do some metamorphism through the use of a grammar, one may want to avoid grammars for which techniques to build practical word recognizers of a language are known.
Van Wijngaarden grammars are different than the one which fall in Chomsky’s classification. Their writing is particular, and above all, their production process is quite different than the grammars in Chomsky’s hierarchy. We will see that these grammars may be used as “code translators”. They indeed have some rules which allow them to be very expressive.
2 Metamorphism vs. Polymorphism
The difference between polymorphism and metamorphism is often not very clear in people’s mind, so we describe it quickly in this section.
Polymorphism first appeared to counter the detection scheme of AV companies which was, and still is for a main part, based on signature matching. The aim of virus writers was to write a virus whose signature would change each time it evolves. In order to do so, the virus body is encrypted by an encryption function and it is decrypted by its decryptor at the runtime. The key used to encrypt each copy of the virus is changed, so that each copy has a different body (Figure 1).
Another technique that can be used is to apply a different encryption scheme for each copy of the code. Of course such a technique alone is not enough to evade signature detection as it only shifts the problem, the decryptor being a good candidate for a signature. To resolve this, the decryption routine has to be changed too between each copy of the virus. To do so, virus writers include a mutation engine, which is also encrypted during the propagation process, and which is used to randomly generate a new decryption routine so it is different from copy to copy (Figure 2).
While in the first case the decryption routine can be used as a signature, this is not the case in the second one. Indeed, the decryption routine changes from mutation to mutation, thanks to the engine.
The mutation engine cannot be used as a signature neither, because it is a part of the body, thus it is encrypted. The propagation process can be summed up in five steps :
The decryption routine decrypts the encrypted body;
The body is executed;
The code calls the mutation engine (which is decrypted at this stage) to transform the decryption routine;
The code and the mutation engine are encrypted;
The transformed decryption routine and the new encrypted body are then appended onto a new program.
Metamorphism differs from polymorphism in the fact that there is no use of a decryption routine, because there is no encryption process. In other words, while a polymorphic code has to decrypt itself before it can be executed, a metamorphic one is executed directly.
Indeed, a metamorphic engine can be seen as a “semantic translator”. The idea is to rewrite a given code into another syntactically different, yet semantically equivalent one (Figure 3).
Different techniques can be used to build an efficient metamorphic engine. Among these techniques we can observe :
Junk code insertion : a junk code is a code that is useless for the main code to perform its task.
Variable renaming : the variables used between different versions of the code are different.
Control flow modifications : some instructions are independent from each other, and so, can be swapped. Otherwise instructions can be shuffled and linked by jumps.
In this section, we recall what formal grammars are and the link they have with languages.
3.1 What is a grammar
Let be a finite set of symbols called alphabet. A formal grammar G is defined by the 4-tuple where :
is a finite set of non-terminal symbols, ;
is a finite set of terminal symbols, ;
S is the starting symbol of the grammar;
P is a set of production rules.
Basically, a grammar can be seen as a set of rewriting rules over an alphabet. An alphabet is a finite set of symbols (like ‘’, ‘’). We distinguish two sets of symbols. The first one is the set of non-terminal symbols, and the second is the set of terminal symbols. Non-terminal symbols are symbols which are used to be replaced by the right-hand side of a production rule. On the contrary, terminal symbols are symbols which cannot be modified by a rule. Of course, the two sets are disjoint. A rewriting rule is a rule which defines how a given sequence of symbols can be rewritten into another sequence of symbols. A special symbol, called the start symbol, is used to specify where the rewriting must start. This particular symbol belongs to the set of non-terminal symbols. We then note to define the grammar composed of the set of non-terminal symbols , the set of terminal symbols , the starting symbol , and the set of production (rewriting) rules .
Let be a formal grammar. The language described by G is L(G) = .
Grammars are used to describe languages. A language is a set of words, each word being a sequence of symbols. A word may or may not have a meaning nor a structure. For instance, the grammar , , , ; (here , , , and describes the language (i.e. the words ‘’, ‘’, ‘’, ). There exist different forms which are used to represent grammars. For convenience, we will write the production rules of a grammar as follows :
When some rules share the same left-hand side, as it is the case here, we can shrink the different alternatives in one rule, separated by a ‘’ :
To generate a word from these rules one proceeds as follows : start from the starting symbol and replace it by one of its alternatives. Then two cases have to be considered :
either a sequence of symbols of the produced sentential form matches the left-hand side of a rule;
either it is not the case and, if the sentential form does not contain any non-terminal symbols, it is a word of the language described by the grammar.
Whenever a sequence of symbols matches the left-hand side of a rule, it is replaced by one of the alternatives of the rule, and the process goes on until no more match is found.
As an example, take the above rule. The starting word is ‘’. Suppose that ‘’ produces the sentential form ‘’. As ‘’ does not match any left-hand side of the rules at our disposal, and as it is a terminal symbol, it is a word of the language. Now suppose that ‘’ produces the sentential form ‘’. The non-terminal ‘’ in ‘’ matches the left-hand side of one of the rules, so we replace it by one of its alternatives : ‘’ or ‘’. We thus obtain the sentential forms ‘’ or ‘’. Hence, the words generated by the above rule are : , , , ,
Now, if we use some x86 instructions as the terminal symbols, we can write rules which will generate x86 instructions sequences [Fil07b, Zbi09]. From a given sequence of instructions, it is easy to write a grammar which will generate it. For instance, the instruction sequence :
can be generated by the following production rules :
The instruction sequence is thus represented by the sequence of non-terminal symbols S T U V, the non-terminal S being rewritten into the sentential form “mov eax, key ”, which is then rewritten into the sentential form “mov eax, key xor [ ebx ], eax ”, etc…
Once the production rules are defined, one may want to generate an equivalent sequence of instructions. It is rather easy :
The production rules now generate 8 different sequences, each of them acting the same. In a same manner, one may want to add some junk code. This can be done by adding a new non-terminal which generates “useless” instructions111Care must be taken on the place where to add these instructions, as they may modify some flags which are check later, e.g. by a instruction. :
For this example, the addition of the rule , which is composed of only two alternatives, increases the number of instruction sequences that can be generated to 216 . This number can be made infinite pretty easily, by adding alternatives which generate only junk code for example, like :
3.2 Classification of grammars
Chomsky provided a well-known classification of grammars [Cho56]. He defined four main types, from Type 0 to Type 3, each type defining a set of languages, each of them being a subset of the set described by any lower numbered grammar. In other words, Type 0 are the most general grammars, while Type 3 are the most restrictive. Among these grammars, Type 2, also called context-free grammars, are the most popular. They describe context-free languages. Most of the programming languages are described by such grammars. The rules of Type 2 grammars have the following form :
Where U is a single non-terminal symbol, and V belongs to .
In other words, U can be rewritten as a possibly empty sequence of terminal and non-terminal symbols. The name context-free comes from the fact that the left-hand side of a rewriting rule is a single non-terminal, so the rewriting does not depend of what may be next to it in a sentential form, unlike in Type 0 and Type 1 grammars. We have the relation Type 0 Type 1 Type 2 Type 3. Thus, Type 0 grammars can define all the languages that are definable by Type 1, Type 2 or Type 3 grammars.
4 Van Wijngaarden grammars
4.1 Context-sensitivity restrictions
Context-sensitive languages are more complex than context free languages because one part of the string may “interact” with the structure of the other parts of the string. Once a non-terminal symbol has been produced in a sentential form in a context-free grammar, its further development is independent of the rest of the sentential form, whereas a non-terminal symbol in a sentential form of a context-sensitive grammar has to look at its neighbours, on its left and on its right, to see what are the production rules that are allowed for it. So a context-free grammar cannot express some “long-range” relations.
Yet, these relations are often valuable, as they make possible some fundamental properties of words to be described (like the only use of variables that have been declared). Programming languages are usually context-sensitive. For example a user is usually not allowed to use a variable that has not been created. So as it is not possible to express such properties through a context-free grammar, a solution, which is used most of the time, is to describe the structure of the correct words by a context-free grammar. The properties are checked by a separate program after that the word has been recognize by the grammar (though it may not belong to the “real” language). However, this solution is not very satisfactory as the interest of using a grammar is to have a (formal) description of all the properties of the language.
One can ask why a context-sensitive grammar is not used to describe the language. Actually this would pose some problems. Indeed, in general, context-sensitive languages cannot be parsed efficiently. Moreover, even though context-sensitive grammars have the power to express some long-ranged relations in a sentential form, they don’t do it in a way that is easily understandable.
Also it would make sense that after having written a grammar for , the writing of would work the same way. But this is not the case : the grammar for is more complex. The reason behind that is that to express a long-range relation, informations have to flow through the sentential form, thanks to the non-terminal symbols (which look at their neighbours to rewrite a sentential form into another). Thus it requires almost all rules to know something about almost all the other rules.
Several grammar forms which make these relations more readable and easier to construct have been created. Among them are Van Wijngaarden grammars.
4.2 VW grammar definition
Basically, a VW grammar can be seen as the composition of two context-free grammars (that is why such grammars are also called two-level grammars). The first context-free grammar is used to generate a set of terminal symbols which will act as non-terminals for the second context-free grammar.
Before going further, a few terms have to be introduced.
A is a sequence of small syntactic marks ;
A is a sequence of big syntactic marks which is defined in a metarule ;
A is a possibly empty sequence of metanotions and protonotions ;
A defines a metanotion as a possibly empty sequence of hypernotions ;
A defines a sequence of hypernotions as another sequence of hypernotions, separated by a comma. Actually, they represent a possibly infinite set of production rules ;
A VW grammar is defined by a set of metarules (or metaproduction rules) and a set of hyperrules ;
Whenever a metanotion appears more than once in a hyperrule, each of its occurrence have to be replaced consistently throughout the rule. This is called the Uniform Replacement Rule.
[vWMP77] A Van Wijngaarden grammar is a grammar G = ( M, V, N, T, , , S ) with :
The first set of rules are the metarules. They represent a modified grammar in which the non-terminals are replaced by metanotions, and the terminals are replaced by protonotions. The second set of rules are the hyperrules. They represent some possibly infinite set of production rules.
In order to make a distinction between the metarules, the hyperrules, and the production rules, the production symbol is changed. Instead of the symbol ‘’ we use ‘’ for the metarules and ‘’ for the hyperrules. To separate the different alternatives of a rule, the symbol ‘’ is used instead of ‘’. In metarules, members are separated by a blank, and in hyperrules, by a comma. The metanotions have to be chosen wisely, so that any sequence of metanotions is not also a different sequence of metanotions. For instance, if we have a metanotion X and a metanotion Y, then the metanotion XY should be avoided as it would induce some ambiguity.
To make it clearer, here is a VW grammar which describes the language (i.e. , , , …) :
The first two rules are the metarules, and the last three are the hyperrules. The metanotions are and . The hypernotions are , , , , , , and .
In the definition of a VW grammar, a member is a terminal symbol if it ends in (like ‘b symbol’ for the terminal symbol ‘b’), otherwise it is a non-terminal. So, here the rule “Ai : A symbol.” produces the terminal symbols and .
The metanotion produces an infinite set of . The ’s act as a counter for the number of letters to be produced. Indeed, as we said, the hypernotions describe a possibly infinite set of production rules. For instance here, the rule “AiN : A symbol, AN.” actually produces the rules :
To obtain these sets, the metanotion is replaced consistently by all the words it can generate. Here these are , and . So we obtain the following three rules :
Then the same thing is done with the metanotion N. As it generates the infinite language (i.e. ‘’, ‘’, ‘’…), we obtain the above three sets of infinite production rules.
4.3 Place in Chomsky’s hierarchy
By construction, Van Wijngaarden grammars do not belong to any category of Chomsky’s classification. However, one can compare the expressive power of a Van Wijngaarden grammar and the different types of Chomsky’s hierarchy. In terms of expressive power, they are in fact equivalent to Type 0 grammars. In a sense, they are even more powerful than Type 0 grammars since they can handle infinite symbols sets. For instance, as shown in Figure 4, a Van Wijngaarden grammar can produce the set :
are different symbols
A Type 0 grammar cannot generate this set since its number of (terminal) symbols is infinite.
Sintzoff [Sin67] showed that there exists a Van Wijngaarden grammar for every semi-thue system222A semi-thue system is a string rewriting system. It is equivalent to Chomsky’s Type 0 grammars.. Van Wijngaarden [Wij74] showed that a Van Wijngaarden grammar can simulate a Turing Machine. Thus, both proved that these grammars are at least as powerful as Type 0 grammars (i.e. that they are Turing complete). As a consequence, parsing of these grammars is undecidable in general. On a side note, it is to be noted that, if the first set of rules, i.e. the metarules, does not generate an infinite language, then the Van Wijngaarden grammar is equivalent to a standard context-free grammar. Indeed, if the language generated by a metarule is finite, one can write as much production rules as there is words in the language, and the consistent substitution can be “emulated” by the addition of rules which produce only one sentence. For instance the grammar :
ensures that the opening bracket matches the ending one. By increasing the number of rules of the grammar, we can express more and more context-sensitive conditions. It follows that if we have an infinite collection of context-free grammar rules, we can express any number of context-sensitive conditions, and so we can achieve full context-sensitivity. As said in the beginning of this section, this is the idea behind Van Wijngaarden grammars : a VW grammar can be seen as the composition of two context-free grammars. The first context-free grammar is used to generate a language which can in turn be described by the second context-free grammar. Nonetheless, as mentioned in the previous section, it is possible to produce every words of the languages they may describe.
4.4 VW grammars and word generation
Dick Grune [Gru84] made a program which can produce all the sentences of a Van Wijngaarden grammar. The program reads a grammar on its input, and then the generation of the words starts. If the input’s grammar describes an infinite language, then an infinite number of words will be produced. We modified some parts of this program in order to implement our mutation engine, and we have written a VW grammar based on the x86 instructions set.
It is not possible to generate the words of a Van Wijngaarden grammar in the same way that those of a context-free grammar are. Indeed, to generate a terminal production for a context-free language, we start from the start symbol. Intermediate results of a production (sentential forms) are stored in a queue. To rewrite a sentential form, we consider initially the first sentential form in the queue. Then, we search for a sequence of symbols which match the left-hand side of a production rule. If such a match is found, the sentential form is replaced by all its alternatives by making as much copies as the number of alternatives, and each copy is appended at the end of the queue. If no match is found, it means the sentential form is a terminal production.
This process cannot be applied to Van Wijngaarden grammars, as there may be an infinite number of left-hand side resulting from a same hyperrule. Actually, it would require us to scan all the possible left-hand side of the hyperrule, so you may have to look at an infinite number of left-hand side to know if there is a possible match. In theory this takes an infinite amount of time, but a solution to this problem can be found. The main issue comes from the fact that a metanotion can generate an infinite language (i.e. an infinite number of words). What we want to do is to find the terminal productions of the metanotions which are in the left-hand side of the hyperrule so that, after substitution, it corresponds to the sentential form. So, we want to parse the sentential form in accordance to the “metagrammar”, with the left-hand side of the hyperrule as the starting form. When the parsing is done, we can deduce which are the terminal productions that have to be used to match the sentential form. As the metagrammar is a context-free language, it can be parsed efficiently. So the problem can be solved. Thus, with this mechanism a member is considered to be a terminal symbol if no match is found in the left-hand side of the hyperrules. So it is not needed to append the symbol “symbol” at the end of a member to make it a terminal symbol.
Now, we know how to produce words from a VW grammar. We know too that VW grammars can handle context-sensitivity. So now we want to write rules which transform one sentential form into another one, while preserving its semantic (its context’s information). In order to do so, we modified a little the mechanism of the grammar : the word we want to transform is used as the starting word, and we do not try to parse it. In fact, a sort of parsing is handled by the way the production process works. Moreover, we use a random generator during the production process, to enable the production to randomly generate any word of the language described by the grammar. As an example take these metarules :
The metanotion represents an address or an hexadecimal number. The metanotion represents three instructions (mov, push and pop). And so on..
The hyperrules :
modify an instruction into a readable sentence. For example the word “mov eax, 0” will be replaced by “move 0 in eax”, because of the first hyperrule.
We can add hyperrules which will transform these sentence into other equivalent sentence(s) :
Now the sentential form obtained before (“move 0 in eax”) can be replaced by either “mov, eax, ‘,’, 0” or by “save 0, restore eax”. If the first alternative is selected, then the generation will stop. Indeed, the sentential form is composed of “mov”, “eax”, “ ‘,’ ” and “0”, and none of these words match a left-hand side of a hyperrule. On the other side, if the second alternative is selected then the generation continues, and both parts of the sentential form, “save 0” and “restore eax”, can be replaced independently from each other. Thus, the sentential form “save 0” can be replaced by “push, 0” (so the generation stops) or by “subtract 4 from esp, move 0 in [ esp ]”, etc.
The metarules used above can be more sophisticated so they generate an infinite set of instructions, and so the hyperrules generate an infinite number of production rules. Hence we can have a (infinite) rewriting system handling an infinite number of instructions.
5 K-ary viruses
5.1 What is a K-ary viruses
Definition 4 ([Fil07a]).
A K-ary virus is composed of a family of k files (some of which may not be executable), whose union constitutes a computer virus and performs an offensive action that is equivalent of that of a true virus. Such a code is said sequential if the k constituent parts are acting strictly one after the another. It is said parallel if the k parts executes simultaneously.
The interest of combined virus lies in the fact that the viral information is split in various parts, which taken separately can have a non-malicious behaviour. Because of this separation of the viral information, we are out of the scope of Cohen’s model. His model supposes that a virus is made of a unique sequence of symbols, which is not the case with combined viruses.
Two main classes of K-ary viruses have been identified [Fil07a] :
Class 1 codes. These are the codes that work sequentially.
This class is composed of three subclasses :
Subclass A. Each code refers or contains a reference to the others. Thus, the detection of one of these codes leads to the detections of all of the others.
Subclass B. None of the codes refers of contains a reference to the others. Thus, detecting one code does not affect the other codes. The detected code can be replaced by another code.
Subclass C. The dependence of the code is directed. Thus detecting one code does not affect the codes which are before it in the sequential execution.
Class 2 codes. These are the codes that work in parallel. This class is composed of the same three subclasses as the class 1.
5.2 Van Wijngaarden representation
The power of a K-ary virus lies in the fact that it is split in several parts. Thus, one can see a K-ary virus as a distributed program whose global action is the same as that of a virus. If we look at this type of program from the point of view of formal grammars, we can feel that such a program can be described by them.
Let be two files, and a virus. We define the relation by
The operator is a selection function, whose result is a set of words over its input. The idea is that is does a selection of some parts of its inputs to extract a word from them, and if one of the results is in the language generated by then its inputs form a K-ary virus.
The different parts of a K-ary virus can each be described separately by a grammar. If we put all these parts together, we have the description of the virus as a whole. Thus a Van Wijngaarden grammar can be used to define K-ary virus. The starting symbol produces all the parts of the K-ary virus, then the different parts are recognized by some hyperrules of the grammar. The consistent substitution allows some informations to be shared between each parts while they are created. As an example, for a K-ary virus with K=3, the rules would look like :
Once the combined virus is produced (that is, that we have different files that contains the elements of the virus) each part may mutate on its own. While K-ary malware have been formally defined [Fil07a] and their detection addressed, our approach enables to formalize the automatic generation of K-ary malware while providing a constructive proof.
Van Wijngaarden grammars are very powerful, and can be easily understood by a human. The power of these grammars comes from the two context-free grammars that are jointly used, coupled to the uniform replacement rule which allows context-sensitive conditions to be expressed. It is thus possible to handle undecidable problems suitable to design undetectable malwarse in a far easier way than considering formal grammars of class 0 directly.
K-ary virus have been defined through the use of a Van Wijngaarden grammar. The main idea is that the alternatives of the starting symbol are actually themselves the starting symbol of a grammar, describing each part (file) that the virus is composed of. This formal definition produces a constructive method to generate those codes automatically.
The author would like to thank Eric Filiol for his fruitful discussions about formal grammars, his active support to this work and all the people at the Operational Cryptology and Virology lab for the wonderful, stimulating and friendly environment they generate.
- [Cho56] Noam Chomsky. Three models for the description of language. IRI Transactions on Information Theory, 2(3):113–124, 1956.
- [Fil07a] Eric Filiol. Formalisation and implementation aspects of K-ary (malicious) codes. Journal in Computer Virology, 3(2):75–86, 2007.
- [Fil07b] Eric Filiol. Metamorphism, formal grammars and undecidable code mutation. International Journal of Computer Science, 2(1):70–75, 2007.
- [Gru84] Dick Grune. How to produce all sentences from a two-level grammar. Inf. Process. Lett., 19(4):181–185, 1984.
- [Sin67] M. Sintzoff. Existence of a van wijngaarden syntax for every recursively enumerable set. Annales de la Societe Scientifique de Bruxelles, 81((II)):115–118, 1967.
- [vWMP77] A. van Wijngaarden, B. J. Mailloux, J. E. L. Peck, C. H. A. Kostcr, M. Sintzoff, C. H. Lindsey, L. G. L. T. Meertens, and R. G. Fisker. Revised report on the algorithmic language algol 68. SIGPLAN Not., 12(5):1–70, 1977.
- [Wij74] Adriaan van Wijngaarden. The generative power of two-level grammars. In Proceedings of the 2nd Colloquium on Automata, Languages and Programming, pages 9–16, London, UK, 1974. Springer-Verlag.
- [Zbi09] Pavel V. Zbitskiy. Code mutation techniques by means of formal grammars and automatons. Journal in Computer Virology, 5(3):199–207, 2009.