[next] [prev] [prev-tail] [tail] [up]

1.2 Formal Languages and Grammars

      Languages
      Grammars
      Derivations
      Derivation Graphs
      Leftmost Derivations
      Hierarchy of Grammars

The universe of strings is a useful medium for the representation of information as long as there exists a function that provides the interpretation for the information carried by the strings. An interpretation is just the inverse of the mapping that a representation provides, that is, an interpretation is a function g from * to D for some alphabet and some set D. The string 111, for instance, can be interpreted as the number one hundred and eleven represented by a decimal string, as the number seven represented by a binary string, and as the number three represented by a unary string.

The parties communicating a piece of information do the representing and interpreting. The representation is provided by the sender, and the interpretation is provided by the receiver. The process is the same no matter whether the parties are human beings or programs. Consequently, from the point of view of the parties involved, a language can be just a collection of strings because the parties embed the representation and interpretation functions in themselves.

Languages

In general, if is an alphabet and L is a subset of *, then L is said to be a language over , or simply a language if is understood. Each element of L is said to be a sentence or a word or a string of the language.

Example 1.2.1 {0, 11, 001}, {, 10}, and {0, 1}* are subsets of {0, 1}*, and so they are languages over the alphabet {0, 1}.

The empty set Ø and the set {} are languages over every alphabet. Ø is a language that contains no string. {} is a language that contains just the empty string.

The union of two languages L₁ and L₂, denoted L₁ L₂, refers to the language that consists of all the strings that are either in L₁ or in L₂, that is, to { x | x is in L₁ or x is in L₂ }. The intersection of L₁ and L₂, denoted L₁ $/~\$ L₂, refers to the language that consists of all the strings that are both in L₁ and L₂, that is, to { x | x is in L₁ and in L₂ }. The complementation of a language L over , or just the complementation of L when is understood, denoted , refers to the language that consists of all the strings over that are not in L, that is, to { x | x is in * but not in L }.

Example 1.2.2 Consider the languages L₁ = {, 0, 1} and L₂ = {, 01, 11}. The union of these languages is L₁ L₂ = {, 0, 1, 01, 11}, their intersection is L₁ $/~\$ L₂ = {}, and the complementation of L₁ is = {00, 01, 10, 11, 000, 001, . . . }.

Ø L = L for each language L. Similarly, Ø $/~\$ L = Ø for each language L. On the other hand, = * and = Ø for each alphabet .

The difference of L₁ and L₂, denoted L₁ - L₂, refers to the language that consists of all the strings that are in L₁ but not in L₂, that is, to { x | x is in L₁ but not in L₂ }. The cross product of L₁ and L₂, denoted L₁ × L₂, refers to the set of all the pairs (x, y) of strings such that x is in L₁ and y is in L₂, that is, to the relation { (x, y) | x is in L₁ and y is in L₂ }. The composition of L₁ with L₂, denoted L₁L₂, refers to the language { xy | x is in L₁ and y is in L₂ }.

Example 1.2.3 If L₁ = {, 1, 01, 11} and L₂ = {1, 01, 101} then L₁ - L₂ = {, 11} and L₂ - L₁ = {101}.

On the other hand, if L₁ = {, 0, 1} and L₂ = {01, 11}, then the cross product of these languages is L₁ × L₂ = {(, 01), (, 11), (0, 01), (0, 11), (1, 01), (1, 11)}, and their composition is L₁L₂ = {01, 11, 001, 011, 101, 111}.

L - Ø = L, Ø - L = Ø, ØL = Ø, and {}L = L for each language L.

Lⁱ will also be used to denote the composing of i copies of a language L, where L⁰ is defined as {}. The set L⁰ L¹ L² L³ . . . , called the Kleene closure or just the closure of L, will be denoted by L*. The set L¹ L² L³ , called the positive closure of L, will be denoted by L⁺.

Lⁱ consists of those strings that can be obtained by concatenating i strings from L. L* consists of those strings that can be obtained by concatenating an arbitrary number of strings from L.

Example 1.2.4 Consider the pair of languages L₁ = {, 0, 1} and L₂ = {01, 11}. For these languages L₁² = {, 0, 1, 00, 01, 10, 11}, and L₂³ = {010101, 010111, 011101, 011111, 110101, 110111, 111101, 111111}. In addition, is in L₁*, in L₁⁺, and in L₂* but not in L₂⁺.

The operations above apply in a similar way to relations in * × *, when and are alphabets. Specifically, the union of the relations R₁ and R₂, denoted R₁ R₂, is the relation { (x, y) | (x, y) is in R₁ or in R₂ }. The intersection of R₁ and R₂, denoted R₁ $/~\$ R₂, is the relation { (x, y) | (x, y) is in R₁ and in R₂ }. The composition of R₁ with R₂, denoted R₁R₂, is the relation { (x₁x₂, y₁y₂) | (x₁, y₁) is in R₁ and (x₂, y₂) is in R₂ }.

Example 1.2.5 Consider the relations R₁ = {(, 0), (10, 1)} and R₂ = {(1, ), (0, 01)}. For these relations R₁ R₂ = {(, 0), (10, 1), (1, ), (0, 01)}, R₁ $/~\$ R₂ = Ø, R₁R₂ = {(1, 0), (0, 001), (101, 1), (100, 101)}, and R₂R₁ = {(1, 0), (110, 1), (0, 010), (010, 011)}.

The complementation of a relation R in * × *, or just the complementation of R when and are understood, denoted , is the relation { (x, y) | (x, y) is in * × * but not in R }. The inverse of R, denoted R^-1, is the relation { (y, x) | (x, y) is in R }. R⁰ = {(, )}. Rⁱ = R^i-1R for i 1.

Example 1.2.6 If R is the relation {(, ), (, 01)}, then R^-1 = {(, ), (01, )}, R⁰ = {(, )}, and R² = {(, ), (, 01), (, 0101)}.

A language that can be defined by a formal system, that is, by a system that has a finite number of axioms and a finite number of inference rules, is said to be a formal language.

Grammars

It is often convenient to specify languages in terms of grammars. The advantage in doing so arises mainly from the usage of a small number of rules for describing a language with a large number of sentences. For instance, the possibility that an English sentence consists of a subject phrase followed by a predicate phrase can be expressed by a grammatical rule of the form <sentence> <subject><predicate>. (The names in angular brackets are assumed to belong to the grammar metalanguage.) Similarly, the possibility that the subject phrase consists of a noun phrase can be expressed by a grammatical rule of the form <subject> <noun>. In a similar manner it can also be deduced that "Mary sang a song" is a possible sentence in the language described by the following grammatical rules.

The grammatical rules above also allow English sentences of the form "Mary sang a song" for other names besides Mary. On the other hand, the rules imply non-English sentences like "Mary sang a Mary," and do not allow English sentences like "Mary read a song." Therefore, the set of grammatical rules above consists of an incomplete grammatical system for specifying the English language.

For the investigation conducted here it is sufficient to consider only grammars that consist of finite sets of grammatical rules of the previous form. Such grammars are called Type 0 grammars , or phrase structure grammars , and the formal languages that they generate are called Type 0 languages.

Strictly speaking, each Type 0 grammar G is defined as a mathematical system consisting of a quadruple <N, , P, S>, where

N: is an alphabet, whose elements are called nonterminal symbols.
: is an alphabet disjoint from N, whose elements are called terminal symbols.
P: is a relation of finite cardinality on (N )*, whose elements are called production rules. Moreover, each production rule (, ) in P, denoted , must have at least one nonterminal symbol in . In each such production rule, is said to be the left-hand side of the production rule, and is said to be the right-hand side of the production rule.
S: is a symbol in N called the start , or sentence , symbol.

Example 1.2.7 <N,

, P, S> is a Type 0 grammar if N = {S},

= {a, b}, and P = {S

aSb, S

}. By definition, the grammar has a single nonterminal symbol S, two terminal symbols a and b, and two production rules S

aSb and S

. Both production rules have a left-hand side that consists only of the nonterminal symbol S. The right-hand side of the first production rule is aSb, and the right-hand side of the second production rule is

<N₁, ₁, P₁, S> is not a grammar if N₁ is the set of natural numbers, or ₁ is empty, because N₁ and ₁ have to be alphabets.

If N₂ = {S}, ₂ = {a, b}, and P₂ = {S aSb, S , ab S} then <N₂, ₂, P₂, S> is not a grammar, because ab S does not satisfy the requirement that each production rule must contain at least one nonterminal symbol on the left-hand side.

In general, the nonterminal symbols of a Type 0 grammar are denoted by S and by the first uppercase letters in the English alphabet A, B, C, D, and E. The start symbol is denoted by S. The terminal symbols are denoted by digits and by the first lowercase letters in the English alphabet a, b, c, d, and e. Symbols of insignificant nature are denoted by X, Y, and Z. Strings of terminal symbols are denoted by the last lowercase English characters u, v, w, x, y, and z. Strings that may consist of both terminal and nonterminal symbols are denoted by the first lowercase Greek symbols , , and . In addition, for convenience, sequences of production rules of the form

are denoted as

Example 1.2.8 <N, , P, S> is a Type 0 grammar if N = {S, B}, = {a, b, c}, and P consists of the following production rules.

The nonterminal symbol S is the left-hand side of the first three production rules. Ba is the left-hand side of the fourth production rule. Bb is the left-hand side of the fifth production rule.

The right-hand side aBSc of the first production rule contains both terminal and nonterminal symbols. The right-hand side abc of the second production rule contains only terminal symbols. Except for the trivial case of the right-hand side of the third production rule, none of the right-hand sides of the production rules consists only of nonterminal symbols, even though they are allowed to be of such a form.

Derivations

Grammars generate languages by repeatedly modifying given strings. Each modification of a string is in accordance with some production rule of the grammar in question G = <N, , P, S>. A modification to a string in accordance with production rule is derived by replacing a substring in by .

In general, a string is said to directly derive a string ' if ' can be obtained from by a single modification. Similarly, a string is said to derive ' if ' can be obtained from by a sequence of an arbitrary number of direct derivations.

Formally, a string is said to directly derive in G a string ', denoted _G', if ' can be obtained from by replacing a substring with , where is a production rule in G. That is, if = and ' = for some strings , , , and such that is a production rule in G.

Example 1.2.9 If G is the grammar <N, , P, S> in Example 1.2.7, then both and aSb are directly derivable from S. Similarly, both ab and a²Sb² are directly derivable from aSb. is directly derivable from S, and ab is directly derivable from aSb, in accordance with the production rule S . aSb is directly derivable from S, and a²Sb² is directly derivable from aSb, in accordance with the production rule S aSb.

On the other hand, if G is the grammar <N, , P, S> of Example 1.2.8, then aBaBabccc _GaaBBabccc and aBaBabccc _GaBaaBbccc in accordance with the production rule Ba aB. Moreover, no other string is directly derivable from aBaBabccc in G.

is said to derive ' in G, denoted _G* ', if ₀ _G _G'_n for some ₀, . . . , _n such that ₀ = and _n = '. In such a case, the sequence ₀ _G _G_n is said to be a derivation of from ' whose length is equal to n. ₀, . . . , _n are said to be sentential forms, if ₀ = S. A sentential form that contains no terminal symbols is said to be a sentence .

Example 1.2.10 If G is the grammar of Example 1.2.7, then a⁴Sb⁴ has a derivation from S. The derivation S _G* a⁴Sb⁴ has length 4, and it has the form S _GaSb _Ga²Sb² _Ga³Sb³ _Ga⁴Sb⁴.

A string is assumed to be in the language that the grammar G generates if and only if it is a string of terminal symbols that is derivable from the starting symbol. The language that is generated by G, denoted L(G), is the set of all the strings of terminal symbols that can be derived from the start symbol, that is, the set { w | w is in *, and S _G* w }. Each string in the language L(G) is said to be generated by G.

Example 1.2.11 Consider the grammar G of Example 1.2.7. is in the language that G generates because of the existence of the derivation S _G. ab is in the language that G generates, because of the existence of the derivation S _GaSb _Gab. a²b² is in the language that G generates, because of the existence of the derivation S _GaSb _Ga²Sb² _Ga²b².

The language L(G) that G generates consists of all the strings of the form a ab b in which the number of a's is equal to the number of b's, that is, L(G) = { aⁱbⁱ | i is a natural number }.

aSb is not in L(G) because it contains a nonterminal symbol. a²b is not in L(G) because it cannot be derived from S in G.

In what follows, the notations ' and * ' are used instead of _G' and _G* ', respectively, when G is understood. In addition, Type 0 grammars are referred to simply as grammars, and Type 0 languages are referred to simply as languages , when no confusion arises.

Example 1.2.12 If G is the grammar of Example 1.2.8, then the following is a derivation for a³b³c³. The underlined and the overlined substrings are the left- and the right-hand sides, respectively, of those production rules used in the derivation.



		aBc

		aBacc

		aBaccc

		aaccc

		abbccc

		aabccc

		aaabccc

The language generated by the grammar G consists of all the strings of the form a ab bc c in which there are equal numbers of a's, b's, and c's, that is, L(G) = { aⁱbⁱcⁱ | i is a natural number }.

The first two production rules in G are used for generating sentential forms that have the pattern aBaB aBabc c. In each such sentential form the number of a's is equal to the number of c's and is greater by 1 than the number of B's.

The production rule Ba aB is used for transporting the B's rightward in the sentential forms. The production rule Bb bb is used for replacing the B's by b's, upon reaching their appropriate positions.

Derivation Graphs

Derivations of sentential forms in Type 0 grammars can be displayed by derivation , or parse, graphs. Each derivation graph is a rooted, ordered, acyclic, directed graph whose nodes are labeled. The label of each node is either a nonterminal symbol, a terminal symbol, or an empty string. The derivation graph that corresponds to a derivation S ₁ _n is defined inductively in the following manner.

The derivation graph D₀ that corresponds to S consists of a single node labeled by the start symbol S.
If is the production rule used in the direct derivation _i _i+1, 0 i < n and ₀ = S, then the derivation graph D_i+1 that corresponds to ₀ _i+1 is obtained from D_i by the addition of max(||, 1) new nodes. The new nodes are labeled by the characters of , and are assigned as common successors to each of the nodes in D_i that corresponds to a character in . Consequently, the leaves of the derivation graph D_i+1 are labeled by _i+1.

Derivation graphs are also called derivation trees or parse trees when the directed graphs are trees.

Example 1.2.13 Figure 1.2.1(a) provides examples of derivation trees for derivations in the grammar of Example 1.2.7. Figure 1.2.1(b) provides examples of derivation graphs for derivations in the grammar of Example 1.2.8.

Figure 1.2.1

(a) Derivation trees. (b) Derivation graphs.

Figure 1.2.2

A derivation graph with ordering of the usage of production rules indicated with arrows.

Leftmost Derivations

A derivation ₀ _n is said to be a leftmost derivation if ₁ is replaced before ₂ in the derivation whenever the following two conditions hold.

₁ appears to the left of ₂ in _i, 0 i < n.
₁ and ₂ are replaced during the derivation in accordance with some production rules of the form ₁ ₁ and ₂ ₂, respectively.

Example 1.2.14 The derivation graph in Figure 1.2.2 indicates the order in which the production rules are used in the derivation of a³b³c³ in Example 1.2.12. The substring

₁ = aB that is replaced in the seventh step of the derivation is in the same sentential form as the substring

₂ = Bb that is replaced in the sixth step of the derivation. The derivation is not a leftmost derivation because

₁ appears to the left of

₂ while it is being replaced after

₂.

On the other hand, the following derivation is a leftmost derivation for a³b³c³ in G. The order in which the production rules are used is similar to that indicated in Figure 1.2.2. The only difference is that the indices 6 and 7 should be interchanged.



		ac

		aBcc

		aaBcc

		aabccc

		aaccc

		aaaccc

		aaabccc

Hierarchy of Grammars

The following classes of grammars are obtained by gradually increasing the restrictions that the production rules have to obey.

A Type 1 grammar is a Type 0 grammar <N, , P, S> that satisfies the following two conditions.

Each production rule in P satisfies || || if it is not of the form S .
If S is in P, then S does not appear in the right-hand side of any production rule.

A language is said to be a Type 1 language if there exists a Type 1 grammar that generates the language.

Example 1.2.15 The grammar of Example 1.2.8 is not a Type 1 grammar, because it does not satisfy condition (b). The grammar can be modified to be of Type 1 by replacing its production rules with the following ones. E is assumed to be a new nonterminal symbol.

An addition to the modified grammar of a production rule of the form Bb b will result in a non-Type 1 grammar, because of a violation to condition (a).

A Type 2 grammar is a Type 1 grammar in which each production rule satisfies || = 1, that is, is a nonterminal symbol. A language is said to be a Type 2 language if there exists a Type 2 grammar that generates the language.

Example 1.2.16 The grammar of Example 1.2.7 is not a Type 1 grammar, and therefore also not a Type 2 grammar. The grammar can be modified to be a Type 2 grammar, by replacing its production rules with the following ones. E is assumed to be a new nonterminal symbol.

An addition of a production rule of the form aE EaE to the grammar will result in a non-Type 2 grammar.

A Type 3 grammar is a Type 2 grammar <N, , P, S> in which each of the production rules , which is not of the form S , satisfies one of the following conditions.

is a terminal symbol.
is a terminal symbol followed by a nonterminal symbol.

A language is said to be a Type 3 language if there exists a Type 3 grammar that generates the language.

Example 1.2.17 The grammar <N, , P, S>, which has the following production rules, is a Type 3.

An addition of a production rule of the form A Ba, or of the form B bb, to the grammar will result in a non-Type 3 grammar.

Figure 1.2.3 illustrates the hierarchy of the different types of grammars.

Figure 1.2.3

Hierarchy of grammars.

[next] [prev] [prev-tail] [front] [up]