[next] [prev] [prev-tail] [tail] [up]

2.3 Finite-State Automata and Regular Languages

      Finite-State Automata
      Nondeterminism versus Determinism in Finite-State Automata
      Finite-State Automata and Type 3 Grammars
      Type 3 Grammars and Regular Grammars
      Regular Languages and Regular Expressions

The computations of programs are driven by their inputs. The outputs are just the results of the computations, and they have no influence on the course that the computations take. Consequently, it seems that much can be studied about finite-state transducers, or equivalently, about finite-memory programs even when their outputs are ignored. The advantage of conducting a study of such stripped-down finite-state transducers is in the simplified argumentation that they allow.

Finite-State Automata

A finite-state transducer whose output components are ignored is called a finite-state automaton. Formally, a finite-state automaton M is a tuple <Q, , , q₀, F>, where Q, , q₀, and F are defined as for finite-state transducers, and the transition table is a relation from Q × ( {}) to Q.

Transition diagrams similar to those used for representing finite-state transducers can also be used to represent finite-state automata. The only difference is that in the case of finite-state automata, an edge that corresponds to a transition rule (p, , p) is labeled by the string .

Example 2.3.1 The finite-state automaton that is induced by the finite-state transducer of Figure 2.2.2 is <Q, , , q₀, F>, where Q = {q₀, q₁}, = {a, b}, = {(q₀, a, q₁), (q₀, b, q₁), (q₁, b, q₁), (q₁, a, q₀)}, and F = {q₀}.

The transition diagram in Figure 2.3.1 represents the finite-state automaton.

Figure 2.3.1

A finite-state automaton that corresponds to the finite-state transducer of Figure 2.2.2.

The finite-state automaton M is said to be deterministic if, for each state q in Q and for each input symbol a in , the union (q, a) (q, ) is a multiset that contains at most one element. The finite-state automaton is said to be nondeterministic if it is not a deterministic finite-state automaton.

A transition rule (q, , p) of the finite-state automaton is said to be an transition rule if = . A finite-state automaton with no transition rules is said to be an -free finite-state automaton.

Example 2.3.2 Consider the finite-state automaton M₁ = <{q₀, . . . , q₆}, {0, 1}, {(q₀, 0, q₀), (q₀, , q₁), (q₀, , q₄), (q₁, 0, q₂), (q₁, 1, q₁), (q₂, 0, q₃), (q₂, 1, q₂), (q₃, 0, q₃), (q₃, 1, q₁), (q₄, 0, q₄), (q₄, 1, q₅), (q₅, 0, q₅), (q₅, 1, q₆), (q₆, 1, q₆), (q₆, 0, q₄)}, q₀, {q₀, q₃, q₆}>. The transition diagram of M₁ is given in Figure 2.3.2.

Figure 2.3.2

A nondeterministic finite-state automaton.

M₁ is nondeterministic owing to the transition rules that originate at state q₀. One of the transition rules requires that an input value be read, whereas the other two transition rules require that no input value be read. Moreover, M₁ is also nondeterministic when the transition rule (q₀, 0, q₀) is ignored, because M₁ cannot determine locally which of the other transition rules to follow on the moves that originate at state q₀.

The finite-state automaton M₂ in Figure 2.3.3 is a deterministic finite-state automaton.

Figure 2.3.3

A deterministic finite-state automaton.

M₁ has two transition rules, and M₂ has one.

A configuration , or an instantaneous description, of the finite-state automaton is a singleton uqv, where q is a state in Q, and uv is a string in *. The configuration is said to be an initial configuration if u = and q is the initial state. The configuration is said to be an accepting , or final, configuration if v = and q is an accepting state. With no loss of generality it is assumed that Q and are mutually disjoint.

Other definitions, like those of _M, , _M*, *, and acceptance, recognition, and decidability of a language by a finite-state automaton, are similar to those given for finite-state transducers.

Nondeterminism versus Determinism in Finite-State Automata

By the following theorem, nondeterminism does not add to the recognition power of finite-state automata, even though it might add to their succinctness. The proof of the theorem provides an algorithm for constructing, from any given n-state finite-state automaton, an equivalent deterministic finite-state automaton of at most 2ⁿ states.

Theorem 2.3.1 If a language is accepted by a finite-state automaton, then it is also decided by a deterministic finite-state automaton that has no transition rules.

Proof Consider any finite-state automaton M = <Q, , , q₀, F>. Let A_x denote the set of all the states that M can reach from its initial state q₀, by the sequences of moves that consume the string x, that is, the set { q | q₀x * xq }. Then an input w is accepted by M if and only if A_w contains an accepting state.

The proof relies on the observation that A_xa contains exactly those states that can be reached from the states in A_x, by the sequences of transition rules that consume a, that is, A_xa = { p | q is in A_x, and qa * ap }.

Specifically, if p is a state in A_xa, then by definition there is a sequence of transition rules ₁, . . . , _t that takes M from the initial state q₀ to state p while consuming xa. This sequence must have a prefix ₁, . . . , _i that takes M from q₀ to some state q while consuming x (see Figure 2.3.4(a)).

Figure 2.3.4

Sequences of transition rules that consume xa.

Consequently, q is in A_x and the subsequence

_i+1, . . . ,

_t of transition rules takes M from state q to state p while consuming a.

On the other hand, if q is in A_x and if p is a state that is reachable from state q by a sequence ₁, . . . , _s of transition rules that consumes a, then the state p is in A_xa. In such a case, if '₁, . . . , '_r is a sequence of transition rules that takes M from the initial state q₀ to state q while consuming x, then M can reach the state p from state q₀ by the sequence '₁, . . . , '_r, ₁, . . . , _s of transition rules that consumes xa (see Figure 2.3.4(b)).

As a result, to determine if a₁ a_n is accepted by M, one needs only to follow the sequence A, A_a₁, A_a₁a₂, . . . , A_{a₁ a_n} of sets of states, where each A_{a₁ a_i+1} is uniquely determined from A_{a₁ a_i} and a_i+1. Therefore, a deterministic finite-state automaton M' of the following form decides the language that is accepted by M.

The set of states of M' is equal to { A | A is a subset of Q, and A = A_x for some x in * }. Since Q is finite, it follows that Q has only a finite number of subsets A, and consequently M' has also only a finite number of states. The initial state of M' is the subset of Q that is equal to A. The accepting states of M' are those states of M' that contain at least one accepting state of M. The transition table of M' is the set { (A, a, A') | A and A' are states of M', a is in , and A' is the set of states that the finite-state automaton M can reach by consuming a from those states that are in A }.

By definition, M' has no transition rules. Moreover, M' is deterministic because, for each x in * and each a in , the set A_xa is uniquely defined from the set A_x and the symbol a.

Example 2.3.3 Let M be the finite-state automaton whose transition diagram is given in Figure 2.3.2. The transition diagram in Figure 2.3.5

Figure 2.3.5

A transition diagram of an

-free, deterministic finite-state automaton that is equivalent to the finite-state automaton whose transition diagram is given in Figure 2.3.2.

represents an

-free, deterministic finite-state automaton that is equivalent to M. Using the terminology of the proof of Theorem 2.3.1 A = {q₀, q₁, q₄}, A₀ = {q₀, q₁, q₂, q₄}, and A₀₀ = A₀₀₀ = = A₀₀ = {q₀, q₁, q₂, q₃, q₄}.

A is the set of all the states that M can reach without reading any input. q₀ is in A because it is the initial state of M. q₁ and q₂ are in A because M has transition rules that leave the initial state q₀ and enter states q₁ and q₂, respectively.

A₀ is the set of all the states that M can reach just by reading 0 from those states that are in A. q₀ is in A₀ because q₀ is in A and M has the transition rule (q₀, 0, q₀). q₁ is in A₀ because q₀ is in A and M can use the pair (q₀, 0, q₀) and (q₀, , q₁) of transition rules to reach q₁ from q₀ just by reading 0. q₂ is in A₀ because q₀ is in A and M can use the pair (q₀, , q₁) and (q₁, 0, q₂) of transition rules to reach q₂ from q₀ just by reading 0.

The result of the last theorem cannot be generalized to finite-state transducers, because deterministic finite-state transducers can only compute functions, whereas nondeterministic finite-state transducers can also compute relations which are not functions, for example, the relation {(a, b), (a, c)}. In fact, there are also functions that can be computed by nondeterministic finite-state transducers but that cannot be computed by deterministic finite-state transducers. R = { (x0, 0^|x|) | x is a string in {0, 1}* } { (x1, 1^|x|) | x is a string in {0, 1}* } is an example of such a function. The function cannot be computed by a deterministic finite-state transducer because each deterministic finite-state transducer M satisfies the following condition, which is not shared by the function: if x₁ is a prefix of x₂ and M accepts x₁ and x₂, then the output of M on input x₁ is a prefix of the output of M on input x₂ (Exercise 2.2.5).

Finite-State Automata and Type 3 Grammars

The following two results imply that a language is accepted by a finite-state automaton if and only if it is a Type 3 language. The proof of the first result shows how Type 3 grammars can simulate the computations of finite-state automata.

Theorem 2.3.2 Finite-state automata accept only Type 3 languages.

Proof Consider any finite-state automaton M = <Q, , , q₀, F>. By Theorem 2.3.1 it can be assumed that M is an -free, finite-state automaton. With no loss of generality, it can also be assumed that no transition rule takes M to its initial state when that state is an accepting one. (If such is not the case, then one can add a new state q'₀ to Q, make the new state q'₀ both an initial and an accepting state, and add a new transition rule (q'₀, , q) to for each transition rule of the form (q₀, , q) that is in .)

Let G = <N, , P, [q₀]> be a Type 3 grammar, where N has a nonterminal symbol [q] for each state q in Q and P has the following production rules.

A production rule of the form [q] a[p] for each transition rule (q, a, p) in the transition table .
A production rule of the form [q] a for each transition rule (q, a, p) in such that p is an accepting state in F.
A production rule of the form [q₀] if the initial state q₀ is an accepting state in F.

The grammar G is constructed to simulate the computations of the finite-state automaton M. G records the states of M through the nonterminal symbols. In particular, G uses its start symbol [q₀] to initiate a simulation of M at state q₀. G uses a production rule of the form [q]

a[p] to simulate a move of M from state q to state p. In using such a production rule, G generates the symbol a that M reads in the corresponding move. G uses a production rule of the form [q]

a instead of the production rule of the form [q]

a[p], when it wants to terminate a simulation at an accepting state p.

By induction on n it follows that a string a₁a₂ a_n has a derivation in G of the form [q] a₁[q₁] a₁a₂[q₂] a₁a₂ a_n-1[q_n-1] a₁a₂ a_n if and only if M has a sequence of moves of the form qa₁a₂ a_n a₁q₁a₂ a_n a₁a₂q₂a₃ a_n a₁ a_n-1q_n-1a_n a₁a₂ a_nq_n for some accepting state q_n. In particular the correspondence above holds for q = q₀. Therefore L(G) = L(M).

Example 2.3.4 The finite-state automaton M₁, whose transition diagram is given in Figure 2.3.6(b),

Figure 2.3.6

Two equivalent finite-state automata.

is an

-free, deterministic finite-state automaton. M₁ is not suitable for a direct simulation by a Type 3 grammar because its initial state q₀ is both an accepting state and a destination of a transition rule. Without modifications to M₁ the algorithm that constructs the grammar G will produce the production rule [q₀]

because q₀ is an accepting state, and the production rule [q₁]

b[q₀] because of the transition rule (q₁, b, q₀). Such a pair of production rules cannot coexist in a Type 3 grammar.

M₁ is equivalent to the finite-state automaton M₂, whose transition diagram is given in Figure 2.3.6(a). The Type 3 grammar G = <N, , P, [q'₀]> generates the language L(M₂), if N = {[q'₀], [q₀], [q₁], [q₂]}, = {a, b}, and P consists of the following production rules.

The accepting computation q'₀abaa aq₁baa abq₀aa abaq₁a abaaq₂ of M₂ on input abaa is simulated by the derivation [q'₀] a[q₁] ab[q₀] aba[q₁] abaa of the grammar.

The production rule [q₁] a[q₂] can be eliminated from the grammar without affecting the generated language.

The next theorem shows that the converse of Theorem 2.3.2 also holds. The proof shows how finite-state automata can trace the derivations of Type 3 grammars.

Theorem 2.3.3 Each Type 3 language is accepted by a finite-state automaton.

Proof Consider any Type 3 grammar G = <N, , P, S>. The finite-state automaton M = <Q, , , q_S, F> accepts the language that G generates if Q, , q_S, and F are as defined below.

M has a state q_A in Q for each nonterminal symbol A in N. In addition, Q also has a distinguished state named q_f. The state q_S of M, which corresponds to the start symbol S, is designated as the initial state of M. The state q_f of M is designated to be the only accepting state of M, that is, F = {q_f}.

M has a transition rule in if and only if the transition rule corresponds to a production rule of G. Each transition rule of the form (q_A, a, q_B) in corresponds to a production rule of the form A aB in G. Each transition rule of the form (q_A, a, q_f) in corresponds to a production rule of the form A a in G. Each transition rule of the form (q_S, , q_f) in corresponds to a production rule of the form S in G.

The finite-state automaton M is constructed so as to trace the derivations of the grammar G in its computations. M uses its states to keep track of the nonterminal symbols in use in the sentential forms of G. M uses its transition rules to consume the input symbols that G generates in the direct derivations that use the corresponding production rules.

By induction on n, the constructed finite-state automaton M has a sequence q_A₀x u₁q_A₁v₁ u₂q_A₂v₂ u_n-1q_{A_n-1}v_n-1 xq_{A_n} of n moves if and only if the grammar G has a derivation of length n of the form A₀ u₁A₁ u₂A₂ u_n-1A_n-1 x. In particular, such correspondence holds for A₀ = S. Consequently, x is in L(M) if and only if it is in L(G).

Example 2.3.5 Consider the Type 3 grammar G = <{S, A, B}, {a, b}, P, S>, where P consists of the following transition rules.

The transition diagram in Figure 2.3.7

Figure 2.3.7

A finite-state automaton that accepts L(G), where G is the grammar of Example 2.3.5.

represents a finite-state automaton that accepts the language L(G). The derivation S

aaA

aab in G is traced by the computation q_Saab

aq_Aab

aaq_Ab

aabq_f of M.

It turns out that finite-state automata and Type 3 grammars are quite similar mathematical systems. The states in the automata play a role similar to the nonterminal symbols in the grammars, and the transition rules in the automata play a role similar to the production rules in the grammars.

Type 3 Grammars and Regular Grammars

Type 3 grammars seem to be minimal in the sense that placing further meaningful restrictions on them results in grammars that cannot generate all the Type 3 languages. On the other hand, some of the restrictions placed on Type 3 grammars can be relaxed without increasing the class of languages that they can generate.

Specifically, a grammar G = <N, , P, S> is said to be a right-linear grammar if each of its production rules is either of the form A xB or of the form A x, where A and B are nonterminal symbols in N and x is a string of terminal symbols in *.

The grammar is said to be a left-linear grammar if each of its production rules is either of the form A Bx or of the form A x, where A and B are nonterminal symbols in N and x is a string of terminal symbols in *.

The grammar is said to be a regular grammar if it is either a right-linear grammar or a left-linear grammar. A language is said to be a regular language if it is generated by a regular grammar.

By Exercise 2.3.5 a language is a Type 3 language if and only if it is regular.

Regular Languages and Regular Expressions

Regular languages can also be defined, from the empty set and from some finite number of singleton sets, by the operations of union, composition, and Kleene closure. Specifically, consider any alphabet . Then a regular set over is defined in the following way.

The empty set Ø, the set {} containing only the empty string, and the set {a} for each symbol a in , are regular sets.
If L₁ and L₂ are regular sets, then so are the union L₁ L₂, the composition L₁L₂, and the Kleene closure L₁*.
No other set is regular.

By Exercise 2.3.6 the following characterization holds.

Theorem 2.3.4 A set is a regular set if and only if it is accepted by a finite-state automaton.

Regular sets of the form Ø, {}, {a}, L L, LL, and L* are quite often denoted by the expressions Ø, , a, () + (), ()(), and ()*, respectively. and are assumed to be the expressions that denote L and L in a similar manner, respectively. a is assumed to be a symbol from the alphabet. Expressions that denote regular sets in this manner are called regular expressions.

Some parentheses can be omitted from regular expressions, if a precedence relation between the operations of Kleene closure, composition, and union in the given order is assumed. The omission of parentheses in regular expressions is similar to that in arithmetic expressions, where closure, composition, and union in regular expressions play a role similar to exponentiation, multiplication, and addition in arithmetic expressions.

Example 2.3.6 The regular expression 0*(1*01*00*(11*01*00*)* + 0*10*11*(00*10*11*)*) denotes the language that is recognized by the finite-state automaton whose transition diagram is given in Figure 2.3.2. The expression indicates that each string starts with an arbitrary number of 0's. Then the string continues with a string in 1*01*00*(11*01*00*)* or with a string in 10*11*(00*10*11*)*. In the first case, the string continues with an arbitrary number of 1's, followed by 0, followed by an arbitrary number of 1's, followed by one or more 0's, followed by an arbitrary number of strings in 11*01*00*.

By the previous discussion, nondeterministic finite-state automata, deterministic finite-state automata, regular grammars, and regular expressions are all characterizations of the languages that finite-memory programs accept. Moreover, there are effective procedures for moving between the different characterizations. These procedures provide the foundation for many systems that produce finite-memory-based programs from characterizations of the previous nature. For instance, one of the best known systems, called LEX , gets inputs that are generalizations of regular expressions and provides outputs that are scanners. The advantage of such systems is obviously in the reduced effort they require for obtaining the desired programs.

Figure 2.3.8

Figure 2.3.8

The structural and functional relationships between some descriptive systems.

illustrates the structural and functional hierarchies for some descriptive systems. The structural hierarchies are shown by the directed acyclic graphs. The functional hierarchy is shown by the Venn diagram.

[next] [prev] [prev-tail] [front] [up]