1.1 Alphabets, Strings, and Representations

      Alphabets and Strings
      Ordering of Strings
      Representations

The ability to represent information is crucial to communicating and processing information. Human societies created spoken languages to communicate on a basic level, and developed writing to reach a more sophisticated level.

The English language, for instance, in its spoken form relies on some finite set of basic sounds as a set of primitives. The words are defined in term of finite sequences of such sounds. Sentences are derived from finite sequences of words. Conversations are achieved from finite sequences of sentences, and so forth.

Written English uses some finite set of symbols as a set of primitives. The words are defined by finite sequences of symbols. Sentences are derived from finite sequences of words. Paragraphs are obtained from finite sequences of sentences, and so forth.

Similar approaches have been developed also for representing elements of other sets. For instance, the natural number can be represented by finite sequences of decimal digits.

Computations, like natural languages, are expected to deal with information in its most general form. Consequently, computations function as manipulators of integers, graphs, programs, and many other kinds of entities. However, in reality computations only manipulate strings of symbols that represent the objects. The previous discussion necessitates the following definitions.

Alphabets and Strings

A finite, nonempty ordered set will be called an alphabet if its elements are symbols , or characters (i.e., elements with "primitive" graphical representations). A finite sequence of symbols from a given alphabet will be called a string over the alphabet. A string that consists of a sequence a₁, a₂, . . . , a_n of symbols will be denoted by the juxtaposition a₁a₂ a_n. Strings that have zero symbols, called empty strings, will be denoted by .

Example 1.1.1 ₁ = {a, . . . , z} and ₂ = {0, . . . , 9} are alphabets. abb is a string over ₁, and 123 is a string over ₂. ba12 is not a string over ₁, because it contains symbols that are not in ₁. Similarly, 314 . . . is not a string over ₂, because it is not a finite sequence. On the other hand, is a string over any alphabet.

The empty set Ø is not an alphabet because it contains no element. The set of natural numbers is not an alphabet, because it is not finite. The union ₁ ₂ is an alphabet only if an ordering is placed on its symbols.

An alphabet of cardinality 2 is called a binary alphabet, and strings over a binary alphabet are called binary strings. Similarly, an alphabet of cardinality 1 is called a unary alphabet, and strings over a unary alphabet are called unary strings.

The length of a string is denoted || and assumed to equal the number of symbols in the string.

Example 1.1.2 {0, 1} is a binary alphabet, and {1} is a unary alphabet. 11 is a binary string over the alphabet {0, 1}, and a unary string over the alphabet {1}.

11 is a string of length 2, || = 0, and |01| + |1| = 3.

The string consisting of a sequence followed by a sequence is denoted . The string is called the concatenation of and . The notation ⁱ is used for the string obtained by concatenating i copies of the string .

Example 1.1.3 The concatenation of the string 01 with the string 100 gives the string 01100. The concatenation of with any string , and the concatenation of any string with give the string . In particular, = .

If = 01, then ⁰ = , ¹ = 01, ² = 0101, and ³ = 010101.

A string is said to be a substring of a string if = for some and . A substring of a string is said to be a prefix of if = for some . The prefix is said to be a proper prefix of if . A substring of a string is said to be a suffix of if = for some . The suffix is said to be a proper suffix of if .

Example 1.1.4 , 0, 1, 01, 11, and 011 are the substrings of 011. , 0, and 01 are the proper prefixes of 011. , 1, and 11 are the proper suffixes of 011. 011 is a prefix and a suffix of 011.

If = a₁ a_n for some symbols a₁, . . . , a_n then a_n a₁ is called the reverse of , denoted ^rev. is said to be a permutation of if can be obtained from by reordering the symbols in .

Example 1.1.5 Let be the string 001. ^rev = 100. The strings 001, 010, and 100 are the permutations of .

The set of all the strings over an alphabet will be denoted by * . ⁺ will denote the set * - {}.

Ordering of Strings

Searching is probably the most commonly applied operation on information. Due to the importance of this operation, approaches for searching information and for organizing information to facilitate searching, receive special attention. Sequential search, binary search, insertion sort, quick sort, and merge sort are some examples of such approaches. These approaches rely in most cases on the existence of a relationship that defines an ordering of the entities in question.

A frequently used relationship for strings is the one that compares them alphabetically, as reflected by the ordering of names in telephone books. The relationship and ordering can be defined in the following manner.

Consider any alphabet . A string is said to be alphabetically smaller in * than a string , or equivalently, is said to be alphabetically bigger in * than if and are in * and either of the following two cases holds.

is a proper prefix of .
For some in * and some a and b in such that a precedes b in , the string a is a prefix of and the string b is a prefix of .

An ordered subset of

* is said to be alphabetically ordered, if

is not alphabetically smaller in

* than

whenever

precedes

in the subset.

Example 1.1.6 Let be the binary alphabet {0, 1}. The string 01 is alphabetically smaller in * than the string 01100, because 01 is a proper prefix of 01100. On the other hand, 01100 is alphabetically smaller than 0111, because both strings agree in their first three symbols and the fourth symbol in 01100 is smaller than the fourth symbol in 0111.

The set {, 0, 00, 000, 001, 01, 010, 011, 1, 10, 100, 101, 11, 110, 111}, of those strings that have length not greater than 3, is given in alphabetical ordering.

Alphabetical ordering is satisfactory for finite sets, because each string in such an ordered set can eventually be reached. For similar reasons, alphabetical ordering is also satisfactory for infinite sets of unary strings. However, in some other cases alphabetical ordering is not satisfactory because it can result in some strings being preceded by an unbounded number of strings. For instance, such is the case for the string 1 in the alphabetically ordered set {0, 1}*, that is, 1 is preceded by the strings 0, 00, 000, . . . This deficiency motivates the following definition of canonical ordering for strings. In canonical ordering each string is preceded by a finite number of strings.

A string is said to be canonically smaller or lexicographically smaller in * than a string , or equivalently, is said to be canonically bigger or lexicographically bigger in * than if either of the following two cases holds.

is shorter than .
and are of identical length but is alphabetically smaller than .

An ordered subset of

* is said to be canonically ordered or lexicographically ordered, if

is not canonically smaller in

* than

whenever

precedes

in the subset.

Example 1.1.7 Consider the alphabet = {0, 1}. The string 11 is canonically smaller in * than the string 000, because 11 is a shorter string than 000. On the other hand, 00 is canonically smaller than 11, because the strings are of equal length and 00 is alphabetically smaller than 11.

The set * = {, 0, 1, 00, 01, 10, 11, 000, 001, . . .} is given in its canonical ordering.

Representations

Given the preceding definitions of alphabets and strings, representations of information can be viewed as the mapping of objects into strings in accordance with some rules. That is, formally speaking, a representation or encoding over an alphabet of a set D is a function f from D to that satisfies the following condition: f(e₁) and f(e₂) are disjoint nonempty sets for each pair of distinct elements e₁ and e₂ in D.

If is a unary alphabet, then the representation is said to be a unary representation. If is a binary alphabet, then the representation is said to be a binary representation.

In what follows each element in f(e) will be referred to as a representation, or encoding, of e.

Example 1.1.8 f₁ is a binary representation over {0, 1} of the natural numbers if f₁(0) = {0, 00, 000, 0000, . . . }, f₁(1) = {1, 01, 001, 0001, . . . }, f₁(2) = {10, 010, 0010, 00010, . . . }, f₁(3) = {11, 011, 0011, 00011, . . . }, and f₁(4) = {100, 0100, 00100, 000100, . . . }, etc.

Similarly, f₂ is a binary representation over {0, 1} of the natural numbers if it assigns to the ith natural number the set consisting of the ith canonically smallest binary string. In such a case, f₂(0) = {}, f₂(1) = {0}, f₂(2) = {1}, f₂(3) = {00}, f₂(4) = {01}, f₂(5) = {10}, f₂(6) = {11}, f₂(7) = {000}, f₂(8) = {1000}, f₂(9) = {1001}, . . .

On the other hand, f₃ is a unary representation over {1} of the natural numbers if it assigns to the ith natural number the set consisting of the ith alphabetically (= canonically) smallest unary string. In such a case, f₃(0) = {}, f₃(1) = {1}, f₃(2) = {11}, f₃(3) = {111}, f₃(4) = {1111}, . . . , f₃(i) = {1ⁱ}, . . .

The three representations f₁, f₂, and f₃ are illustrated in Figure 1.1.1.

Figure 1.1.1

Representations for the natural numbers.

In the rest of the book, unless otherwise is stated, the function f₁ of Example 1.1.8 is assumed to be the binary representation of the natural numbers.

[next] [front] [up]