CSC 241 Ch. 6
When we introduced dictionaries at the start of this section, our motivation was the need for a container with user-defined indexes. We now show alternate uses for dictionaries. Suppose we would like to develop a small function, named complete(), that takes the abbreviation of a day of week, such as 'Tu', and returns the corresponding day, which for input 'Tu' would be 'Tuesday': >>> complete('Tu') 'Tuesday' def complete(abbreviation): 'returns day of the week corresponding to abbreviation' days = {'Mo': 'Monday', 'Tu':'Tuesday', 'We': 'Wednesday', 'Th': 'Thursday', 'Fr': 'Friday', 'Sa': 'Saturday', 'Su':'Sunday'} return days[abbreviation]
A Dictionary as a Substitute for the Multiway if Statement
For many years, the standard encoding for characters in the English language was ASCII en- coding. The American Standard Code for Information Interchange (ASCII) was developed in the 1960s. It defines a numeric code for 128 characters, punctuation, and a few other symbols common in the American English language. Table 6.4 shows the decimal ASCII codes for the printable characters. Let's explain what the entries of this table mean. The decimal ASCII code for lowercase a is 97. The & sign is encoded with decimal ASCII code 38. ASCII codes 0 through 32 and 127 include nonprintable characters, such as backspace (decimal code 8), horizontal tab (decimal code 9), and line feed (decimal code 10). You can explore the ASCII encodings using the Python function ord(), which returns the decimal ASCII code of a character: >>> ord('a') 97 The sequence of characters of a string value (such as 'dad') is encoded as a sequence of ASCII codes 100, 97, and 100. What is stored in memory is exactly this sequence of codes. Of course, each code is stored in binary. As ASCII decimal codes go from 0 to 127, they can be encoded with seven bits; because a byte (eight bits) is the smallest memory storage unit, each code is stored in one byte. The sequence of characters of a string value (such as 'dad') is encoded as a sequence of ASCII codes 100, 97, and 100. What is stored in memory is exactly this sequence of codes. Of course, each code is stored in binary. As ASCII decimal codes go from 0 to 127, they can be encoded with seven bits; because a byte (eight bits) is the smallest memory storage unit, each code is stored in one byte. For example, the decimal ASCII code for lowercase a is 97, which corresponds to binary ASCII code 1100001. So, in the ASCII encoding, character a is encoded in a single byte with the first bit being a 0 and the remaining bits being 1100001. The resulting byte 01100001 can be described more succinctly using a two-digit hex number 0x61 (6 for the leftmost four bits, 0110, and 1 for the rightmost 4 bits, 0001). In codes (as a shorthand for ASCII binary codes). fact, it it common to use hex ASCII.
ASCII
This chapter starts by introducing several built-in Python container classes that complement the string and list classes we have been using so far. The dictionary class dict is a container of (key, value) pairs. One way to view a dictio- nary is to see it as as a container that stores values that are accessible through user-specified indexes called keys. Another is to see it as a mapping from keys to values. Dictionaries are as useful as lists in practice. A dictionary can be used, for example, as a substitute for a multiway conditional structure or as a collection of counters. In some situations, the mutability of lists is a problem. For example, we cannot use lists as keys of a dictionary because lists are mutable. We introduce the built-in class tuple, which is essentially an immutable version of class list. We use tuple objects when we need an immutable version of a list. The last built-in container class covered in this book is the class set that implements a mathematical set, that is, a container that supports mathematical set operations, such as union and intersection. As all elements of a set must be distinct, sets can be used to easily remove duplicates from other containers. In this chapter, we also complete the coverage of Python's built-in string type str that we started in Chapter 2 and continued in Chapter 4. We describe the range of characters that a string object can contain. We introduce the Unicode character encoding scheme, the default in Python 3 (but not Python 2), which enables developers to work with strings that use non-American English characters. Finally, this chapter introduces the Standard Library module random. The module sup- ports functions that return pseudorandom numbers, which are needed in simulations and computer games. We also introduce random module functions shuffle(), choice(), and sample() that enable us to do shuffling and sampling on container objects.
Chapter Summary
String objects are used to store text, that is, a sequence of characters. The characters could be upper- and lowercase letters from the alphabet, digits, punctuation marks, and possibly symbols like the dollar sign ($). As we saw in Chapter 2, in order to create a variable whose value is the text 'An apple costs $0.99!', we just need to do: >>> text = 'An apple costs $0.99!' The variable text then evaluates to the text: >>> text 'An apple costs $0.99!' While all this may sound very clean and straightforward, strings are somewhat messy. The problem is that computers deal with bits and bytes, and string values need to be somehow encoded with bits and bytes. In other words, each character of a string value needs to be mapped to a specific bit encoding, and this encoding should map back to the character. But why should we care about this encoding? As we saw in Chapters 2 and 4, manipulat- ing strings is quite intuitive, and we certainly did not worry about how strings are encoded. Most of the time, we do not have to worry about it. However, in a global Internet, documents created in one location may need to be read in another. We need to know how to work with characters from other writing systems, whether they are characters from other languages, such as French, Greek, Arabic, or Chinese, or symbols from various domains, such as math, science, or engineering. As importantly, we need to understand how strings are represented because, as computer scientists, we do like to know what is below the hood.
Character Encodings
We start with function randrange(), which takes a pair of integers a and b and returns some number in the range from—and including—a up to—and not including—b with each number in the range equally likely. Here is how we would use this function to simulate several (six-sided) die tosses: >>> random.randrange(1,7) 2 >>> random.randrange(1,7) 6 >>> random.randrange(1,7) 5 >>> random.randrange(1,7) 1 >>> random.randrange(1,7) 2
Choosing a Random Integer
x in s True if x is in set s, else False x not in s False if x is in set s, else True len(s) Returns the size of set s s == t True if sets s and t contain the same elements, False otherwise s != t True if sets s and t do not contain the same elements, False otherwise s <= t True if every element of set s is in set t, False otherwise s<t Trueifs <= tands != t s|t Returns the union of sets s and t s&t Returns the intersection of sets s and t s-t Returns the difference between sets s and t s^t Returns the symmetric difference of sets s and t
Class set operators. Shown are the usage and explanation for commonly used set operators.
The Python dictionary type, denoted dict, is a container type, just like list and str. A dictionary contains (key, value) pairs. The general format of the expression that evaluates to a dictionary object is: {<key 1>:<value 1>, <key 2>:<value 2>, ..., <key i>:<value i>} This expression defines a dictionary containing i key:value pairs. The key and the value are both objects. The key is the "index" that is used to access the value. So, in our dictionary employee, '100-01-0010' is the key and ['Hans', 'Castorp'] is the value. The (key, value) pairs in a dictionary expression are separated by commas and enclosed in curly braces (as opposed to square brackets, [], used for lists.) The key and value in each (key, value) pair are separated by a colon (:) with the key being to the left and the value to the right of the colon. Keys can be of any type as long as the type is immutable. So string and number objects can be keys, whereas objects of type list cannot. The value can be of any type. We often say that a key maps to its value or is the index of the value. Because dictionaries can be viewed as a mapping from keys to values, they are often referred to as maps. For example, here is a dictionary mapping day abbreviations 'Mo', 'Tu', 'We', and 'Th' (the keys) to the corresponding days 'Monday', 'Tuesday', 'Wednesday', and 'Thursday' (the values): >>> days = {'Mo':'Monday', 'Tu':'Tuesday', 'We':'Wednesday', 'Th':'Thursday'} The variable days refers to a dictionary, illustrated in Figure 6.2, with four (key, value) pairs. The (key, value) pair 'Mo':'Monday' has key 'Mo' and value 'Monday', the (key, value) pair 'Tu':'Tuesday' has key 'Tu' and value 'Tuesday', etc.
Dictionary Class Properties
While the list and dict class share quite a few operators, there is only one method that they share: pop(). This method takes a key, and if the key is in the dictionary, it removes the associated (key, value) pair from the dictionary and returns the value: >>> days {'Fr': 'Friday', 'Mo': 'Monday', 'Tu': 'Tuesday', 'We': 'Wednesday', 'Th': 'Thursday', 'Sa': 'Sat'} >>> days.pop('Tu') 'Tuesday' >>> days.pop('Fr') 'Friday' >>> days {'Mo': 'Monday', 'We': 'Wednesday', 'Th': 'Thursday', 'Sa': 'Sat'} We now introduce some more dictionary methods. When dictionary d1 calls method update() with input argument dictionary d2, all the (key, value) pairs of d2 are added to d1, possibly writing over (key, value) pairs of d1. For example, suppose we have a dictionary of our favorite days of the week: >>> favorites = {'Th':'Thursday', 'Fr':'Friday','Sa':'Saturday'} We can add those days to our days dictionary: >>> days.update(favorites) >>> days {'Fr': 'Friday', 'Mo': 'Monday', 'We': 'Wednesday', 'Th': 'Thursday', 'Sa': 'Saturday'} The (key, value) pair 'Fr':'Friday' has been added to days and the (key, value) pair 'Sa':'Saturday' has replaced the pair 'Sa':'Sat', originally in dictionary days. Note that only one copy of (key, value) pair 'Th':'Thursday' can be in the dictionary. Particularly useful dictionary methods are keys(), values(), and items(): They return the keys, values, and (key, value) pairs, respectively, in the dictionary. To illustrate how to use these methods, we use dictionary days defined as: >>> days {'Fr': 'Friday', 'Mo': 'Monday', 'We': 'Wednesday', 'Th': 'Thursday', 'Sa': 'Saturday'} The method keys() returns the keys of the dictionary: >>> keys = days.keys() >>> keys dict_keys(['Fr', 'Mo', 'We', 'Th', 'Sa']) The container object returned by method keys() is not a list. Let's check its type: >>> type(days.keys()) <class 'dict_keys'> OK, it's a type we have not seen before. Do we really have to learn everything there is to know about this new type? At this point, not necessarily. We only really need to understand its usage. So, how is the object returned by the keys() method used? It is typically used to iterate over the keys of the dictionary, for example: >>> for key in days.keys(): print(key, end=' ') Fr Mo We Th Sa Thus, the dict_keys class supports iteration. In fact, when we iterate directly over a dictionary, as in: >>> for key in days: print(key, end=' ') Fr Mo We Th Sa the Python interpreter translates the statement for key in days to the statement for key in days.keys() before executing it. Table 6.2 lists some of the commonly used methods that the dictionary class supports; as usual, you can learn more by looking at the online documentation or by typing >>> help(dict) ... in the interpreter shell. The dictionary methods values() and items() shown in Table 6.2 also return objects that we can iterate over. The method values() is typically used to iterate over the values of a dictionary: >>> for value in days.values(): print(value, end=', ') Friday, Monday, Wednesday, Thursday, Saturday,
Dictionary Methods
The dictionary class supports some of the same operators that the list class supports. k in d True if k is a key in dictionary d, else False k not in d False if k is a key in dictionary d, else True d[k] Value corresponding to key k in dictionary d len(d) Number of (key, value) pairs in dictionary d
Dictionary Operators There are operators that the list class supports but the class dict does not. For ex- ample, the indexing operator [] cannot be used to get a slice of a dictionary. This makes sense: A slice implies an order, and there is no order in a dictionary. Also not supported are operators + and *, among others.
An important application of the dictionary type is its use in computing the number of oc- currences of "things" in a larger set. A search engine, for example, may need to compute the frequency of each word in a web page in order to calculate its relevance with respect to search engine queries. On a smaller scale, suppose that we would like to count the frequency of each name in a list of student names such as: >>> students = ['Cindy', 'John', 'Cindy', 'Adam', 'Adam', 'Jimmy', 'Joan', 'Cindy', 'Joan'] >>> frequency(students) {'John': 1, 'Joan': 2, 'Adam': 2, 'Cindy': 3, 'Jimmy': 1} In the dictionary returned by the call frequency(students), shown in Figure 6.4, the keys are the distinct names in the list students and the values are the corresponding frequencies: so 'John' occurs once, 'Joan' occurs twice, and so on.
Dictionary as a Collection of Counters
d.items() Returns a view of the (key, value) pairs in d as tuples d.get(k) Returns the value of key k, equivalent to d[k] d.keys() Returns a view of the keys of d d.pop(k) Removes the (key, value) pair with key k from d and returns the value d.update(d2) Adds the (key, value) pairs of dictionary d2 to d d.values() Returns a view of the values of d
Methods of the dict class. Listed are some commonly used methods of the dictionary class. d refers to a dictionary.
The set class has all the properties of a mathematical set. It is used to store an unordered collection of items, with no duplicate items allowed. The items must be immutable objects. The set type supports operators that implement the classical set operations: set membership, intersection, union, symmetric difference, and so on. It is thus useful whenever a collection of items is modeled as a mathematical set. It is also useful for duplicate removal. A set is defined using the same notation that is used for mathematical sets: a sequence of items separated by commas and enclosed in curly braces: { }. Here is how we would assign the set of three phone numbers (as strings) to variable phonebook1: >>> phonebook1 = {'123-45-67', '234-56-78', '345-67-89'} We check the value and type of phonebook1: >>> phonebook1 {'123-45-67', '234-56-78', '345-67-89'} >>> type(phonebook1) <class 'set'> If we had defined a set with duplicate items, they would be ignored: >>> phonebook1 = {'123-45-67', '234-56-78', '345-67-89', '123-45-67', '345-67-89'} >>> phonebook1 {'123-45-67', '234-56-78', '345-67-89'}
Sets
Let's illustrate a few more functions from the random module. The function shuffle() shuffles, or permutes, the objects in a sequence not unlike how a deck of cards is shuffled prior to a card game like blackjack. Each possible permutation is equally likely. Here is how we can use this function to shuffle a list twice: >>> lst = [1,2,3,4,5] >>> random.shuffle(lst) >>> lst [3, 4, 1, 5, 2] >>> random.shuffle(lst) >>> lst [1, 3, 2, 4, 5] The function choice() allows us to choose an item from a container uniformly at ran- dom. Given list >>> lst = ['cat', 'rat', 'bat', 'mat'] here is how we would choose a list item uniformly at random: >>> random.choice(lst) 'mat' >>> random.choice(lst) 'bat' >>> random.choice(lst) 'rat' >>> random.choice(lst) 'bat' If, instead of needing just one item, we want to choose a sample of size k, with every sample equally likely, we would use the sample() function. It takes as input the container and the number k. Here is how we would choose random samples of list lst of size 2 or 3: >>> random.sample(lst, 2) ['mat', 'bat'] >>> random.sample(lst, 2) ['cat', 'rat'] >>> random.sample(lst, 3) ['rat', 'mat', 'bat']
Shuffling, Choosing, and Sampling at Random
ASCII is an American standard. As such, it does not provide for characters not in the Ameri- can English language. There is no French 'é', Greek 'Δ', or Chinese '世' in ASCII encoding. Encodings other than ASCII were developed to handle different languages or groups of lan- guages. This raises a problem, however: With the existence of different encodings, it is likely that some encodings are not installed on a computer. In a globally interconnected world, a text document that was created on one computer will often need to be read on another, a continent away. What if the computer reading the document does not have the right encoding installed? Unicode was developed to be the universal character-encoding scheme. It covers all char- acters in all written languages, modern or ancient, and includes technical symbols from sci- ence, engineering, and mathematics, punctuation, and so on. In Unicode, every character is represented by an integer code point. The code point is not necessarily the actual byte representation of the character, however; it is just the identifier for the particular character. For example, the code point for lowercase 'k' is the integer with hex value 0x006B, which corresponds to decimal value 107. As you can see in Table 6.4, 107 is also the ASCII code for letter 'k'. Unicode conveniently uses a code point for ASCII characters that is equal to their ASCII code. How do you incorporate Unicode characters into a string? To include character 'k', for example, you would use the Python escape sequence \u006B: >>> '\u006B' 'k' In the next example, the escape sequence \u0020 is used to denote the Unicode character with code point 0x0020 (in hex, corresponding to decimal 32). This is, of course, the blank space (see Table 6.4): >>> 'Hello\u0020World !' 'Hello World !' We now try a few examples in several different languages. Let's start with my name in Cyrillic: >>> '\u0409\u0443\u0431\u043e\u043c\u0438\u0440' 'ððððððð' Here is 'Hello World!' in Greek: >>> '\u0393\u03b5\u03b9\u03b1\u0020\u03c3\u03b1\u03c2 \u0020\u03ba\u03cc\u03c3\u03bc\u03bf!' 'Γεια σας κόσμο!' Finally, let's write 'Hello World!' in Chinese: >>> chinese = '\u4e16\u754c\u60a8\u597d!' >>> chinese '世界您好!'
Unicode
A dictionary is a container that stores items that are accessible using "user-specified" indexes.
User-Defined Indexes as Motivation for Dictionaries
The fact that sets cannot have duplicates gives us the first great application for sets: removing duplicates from a list. Suppose we have a list with duplicates, such as this list of ages of students in a class: >>> ages = [23, 19, 18, 21, 18, 20, 21, 23, 22, 23, 19, 20] To remove duplicates from this list, we can convert the list to a set, using the set constructor. The set constructor will eliminate all duplicates because a set is not supposed to have them. By converting the set back to a list, we get a list with no duplicates: >>> ages = list(set(ages)) >>> ages [18, 19, 20, 21, 22, 23]
Using the set Constructor to Remove Duplicates
The method items() returns a container that contains tuple objects, one for each (key, value) pair: >>> days.items() dict_items([('We', 'Wednesday'), ('Mo', 'Monday'), ('Th', 'Thursday'), ('Tu', 'Tuesday')]) This method is typically used to iterate over the (key, value) pairs of the dictionary: >>> for item in days.items(): print(item, end='; ') ('Fr', 'Friday'); ('Mo', 'Monday'); ('We', 'Wednesday'); ('Th', 'Thursday'); ('Sa', 'Saturday');
method items()
The set class supports operators that correspond to the usual mathematical set operations. Some are operators that can also be used with list, string, and dictionary types. For example, the in and not in operators are used to test set membership: >>> '123-45-67' in phonebook1 True >>> '456-78-90' in phonebook1 False >>> '456-78-90' not in phonebook1 True The len() operator returns the size of the set: >>> len(phonebook1) 3 Comparison operators ==, !=, <, <=, >, and >= are supported as well, but their meaning is set-specific. Two sets are "equal" if and only if they have the same elements: >>> phonebook3 = {'345-67-89','456-78-90'} >>> phonebook1 == phonebook3 False >>> phonebook1 != phonebook3 True
set Operators
In Practice Problem 6.2, we defined a dictionary that maps phone numbers to (the first and last name of) individuals: >>> rphonebook = {'(123)456-78-90':['Anna','Karenina'], '(901)234-56-78':['Yu', 'Tsun'], '(321)908-76-54':['Hans', 'Castorp']} We used this dictionary to implement a reverse phone book lookup application: Given a phone number, the app returns the individual that number is assigned to. What if, instead, we wanted to build an app that implements a standard phone book lookup: Given a person's first and last name, the app would return the phone number assigned to that individual. For the standard lookup app, a dictionary such as rphonebook is not appropriate. What we need is a mapping from individuals to phone numbers. So let's define a new dictionary that is, effectively, the inverse of the mapping of rphonebook: Because tuple objects are immutable, they can be used as dictionary keys. Let's get back to our original goal of constructing a dictionary that maps (the first and last name of) individuals to phone numbers. We can now use tuple objects as keys, instead of list objects: >>> phonebook = {('Anna','Karenina'):'(123)456-78-90', ('Yu', 'Tsun'):'(901)234-56-78', ('Hans', 'Castorp'):'(321)908-76-54'} >>> phonebook {('Hans', 'Castorp'): '(321)908-76-54', ('Yu', 'Tsun'): '(901)234-56-78', ('Anna', 'Karenina'): '(123)456-78-90'} Let's check that the indexing operator works as we want: >>> phonebook[('Hans', 'Castorp')] '(321)908-76-54'
tuple Objects Can Be Dictionary Keys