StudySet
Work flow of Machine Learning
"Looking at the following typical workflow, we see that most of our time (in ML) will be spent with rather mundane tasks: (1) Reading the data and cleaning it (2) Exploring and understanding the input data (3) Analyzing how best to present the data to the learning algorithm (4) Choosing the right model and learning algorithm (5) Measuring the performance correctly" — Building Machine Learning Systems with Python book
Performance Tips - General
* A comprehensive but quick-to-run test suite that can be run to ensure that future optimizations don't change the correctness of your program. In short: 1. Make sure program is running correctly, as it was intended to 2. Test it's right 3. Profile if slow 4. Optimise 5. Repeat from step 2
ASCII
* ASCII - ASCII defines 128 characters which map to the numbers 0-127 - For example the number 65 means latin capital 'A' >>> ord('A') 65 >>> chr(65) 'A' - ord(c): return an integer representing the unicode code point of the character when the character is a unicode object or the value of the byte when the argument is an 8-bit string. This is the inverse of chr(i) - chr(i): return a string of once character whose ASCII code is the integer i. This is the inverse of ord(c). The argument must be in the range [0,255] - ASCII is the first 128 code points in Unicode, 0 through 127 - ASCII is a subset of iso-latin-1 and iso-latin-1 is the first 256 points of unicode - ASCII characters are one byte each
Performance Tips - Avoiding Dots ...
* Avoiding dots... - The traditional foor loop has another inefficiency - Both newlist.append and word.upper are function references that are reevaluated each time through the loop - Can be replaced with >>> upper = str.upper >>> newlist = [] >>> append = newlist.append >>> for word in oldlist: append(upper(word)) - Use this technique with caution. This gets much more difficult to maintain...
Default Arguments in Python
* Default Arguments in Python - When the default value for a function argument is an expression (or a mutable object, i.e list), the expression is evalulated only once, not every time the function is called - Example: >>> def append(list=[]): ... # append the length of a list to the list ... list.append(len(list)) ... return list ... >>> append(['a','b']) ['a', 'b', 2] >>> >>> append() # calling with no arg uses default list value of [] [0] >>> >>> append() # but what happens when we AGAIN call append with no arg? >>> append() # first call with no arg uses default list value of [] [0] >>> append() # but then look what happens... [0, 1] >>> append() # successive calls keep extending the same default list! [0, 1, 2] >>> append() # and so on, and so on, and so on... [0, 1, 2, 3] - How do we avoid this behavior? Default it to a immutable object >>> def append(list=None): ... if list is None: list = [] # append the length of a list to the list ... list.append(len(list)) ... return list ... >>> append() [0] >>> append() [0]
Default Dict from collections
* Default Dict is awesome - avoids if statements, try/catch blocks - Simple dictionary that counts the frequency for each word >>> from collections import defaultdict >>> wdict = defaultdict(int) >>> for word in words: wdict[word] += 1
yield
- the yield keyword tells the python interpreter to turn whatever function its currently in into a generator - any function that contains the yield statement is called a generator - the yield keyword saves the 'state' of the generator function
Selection Sort
- this algorithm repeatedly identifies the smallest remaining unsorted element and puts it at the end of the sorted portion of the array - O(n^2) def selection_sort(li): """ The selection sort improves on the bubble sort by making only one exchange for every pass through the list. O(n^2) """ for num in range(len(li) - 1, 0, -1): index_of_max = 0 for location in range(1, num + 1): if li[location] > li[index_of_max]: index_of_max = location li[num], li[index_of_max] = li[index_of_max], li[num]
Matrix Multiplication
- O(n^3)
Abstraction
- to distill a complicated system down to its fundamental parts - Python has a tradition of treating abstractions implicitly using a mechanism known as duck typing. Programmers assume that an object supports a set of known behaviors, with the interpreter raising a run-time error if those assumptions fail - Through Abstract Base Classes (ABC) a concrete class can inherit from it and provide further implementations
Copy of a Reference
> list1 = [1,5,9,13] > list2 = list1 > list2[0] = -1 > print list1, list2 [-1, 5, 9, 13] [-1, 5, 9, 13] - modifying list2 also modifies list1! - this is because list2 does not copy list1, instead list2 is set to reference the same data as list1 - list1 and list2 refer to the same data - we can force python to copy list1 by using the following: > list1 = [1,5,9,13] > list2 = list1[:]
What happens when you go to google.com (2)
Browser: "Ok, so, I have a user requesting this address: www.cnn.com. I figure since there are no slashes or anything, this is a direct request of a main page. There was also no protocol or port defined, so I'll assume it's HTTP and going to port 80... oh well, first things first. Hey DNS, pal, wake up! Where is this www.cnn.com hiding at?" DNS: "Right... wait a sec, I'll ask the ISP servers. Ok, it looks like 157.166.226.25." Browser: "Ok. Internet Protocol Suite, your turn! Call 157.166.226.25, please. Send them this HTTP header. It's asking for the basic structure and content of their main page so I know what else to fetch... oh well, not that you'd care about this I guess. " TCP/IP: "What do you mean my turn? Like I wasn't just working my back off right there for the DNS? God, what does it take to get a bit of appreciation here..." Browser: ... TCP/IP: "Yeah, yeah... Connecting... I'll just ask the gateway to forward it. You know, it isn't all that easy, I'll have to divide your pretty request there into multiple parts so it reaches the end, and assemble any stuff they send back from all the thousands of packages I get... ah, right, you don't care. Figures." Meanwhile, at the CNN headquarters, a message finally ends up at the door of the Web Server. CNN Web Server: "Nzhôô! A customer! He wants news! The Front Page! How about it?" CNN Server Side Script Engine: "Right, will do! Front page, right?" CNN Database Server: "Yey! Work for me! What content do you need?" CNN Server Side Script Engine: "... um, sorry DB, I have a copy of front page right here in my cache, no need to compile anything. But hey, take this user ID and store it, I'll send it to the customer too, so we know who we're talking to later on." CNN Database Server: "Yey!" Back at the user's computer... TCP/IP: "Ooookay, here comes the reply. Oh boy, why do I have a feeling this'll be a big one..." Browser: "Uh, wow... this has all sorts of javascript code... bunch of images, couple of forms... Right, this'll take a while to render. Better get to it. Hey, IP system, there's a bunch more stuff you'll need to get. Let's see I need a few stylesheets from i.cdn.turner.com - via HTTP and ask for the file /cnn/.element/css/2.0/common.css. And then get some of those scripts at i.cdn.turner.com too, I'm counting six so far..." TCP/IP: "I get the picture. Just give me the server addresses and all that. And wrap that file stuff within the HTTP request, I don't want to deal with it." DNS: "Checking the i.cdn.turner.com... hey, bit of trivia, it's actually called cdn.cnn.com.c.footprint.net. IP is 4.23.41.126" Browser: "Sure, sure... wait a sec, this'll take a few nsec to process, I'm trying to understand all this script..." TCP/IP: "Hey, here's the CSS you asked for. Oh, and... yeah, those additional scripts also just came back." Browser: "Whew, there's more... some sort of video ad!" TCP/IP: "Oh boy, what fun that sounds like..." Browser: "There's all sorts of images too! And this CSS looks a bit nasty... right, so if that part goes there, and has this line at the top... how on earth would that fit anymore... no, I'll have to stretch this a bit to make it... Oh, but that other CSS file overrides that rule... Well, this one ain't going to be an easy piece to render, that's for sure!" TCP/IP: "Ok, ok, stop distracting me for a sec, there's a lot to do here still." Browser: "User, here's a small progress report for you. Sorry, this all might take a few secs, there's like 140 different elements to load, and going at 16 so far." One or two seconds later... TCP/IP: "Ok, that should be all. Hey, listen... sorry I snapped at you earlier, you managing there? This sure seems like quite the load for you too." Browser: "Phew, yeah, it's all these websites nowdays, they sure don't make it easy for you. Well, I'll manage. It's what I'm here for." TCP/IP: "I guess it's quite heavy for all of us these days... oh, stop gloating there DNS!" Browser: "Hey user! The website's ready - go get your news!"
Data Structures -- Pointers and Linked Structures
Data Structures — Pointers and Linked Structures - Pointers are the connections that hold the pieces of linked structures together. Pointers represent the address of a location in memory - A cell phone number can be thought of as a pointer to its owner as they move about the planet - The relative advantages of linked lists over static arrays include: - Overflow on linked structures can never occur unless the memory is actually full - Insertions and deletions are simple than for continuous (array) lists - With large records, moving pointers is easier and faster than moving the items themselves - The relative advantages of arrays over Linked Lists are: - Linked structures require extra space for storing the pointer field - Linked lists do not allow efficient random access to items - Arrays allow better memory locality and cache performance than random pointer jumping
What happens when you go to google.com (1)
In an extremely rough and simplified sketch, assuming the simplest possible HTTP request, no proxies and IPv4 (this would work similarly for IPv6-only client, but I have yet to see such workstation): browser checks cache; if requested object is in cache and is fresh, skip to #9 browser asks OS for server's IP address OS makes a DNS lookup and replies the IP address to the browser browser opens a TCP connection to server (this step is much more complex with HTTPS) browser sends the HTTP request through TCP connection browser receives HTTP response and may close the TCP connection, or reuse it for another request browser checks if the response is a redirect (3xx result status codes), authorization request (401), error (4xx and 5xx), etc.; these are handled differently from normal responses (2xx) if cacheable, response is stored in cache browser decodes response (e.g. if it's gzipped) browser determines what to do with response (e.g. is it a HTML page, is it an image, is it a sound clip?) browser renders response, or offers a download dialog for unrecognized types Again, discussion of each of these points have filled countless pages; take this as a starting point. Also, there are many other things happening in parallel to this (processing typed-in address, adding page to browser history, displaying progress to user, notifying plugins and extensions, rendering the page while it's downloading, pipelining, connection tracking for keep-alive, etc.).
why if __name__ == '__main__':
It makes a python script both importable and executable
Design the infrastructure for a link shortener
Question -------- I want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to "http://www.example.org/abcdef". Instead of "abcdef" there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56~57 billion possible strings. My approach: I have a database table with three columns: id, integer, auto-increment long, string, the long URL the user entered short, string, the shortened URL (or just the six characters) I would then insert the long URL into the table. Then I would select the auto-increment value for "id" and build a hash of it. This hash should then be inserted as "short". But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don't use these algorithms, I think. A self-built algorithm will work, too. My idea: For "http://www.google.de/" I get the auto-increment id 239472. SO Answer --------- I would continue your "convert number to string" approach. However you will realize that your proposed algorithm fails if your ID is a prime and greater than 52. Theoretical background You need a Bijective Function f. This is necessary so that you can find a inverse function g('abc') = 123 for your f(123) = 'abc' function. This means: There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2), and for every y you must be able to find an x so that f(x) = y. How to convert the ID to a shortened URL Think of an alphabet we want to use. In your case that's [a-zA-Z0-9]. It contains 62 letters. Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example). For this example I will use 12510 (125 with a base of 10). Now you have to convert 12510 to X62 (base 62). 12510 = 2×621 + 1×620 = [2,1] This requires use of integer division and modulo. A pseudo-code example: digits = [] while num > 0 remainder = modulo(num, 62) digits.push(remainder) num = divide(num, 62) digits = digits.reverse Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an array for example) could look like: 0 → a 1 → b ... 25 → z ... 52 → 0 61 → 9 With 2 → c and 1 → b you will receive cb62 as the shortened URL. http://shor.ty/cb How to resolve a shortened URL to the initial ID The reverse is even easier. You just do a reverse lookup in your alphabet. e9a62 will be resolved to "4th, 61st, and 0th letter in alphabet". e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810 Now find your database-record with WHERE id = 19158 and do the redirect.
insertion sort
def insertion_sort(li): """ The insertion sort always maintains a sorted sublist in the lower positions of the list. O(n^2) """ for index in range(1, len(li)): current_value = li[index] position = index while position > 0 and li[position-1] > current_value: li[position] = li[position-1] position = position - 1 li[position] = current_value
merge sort
def merge_sort(li): """ The merge sort is a divide and conquer strategy as a way to improve the performance of sorting algorithms. It is a recursive algorithm that continually splits a list in half. Split the list in halves, sort them, then merge them back together. O(n log n). It continually splits the list in halves (hence the log n). And the merge operation results in a list of size n requires n operations. This can be improved by having indicies initialized to 0 instead of popping each value """ if len(li) < 2: return li result = [] mid = int(len(li) / 2) y = merge_sort(li[:mid]) z = merge_sort(li[mid:]) while (len(y) > 0 and len(z) > 0): if y[0] > z[0]: result.append(z[0]) z.pop(0) else: result.append(y[0]) y.pop(0) result += y result += z return result
quick sort
def quick_sort(li): """ The quick sort uses divide and conquer to gain the same advantages as merge sort, while not using additional storage. It entails picking a pivot value (which usually is just the first item in the list). The role of the pivot vlaue is to assist with splitting the list O(n log n) on average. It makes O(n^2) comparisions on worst case although this is rare """
Unique string
def unique(s): """ Return True if there are no duplicate elements in sequence s O(n^2) """ for i in range(len(s)): for j in range(i+1, len(s)): if s[i] == s[j]: return False return True def unique2(s): """ Return True if there are no duplicate elements in sequence s. The build if function sorted() produces a copy of the original list with elements in sorted order O(n log n) """ temp = sorted(s) for i in range(1, len(temp)): if s[i-1] == s[i]: return False return True
Encode/Decode Unicode/Bytes
* Encode/Decode - You a unicode string to get a byte string. - You decode a byte string to get a unicode string - Byte strings and unicode string each have a method to convert it to the other type of string. Unicode strings have .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. Each takes an argument, which is the name of the encoding to use for the operation. >>> my_unicode = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24" >>> len(my_unicode) 9 >>> my_utf8 = my_unicode.encode('utf-8') >>> len(my_utf8) 19 >>> my_utf8 'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4' >>> my_utf8.decode('utf-8') u'Hi \u2119\u01b4\u2602\u210c\xf8\u1f24' - When encoding or decoding, you can specify what should happen when the codec can't handle the data. The default value is strict and thats why exceptions get raised >>> my_unicode.encode("ascii", "replace") 'Hi ??????' >>> my_unicode.encode("ascii", "xmlcharrefreplace") 'Hi ℙƴ☂ℌøἤ' >>> my_unicode.encode("ascii", "ignore") 'Hi '
How do computers store text
* Fact of Life #1 - Computers need and store bytes. When you send your friend an email your sending him bytes, not unicode or text - Files are not stored as text. Files (and more broadly speaking any I/O) is stored in bytes
Questions
* I see that you guys have a $60MM grant from Harvard and MIT, I assume this is your major source of funding? If so, when do you approximate that you will need another round of capital/funding? What is your burn rate? * * How big is the engineering team? * * How is your guys attrition rate? Look at VP of fb quote * * Would you ever license these MOOCs to different universities? * * "Companies go through many transitions over time, but the most transformative is the rapid switch from build phase to growth phase. Build phase is defined as that time before shipping. The team has just come together, united by a change-the-world vision. Everyone is ready to do whatever is needed to get version 1.0 — or perhaps version 0.4.2 alpha — into the hands of real people. But once people are using the product, a company immediately hits growth phase. The clarity of "ship version 1" splinters into the competing demands of new features, quality improvements, additional product lines, new markets, and company expansion — all focused on the need to grow revenue and users." * * A managers basic responsibility is to turn talent into performance * edX is a learning destination. We offer free courses from the best universities and over 1.5MM students world wide take these courses — Anant Agarwal * I really view online courses as a new-age textbook, as a tool we can give teachers — Anant Agarwal * We want to reinvent and reimagine how the classroom is built * Education has not really changed in the past 500 years. While healthcare, businesses, etc. have taken advantage of technology and made drastic changes * edX mission is comprised of 3 parts: * Increase education access to students around the world * Improve campus education by brining in online technologies to campus * do research around learning - Edx promotes ACTIVE learning * one of the main benefits of this type of learning is access. * However another benefit is to improve the overall quality of learning. Students can rewind and watch as many times as they want (this is far different to a lecture based course were a student can lose concentration quickly) * Downside is there is no social aspect like a college campus. You can't collaborate or work in groups.
Implicit Conversion Unicode/Bytes
* Implicit Conversion - Python 2 tries to be helpful when working with unicode byte strings. If you try to perform a string operation that combines a unicode string with a byte string, Python 2 will automatically decode the byte string to produce a second unicode string, then will complete the operation with the two unicode strings >>> u"Hello " + "world" u'Hello world' >>> u"Hello " + ("world".decode("ascii")) u'Hello world'
Performance Tips - Local Variables
* Local Variables - The final speed up for the for loop is to use local variables wherever possible - Python accesses local variables much more efficiently than global variables - Therefore define it as a function >>> def func(): upper = str.upper newlist = [] append = newlist.append for word in oldlist: append(upper(word)) return newlist - At the time I originally wrote this I was using a 100MHz Pentium running BSDI. - I got the following times for converting the list of words in /usr/share/dict/words (38,470 words at that time) to upper case: Version Time (seconds) Basic loop 3.47 Eliminate dots 2.45 Local variable & no dots 1.79 Using map function 0.54
Performance Tips - Loops
* Loops - If the body of your loop is simple, the interpreter overhead of the for loop itself can be a substantial amount of the overhead - This is where the map function is handy. You can think of map as a for moved into C code - The only restriction with map is that the "loop body" of map must be a function call - 4 examples that all do the same thing # traditional, slowest >>> new_list = [] >>> for word in oldlist: new_list.append(word.upper()) # use map to push the loop from the interpreter into compiled C code # str.upper is not called (i.e. its not str.upper()) because you need to supply the argument >>> newlist = map(str.upper, oldlist) # list comprehension, just as fast as using map. probably best implementation >>> newlist = [word.upper() for word in oldlist] # or you can use a generator to return a generator object which can be iterated over bit-by-bit >>> iter = (word.upper() for word in oldlist) >>> iter.next() 'THIS' - Which method you use is dependent on your charactersitics of the data
Pro Tips for Unicode
* Pro Tip #1 - Unicode Sandwich: - As we saw with Fact of Life #1, the data coming into and going out of your program must be bytes. But you don't need to deal with bytes on the inside of your program. - The best strategy is to decode incoming bytes as soon as possible, producing unicode. You use unicode throughout your program, and then when outputting data, encode it back to bytes - For example Django kind of does this for you, it will decode all incoming bytes and encode all out going unicode * Pro Tip #2 - Know what you have - At any point if your program, you need to know whether you have a byte string or a unicode string. - This shouldnt be a matter of guessing, it should be by design - If you have a byte string you need to know what encoding it is. - For example an exotic unicode string was encoded to UTF-8. So now it is a byte string. However when you decode that byte string back to unicode you could mistakenly use the wrong encoding. >>> my_unicode = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24" >>> my_unicode.decode('utf-8') u'Hi \u2119\u01b4\u2602\u210c\xf8\u1f24' >>> my_unicode.decode('iso8859-1') u'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4' - This is a good demonstration of Fact of Life #4. The same stream of bytes is decodable using a number of different encodings. The bytes themselves don't indicate what encoding they use * Pro Tip #3. Test Unicode
Python is not C
* Python is not C - For example: >>> x*2 slowest >>> x << 1 mid >>> x + x fastest - however for the same corresponding C program those execution times are all the same - In C on all modern computer architectures, each of the 3 arithmetic operations are translated into a single machine instruction which executes in one cycle, so it does not matter which one you choose - In python however we see that there is a significant advantage to adding a number to itself instead of multiplying it by two or shifting it left by one bit
Performance Tips - Sorting Lists
* Sorting Lists - From Python 2.4 and above, the built in sort method takes a key argument if you would like to sort in a certain way - Starting with Python 2.4, both list.sort() and sorted() added a key parameter to specify a function to be called on each list element prior to making comparisions >>> stream = "This is a test string from Andrew" >>> li = stream.split() z = sorted(li, key=str.lower) # or if you wanted to do it in place li.sort(key=str.lower) - The value of the key parameter should be a function that takes a single argument and returns a key to use for sorting purposes. - This technique is fast because the key function is called exactly once for each input record - Python lists have a built-in sort() method that modifies the list in-place and a sorted() built-in function that builds a new sorted list from an iterable - sorted() function accepts any iterable while sort() only works on lists
Performance Tips - String Concatenation
* String Concatenation - The latter of each of these solutions is generally much faster - Avoid this. It can become very slow when building large strings: >>> s = "" >>> for substring in list: s += substring - Instead use this: >>> s = "".join(list) - Similarly if you are generating bits of a string sequentially instead of doing this: >>> s = "" >>> for x in list: s += some_function(x) - Use this: >>> slist = [some_function(elem) for elem in somelist] >>> s = "".join(slist) - Avoid: >>> out = "<html>" + head + prologue + query + tail + "</html"> - Instead use this: >>> out = "<html>%s%s%s%s</html>" (head, prologue, query, tail)
Questions for Diana
* The edX Open Response Assessor (ORA) looked pretty interesting. Can you explain more about what that does? https://github.com/edx/edx-ora * the ORA appears to be an entire project that is supposed to be run in celery tasks. Is that correct? Is this submission done asynchronously and is the user supposed to check back later? * * is the XQueue its on instance in production? * * Do you guys use a combination of assessors to grade an individuals assignment (i.e. self, peer, instructor, and AI)? * * For edX ORA, whats the deal with 'pre-requirements.txt' and having numpy in there? * * What machine learning algorithms do you guys use?
Initailizing an Array
* Two approaches to initializing an array In [1]: x = [[1,2,3,4]] * 3 In [2]: x Out[2]: [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] In [3]: y = [[1,2,3,4] for _ in range(3)] In [4]: y Out[4]: [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] In [6]: x[0][3] = 99 In [7]: x Out[7]: [[1, 2, 3, 99], [1, 2, 3, 99], [1, 2, 3, 99]] # probably not what you were looking for! In [8]: y[0][3] = 99 In [9]: y Out[9]: [[1, 2, 3, 99], [1, 2, 3, 4], [1, 2, 3, 4]] - In the first option the members of the array all point to the same object! - In the second option, we produce an array of 3 elements, each of which is itself an independent 4-element array
Unicode vs. Bytes
* Unicode vs Bytes - unicode string: uni_str = u"Hello" - byte string: byte_str = "Hello"
Performance Tips - xrange
* Use xrange or range - xrange saves gobs of memory, only one yielded object will exist at a time - however when you call range, it creates a list containing so many number objects. All of those objects are created at once, and all of them exist at the same time. This can be a pain when the number of numbers is large - xrange on the other hand, creates no numbers immediately, only the range object itself. Number objects are created only when you pull on the generator, e.g. by looping through it - To see this in real action do the following: >>> xrange(sys.maxint) # works fine >>> range(sys.maxint) # python will lock up, it will be too busy allocating sys.maxint number of object (about 2.1 billion) - I got a memory error when running the old range
Unicode General
* Version 1 of Unicode started out with 65,536 code points * Unicode makes room for all the characters we will ever need in the world. There is 1.1M data points in unicode and we are currently only using 10% of that * UTF-8 is the most popular encoding for storage and transmission of unicode. It uses a variable number of bytes for each code point (ascii and iso-latin-1 are other encodings). Note though that all of these encodings are a subset of Unicode
Questions for Cale
* What services do they use to monitor production server performance? * * Can you explain the production stack in a little more detail? * * I am interested in background architecture, scalability, and machine learning. What can edX offer me that will compliment these areas and help me grow as an individual? * * What are the biggest bottlenecks/obstacles in your infrastructure right now? * * Do you guys use a caching layer w/slaves? Do you use RabbitMQ?
Logarithms and their Applications
- A logarithm is simply an inverse exponential function - Logarithms arise in any process where things are repeatedly halved - Logarithms and Binary Search - Binary search is a good example of an O(log n) algorithm - To locate a paritcular person p in a telephone book that contains n names, you start by comparing p against the middle, or (n/2), say Monroe, Marilyn - Regardless of whether p belongs before this middle name, or after it, after only one comparison you can discard one half of all the names in the book - By definition this is log n. O(log n) algorithms are fast enough to be used on problem instances of essentially unlimited size - Logarithms and Trees - A binary tree of height 1 can have up to 2 leaf nodes, while a tree of height two can have up to 4 leaves - What is the height h of a rooted binary tree with n leaf nodes? Note that the number of leaves doubles every time we increase the height by one. To account for n leaves, n = 2^h which implies h = log_2 n - If we have d leaves then n = d^h which implies h= log_d n - b^x = y is equivalent to x = log_b y
Generator return statement
- A return statement inside a generator means that the generator should stop executing (without yielding anything more; - return can be called without arguments only when used inside a generator)
Data Structure -- Arrays
- Advantages of arrays include: - Constant time access given the index. The index of each element maps directly to a particular memory address - Space Efficiency. Arrays consist purely of data so no space is wasted with links or other formatting info - Memory Locality. Physical continuity between successive data accesses help exploit the high-speed "cache memory" on modern computer architectures - Disadvantage is that we cannot adjust their size in the middle of a programs execution
Inheritance
- Programmers want to aboid typing the same code more than once. - Example a base 'Shape' class and a derived 'Rectangle' class that inherits a lot of the methods from 'Shape' class
Generator vs Iterator vs Iterable
- An iterator is an object that manages a single pass over a sequence - Both iterables and generators produce an iterator, allowing us to do "for record in iterable_or_generator" without worrying about the nitty gritty of keeping track where we are in the stream, how to get to the next item, how to stop iterating, etc. - Difference between a iterable and a generator: - once you've burned through a generator once, your down, no more data - on the other hand, an iterable creates a new iterator every times its looped over - you can go over the data stream sequence more then once for an utterable
Algorithm Analysis
- Assessing algorithmic performance makes use of the "big Oh" notation that, proves essential to compare algorithms and design more efficient ones - Each simple operation (+, *, -, =, if, call) takes exactly one time step - Loops are the composition of many single step operations - Each memory access takes exactly one time step - Under the RAM model, we measure run time by counting up the number of steps an algorithm takes on a given problem instance - Take-Home Lesson: Algorithms can be understood and studied in a language and machine independent way
What is the difference between a copy and a deep copy:
- Copy (i.e. a shallow copy) constructs a new compound object and then inserts references into it to the objects found in the original - Deep Copy constructs a new compound object and the, recursively, inserts copies into it of the objects found in the original - So in other words a shallow copy is just a reference, so if you change A it will also change B. - A Deep copy will be an entire new object so changing A will not change B
Encapsulation
- Different components of a software system should not reveal the internal details of their respective implementations. - i.e. which aspects of a data structure are supposed to be public and which are to be internal details - For example you can change the internal details of the property getters/setters and the end program should not mind
Deques
- Doubled-ended queues, or deques, can be be useful when you need to remove elements in the order in which they were added
Data Structures -- Queue
- First In First Out (FIFO). Fairest way to control waiting times for services - You want the container holding the jobs to be processed in FIFO order to minimize the maximum time spent waiting - Enqueue(x,q): Insert item x at the back of the queue q - Dequeue(q): Return (and remove) the front item from queue q
Using else with Exceptions
- In some cases it can be useful to have a block of code that is executed unless something bad happens: while True: try: x = input('Enterh the first number: ') y = input('Enter the second nunber: ') value = x/y print 'x/y is', value except: print 'Invalid input. Please try again.' else: break
What are some alternatives to CPython? When and why might you use them?
- Jython - A python implementation written in Java that utilizes the Java Virtual Machine (JVM) - Jython produces Java bytecode to run on the JVM - IronPython - written in C# and targeting the .NET stack - IronPython runs on Microsoft's Common Language Runtime (CLR) - PyPy - Has great speed. Thanks to its Just-In-Time (JIT) compiler, Python programs run faster on PyPy. I also think that it adapts and continually improves the optimizations - Memory usage. Might take up less space in PyPy than they do in CPython - Compatibility. Is compatible in a lot of existing python code. However I think it breaks part of Django - Sandboxing. PyPy provides the ability to run untrusted code in a fully secure way - Stackless mode. PyPy comes be default with support for stackless mode, providing microthreads and massive concurrency
Exploring Modules
- KNowing how to explore whats in a module is valuable skill because you will encounter many useful modules in your lifetime - Being able to grok[1] them quickly and easily will make your programming much more enjoyable [1] hackerspeak meaning "to understand fully", take from Robert A. Heinlein's novel Stranger in a Strange Land
What are lambda expressions
- Lambda expressions are a shorthand technique for creating single line, anonymous functions - Their inline - Can lead to more readable and pythonic code >>> square_root = lambda x: math.sqrt(x) >>> square_root(9) 3.0 >>> lamb = lambda x: x ** 3 >>> print lamb(5) 125 - lambda functions are good for simple one-off functions: a function that is going to be used only once - the body of lambda con contain only a single expression - the lambda definition does not include a return statement -- it always contains an expression which is returned - they are inline and anonymous - lambda is an expression, not a statement. Because of this, a lambda can appear in places a def is not allowed. For example, places like inside a list literal, or a function call's arguments >>> death = [ ('James', 'Dean', 24), ('Jimi', 'Hendrix', 27), ('George', 'Gershwin', 38), ] >>> sorted(death, key=lambda age: age[2]) [('James', 'Dean', 24), ('Jimi', 'Hendrix', 27), ('George', 'Gershwin', 38)]
Data Structures -- Stack
- Last In First Out (LIFO). They are very efficient - Push(x,s): Insert item x at the top of the stack s - Pop(s): Return (and remove) the top item of the stack s
Chief among the principles of the object-oriented approach are the following:
- Modularity - Abstraction - Encapsulation - Polymorphism - Inheritance
Generator
- One of the true powers of iterating over sequences (i.e. generators) is in saving memory - A generator was brought into existence for simplification of code and pythonic expression. It allows developers to not worry about a lot of boilerplate code. For example __iter__() and next() method are automatically included in them and thats what makes them so compact and pythonic. You could of course write a iterable class that adopts the generator pattern and overrides __iter__() and next() which does the exact same thing - A generator is a kind of iterator that is defined with normal function syntax - written like normal function but use the yield statement whenever they want to return data. Generator functions return/create the values on the fly - instead of returning one value (like you do with return), you can yield several values one at a time. It is a lazy on demand generation of values. Each time a value is yielded, the function freezes; that is, it stops its execution at exactly that point and waits to be reawakened. When it is, it resumes execution at the point where it stopped - Like iterators, we can get the next value of a generator using next() ('for' gets values by calling next() implicitly) - one (small) downside we generators is that there slightly slower to iterate over. They need to calculate the value then return it whereas a normal iterator already has all the values computed
Differences between Python 2.0 and 3.0
- Python 3.0 uses the concept of text and (binary) data instead of Unicode string and 8-bit strings in Python 2.0 - Python 3.0 print statment is now a function, i.e. print() - Python 3.0 xrange() no longer works, range() now behaves like xrange() - Python 3.0 has some API changes. Examples include zip(), map(), and filter() now return iterators instead of lists - Python 3.0 dict.iterkeys(), dict.iteritems(), and dict.itervalues() are no longer supported - Major differences on how unicode strings are handled
Sequence and Mapping Protocol
- Sequences and mappings are basically collection of `items` - To implement their basic behavior (protocol), you need 2 magic methods if your objects are immutable, or 4 if they are mutable: __len__(self): Should return the number of items contained in the collection. If __len__ returns 0 (and you don't implement __nonzero__, which overrides this behavior) the object is treated as False __getitem__(self, key): Should return the value corresponding to the given key __setitem__(self, key, value): Should store value in a manner associated with key so it can later be retrieved with __getitem__ __delitem__(self, key): Should delete the element associated with the key
Generator Expressions
- Some simple generators can be coded succinctly as expressions using a syntax similar to list comprehensions but with parenthesis instead of brackets. - Examples: > sum (i*i for i in range(10))
Big-Oh Notation
- The Big-O notation ignores the difference between multiplicative constants. - The functions f(n) = 2n and g(n) = n are identical in Big Oh analysis - Take-Home Lesson: The Big Oh notation and worse-case analysis are tools that greatly simplify our ability to compare the efficiency of algorithms
What are some drawbacks to Python
- The Global Interpreter Lock (GIL). CPython is not fully thread safe. CPython provides a global lock that must be held by the current thread before it can safely access Python objects. As a result, CPython is single threaded - PyPy on the other hand provides a stackless mode that supports micro-threads for massive concurrency - Execution speed. Python at times can be slower than compiled languages * Is Python interpreted or compiled? - In the case of CPython, the answer is kind of both - For CPython, code is first compiled then interpreted. More precisely, it is not precompiled to native machine code, but rather to bytecode. - While machine code is certainly faster, bytecode is more portable and secure. - The bytecode is then interpreted
*.pyc files
- The file with the .pyc extension is a plaform-independent processed ("compiled") Python file thta has been translated to a format that Python can handle more efficiently. - If you import the same module later, Python will import the .pyc file rather than the .py file, unless the .py file has changed; in that case, a new .pyc file is generated - Deleting the .pyc file does no harm -- a new one is created as needed
What is your approach to Unit testing?
- The most fundamental answer to this question centers around Python's 'unittest' testing framework. - key elements of the unittest framework are test fixtures, test cases, test suits, and test runners
Performance Tips - Doing Stuff Less Often
- The python iterpreter performs some periodic checks - In particular it decides whether or not to let another thread run and whether or not to run a signal or pending call - Most of the time there is nothing to do, so performing these checks each pass around the interpreter can slow things done - If you are not running threads and you don't expect to be catching many signals you can raise the setcheckinterval to tell the interpreter how often this should be checked - it is found in the sys module and is defaulted to 100 - setting this to a largeer value can improve the interpreter's perfomance, sometimes substantially
Python Tuple
- Think of a tuple as a list that you can't change. They are immutable > mytuple = (1,2,3,4) - You might be asking, why do we need tuples if we have lists? - The answer is that tuples are used internally in python in a lot of places - One of the basic differences is that dictionaries cannot use a list as a key, but they can use a tuple - Tuples are great for performance because of their immutability. Python will know exactly how much memory to allocate for the data to be stored - When to use tuples: - when you need to store data that doesnt have to change (fixed data collection) - when the performance of your application is very important
String pattern matching
- This is the classic find command available in any web browser or text editor - Problem: Substring Pattern Matching - Input: A text string t and a pattern string p - Output: Does t contain the pattern p as a substring, and if so where? - O(nm)
Thoughts on Object Oriented Design
- Use this as a guideline for building and designing classes: 1. Write down a description of your problem (what should the program do?). Underline all the nouns, verbs, and adjectives 2. Go through the nouns, looking for potential classes 3. Go though the verbs, looking for potential methods 4. Go through the adjectives, looking for potential attributes 5. Allocate methods and attributes to your classes
Differences between Java and Python
- Well thats basically comparing apples to oranges. We could talk days about that - Java Differences - Java is restricted to static typing - A static method in java does not translate to a python class method - Java has method overloading. Same method name but accepts different parameters - Using single vs double quotes in java has different meanings - Java encourages getters and setters - Classes are not optional. All java functions need to reside in a class - Indentation does not matter - Python differences - Python is dynamically typed - Calling a class method involves additional memory allocation that calling a static method or function does not - No explicit method overload. However can achieve it with using optional keyword arguments - No difference between single and double quotes - Its non-python to use getter/setters. Use the property methods - Classes are optional - Indentation matters
Iterators
- You can iterator over other objects then just list, dicts. They just need to implement the __iter__ method - The __iter__ method returns an iterator, which is any object with a method called `next`, which is callable without any arguments - If the iterator has no more values it raises a StopIteration exception - The `iterator` is the object that calls next. The `iterable` is the stream the values - Tip: the built-in function `iter` an be used to get an iterator from an iterable object: >>> it = iter([1,2,3]) >>> it.next() 1 >>> it.next() 2 >>> it.next() 3 >>> it.next() StopIteration Traceback
Polymorphism
- You can use the same operations on objects of different classes, and they will work as if by magic - A little confusing but basically the methods of an object do the magic. So for example a list has list.count('e') to see how many times 'e' is found in the list. However the string object also has that method. str.count('e') also returns the number of times 'e' is found in the string. Different objects but they have the same method and can be called - Another example is the + operator. 1+2=3. But also "Fish" + "Food" = "FishFood" - Polymorphism is a good thing. With polymorphism you should NOT do type checking (e.g. issubclass, isinstance, type). This is how you destroy polymorphism. Because with type checking you need to do isinstance(object, tuple). But what if down the line the object is a list. OK now you'll have two checks isinstance(object, tuple) and isinstance(object, list). OK but what if a third object gets introduced but you've already shipped your product to the client. Won't be able to change it now. - Polymorphism enables you to call the methods of an object without knowing its class (type of object)
Using __all__
- You should actually use __all__ over the list comprehension >> copy.__all__ ['Error', 'copy', 'deepcopy'] - __all__ = ["Error", "copy", "deepcopy"] # this is written in the copy.py module - It defines the public interface of the module. More specifically, it tells the interpreter what it means to import all the names from this module. - So if you did this "from copy import *" you get only the four functions listed in the __all__ variable - Setting __all__ like this is a useful technique when writing modules. Only filter out what the developer should use
Why use function decorators
- a decorator is essentially a callable that is used to modify or extend a function or class definition - A big plus of decorators is that one decorator can be applied to multiple functions (or classes) - It can remove a lot of boilerplate code
Python List
- a list can contain objects of any data type - you can delete an index using del mylist[2] - you can sort a list using mylist.sort() (sort is inline, i.e. just call it like mylist.sort(), not z = mylist.sort()) - you can slice lists using colon, i.e. mylist[2:4] - When to use lists - when you need a mixed collection of data all in one place - when the data needs to be ordered - when your data requires the ability to be changed or extended - when you don't require data to be indexed by a custom value (i.e. like a dict). Lists are numerically indexed and to retrieve an element, you must know its numeric position in the list - When you need a stack or queue. Lists can be easily manipulated by appending/removing elements from the beginning/end of the list
The class namespace
- all code in the `class` statement is executed in a special namespace -- the `class namespace` - For example you are not restricted to def statements inside the class definition block: >>> class C: print 'Class C being defined...' Class C being defined... >>>
Breakdown of different powers
- all such algorithms take roughly the same time for n = 10 - Any algorithm with n! running time because useless for n >= 20 - Algorithms whose running time is 2^n become impractical for n >= 40 - Quadratic-time algorithms whose running time in n^2 remain usable up to about n = 10,000. They are hopeless for n > 1,000,000 - Linear time (i.e. n) and n log n algorithms remain practical on inputs of one billion items - An O(log n) algorithm hardly breaks a sweat for any imaginable value of n - Algorithms in order of decreasing complexity (i.e. faster) n! >> 2^n >> n^3 >> n^2 >> nlogn >> n >> logn >> 1
Multiple Inheritance
- class TalkingCalculator(Calculator, Talker): pass - However you need to be careful with multiple inheritance. If a method is implemented differently by two or more of the superclasses (that is you have 2 different methods with the same name) you must be careful about the order of these superclasses (in the class statement) - The methods in the ealier classes override the methods in the later ones. - So if the Calculator class in the preceding example had a method called `talk`, it would override (and make inaccessible) the `talk` method of the Talker. - If the superclasses share a common superclass then python uses a rather complicated Method Resolution Order (MRO) to look for a given attribute - Tip: Go easy on multiple inheritance. It can make things unnecessarily complex, difficult to get right, and even hard to debug
Heaps
- import heapq - a kind of priority queue - lets you add objects in an arbitrary order, and at any time, find (and possibly remove) the smallest element - It does so much more efficiently than, say, using min on a list
Python Dict
- key/value data structure - the keys can be any immutable type and must be unique - if the keys are string, you can use the following keyword expression: > dict(a=1, e=2, i=3, o=4, u=5) {'i': 3, 'u': 5, 'e': 2, 'a': 1, 'o': 4} - When to use dicts - when you need a logical association between a key:value pair - when you need a fast lookup for your data, based on a custom key - when your data is being constantly modified
Python Set
- only stores unique items - sets are mutable - a set can contain objects of any data type - a set is unordered > s = set() > s = set([1,1,2,3,4]) > print s set([1,2,3,4]) > 1 in s True > another_set = set([1,2,'hello']) > s.intersection(another_set) set([1,2]) > set1 = set([1, 2, 3]) > set2 = set([3, 4, 5]) > set1 | set2 #union set([1, 2, 3, 4, 5]) > set1 & set2 #intersection set([3]) > set1 - set2 #difference set([1, 2]) > set1 ^ set2 #symmetric difference (elements that are in the first set and the second, but not in both) set([1, 2, 4, 5]) - the way a set detects if a clas between non-unique elements has occurred is by indexing the data in memory, creating a hash for each element. - this means that all elements in a set must be hashable (so an item can not be a list or dict) - When to use sets: - you need unique data - when your data constantly changes; sets, just like lists, are mutable - when you need a collection that can be manipulated mathematically: with sets its easy to do operations like difference, union, intersection, etc.
Modularity
- refers to an organizing principle in which different components of a software system are divided into separate functional units. - Real world example: A house can be viewed as several interacting units: electrical, heating and cooling, plumbing, and structural. Rather then viewing these as one gian jumble of wires, vents, and pipes - Modularity helps enable software reusability, easier to test/debug separate components, isolation
Modules of the standard library
- sys - os - fileinput - sets, heapq, deque - time - random - math - re
Private Attributes
-Keep in mind that python does not support private attributes. - However using a little trickery you can make a method or attribute private (inaccessible from the outside), simply by starting it with two underscores - Inside a class definition, all names beginning with a double underscore are "translated" by adding a single underscore and the class name to the beginning: - So you can get access to the private name through the following: >>> p = Person() >>> p._Person__inaccessible() Can you see me? - So you can use this strategy to show that this sort of name-mangling is a pretty strong signal that these attributes should not be accessed!
Interfaces and Introspection
>>> callable(getattr(p, 'get_name')) True - If you want to see all the values store in an object you can examine its __dict__ attribute - And if you `really` want to find out what an object is made of, you should take a look at the `inspect` module
Getting help with help
>>> help(copy.copy) Help on function copy in module copy: copy(x) Shallow copy operation on arbitrary Python objects. See the module's __doc__ string for more info. (END)
Using dir
>>> import copy >>> [n for n in dir(copy) if not n.startswith('_')] Out[23]: ['Error', 'PyStringMap', 'copy', 'deepcopy', 'dispatch_table', 'error', 'name', 't', 'weakref']
Bubble Sort
def bubble_sort(li): """ The bubble sort makes multiple passes through a list. It compares adjacent items and exchanges those that are out or order. Each pass through the list places the next largest value in its proper place. In essence, each item "bubbles" up to the location where it belongs A bubble sort is considered the most inefficient method since it must exchange items before the final location is known. These "wasted" exhange operators are very costly O(n^2) """ for num in range(len(li) - 1, 0, -1): for i in range(num): if li[i] > li[i+1]: li[i], li[i+1] = li[i+1], li[i]
flatten nested list using yield and recursion
def flatten(nested): """ this will flatten a set of lists arbitarily deeply >>> list(flatten([[[1],2],3,4,[5,[6,7]],8])) [1, 2, 3, 4, 5, 6, 7, 8] However this is a problem with this. If nested is a string like object it will be treated as an iterable and we don't want to iterate over it. We want to keep string like objects as atomic, not as sequences that should be flattened. Also this would lead to infinite recursion because the first element of a string is another string of lenght one, and that string is the string itself """ try: for sublist in nested: for element in flatten(sublist): yield element except TypeError: yield nested def flatten_improved(nested): try: # Don't iterate over string-like objects try: nested + '' except TypeError: pass else: raise TypeError for sublist in nested: for element in flatten_improved(sublist): yield element except TypeError: yield nested
why is ''.join(list) good?
chars = ['s','a','f','e'] name = ''.join(chars) print name name = '' for c in chars: name += c print name - The join method called on a string and passed a list of string takes linear time based on length of string - Repeatedly appending to a string using '+' takes quadratic time -
Example class of Iterator
class Fibs(object): """ In formal terms an object that implements the __iter__ method is the iterable, and the object that implements next is the iterator """ def __init__(self): self.a = 0 self. b = 1 def next(self): # this computes the actual value and returns it self.a, self.b = self.b, self.a + self.b return self.a def __iter__(self): # this returns the iterator itself return self
Constructor super
class SongBird(Bird): def __init__(self): super(SongBird, self).__init__() self.sound = 'Squak' def sing(self): print self.sound
Graph Class
graph = { "a" : ["c"], "b" : ["c", "e"], "c" : ["a", "b", "c", "d", "e"], "d" : ["c"], "e" : ["c", "b"], "f" : [] } class Graph(object): def __init__(self, graph_dict={}): self.graph_dict = graph_dict def nodes(self): """ return all nodes of graph """ return self.graph_dict.keys() def edges(self): """ generate all edges of graph """ edges = [] for node, neighbors in self.graph_dict.iteritems(): for neighbor in neighbors: edges.append((node, neighbor)) return edges def find_isolated_nodes(self): isolated_nodes = [] for node, neighbors in self.graph_dict.iteritems(): if not neighbors: isolated_nodes.append(node) return isolated_nodes def find_path(self, start, end, path=[]): """ finds a path from start node to end node """ path = path + [start] if start == end: return path if not self.graph_dict.has_key(start): return None for node in self.graph_dict[start]: if node not in path: new_path = self.find_path(node, end, path) if new_path: return new_path return None def find_all_paths(self, start, end, path=[]): """ finds all paths from start node to end node """ path = path + [start] if start == end: return [path] if not self.graph_dict.has_key(start): return [] paths = [] for node in self.graph_dict[start]: if node not in path: new_paths = self.find_all_paths(node, end, path) for new_path in new_paths: paths.append(new_path) return paths def node_degree(self, node): """ finds the number of edges connected to a single node with loops counted twice """ edges = self.graph_dict[node] return len(edges) + edges.count(node)