MCDB170 MT

Ace your homework & exams now with Quizwiz!

big O notation and timing

-*big O notation* (landau's symbol): describes how running time or memory space for your algorithm grow as the size of input to algorithm GROWS (goes to infinity) --> describes the *asymptotic behavior of the algorithm* -O (1): order of 1; constant time no matter what input size is -O (N): order of N, time to finish algorithm is directly proportional to size of input data (typical for 1 FOR loop) -O (N^2): order of N^2, typical for *nested* FOR loops *%timeit function*: measure the *average execution time* (any assignment operation is ignored), by running the function multiple times -

IF statement

-colon (:) after condition, which should be a boolean *expression* -INDENT in front of the statement(s); python uses *indentation to indicate a block of code single line syntaxes of IF statement -*PASS statement*: use "pass" inside the IF block to continue without generating error I) IF/ELIF/ELSE > if <bool expression>: <statement>; <statement>; ... > elif <same as above> > else: <statement>; <statement>; ... II) cannot use ELIF! > <statement> if <conditional bool expression> else <statement>

jupyterlab basics

-common formats: images (-/=: zoom in/out; [ ]: rotate left/right; 0: reset) -notebooks have command mode + edit mode -command mode: *cell keep data in "containers"* -a: add cell above, b: add cell below -c: copy cell, v: paste cell, x: cut cell -dd: delete cell, z: undo, shift-z: redo -y: change format to code, m: change to markdown format -shift-enter: run cell (output will be below) -00: restart kernel in command mode

functions

-define fxns and use them to avoid repetition (don't want to write the same code over and over again) -*def function_name ( parameter(s) separated by comma)* -parameter = arguments -note: if you assign (=) a value to an argument, then that value is the default value) -note: using the NAME of fxn parameters *ignores the order in the definition*

data structures and algorithms

-how the data is organized affects how you manipulate the data--> changing organization changes manipulation/way to design flow -*organization (DATA STRUCTURES) ALWAYS goes together w manipulation (ALGORITHMS) of data* -there are *2 common patterns in programming*: 2 major categories of patterns are data structures + algorithms -the choice of data structure affects the IMPLEMENTATION of algorithms + its processing time I) ARRAYS: linear data structure -good for SPEED/finding data; easy to *search*/modify random piece of data -bad/slow for adding/removing components -fast for search, slow for resizing -ie. mostly used in biological analysis to boost speed II) LINKED LIST: more or less random data structures -SLOW to find data -FAST for adding new data in memory -slow for search, fast for resizing -ie. if you want to store a lot of data + don't need to modify each entry often (game development)

classes

-instructions + data combined into CLASSES, a single package: data contained that has relevant fxns packaged together -*data structure* that has some data, some fxns that manipulate those data only -why use classes? when program becomes big, hard to keep track of things > must group data/relevant fxns together to simplify entire code base --> *OBJECT ORIENTED PROGRAMMING* CODE -class definition: > class myClass( ): -fxns defined in classes are *METHODS*, definitions of methods MUST include *SELF* as the 1st argument/parameter -class *INSTANCE*: object formed in memory; ie a = myClass ( ) >> a is an INSTANCE of the class, myClass -note: once you make an INSTANCE, it has ATTRIBUTES of the mother class methods (so you access it with . syntax) -*__INIT__: define method __init__ to initialize the data = CONSTRUCTOR, with self as the 1st argument

loops

-keywords: break(stop the loop), continue (jump to next iteration), pass (fill blank space where statements are syntactically required) I) FOR loop -useful fxns: *range (returns #s), enumerate (returns element and its index in a seq, ie list)* -ex. for index, person in enumerate(people_list): >>> returns index + the element in the list -ex. for x in range(5,10): >>> #s from 5 to 9, NOT 10! -note: range fxn returns an iterator; if use range in FOR loop, returns # that is stored in the iterating variable x -range fxn: the 3rd argument is the STEP SIZE (ie. range(5, 10, 2) returns 5, 7, 9) II) WHILE loop: has no iterating variable, but REPEATS the block of code as long as condition is met -infinite loop: while 0 == 0: -be careful to avoid an infinite loop

data types

-strings, numeric types, boolean types, etc. I) strings: collection of letters/numbers -can concatenate 2 different strings by using ' + ' II) numeric types: int, float, complex -complex types: *use "j" NOT "i" (ex. 4 + 7j is class complex)--> can *extract real + imaginary #s from a complex # by .real + .imag methods* -*type conversion*: any FLOAT converted to INT will be rounded down automatically (3.7 > 3; -3.7 > -3) -note: TYPE ERROR when you try to convert COMPLEX into FLOAT/INT III) boolean types: True, False -ex. tf < 3; print(tf, type(tf)) >>> False <class 'bool'> -*boolean expressions*: ANY expression that returns True or False (ex. c = 1:3j; isinstance(c, complex) >>> True) -note: *isistance* checks if a variable is a type or not (ie. isinstance(c, complex) checks if c is a complex #)

bioinformatics and basics

-topics in DNA seqs, find regular or repeating patterns in DNA seq - many methods built upon statistical comparison b/w a real and random DNA seqs generating random DNA seqs: -use *random( ) fxn* to generate random number b/w 0 and 1 -use *floor( ) and ceil( ) fxns* to round #s down or up -use *' '.join([a_list])* to make a list into a string!!! -*generator expression*: SAVES MEMORY inside any LIST COMPREHENSION by adding to the function before it directly--> get rid of the outer brackets -generator technique ex: can use ' '.join( ) without the brackets to make it into a list first; *generator saves memory, but not necessarily faster* bc the .join fxn adds 1 by 1! -note: *choice( ) fxn* picks a random element out of the given parameter, but it is SLOW counting bases in seqs: -*SUM( ) fxn*: converts True and False to 1 and 0 and sums it up (note: if use 1 instead of True > save time for converting boolena values into integers before summation) -*string fxn .COUNT( )*: more than 20x faster, counting a letter in a string is very common (faster bc don't have to go thru interpreters in py, instead it directly accesses the memory using C lang) *frequency table*> -gene expression regulated by molecules called *transcription factors* that attach to DNA--> turn nearby genes on and off -these molecules *bind preferentially* to a few specific sequences--> these binding preferences can be represented by a table, called frequency table

python interface

-us (program code) --> interpreter of python (gives us a view of the memory --> __main__ (name of memory space) -when we write: a = 3, interpreter makes an OBJECT inside the memory __main__ -inside the object: -1. name of object (NOT an attribute of the obj, just a tag outside, which is available to us!) -2. attributes (run *dir(a)* to see all attributes): they are OBJECTS inside the object a 3 ESSENTIAL SYNTAXES: 1. . (dot): for accessing a certain attribute 2. ( ) parentheses: for USING/CALLING the object, running some code inside it, for executing code -ex. *function_name( )* to run the fxn's code (fxn IS an objext itself, but you have to write ( ) to execute the fxn object 3. = (equal sign): for assignment MORE ON CLASSES: how it works in the interface: -when define a class MyClass, interpreter makes an object inside __main__ named MyClass -inside it has a lot of attributes, some are addresses to definition of fxns (called methods) -when make INSTANCE of MyClass, interpreter FIRST *makes a new instance object, THEN runs the __init__ method* (and whatever is inside __init__ block) -when you call a method from the INSTANCE obj, the instance object itself is used as the 1st argument to the method call (named "SELF")

biopython

-using string type in python to represent seqs is cumbersome--> eventually *collect the most commonly used tools into coherent package* -many tools for biological analysis, NOT just seq analysis -3D protein structure analysis, statistical analysis specific to bio FASTA format VS GenBank format -FASTA: first line beginning with " > " is a description line; followed by *unique ID of the seq*; followed by name of sp, chromosome #, etc. -note: there can be multiple seqs in a SINGLE FASTA file -GenBank: more complex, annotated w a lot more info on the particular seq -*SeqIO object*: parses data files in various formats -*SeqIO.parse( )*: main fxn to parse files, works like open fxn for files--> *RETURNS SeqRecord object!!!* -*SeqRecord object*: holds a *Seq object* with other metadata in the data file -note: SeqRecord.Seq is a SEQUENCE object (similar to a string), not a SeqRecord object Seq class: immutable (original seq never changes), behaves like a string -from Bio.Seq import Seq -METHODS: .complement, .reverse_complement, .count(pattern), .count_overlap(pattern), .join, .lower, .upper, .transcribe, .back_transcribe, .translate -initialize like an instance of a class, ie: *a_seq = Seq('ATG.....')* -slicing of Seq is identical to string, can change Seq to string, concatenation is similar to string SeqRecord class

what is programming?

-writing instructions/commands for computers to perform tasks -CPU + memory makes up "brain" of every comp I) *CPU*: reads each instruction + performs computation--> *executes instructions* -uses elect. signals to execute commands -entire CPU composed of simple operations = units, like *gates* to perform computations -a few transistors implement single bit memory -transistors w/ resistors work as gate -a few *gates implement addition circuit* (ie. *nor gates* can do simple addition operation) -*unit of info: BIT 0, 1*--> comp operations are extremely simple (bits) II) *MEMORY*: keeps data + instructions i) *instructions*: operators, FOR loops, if-else conditionals, fxns ii) *data*: containers like lists, tuples, dictionaries, sets; variables can only store single data value

types of operators

1. common operators: - //: quotient - %: modular/remainder (note: calculation of floating #s is not completely accurate) - **: exponentiation 2. *boolean values: True, False* == equality and != not equal (*comparison operators*) > output either *True or False* -also >, <, >=, <= 3. *identity operators: IS, IS NOT* -ex. 4 is 4 >> True -[1, 2] is [1, 2] >> False (a list object of 1 and 2 compared w another list of 1 and 2 are *different objects*--> *identity operators don't compare values, they compare OBJECTS*) -note: object is unit of data in python -note: *== compares if the same data/info is stored; while IS compares if they ARE the same thing* 4. *logical operators*: not, or, and -ex. not False >> True -ex. True or False >> True; True and False >> False 5. *membership operators: in, not in* -ex. 1 in [3, 1, 5] >> True; 7 not in [3, 1, 5] >> True 6. *assignment operators: =, +=, *=, etc.* -not: bitwise operators, similar to assignment ops, are not used that much in biology

reading a gene file

1. gene seq typically stored in text file -use 'r' in open( ) fxn argument -*new line "\n"*: HIDDEN character visible only to computer--> *STRIP( ) fxn* removes \n characters -files may be in diff location in a diff operating system, so to make code portable b/w diff systems, use *os package* (import os > filename = os.path.join 2. extract exon positions using *.tsv file format* (tab separated values) -*.SPLIT( ) fxn* removes TAB spaces, \n character, returns indiv component -make TUPLE using ( ) brackets 3. .REPLACE( ) fxn: replace T with U letter for mRNA seq 4. make a directory + save our seq -use 'w' in open ( ) fxn -use *.FIND( ) fxn* to find AUG inside a string 5. make a class Gene( ): >> make module by making a new .py file, and copy entire Gene class -if use Gene.py module, MUST *import Gene* (or from Gene import Gene)

scripts, modules, packages, libraries

A) *script*: performs tasks, execute instructions -write script to execute something, and get some result (script = just normal/basic programming) B) *module*: contains ONLY definitions -does NOT execute anything by itself; however, if we run a module inside a script > can use the fxn inside our script C) *packages*: a set of modules w a COMMON theme D) *library*: a collection of packages; python has a standard library (essential packages like print, input, etc.) -must *import a package/module using import keyword* -ex. math packages: import math > math.pi = pi 3.14 -note: importing only selected fxns w/o having to use package name as prefix, use ' from packagename import fxnname1, fxnname2 -OR you can use '*' for importing ALL fxns -*importing w alias*: import math as m -use *dir( )* to list fxn names/attributes of an object SCIENTIFIC PACKAGES -Numpy: foundation of python scientific packages -Biopython: basic tools for bio. data analysis (DNA seq, protein analysis) -Pandas: provides framework for general data analysis (specific data structure), based on Numpy -Scipy: scientific tools for many fields, including stats/physics -Matplotlib: plotting fxns -other packages: Statsmodels (stat analysis), Scikit-image (image analysis)

copying

A) variables storing basic types (int, float, complex, string) -v1 = 10 makes an OBJECT in certain location in memory, that's named the variable name (v1) and stores the value 10 -when copying variables w basic types, *the entire object is duplicated* > v2 = v1 means that a new object is formed--> they are *SEPARATE ENTITIES* -changing the value of one variable doesn't change the other! B) LISTS: have multiple elements -list1 = [1, 2, 3] makes an OBJECT in a certain memory location names list1, memory space for the entire list (actual values) as new location, and a REFERENCE/ADDRESS to this memory from list1 to this new location -list1 object saves reference to the data/actual values of the list -when copying lists (list2 = list1), *ONLY the address/reference is copied, NOT the actual elements of the list*--> SO, changing one affects the other! -to make a REAL COPY of a list, use < *list.copy( )*--> makes a shallow copy of a list (copy() fxn copies list itself) C) NESTED SEQUENCE: list WITHIN a list -list1 address to a memory space where only the elements are found--> EACH element of the list address to memory spaces where the actual values of the nested sequence is found -list2 = list1 ONLY copies the very top level reference -list2 = list1.copy( ) ONLY copies the very top level AND the next level addresses, BUT *NOT the actual data sequences at the deepest level* -*list2 = copy.deepcopy(list1) follows ALL references and makes the WHOLE copy* -note: MUST *import copy* at the top of your program if want to use copy.deepcopy()

sequence alignment

Hox genes in flies are critical for normal development across many sp -Hox aligned in specific *physical order*--> very important -changing the physical position induces birth defects -extremely well preserved genes across sp (range from bills of years) -*seq alignment method* finds similarities b/w sp, infer that fxns of those genes ALSO similar 2 complementary methods 1. *paired seq alignment*: align 2 seqs -faster -use it to select candidate genes/sp from database -BLAST from NCBI (not that accurate, but FAST) 2. *multiple seq alignment*: simultaneously align multiple seqs > give more info -slower -AFTER you obtain the candidates -ex. ClustaQ, MAFFT, MUSCLE

statements and expressions

I) *expression*: piece of code that generates some value/outcome -ex. [1, 2, 3] (output [1, 2, 3]; 7 + 5 (output 12) II) *statement*: piece of code that makes a complete line/unit of execution, BUT there is NO return value to use in the next line of code -ex. print(42) (output 42); a = 3, b = 5 + 7 (composed of 2 instructions: first is add 5 + 7 and return result--> then assign result to variable b) -if not interested in what comp does, then instruction = statement -if we need to get some outcome of the comp, then instruction != statement -each line of expression can be executed by python interpreter -note: *EVER expression is a statement, but not every statement is an expression!* (some statements may not return values)

languages

I) history -assembly langs: human readable computer instructions -*compiler*: convert human friendly langs to ASSEMBLY CALLS--> machine calls (note: no way to see result as writing the program, ie fortran, C) II) *compiled langs*: must convert program to machine code + generate separate files that machines can read -ex. C/C++, fortran, java III) *interpretable langs*: allows to see result while you're writing the program; *does NOT compile program* -compiles EACH line of code > executes it right after -SLOWER than compiled langs; but convenient! -ex. python (in almost all science) -other ex: matlab (most engineering, some bio, matrices); R (in statistics); perl (extensively used in human genome project); ruby IV) why python? -easy to learn (syntax v human friendly) -extremely popular in every field of science mature scientific libraries (numpy, scipy, which are fast, stable, maintained well) -easy to learn programming enviro (jupyterlab) -note: challenges include TOO many ways to achieve 1 goal!


Related study sets

REAL NUMBERS, IMAGINARY NUMBERS, AND COMPLEX NUMBERS

View Set

Principles of Management Chapter 5; Planning and Decision-making

View Set

Age of Exploration - Chapter 6 - Magellan's Voyage

View Set

Module 1 Exam, Practice T2, EAQ #6 Nursing Process/sexuality, N204 Practice Quizes, Fundamentals Quiz, Health and Physical Assessment, Leadership EAQ's, EAQ NCLEX, Maternity Chap 28, Maternity and Women's Health Nursing - Newborn, Nur 106- Module G2,...

View Set

Voice Exam 3 - Facilitating Techniques

View Set

MH Exam 4 , MH EXAM 4 - ATI, Dementia & Delirium Questions

View Set