DSC 1O WEEK 3-WHO KNOWS
Histogram
-chart that displays the distribution of numerical values -uses bins, one bar for each bin -uses the area principle: the area of each bar is the percent of individuals in the corresponding bars. AREA IS THE PERCENTAGE Have to follow the area principle
aged.bin('Age',bins=make_array(0,5,10,20) second to last row is?
10. It won't look for things over 20. 20 will have 0. all of the rows that were cut off or are the very last are included into the second of last row(or the last argument)
What should happen to out histogram if we combine the two bins[20,40) and [40,60) into one large bin [20 to 60
20,60 bin has to be twice as wide. since the width is fixed, you cant change the height and area. The area of the bar for bin [20,60) should be the sum of the areas of the bars for bins [20,40) and [40,60). This preserves the area.YOU HAVE TO ADD THE AREAS. AREA IS
num_rows
Compute the number of rows in a table
select
Create a copy of a table with only some of the columns
take()
Create a copy of the table with only the rows whose indices are in the given array
Height measures.....
DENSITY.....-how packed things are in the bin. MOST DENSE= MORE STUFF, odr the height of the column. density depends on width
Question 1. Assign us_death_rate to the total US annual death rate during this time interval (July 1, 2016 to July 1, 2017). The annual death rate for a year-long period is the number of deaths in that period as a proportion of the population at the start of the period.
Question 1. Assign us_death_rate to the total US annual death rate during this time interval (July 1, 2016 to July 1, 2017). The annual death rate for a year-long period is the number of deaths in that period as a proportion of the population at the start of the period. In [37]: us_death_rate = sum(pop.column('DEATHS'))/sum(pop.column('2017')) us_death_rate
Table
Table() Create an empty table, usually to extend with data
Table.read_table
Table.read_table("my_data.csv") Create a table from a data file
Histogram AXES
The area of a bar isa percentage of the whole area=% This horizontal axi
What should happen to out histogram if we combine the two bins[20,40) and [40,60) into one large bin [20 to 60 )? What is the density of the new bin
The new bin has about twice as many movies and is twice as big as each original bin, so it is about the same density as each original bin. doubling width, and height, so that we cna keep the height to abou the same
don't use norm=FALSE
True
def f(s) return np.round(s/sum(s) *100,2) 1.What does this function do?
a
count the specific number on the axis, for each age range, how many things fall in that age range
aged.hist('Age', bins=np.arange(0,101,20),normed=False)
group
aggregates all rows with the same value for a column inot a single row in the result First arg: which column you want to manipulate second:what to do with the other columns
group by color
all_cones.group('Cones;) cOLOR. COUNT brown. 1 red. 2
group by Flavor and Color
all_cones.group(['Flavor','Color']) the argument should be in a list
bar
already grouped by a topic history: decide bins, based off od different buckets
def f(s) return np.round(s/sum(s) *100,2) 13.What output will it give? 14.What output will it give?
array of numbers an array of numbers
Question 2. Sort the data in decreasing order by NEI, naming the sorted table by_nei. Create another table called by_nei_pter that's sorted in decreasing order by NEI-PTER instead.
by_nei = unemployment.sort('NEI',descending=True) by_nei_pter = unemployment.sort('NEI-PTER',descending=True)
HW3 Question 2. Sort the data in decreasing order by NEI, naming the sorted table by_nei. Create another table called by_nei_pter that's sorted in decreasing order by NEI-PTER instead.
by_nei = unemployment.sort('NEI',descending=True) by_nei_pter = unemployment.sort('NEI-PTER',descending=True)
Question 5. Add pter as a column to unemployment (named "PTER") and sort the resulting table by that column in decreasing order. Call the table by_pter.
by_pter = unemployment.with_columns("PTER", pter).sort("PTER",descending=True) by_pter
starters.group('TEAM',max)
chooses the biggest of the letter with letter in the alphabet with the letter at the endmost part of the alphabet
scatter plot
compare two numerical data types.
Question 3. Make a table of the number of complaints made against each company. Call it complaints_per_company. It should have one row per company and 2 columns: "company" (the name of the company) and "number of complaints" (the number of complaints made against that company).
complaints_per_company = complaints.group('company').relabeled("count", "number of complaints") complaints_per_company
Question 5. Make a bar chart of just the 5 companies with the most complaints.
complaints_per_company.sort("number of complaints",descending=True).take(np.arange(5)).barh("company")
Question 6. Make a bar chart like the one above, with one difference: The size of each company's bar should be the proportion (among all complaints made against any company in complaints) that were made against that company.
complaints_per_company.with_column("proportion of all complaints", complaints_per_company.column("number of complaints")/complaints.num_rows)\ .sort("proportion of all complaints",descending=True)\ .drop("number of complaints")\ .take(np.arange(5))\ .barh('company')
HW3 How many complaints were made against each kind of product? Make a table called 'complaints_per_product' with one row per product category and 2 columns: "product" (the name of the product) and "number of complaints" (the number of complaints made against that kind of product). You should be able to do this in one line of code.
complaints_per_product=company.group('product').relabeled('count','"number of complaints") complaints_per_product
def f(s) return np.round(s/sum(s) *100,2) 12.What kind of input does it take? examps s=1,2,35 5/6=1/6,2/6,3/6 5/6*100=1/6*100,2/6*100, 3/6*100
computes percents
most expensive chocolate ice-cream
cones.where('Flavor,'chocolate').column
Binning
counting the number of numerical values that lie within ranges, called bins (put numbers into groups, based on the range) inluding left start point and exclusive on the right side endpoint
Apply
creates an array by calling a function on every element in input column(s) table_name.apply(function_name, 'column_label')
with_column("name",.....)
data that you want to go into that column. new column will be added to the end of the graph
def spread (values) return max(values)-min(values)
def spread (values): Name. Argument names(parameters)
sort() is default.....
default false
c_to_F(y/4)
does y/4, calls C_to_F(and plugs in the value)
Question 4.3. What's the title of the earliest movie in the dataset? You could just look this up from the output of the previous cell. Instead, write Python code to find out.
earliest_movie_title =imdb_by_year.column('Title').item(0) earliest_movie_title
do it for all the data at one time
every set of parents, predict the height of their child , compare prediction height with the actual height
Question 2. Assign fastest_growth to an array of the names of the five states with the fastest population growth rates in descending order of growth rate.
fastest_growth = pop.with_column('R', -pop.column(3)/pop.column(2)).sort('R').take(np.arange(5)).column(1) # SOLUTIONfastest_growth
aged.bin('Age, bins=make_array(2,4,6,8,10)
fin2,4,6. fine level of detail
add my_flower to the original table using...
flowers.with_row(my_flower)
L.7group('Age')
for each group how many movies there were for that age
nba.
for each possible pair of team and position, find the max of each player max is measuring the max string , /c its measuring in alphabeitcal order
aged.hist('Age', bins=np.arange(0,101,20),unit='year')
from 0 to 100, in chinks of 20. (age) is added to the z and y axis labels, default measures a weird percentage
Question 3. Use take to make a table containing the data for the 8 quarters when NEI was greatest. Call that table greatest_nei.
greatest_nei = by_nei.take(np.arange(8)), it's upperbound is exclusive , so greatest_nei
cones.group('Flavor')
group by flavor Flavor count chocolate 3 strawberry 2
L.7top.group('Studio')
group by studio
crowdness of bins is......
height
L.7distribute
how many people have that value
If we want just the ratings of the movies, we can get an array that contains the data in that column:
imdb.column("Rating").....returns an array
If you create a table column from a list, it will
it will automatically be converted to an array. A row, on the ther hand, mixes types.
if the column has numbers,
it will sort numerically.
Question 5. Assign less_than_west_births to the number of states that had a total population in 2017 that was smaller than the number of babies born in region 4 (the Western US) during this time interval.
less_than_west_births =pop.where('2017',are.below(west_births)).num_rows less_than_west_berths
Question 3. Assign movers to the number of states for which the absolute annual rate of migration was higher than 0.5%. The annual rate of migration for a year-long period is the net number of migrations (in and out) as a proportion of the population at the start of the period. The MIGRATION column contains estimated annual net migration counts by state.
movers = pop.with_column("test", pop.column("MIGRATION")/pop.column("2016")).where("test",are.above(.005)).num_rows movers
The horizontal axis is a
number line
L.7ov sum is over 100 , so
overlap
Are measures.....
percent
look at silimalr families, mid_parent function
predict the result. should be bale to vary
Compute an array containing the percentage of people who were PTER in each quarter. (The first element of the array should correspond to the first row of unemployment, and so on.)
pter = unemployment.column('NEI-PTER')-unemployment.column('NEI') pter
Function requirements are not ......
required
def cut_off_at_100(age) 'tHE SMALLER OF age AND 100' return their age or 100, whichever is smaller cut_off_at_100
return min(age,100) 104
what if some of the columns can't be summed b/c they're strings?
select the columns you want and and then group nba.select('POSITION','SALARY').group('POSITION,np.mean).sort('SALARY',descending=True)
histogram
show the distribution of numerical data. don't use it when. each column represents a group defined by a continuous, quantitative variable
L.7 easier to compare
sort first
If the column has strings in it....
sort will sort alphabetically
who is the best payed starter
start_salaries.
Which will rank the teams in order of their highest-paid starter?
starters.select('TEAM
given:'Data Science rocks! Data Science rocks: length is 19 Define a function str_len that takes a string as a parameter and retruns a new string that consists of: The given string a colon and a space "length is" the. length of the string
str_len def str_len(s) return st": length is"+ str(len(s)) #turn into a strength
L.7What proportion did not use their phone for online banking
sum is over 100 , so
combine tables
t=drinks.join('Cafe',discounts,"Location")
using apply with multiple arguments def midParent(mother_height,father_height)
table_name.
a common method to use with np.arange
take()
both sorted.take(np.arange(18,30)?
take() function only displays certain rows in the given criteria... 8-29
with_columns
tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2)) Create a copy of a table with more columns
column
tbl.column("N") Create an array containing the elements of a column
drop
tbl.drop("2*N") Create a copy of a table without some of the columns
where
tbl.where("N", are.above(2)) Create a copy of a table with only the rows that match some predicate
cones.group('Flavor,max)
the second and thirds rows will have have the max value from different object that get grouped together
def C_to_F(x_: return x*9/5+32
to define a function, make your own function
If the name od the table is top and the name of our function is str_len, how do we find the length of each movie title?
top.apply(string_len, "Title")
turn a data table into a plot graph with x and y axis labeled
unemployment.with_columns("PTER",pter,"Year",2000+ np.arange(by_pter.num_rows)/4)..plot(x-axis,y-axis)
bar chart
used to compare variables. each column(or row represents a group defined by a categorial variable
Question 4. Assign west_births to the total number of births that occurred in region 4 (the Western US).
west_births = sum(pop.where('REGION',are.equal_to('4')).column('BIRTHS')) west_births
add a row to a table
with_row
L.7scatter()
x and y axis labels
minimize the cost
you should get espresso at nefeli
starters.drop('POSITION).group('TEAM,max).sort(1,descending=True)co
you're sorting by column 1