Coding Fundamentals with Python Final
parent or base class
the previously defined class from which the new class inherits attributes and methods
kind = 'box'
to create a box plot
sharey = True
to ensure plots share the same scale for the x-axis
.keys()
to list all the keys of a dictionary
.values()
to list all the values of a dictionary
.items()
to list bth the keys and values of a dictionary
You are given two DataFrames, alpha and beta. you run the following piece of code to join them: omega = pd.merge(alpha, beta, on = 'theta', how = 'right') the omega data frame will contain all rows in beta and only those rows in alpha that have matching key values
true
by default, the .describe() method of a dataframe only returns summary statistics for the numeric columns in a dataframe
true
the pandas plot() method is an abstraction of some of the functions and methods of the matplotlib package
true
the while loop is very similar to a conditional statement because it is made up of a condition and a response
true
when collecting data, the absence of existing data on certain subpopulations can lead to bias in ground truth data
true
try-except (example)
try: 1/0 print(a) except NameError: print('The variable is not defined!') except: print('You can't divide by 0!')
(parenthese)
tuple
if we try to create a NumPy array from a list with elements of different data types, python will convert all the elements to a single data type. what is this process called?
upcasting
combining strings
use addition (+) operator
extracting n-th character in a string
use square brackets ([ ])
.agg()
used to apply multiple aggregation functions to one or more columns in a dataframe or to apply different aggregations to different columns at once
assignment operator (=)
used to create a variable
del()
used to delete a variable
.format()
used to display a message that includes information stored in a single variable or several variables
try-except statement
used to handle exceptions tat occur during code execution
for loop
used to iterate over the items of a sequence or container
operators
used to perform calculations in python
.rename(columns = {"" : ""}
used to rename columns
xlim and ylim
used to zoom in to a certain part of the plot in order to get a closer look at the data
triple quotes
used when strings go across multiple lines
class attributes
used when we want to define attributes with values that are shared by all objects created from a class
grouped bar chart
useful when we want to compare values across two or more categories
def function_name(arguments): """docstring""" <code> return output
user-defined functions
python supports three main types of function: built-in functions, ___, and ___
user-defined functions, anonymous functions
which of these is not one of the five key considerations we must keep in mind when collecting data for the analytics process?
value
global variables
variables defined outside of a function, can be used inside and outside of a function
Create a scatterplot that shows the relationship between city miles per gallon (on the x-axis) and CO2 emissions (on the y-axis) for all vehicles in the vehicles dataset.
vehicles.plot(kind = 'scatter', x = 'citympg', y = 'co2emissions')
Create two overlapping histograms from the vehicles dataset. The first histogram should show the distribution of city miles per gallon, while the second should show the distribution of highway miles per gallon. Set the opacity of the histograms to 0.4 and the make the plot 10 inches wide by 6 inches high. Label the x-axis "Miles Per Gallon" and the y-axis "Number of Vehicles".
vehicles[["citympg", "highwaympg"]].plot(kind = "hist", figsize = (10, 6), alpha = 0.4) plt.xlabel("Miles Per Gallon") plt.ylabel("Number of Vehicles")
resolving duplicate columns example
vehicles_concat_col = vehicles_concat_col.loc[:, ~vehicles_concat_col.columns.duplicated()]
Create a new DataFrame called washer_config from the washers dataset that lists the minimum, median, mean and maximum Energy Usage and Water Usage values for each type of washer configuration (i.e top load or front load). Output the washer_config DataFrame.
washer_config = pd.DataFrame(washers.groupby(["Configuration"])["EnergyUse", "WaterUse"].agg(["min", "median", "mean", "max"])) washer_config
Create a new DataFrame called washers by importing the CSV file located at https://coding-fundamentals.s3.amazonaws.com/residentialwashers.csv. Preview the first 5 rows of the DataFrame.
washers = pd.read_csv("https://coding-fundamentals.s3.amazonaws.com/residentialwashers.csv") washers.head(5)
Create a new DataFrame called washers by importing the CSV file located at https://coding-fundamentals.s3.amazonaws.com/residentialwashers.csv. Return a concise summary of the rows and columns in the washers DataFrame. Hint: The summary must include the number of columns, number of rows, column names, data type of each column, number of non-missing values in each column and how much memory is used to store the DataFrame.
washers = pd.read_csv("https://coding-fundamentals.s3.amazonaws.com/residentialwashers.csv") washers.info()
Create a new DataFrame called washers by importing the Residential Washers CSV file located at https://coding-fundamentals.s3.amazonaws.com/residentialwashers.csv. Set the ID column as the row index (either after the import or during the import) and preview the last 10 rows of the washers DataFrame.
washers = pd.read_csv("https://coding-fundamentals.s3.amazonaws.com/residentialwashers.csv", index_col = "ID") washers.head(10)
Sort the washers DataFrame in descending order, by DateAvailable and DateCertified. Note: Perform an in place sort.
washers.sort_values(by = ['DateAvailable', 'DateCertified'], inplace = True, ascending = False) washers
Output a count of each unique value in the BrandName column of the washers DataFrame.
washers["BrandName"].value_counts()
Based on the data in the BrandName column of the washers DataFrame, output the percentage of washers that belong to each brand.
washers["BrandName"].value_counts(normalize = True)
Convert the DateAvailable and DateCertified columns in the washers DataFrame to datettime. Use a DataFrame attribute to output the data type of both columns after the conversion is done. Hint: The display() function allows us to output more than one result at the same time.
washers['DateAvailable'] = pd.to_datetime(washers['DateAvailable']) washers['DateCertified'] = pd.to_datetime(washers['DateCertified']) display(washers["DateAvailable"].dtype, washers["DateCertified"].dtype)
Output the number of non-missing values, average, standard deviation, minimum, maximum, 25th percentile, 50th percentile and 75th percentile for the Volume, IMEF, EnergyUse, IWF and WaterUse columns in the washers DataFrame.
washers[["Volume", "IMEF", "EnergyUse", "IWF", "WaterUse"]].describe()
Output the 20th, 40th, 60th and 80th percentile values for the Volume, IMEF, EnergyUse, IWF and WaterUse columns in the washers DataFrame. In [26]:
washers[["Volume", "IMEF", "EnergyUse", "IWF", "WaterUse"]].quantile([0.20, 0.40, 0.60, 0.80])
Create a new DataFrame called water_config_brand from the washers dataset that lists the maximum water usage for each configuration and brand. Output the water_config_brand DataFrame.
water_config_brand = pd.DataFrame(washers.groupby(["Configuration", "BrandName"])["WaterUse"].max()) water_config_brand
instance attributes
we specify their value whenever we create a new instance of the object
descriptive analytics
what happened, what is happening; using summary statistics and visualizations to describe historical data
predictive analytics
what is likely to happen; uses statistical models and machine learning to estimate the likelihood of future outcome
prescriptive analytics
what should be do; least used and most complex; considers implications of several possible decisions and makes recommendations on which actions to take in order to maximize stated objective
integer
whole number
diagnostic analytics
why did it happen; root cause analysis, identify outliers, isolate patterns, and discover hidden relationships in data
verify_integrity = True
will give a value error is it encounters duplicate indices
import pandas as pdtrackwomen = pd.read_html("https://und.com/sports/track/roster/season/2021-22/")[1]trackwomen.head() trackwomen.groupby('______')['______'].value_counts().unstack().plot( kind = '______', stacked = ______, figsize = (10,6))
1) POSITION 2) Class 3) barh 4) True
import pandas as pdmbball = pd.read_html("https://und.com/sports/mbball/roster/season/2021-22/")[0]mbball[['Number','Name','POSITION','Hometown','High School','Class']] Complete the following piece of code so that I get a summary similar to the one below. mbball['_______'].________()
1) class 2) describe
array
a fixed-type data structure that allows us to store multidimensional data in an efficient way
arguments or parameters
a function can accept data (objects) as input
read_csv()
a function from the readr library used to import csv files
given a pivot table called centuri. what type of chart will the following piece of code create? centuri.plot(kind = "bar", stacked = False)
a grouped bar chart
DataFrame
a heterogeneous two-dimensional data structure with labeled axes (rows and columns)
pandas series
a homogeneous one-dimensional array-like data structure with a labeled axis (rows)
given a pandas series object called epsilon. what type of chart will the following piece of code create? epsilon.plot(kind = "barh")
a horizontal bar chart
variable
a label or name that is assigned to an object (such as a number)
list
a mutable, heterogeneous, ordered, multi-element container
Before we can generate group-level aggregations, we first need to group data using the groupby() method of a series or data frame. in the following code snippet, what would the date type of florida_county be? florida_county = votes[votes['state']=='FL'].groupby('county')
a pandas GroupBy object
function
a reusable piece of code that performs a certain task
while loop
a type of loop that runs as long as a logical condition is true and stops running when the logical condition becomes false
5 aspects of data collections
accuracy, relevance, quantity, variability, ethics (privacy, security, informed consent, bias)
.append()
add an element to the end of a list
title =
adds a title to a plot
np.append)
adds an element to a NumPy array
descriptive statistics or summary statistics
aggregations or statistical measures are used to describe the general and specific characteristics of our data
if-elif-else statement
allows us to chain multiple conditional statements together in order to execute different blocks of code depending on which condition is met
joins
allows us to combine two or more datasets based on the values in related columns from each dataset
inheritance
allows us to define a class that inherits the attributes and methods of a previously defined class
if-else statement
allows us to execute a separate block of code if the condition in the if statement is not met
Complete the function below so that when you pass an unspecified number of numbers to it, it returns the largest. def max_number(*alpha): beta = alpha[0] for gamma in ________ : if gamma > beta: beta = gamma return beta
alpha
a package is a collection of code used to perform a specific type of task. In order to use the code provided by a package in python, we have to first import the package in the following way: import pandas as pd what do we call the pd in the line of code above?
an alias
what will the following piece of code return? mylist = ['My true love sent to me', 1, 'Partridge', 1, 'Pear Tree'] mylist.sort() mylist
an error
each cycle through a loop is called
an iteration
another name for a NumPy array is
an nd-array
string (str)
an object that holds a block of text
boolean (bool)
an object that holds the dichotomous values TRUE or FALSE
dictionary (dict)
an unordered and mutable data structure that stores information as key-value pairs
&
and
avg = 72 std = 4.3 tom = 75 print(tom >= (avg - std) ____ tom <= (avg + std))
and
isinstance('variable', 'data type')
another way to check the data type of a variable
class attributes and instance attributes
are both mutable
required arguments
arguments that must be passed to the function when we use it in our code (ex. if function is specified with 3 arguments, you must give 3 arguments)
referencing an element in a 2-d array
arrayname[row_index, column_index]
evaluation
asses how well the chosen analytics approach works
how to create a boolean
assign TRUE or FALSE to a variable or assign the result of a comparison, logical, or membership operation to a variable (ex. my_boolean = 5 > 4)
Create a line plot from the vehicles DataFrame that shows the change in the average city miles per gallon by year.
avg_citympg = vehicles.groupby('year')[['citympg']].mean() avg_citympg.plot(kind = "line")
Create a line chart from the vehicles DataFrame that shows the change in both the average city and highway miles per gallon by year.
avg_mpg = vehicles.groupby('year')[['citympg', 'highwaympg']].mean() avg_mpg.plot(kind = "line")
Create two separate line plots in the same figure from the vehicles DataFrame that show the change in the average city and highway miles per gallon by year. The city miles per gallon plot should be on top of the highway miles per gallon plot.
avg_mpg = vehicles.groupby('year')[['citympg', 'highwaympg']].mean() avg_mpg.plot(kind = 'line', y = ['citympg', 'highwaympg'], subplots = True)
Create two separate line plots in the same figure from the vehicles DataFrame that show the change in the average city and highway miles per gallon by year. This time, the city miles per gallon plot should be to the left of the highway miles per gallon plot. Title the plot "Average City MPG versus Average Highway MPG" and make it 12 inches wide by 4 inches high.
avg_mpg = vehicles.groupby('year')[['citympg', 'highwaympg']].mean() avg_mpg.plot(kind = 'line', y = ['citympg', 'highwaympg'], title = 'Average City MPG versus Average Highway MPG', figsize = (12, 4), subplots = True, layout = (1, 2))
keyword (or named) arguments
be explicit in specifying which value goes with which argument when calling a function
how to create a pandas series
brics = pd.Series(["Brazil", "Russia", "India", "China", "South Africa"])
escape character
can be used to represent whitespace characters or characters that are typically not allowed in strings
lists aree heterogeneous
can contain elements of different data types
optional (or default) arguments
can specify a default value for some or all of our arguments when defining a function
.capitalize()
capitalize only the first letter in each sentence
.title()
capitalizes the first letter of each word
bins =
changes the number of bins in a histogram
np.reshape()
changes the shape of an array
in (membership operator)
checks if a substring exists within a string (ex. 'Python' in my_string)
.find()
checks if a substring exists within the string (returns the starting index position)
modeling
choosing and applying the right analytics approach that works well with the data we have and solves the problem we intend to solve
Instantiate an object from the ParttimeEmployee class named chris for an employee called "Chris Clark", who works thirty hours per week and has been at the company for ten years. Call the intro() method for chris.
chris = ParttimeEmployee('Chris Clark', 10, 30) chris.intro()
Define a Car class that has two instance attributes - color and capacity.
class Car: def __init__(self, color, capacity): self.color = color self.capacity = capacity
how to define a child class
class ChildClassName(ParentClassName): <code>
defining a class example
class Dog: pass
Define an Employee class that has one class attribute called salaried and three instance attributes called first_name, last_name, and work_years. The class attribute should have a default value of True.
class Employee: salaried = True def __init__(self, first_name, last_name, work_years): self.first_name = first_name self.last_name = last_name self.work_years = work_years
Define a child class called ParttimeEmployee from the Employee parent class. The ParttimeEmployee class should have an additional instance attribute called weekly_hours.
class Employee: def __init__(self, name, work_years): self.name = name self.work_years = work_years def intro(self): print("Hi, my name is {}. I've worked here for {} years.".format(self.name, self.work_years)) class ParttimeEmployee(Employee): def __init__(self, name, work_years, weekly_hours): self.name = name self.work_years = work_years self.weekly_hours = weekly_hours
Redefine a child class called ParttimeEmployee from the Employee parent class. The ParttimeEmployee class should have an additional instance attribute called weekly_hours. The ParttimeEmployee class should have its own intro() method, which prints a message that reads "Hi, my name is {name}. I've worked here for {work_years} years and I work {weekly_hours} hours per week.".
class PartimeEmployee(Employee): def __init__(self, name, work_years, weekly_hours): self.name = name self.work_years = work_years self.weekly_hours = weekly_hours def intro(self): print('Hi, my name is {}. Ive worked here for {} years and I work {} hours per week.'.format(self.name, self.work_years, self.weekly_hours))
Redefine the ParttimeEmployee class by adding another method called health_benefit. The method should respond according to the following rules: if an employee works thirty or more hours per week, the method prints a message that reads "I am eligible for employer-provided health benefits." if the employee works less than thirty hours per week, the method prints a message that reads "I am not eligible for employer-provided health benefits."
class ParttimeEmployee(Employee): def __init__(self, name, work_years, weekly_hours): self.name = name self.work_years = work_years self.weekly_hours = weekly_hours def intro(self): print('Hi, my name is {}. Ive worked here for {} years and I work {} hours per week.'.format(self.name, self.work_years, self.weekly_hours)) def health_benefit(self): if (int(self.weekly_hours) >= 30): print('I am eligible for employer-provided health benefits.') else: print('I am not eligible for employer-provide health benefits.')
key
commonly used name for related columns
a conditional statement is a combination of one or more ____ and ____.
conditions, responses
.plot()
creates a plot form a pandas data structure
subplots = True
creates multiple plots within a figures
figsize =
customizes the size of a plot or figure
statically-typed programming language
data type of a variable has to be explicitly defined in advance before being assigned a value
Define a function called calculator that accepts 3 arguments called operation, x and y. The allowed values for operation are 'add', 'subtract', 'multiply', and 'floor'. If a user enters a value for operation that is not one of these, return a message that reads "Invalid Operation!". Otherwise, depending on the value of the operation argument, the functon should return one of the following: x plus y x minus y x times y the floor division of x by y
def calculator(operation, x, y): if operation == 'add': result = x + y elif operation == 'subtract': result = x - y elif operation == 'multiply': result = x * y elif operation == 'floor': result = x // y else: result = 'Invalid Operation!' return result
Define a function called round_mean that accepts an unspecified number of numeric values as arguments and returns the mean of the numbers (rounded to two decimal places). Add a docstring to your function that explains what the function does, how many arguments it accepts, and what it returns.
def round_mean(*args): result = round(sum(args)/len(args), 2) ''' This function accepts a variable number of numeric values and returns the mean of these numbers. ''' return result
Define a function called to_fahrenheit that accepts a temperature value in celsius as an argument, and returns the temperature in fahrenheit rounded to no decimal places.
def to_fahrenheit(number1): result = round(9/5*(number1)+32, 0) return result
Modify the function you defined in Problem 2 so that the height argument becomes an optional argument with a default value of 10.
def triangle_area(number1, number2 = 10): result = (1/2 * number1 * number2) return result
Define a function called triangle_area that accepts arguments for the base and height of a triangle and returns the area of the triangle
def triangle_area(number1, number2): result = (1/2 * number1 * number2) return result
Define a function called vowel_count that returns the number of English vowels in a variable passed to it.
def vowel_count(arg): count = 0 vowel = set("aeiouy") for alphabet in arg: if alphabet in vowel: count = count + 1 return (count)
sparsity and density
degree to which data exists in a dataset
np.delete()
deletes a specific element from an array
{curly brackets}
dictionary
which of these should I run if I want to get a list of all methods supported by the tuple data structure?
dir(tuple)
Create a DataFrame called dive_women by selecting the Name, Class and Hometown columns from the swim_women DataFrame for those swimmers who are members of the dive team. Output the dive_women DataFrame sorted in ascending order of Name.
dive_women = swim_women.sort_values( by=["Name", "Class", "Hometown"], ascending = [True, False, False] )[["Name", "Class", "Hometown"]] dive_women
[2:]
end is the last element of the list
Create a new DataFrame called energy_config_brand from the washers dataset that lists the minimum and maximum energy usage for each configuration and brand. The min column should be called min_energy_use and the max column should be called max_energy_use. Output the energy_config_brand DataFrame.
energy_config_brand = pd.DataFrame(washers.groupby(["Configuration", "BrandName"])["EnergyUse"].agg({"min", "max"})) energy_config_brand.rename(columns = {'min':'min_energy_use', 'max':'max_energy_use'}, inplace = True) energy_config_brand
what does the following piece of code do? balance == 45
evaluates whether balance is equal to 45
rounding to even (banker's rounding)
even number is returned (ex. 1147.5 is rounded to 1148)
dict()
ex. university_info = dict( name = 'University of Notre Dame', mascot = 'Leperchaun', city = 'Notre Dame', state = 'Indiana')
change any elements in a list by using index notation
ex. color_list[3] = 'orange'
list()
ex. list('Python is my friend) separates each character by comma
how to create a dictionary
ex: university_info = { 'name' : 'University of Notre Dame', 'mascot' : 'Leperchaun', 'city' : 'Notre Dame', 'state' : 'Indiana' }
if statement
executes a block of code if one or more logical conditions are met
listname[index]
extract an individual element in a list
Given two pandas DataFrames called shake and bake with the same number of columns, rows, and index values, the following code will combine the columns of the two DataFrames pd.concat([shake, bake])
false
Python supports three types of conditional statements, the if statement, the try-if-else statement, and the try-except-finally statement.
false
a programming language in which the data type of a variable has to be explicitly defined is known as a dynamically typed language
false
one of the benefits of using dictionaries is that the keys in a dictionary are mutable
false
the actions an object can take or the functions it can perform are known as its attributes
false
the characteristics of an object are known as its methods
false
the strong data type is used to represent the dichotomous values of TRUE and FALSE
false
when using slice notation, the stop index value is inclusive
false
Given the list nums = [10, 20, 30, 40, 50, 60, 70, 80, 90], which type of loop is most appropriate if my goal is to calculate the square of every element in the list?
for loop
Use a loop to iteratively call the calculator function using each item in op_list as the value of the operation argument for the numbers 135 (as x) and 75 (as y). Print the returned value in each iteration of the loop. For example, the first output should be "add: 210".
for operation in op_list: print(operation,':', calculator(operation, 135, 75))
.groupby()
for single columns, pass name of column we intend to group by
for loop to iterate through values of dictionaries
for value in fruit_price.values(): discount_value = round(value * 0.8, 2) print(discount_value)
descriptive statistics
frequency distributions, measures of central location, measures of spread
class gamma: def __init__(self, x): self.x = x def calc(self): print(self.x ** 2) class theta(gamma): def __init__(self, x, y): self.x = x self.y = y def calc(self): print(self.x ** self.y ** self.x) Based on the class definitions above, we know that __ is the parent class and __ is the child class
gamma; theta
.value_counts()
gives a count of each unique value in a single column within a series
.count()
gives a count of the number of occurrences of a particular value
dir()
gives a list of methods supported for a particular type of object in python
.value_counts(normalize = True)
gives a percentage rather than a count
.describe(include='all')
gives descriptive statistics for all columns
.index
gives information about the index or row labels of a DataFrame
.info()
gives quick overview of structure of data including number of columns, number of rows, column names, data type of each column, number of non-missing values, and how much memory is used
.mean()
gives the average of the values within a series or the column of a dataframe
docstring
gives the description of what the function does (optional)
.ndim
gives the dimensions of a NumPy array
.itemsize
gives the number of bytes used to store each element of a NumPy array
.nbytes
gives the number of bytes used to store the entire array
.shape
gives the number of elements in each dimension of a NumPy array
.quantile()
gives the percentiles of the values within a series or the column of a dataframe
.sum()
gives the sum of the values within a series or the column of a dataframe
.size
gives the total number of elements in a NumPy array
.values
gives the values in the cells of the DataFrame
Use the list approach to create a pandas DataFrame called grades from the data presented in the following table. Set the Name column as the index (in place) and output the grades DataFrame.
grades = pd.DataFrame([["John", "Physics", 74, 82, 67, "B"], ["Carol", "Math", 76.5, 86, 82.5, "A"], ["Jim", "Economics", 71, 77.5, 62.5, "C"], ["Laura", "Engineering", 84.5, 92, 87.5, "A"], ["Tom", "Biology", 79, 80.5, 77, "B"], ["Chris", "Theology", 70.5, 73.5, 71.5, "C"]], columns = ["Name", "Major", "Exam1", "Exam2", "Midterm", "Final"]) grades.set_index("Name", inplace = True) grades
Use the dictionary approach to create a pandas DataFrame called grades from the data presented in the following table. Output the grades DataFrame.
grades_dict = brics_dict = {"Name": ["John", "Carol", "Jim", "Laura", "Tom", "Chris"], "Major": ["Physics", "Math", "Economics", "Engineering", "Biology", "Theology"], "Exam1": [74, 76.5, 71, 84.5, 79, 70.5], "Exam2":[82, 86, 77.5, 92, 80.5, 73.5], "Midterm": [67, 82.5, 62.5, 87.5, 77, 71.5], "Final": ["B", "A", "C", "A", "B", "C"]} grades = pd.DataFrame(grades_dict) grades
box plot
great for visualizing distribution of values for a variable (min, 1st quartile, median, 3rd quartile, max)
Create a new DataFrame called hometown by importing the 'Hometown' sheet in the Excel file located at https://coding-fundamentals.s3.amazonaws.com/students.xlsx. Combine the students (you created in the previous problem) and hometown DataFrames by using an inner join on the ID column. Call the new DataFrame students_hometown_inner and display it.
hometown = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Hometown") students_hometown_inner = pd.merge(students, hometown, on = "ID", how = "inner") students_hometown_inner
measures of spread
how similar or varied values of feature are
data collection
identify and gather the data we need for the analytics process
actionable inisht
identifying potential course of action or a series of actions based on the results of the model
if-elif-else (example)
if score >= 90: print('The grade is A.') elif score >= 80: print('The grade is B.') elif score >= 70: print('The grade is C.') elif score >= 60: print('The grade is D.') else: print('The grade is F.')
if-else (example)
if score >= 90: print('The grade is A.') else: print('The grade is not A.')
method overriding (or polymorphism)
if we define a method in a child class with the same name as a method defined in the parent class, the child method overrides the parent method
ignore_index = true
ignores original index values and assigns new ones
relationship visualization
illustrate correlation between two or more variables
comparison visualizations
illustrate difference between two or more items
Define a function called math_facts that only makes use of functions in the math module to return the square, square root, natural log (base e), and factorial of any whole number passed to it. Add a docstring to your function that explains what the function does, how many and what type of arguments it accepts, and what it returns.
import math as m def math_facts(arg): ''' This function accepts any whole numbers and returns the square, square root, natural log (base e), and factorial of that whole number. ''' return (m.pow(arg, 2), m.sqrt(arg), m.log(arg), m.factorial(arg))
Use a function from the math module to get the greatest common divisor between the numbers 30 and 76.
import math as m m.gcd(30, 76)
Use a function from the math module to get the result of 3^7
import math as m m.pow(3, 7)
add or override the labels for x and y axis
import matplotlib.pyplot as plt plt.xlabel() plt.ylabel()
from (module_name) import (function_name)
imports only the specified function ex. from math import factorial
f-string
include an 'f' at the beginning of the string (ex. f'{name} is Number {rank}.')
full outer join
includes all rows from both left and right datasets regardless of whether the key values match
left join
includes all the rows from the left dataset and only the rows from the right dataset with matching key values
right join
includes all the rows from the right dataset and only the rows from the left dataset with matching key values
inner join
includes only the rows from both datasets where the key values match
By default, the pandas concat() function combines the columns of two DataFrames by matching the ____ of both Data Frames
index labels
start
index value we start at (inclusive)
stop
index value we stop at (exclusive)
ground truth data
information that is known to be real or true
try-except-else (example)
input_number = input('Enter a number: ) try: reciprocal_number = 1/ float(input_number) except ZeroDivisionError: print('Zero does not have a reciprocal.') except: print('Invalid input.') else: print('The reciprocal of {} is {}.'.format(input_number, round(reciprocal_number, 2)))
np.insert()
inserts an element to a particular position in an array
extracting elements from dictionary
instead of indexing, use key
continue
instead of terminating the loop early, we skip the current iteration and move on to the next
given the following code snippet, what will the data type of x by: x = 55 // 6
int
int64 data type in pandas
int
int
integer variable
.describe(exclude= )
limits the types of columns to include in our input
[square brackets]
list
inplace = True
makes it so the set_index change persists
data preporation
making sure data is suitable for the analytics approach that we intend to use; resolving data quality issues and modifying/transforming structure of data to make it easier to work with
negative value for step
means we step from right to left
.update()
merge the contents of one dictionary with that of another (replaces value if key exists in both dicts)
the process of calling several methods on an object without having to create temporary variables is known as
method chaining
For example, if we wanted to compare the SAT average scores by type of college amongst colleges in Michigan and Indiana, we group the michiana_colleges DataFrame by both state and institutional_owner then get the mean of the sat_average column:
michinana_colleges.groupby(["state", "instirutional_owner"])["sat_average"].mean()
import math as m m.sqrt(567)
modules are sometimes imported with aliases so we don't have to type their long names
rules for naming variable
must not begin with number, must not contain punctuation, must not contain a space, must not be one of python's reserved words
while loop (example)
my_list = [ ] while len(my_list) < 5: x = input('Enter anything: ') my_list.append(x) print(x)
break (example)
my_list = [ ] while len(my_list) < 5: x = input('Enter anything: ') if x == "!": break my_list.append(x) print(x)
continue (example)
my_list = [ ] while len(my_list) < 5: x = input('Enter anything: ') my_list.append(x) if x == "*": continue print(x)
for loop (example)
my_list = [45, 57.5, 231.4, -56, 99.3, 132, 89.5] sum_value = 0 for item in my_list: sum_value = (sum_value + item) print(sum_value)
np.object
non numeric columns
two dimensional NumPy array
np.array([[2, 3, 4],[4, 5, 6]])
Create a bar chart that shows the number of vehicles in the vehicles dataset by model year. Make the plot 10 inches wide by 6 inches high.
number = vehicles.groupby(["year"])["make"].count() number.plot(kind = 'bar', figsize = (10, 6))
Convert the bar chart from the previous problem into a stacked bar chart that shows the number of vehicles in the vehicles dataset by model year, broken out by drive type (i.e. '2-Wheel Drive', 'Rear-Wheel Drive', etc). Make the plot 10 inches wide by 6 inches high.
number_drive = vehicles.groupby(["year"])["drive"].value_counts() number_drive = number_drive.unstack() number_drive.plot(kind = 'bar', stacked = True, figsize = (10, 6))
exception
occur when a line of grammatically correct code fails during execution (ex. ZeroDivisionError)
syntax error
occurs when a line of code does not abide by the rules of the language
|
or
child class's init()
overrides that of its parent class; child class won't inherit attributes from the parent class
b, h = 12, 5 ____ = ((b**2) + (h**2))**.5 print(p)
p
adding a legend
plt.legend() (loc = (location))
when we define a method in a child class with the same name as a method in its parent class, the child class method overrides that of the parents. this is known as ___
polymorphism
Call the round_mean function and pass the numbers 243, 435, 563, 412, 369 and 679 to it. Print the returned values.
print(round_mean(243, 435, 563, 412, 369, 679))
Call the to_fahrenheit function, pass 17 degrees celsius to it, and print the returned value.
print(to_fahrenheit(17))
Call the modified triangle_area function, pass 12 as the base, and print the returned value.
print(triangle_area(number1 = 12))
Call the triangle_area function, pass 6 and 15 as the height and base, respectively, and print the returned value.
print(triangle_area(number1 = 15, number2 = 6))
data exploration
process of describing, visualizing, and analyzing data in order to better understand it
data analytics
process of extracting value or insight from data through series of iterative and methodical processes
how to create a string
quotes (') (") ('")
.read_excel()
read an excel file into python
read_json()
reads JSON files into python
read_html()
reads an html table into python
negative index notation
refers to elements based on how far away they are from the end of the list (starts at -1 not 0)
.pop()
remove the last element from a list (or a specific value using the index method ex. .pop(0))
.replace()
replace a substring within a string (ex. my_new_string = my_string.replace('$', 's')
measures of central location
represents typical value for feature
.reset_index()
resets the index
.drop_duplicates()
resolves duplicate rows in a dataframe
listname[start: stop: step]
retrieves multiple elements from a list
Call the letter_count function, pass the word variable to it and print the returned value.
return_value = letter_count(word) print(return_value)
Call the math_facts function, pass num to it, and assign the returned values to variables called var1, var2, var3, and var4. Print a message that reads "The square, square root, natural log, and factorial of {num} is {var1}, {var2}, {var3}, and {var4}.".
return_value = math_facts(num) var1, var2, var3, var4 = math_facts(num) print('The square, square root, natural log, and factorial of {} is {}, {}, {}, and {}.'.format(num, var1, var2, var3, var4))
Call the vowel_count function, pass the word variable to it and print the returned value.
return_value = vowel_count(word) print(return_value)
.describe()
returns a statistical summary for each of the columns in a dataframe (count, mean, std, min, 25 percentile, 50 percentile, 75 percentile, max) - numeric columns (count, unique, top, freq) - non-numeric columns
.columns
returns the column labels
floor division
returns the integer portion of the division operation (whole number)
modulus
returns the remainder of a division operation
how to use .sort() in descending order
reverse argument to TRUE (ex. nums.sort(reverse=True)
.reverse()
reverse the order of a list
round()
rounds (even numbers are returned when fractional is exactly halfway between 2 numbers)
example of filtering dataframe
rows = brics["literacy"] >= .95 cols = ["country", "gdp", "population"] brics[rows][cols]
Create four new DataFrames called seniors, juniors, sophomores and freshmen by importing the 'Seniors', 'Juniors', 'Sophomores' and 'Freshmen' sheets in the Excel file located at https://coding-fundamentals.s3.amazonaws.com/students.xlsx. Combine all four DataFrames vertically into a new DataFrame called students and display it. Hint: Make sure that there are no duplicate index values in the students DataFrame and that the index values go from 0 to 19 (see the previous tutorial if you need a refresher).
seniors = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Seniors") juniors = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Juniors") sophomores = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Sophomores") freshmen = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Freshmen") students = pd.concat([seniors, juniors, sophomores, freshmen]).reset_index() students = students.drop("index", axis = "columns") students
Create four new DataFrames called seniors, juniors, sophomores and freshmen by importing the 'Seniors', 'Juniors', 'Sophomores' and 'Freshmen' sheets in the Excel file located at https://coding-fundamentals.s3.amazonaws.com/students.xlsx. Make the ID column the index label for each DataFrame and display all four of them.
seniors = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Seniors", index_col = "ID") juniors = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Juniors", index_col = "ID") sophomores = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Sophomores", index_col = "ID") freshmen = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Freshmen", index_col = "ID") display(seniors, juniors, sophomores, freshmen)
alpha =
sets opacity of a line within a line plot (value between 0 and 1)
style =
sets style of a line within a line plot ( - solid, -- dashed, -. dash-dot, . dotted)
color =
sets the color of a line within a line plot
Instantiate an object from the ParttimeEmployee class named shelly, for an employee called "Shelly Smith", who works thirty hours per week and has been at the company for two years. Call the intro() and health_benefit() methods for shelly.
shelly = ParttimeEmployee('Shelly Smith', 2, 30) shelly.intro() shelly.health_benefit()
histograms
show the frequency distribution of values within a dataset
np.array()
simplest way to create a NumPy array (ex. np.array([0, 1, 2, 3, 4, 5])
.sort()
sort a list (with elements of a single data type)
.sort_values(by = "")
sort the data by one or more columns
ascending = False
sorts a dataframe in descending order
step
specifies length of each loop
else clause
specifies the block of code that should be executed if the try clause does not raise an exception
finally clause
specifies the block of code that would be executed regardless of whether an exception was raised or not
variable-length arguments
specify a single variable name preceded by an asterisk (*) ex. def total_sum(*args)
.split()
splits a string into individual words
measures of ___ describe how similar or varied the set of observed values are for a particular feature
spread
def
stands for definition and indicates that a function definition follows
[:2]
start is the first element of the list
object data type in pandas
str
Remove the duplicate columns in the students_demo_major DataFrame that you created in the previous problem. Sort the DataFrame by the FirstName and LastName columns and display it.
student_demo_major = student_demo_major.loc[:, ~student_demo_major.columns.duplicated()] student_demo_major.sort_values(by = ["FirstName", "LastName"])
Create a new DataFrame called student_demographics by importing the 'Demographics' sheet in the Excel file located at https://coding-fundamentals.s3.amazonaws.com/students.xlsx. Make the ID column the index label. Combine the students (from problem 2) and student_demographics DataFrames horizontally. Call the new DataFrame students_demo and display it.
student_demographics = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Demographics", index_col = "ID") students_demo = pd.concat([students, student_demographics], axis = "columns") students_demo
Create a new DataFrame called student_hometown_left by combining the students and hometown DataFrames using a left join on the ID column. Display the student_hometown_left DataFrame.
student_hometown_left = pd.merge(students, hometown, on = "ID", how = "left") student_hometown_left
Create a new DataFrame called student_hometown_right by combining the students and hometown DataFrames using a right join on the ID column. Display the student_hometown_right DataFrame.
student_hometown_right = pd.merge(students, hometown, on = "ID", how = "right") student_hometown_right
Create a new DataFrame called student_major by importing the 'Major' sheet in the Excel file located at https://coding-fundamentals.s3.amazonaws.com/students.xlsx. Make the ID column the index label. Combine the students_demo (from Problem 3) and student_major DataFrames horizontally. Call the new DataFrame students_demo_major and display it.
student_major = pd.read_excel("https://coding-fundamentals.s3.amazonaws.com/students.xlsx", sheet_name = "Major", index_col = "ID") student_demo_major = pd.concat([students_demo, student_major], axis = "columns") student_demo_major
input()
student_name = input('Enter your first name: ")
Create a new DataFrame called students by combining the seniors, juniors, sophomores and freshmen DataFrames vertically. Sort the students DataFrame by its index and display it.
students = pd.concat([seniors, juniors, sophomores, freshmen]) students.sort_index()
Set the ID column as the index for the students and hometown DataFrames. Create a new DataFrame called student_hometown_outer by combining the students and hometown DataFrames using an outer join on the index. Display the student_hometown_outer DataFrame.
students = students.set_index("ID") hometown = hometown.set_index("ID") student_hometown_outer = pd.merge(students, hometown, on = "ID", how = "outer") student_hometown_outer
Create a new DataFrame called swim_women by importing the Women's roster from the Notre Dame Swimming and Diving home page located at https://und.com/sports/swim/roster/. Preview the first 10 rows of the swim_women DataFrame. Hint: Go to the webpage to identify the HTML tables on the page and what kind of data is stored in them first.
swim_women = pd.read_html( "https://und.com/sports/swim/roster/")[1] swim_women.head(10)
object_name = ClassName()
syntax used to instantiate a new object
.dtypes
tells us the data type of each column in the DataFrame
break
terminates the loop even if the conditional statement is still TRUE
methods
the actions that an object can take
dynamically-typed langauge
the data type of a variable is based on the data type of the object that it holds or represents
zero-indexed language
the first character is [0]
child or derived class
the new class in inheritance
not equal to operator
!=
complete the following piece of code to copy the elements 44 and 55 from epsilon into a new tuple called dalta: epsilon = (11, 22, 33, 44, 55, 66) delta = epsilon[____:-1:] print(delta)
-3
finally (example)
... finally: print('Thank you!')
def beta(a = 2): b = a + 4 b = b ** 2 return b If I call the beta() function without passing an argument to it and assign the result to a variable called c, what will the value of c be?
36
alpha = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]) given the code snippet above, in terms of dimensions, alpha is a __ by __ array
4, 3
the sequence of numbers returned b this range(6, 21, 3) function will contain ____ elements
5
which of these variable names is not allowed in python?
7_catinthehat
def beta(): a = 2 ** 2 b = 3 ** 2 r eturn b b += 4 If I assign the beta() function to a variable, then print the variable, what would my output be?
9
equality operator
==
JSON
Java Script Object Notation used to store semi-structured data in human-readable form online
missing values are represented as NaN or np.nan in pandas data structures. NaN stands for ___
Not a Number
instantiation
The class we defined, is simply a blueprint. Once we have it defined, we can now create objects based on this blueprint.
In code we sometimes have to ask questions in order to decide what to do. this sort of question is also known as
a condition
data structure
a container that holds a sequence of objects
tuple
a core data structure that is very similar to a list except they are immutable
.insert()
add an element at a specific index
.lower()
all letters are lower case
.upper()
all letters are upper case
method chaining
allows us to call multiple methods on an object all at once without having to create temporary variables
tuple()
also used to create a tuple
built-in functions
always available for use in order to perform different types of tasks
what does the following piece of code do? balance = 45
assigns the value 45 to a variable called balance
data type f input
automatically string
beta = np.array([['red', 'orange'],['yellow', 'green'],['indigo', 'violet']]) which of these should I run to return 'green'?
beta[1, 1]
Instantiate two objects based on the Car class you defined in Problem 1. The first car should be "white" and have a seating capacity of 6. The second car should be "blue" and have a seating capacity of 6. Call the first car, car_one, and the second car, car_two. Print a message that reads "The first car is {color} with a seating capacity of {capacity}, while the second car is {color} with a seating capacity of {capacity}."
car_one = Car('white', 6) car_two = Car('blue', 6) print("The first car is {} with a seating capacity of {}, while the second car is {} with a seating capacity of {}.".format(car_one.color, car_one.capacity, car_two.color, car_two.capacity))
class
categorical
feature
categorical - discrete form continuous - integer
int()
changes a decimal number to a whole number
layout = (row, column)
changes how subplots are displayed
Modify the Employee class you defined in Problem 3 to include a method called intro(). When called, the intro()method should print a message that reads "Hi, my name is {first_name} {last_name}. I've worked here for {work_years} years."
class Employee: salaried = True def __init__(self, first_name, last_name, work_years): self.first_name = first_name self.last_name = last_name self.work_years = work_years def intro(self): print('Hi, my name is {} {}. Ive worked here for {} years.'.format(self.first_name, self.last_name, self.work_years))
axis = 1, axis = 'columns'
combines data horizontally
.concat()
combines multiple series or dataframe objects vertically
CSV file
comma-separated values file; one of the most common ways to save data in tabular format
let's assume that the variable numbers is a list of whole numbers between 1 and 50. complete the code snippet below so the loop prints all of the even numbers in the numbers variable: for n in numbers: if n % 2 != 0: _________ print(n)
continue
response
continuous
float()
convert a whole number to a decimal number
str()
converts the variable to a string
floating point
decimal number
Define a function called letter_count that returns the number of letters in a variable passed to it.
def letter_count(arg): return len(arg)
a dataset that is 80% dense is also 80% sparse
false
.head()
first five rows of a dataframe
float64 data type in pandas
float
float
floating point variable
for loop to iterate through keys of dictionaries
for key in fruit_price.keys(): print(key.capitalize())
for loop to iterate through all items of dictionaries
for key, value in fruit_price.items(): key = key.capitalize() value = round(value * 0.8, 2) sale_fruit_price[key] = value
return
functions immediately exit when they encounter a return statement
range(start, stop, step)
generates a sequence of numbers
.dtype
gives the data type
Write code to determine if chris is an instance of the Employee class.
isinstance(chris, Employee)
Reinstantiate the same two Employee objects as you did in Problem 4 and call the intro() method for both of them.
jack = Employee('Jack', 'Turner', 6) kate = Employee('Kate', 'Brown', 8) jack.intro() kate.intro()
Instantiate two objects from the Employee class you defined in Problem 3. The first employee is Jack Turner. He has worked at the company for 6 years and is a salaried employee. The second employee is Kate Brown. She is an hourly employee and has worked at the company for 8 years. Use the first name of each employee (in lower case) as the name for the object you instantiate. Print the first_name, last_name, work_years, and salaried attributes for both Employee objects.
jack = Employee('Jack','Turner', 6) kate = Employee('Kate','Brown', 8) print('{} {} {} {}'.format(jack.first_name, jack.last_name, jack.work_years, jack.salaried)) kate.salaried = False print('{} {} {} {}'.format(kate.first_name, kate.last_name, kate.work_years, kate.salaried))
.tail()
last five rows of a dataframe
Instantiate an object from the new ParttimeEmployee class named laura, for an employee called "Laura Walker", who works ten hours per week and has been at the company for eight years. Call the intro() method for laura.
laura = ParttimeEmployee('Laura Walker', 8, 10) laura.intro()
~
not
second number refers to
number of columns
dimensionality
number of features in dataset
frequency distributions
number of occupancies within feature
first number refers to
number of rows
np.number
numeric columns
extract every 2nd element in the list
nums[0::2]
operators are special symbols that tell python to take a discrete action. These actions are known as
operations
Given the two DataFrames above called alpha and beta, respectively, which line of code would produce the following DataFrame?
pd.merge(alpha, beta, on = ["make", "model"], how = "left")
Rachel built a model that helps her predict whether a particular patient is at risk for preterm birth. she used existing and historical patient health data to build the model. what type of data analytics did she use
predictive analytics
.remove()
remove a specific value from a list
reset_index(drop = True)
resets index values
how to create a list
separate a list of values by comma surrounded by square brackets [ ]
.set_index()
sets on of the existing columns of a DataFrame as the row index
index_col
sets one of the columns in the data as the index
composition visualizations
show component make up of data
distribution visualizations
show frequency distribution of values of feature
arguments
the objects or values we pass to the function as input (optional)
attributes
the properties of an object
and
the result is FALSE if at least one expression is false
not
the result is TRUE if the expression is FALSE and vice versa
or
the result is TRUE is at least one expression is true
kind = "hist"
to create a histogram
kind = 'barh'
to create a horizontal bar plot
kind = 'line'
to create a line plot
kind = 'scatter'
to create a scatter plot
kind = 'bar' stacked = True
to create a stacked bar graph
kind = 'bar'
to create a vertical bar plot
In a conditional statement, a response is the block of code that executes in response to a question
true
len()
used to get the length of the string
lists are mutable. this means that
we can modify the contents of a list
how to create a tuple
wrap a comma-separated list with parentheses () or assigning a list of elements to a variable
if (example)
x = 'Statements' if x.startswith('S'): print('Starts with S') if x.find('t') !=-1: print('Contains t')
assigning multiple variables to same value
x = y = z = 10
assigning multiple values at the same time
x, y, z = 10, 10.5, 1148.57