Python for Data Science Essential Training Part 1
What are the steps for building a plot the object-oriented method?
(1) create a blank figure object (2) add axes to the figure object (3) generate plot(s) within the figure object (4) specify plotting and layout parameters for plots within the figure
What arguments should be passed to change the (1) line style and (2) line width of a line in a line graph?
(1) ls = ' ' (2) lw = ' '
What arguments should be passed to change the (1) marker style (2) marker size and (3) marker edge width of a marker in a graph?
(1) marker = ' ' (2) s = ' ' (3) mew = ' '
What are the two functions used to make a scatterplot with Seaborn?
(1) the .plot method (2) the .regplot method
What function from the pandas library is used when generating a plot from a CSV file to create columns for the data from the file?
.columns = ['x', 'y', 'z'] .columns - generates columns for the plot based on columns from the data in the CSV x, y, z - columns from the CSV which we want to be read in as columns for the plot
What function is used to create a legend for a specific point on a plot and placed at a specific location on the plot?
.legend(label, loc) label - the label(s) we want to be used in the legend loc - the location on the plot where we want the legend to be located - different locations are specified with different number codes (memorize these)
What function can be used to retrieve specific values within a DataFrame object according to its index labels? Give an example of using the function.
.loc[['x'],['y']] - locates and retrieves a value in a DataFrame object according to that value's index labels x - row index label where the DataFrame object resides y - column index label where the DataFrame object resides example: .loc[['row 1'],['column 1']] - retrieves the DataFrame object value that resides at the cross between the index labels 'row 1' and 'column 1'.
What is the function which can be used to plot data in MatPlotLib via the functional method.
.plt.plot(x, y) .plt - the plotting library from MatPlotLib .plot ( ) - the plotting function from the plt library x - variables for the x-axis y - variables for the y-axis
What function is used to shape a DataFrame into specific dimensions?
.reshape((x,y)) x - the horizontal dimension of the DataFrame y - the vertical dimension of the DataFrame
What functions are used in the MatPlotLib object-oriented method to create labels for the x and y axes?
.set_xlabel('name of x axis') - for the x axis label .set_ylabel('name of y axis') - for the y axis label
What is the function used to create tick labels on an x axis?
.set_xticklabels(figure_name.column_for_labels) figure_name - the name of the figure which the column is .column_for_labels - the name of the column within the figure which you want to use as labels for the x axis
What function is used to add a label to the x and y axes of a plot via the functional method in MatPlotLib?
.xlabel('name_of_your_x_label') - call the .xlabel function and write a name for the x axis .ylabel('name_of_your_y_label') - call the.ylabel function and write a name for the y axis
What is a series object?
A row or column within a DataFrame that is indexed.
What is a scalar?
A scalar is a single numerical value.
What is a DataFrame object?
A spreadsheet of rows and columns (series objects) that are indexible.
What is an array?
A 1-dimensional container for elements that are all of the same data type.
What is a matrix?
A 2-dimensional container for elements that are stored in an array.
What is a dot product?
A mathematical operation that takes 2 factors or a sequential unit of numbers that have an equal number of elements each and multiplies them out to generate a single, scalar value
What is matrix multiplication?
A mathematical operation that takes 2 matrices or 2 dimensional arrays that have an equal number of elements each and multiplies them out to create a single matrix
What is a cross tab function?
A method which quantitatively analyzes the relationship between multiple variables used to understand the correlational relationship between those variables.
What formula could we use to add additional columns (variables) to a plot which we have already generated from a read-in file done in pandas. Explain. Use the functional method in Pandas. Give an example for the read-in file named 'cars'.
DF = read in file name[[ 'x', 'y', 'z']] DF.plot( ) When a file is read-in to pandas as a CSV, it is generated into a DataFrame object. Therefore, we can create additional columns in the plot by calling the file as a DF and specifying which columns we wish to add to the plot. example: DF = cars[['x','y','z']] DF.plot() DF.plot specifies that we want ALL of the columns in the read-in CSV file to be plotted, not just a select few of them.
What function is used to create a DataFrame with both a row and column index?
DataFrame(x, index=['y'], columns=['z']) DataFrame( ) - the function used to create a DataFrame x - the values of the DataFrame y - values of the row indices (the default format for an index is to create rows) z - values of the column indices
What function is used to create a series?
Series(x, index = [y]) x = the values that belong to the series object y = the corresponding index value for each value in the series object; series objects must always be indexed!
Write a formula to create an indexed series of 8 index values with corresponding values and explain each element of the formula.
Series_obj = Series(np.arange(8), index = ['row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6', 'row 7', 'row 8']) Series_obj - creating a variable/name for our indexed series Series(x, index = [y]) - the constructor we use to create a series object; x = the series object values; y = the index values which index the series object values np.arange(8) - returns 8 evenly spaced series values to correspond with index values index = [ 'row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6', 'row 7', 'row 8] - creates an index with 8 index values called row 1, row 2, row 3... row 8
What is setting?
Setting is where you select a specified label index and then set it(s) corresponding values equal to a scalar value.
What formula is used to multiply two matrices named 'aa' and 'bb' together via numpy?
aa = np.array([[a,b,c],[d,e,f]]) bb = np.array([[g,h,i],[j,k,l]]) aa*bb aa*bb - the function which performs matrix multiplication on the two matrices called 'aa' and 'bb'
What argument should be passed to set a bar plot's bars to align at the center when using the functional method in MatPlotLib?
align = center
Write a function via the MatPlotLib object-oriented method which adds tick labels to an existing x axis. The variable name of the axes is 'ax', the figure that you are creating the labels for is called 'cars', and the column you are using as labels is called 'car_names'.
ax.set_xticklabels(car.carnames)
Write a function via the MatPlotLib object-oriented method which adds ticks to the existing x and y axes. The variable name of the axes is 'ax'.
ax.set_xticks(range(32)) ax.set_yticks(range(50)) ax - the name of the existing axes set_x/yticks( ) - the function that creates the tick marks on both axes range(32/50) - sets the ticks to this range of scalar values
What is the function used to add grid lines to an axis in a plot?
axes_variable_name.grid( )
What function is used to add limits to x and y axes of a plot in a figure?
axis_variable_name.set_xlim([a, b]) - for the x axis axis_variable_name.set_ylim([a, b]) - for the y axis a, b - the location in the plot where we want the limits to be set to
What function is used to add ticks to x and y axes of a plot in a figure?
axis_variable_name.set_xticks([a,b,c,d..]) - x axis ticks axis_variable_name.set_yticks([a,b,c,d..]) - y axis ticks a,b,c,d - the locations on the x and y axes where we want the tick marks to be
What function is used to create a histogram using the MatPlotLib object-oriented method?
variable_name.plot(kind = 'hist') variable_name/column_name - the name of the variable/column from a DataFrame that you want used as the data for a histogram
What argument should be passed to set a plot's line/bar do a certain width when using the functional method in MatPlotLib?
width = width_variable(s)
Write a formula with MatPlotLib which uses the functional method to create a line graph of 10 data points.
x = range (1 : 10) y = [1,2,3,4,0,4,3,2,1] plt.plot(x, y) range - can be used to create evenly spaced scalar values in the range specific; this is not necessary for creating a plot but can be used This formula will generate a line chart of data points specific by the x and y variables given.
Write a formula via the functional method in MatPlotLib which creates a bar chart with both x and y axes labels.
x = range(1:10) y = [1,2,3,4,.5,4,3,2,1] plt.bar(x,y) plt.xlabel('My X Axis Label') ply.ylabel('My Y Axis Label')
Create a bar plot via MatPlotLib that sets the width of the bars to .5 and .9, sets the color to 'salmon', and aligns the bars at the center while using the functional method in MatPlotLib.
x = range(1:10) y = [1,2,3,4,.5,4,3,2,1] wide = [.5,.5,.5,.9,.9,.9,.5,.5,.5] color = ['salmon'] plt.bar(x, y, width = wide, color = 'salmon', align = center) x, y - the x and y variables to be generated for the bar plot wide - the variable for width; the widths that we want to set for each individual bar in the bar chart color = ['color name'] - the color that we want the chart to be plt.bar (x, y) - we want to create a plot based off of the x and y variables (width = wide, color = 'salmon', align = center) - we want to create a bar plot under these conditions for width, color, and alignment
Give an example formula that we can use to generate a plot named 'fig' via the object-oriented method? Add axes to the figure in the plot.
x = range(1:10) y = [1,2,3,4,0,,4,3,2,1] fig = plt.figure(x, y) plt.figure() ax = fig.add_axes([.1.1,1,1]) ax.plot(x, y)
What is the function we can use to generate a bar chart plotting a column (variable) named 'x' from a CSV file via the functional method in Pandas.
x.plot(kind = 'bar')
What is the function we can use to generate a horizontal bar chart plotting a column (variable) named 'x' from a CSV file via the functional method in Pandas?
x.plot(kind = 'barh')
Write a formula via the functional method in MatPlotLib which creates a pie chart with labels for each slice of the pie?
z = [1,2,3,4,.5]
Write a formula via the functional method in MatPlotLib which creates a pie chart with labels for each slice of the pie?
z = [1,2,3,4,.5] veh_type = ['a','b','c','d'] plt.pie(z, labels = veh_type) plt.show( )
Write a formula which changes a numerical series of data called 'gear' to a categorical series of data called 'group' and adds it to our DataFrame called 'cars' as a new series object.
cars['group'] = pd.series(cars.gear, dtype = 'category') cars['group'] = - specifies the DataFrame we are modifying and the new series object we are adding to it pd.series( ) - we are calling the series function from the pandas library which creates a new series object cars.gear - the variable which we are modifying in the DataFrame dtype = 'category' - the datatype which the new series object will be - category = categorical data type
What argument should be passed to set a plot to a certain color or color theme when using the functional method in MatPlotLib?
color = ['name_of_the_color']
Write a formula via the MatPlotLib pandas library for which a CSV file is read-in as a DataFrame, with the DataFrame given the name 'df', and then retrieve the head of the DataFrame. Use one of the columns of the csv to create an index.
csv_file_name = 'csv_file_address' df = pd.read_csv(csv_file_name, index_col = 'column name', encoding ='cp1252', parse_dates = True) df.head( ) csv_file_name - the name used as the variable for the csv file df - the name used as the variable for the read-in csv file pd.read_csv - the function which reads in the csv file index_col - the function which is used to generate an index for the DataFrame based on a column of the DataFrame column_name - the name of the column from the DataFrame that we want to use as our index
What is the formula used for the Seaborn .plot method to create scatterplots
csv_file_name.plot(kind = 'scatter', x = 'variable 1', y = 'variable 2') csv_file_name - the name of the read-in csv file .plot( ) - the function that is used to generate the plot kind = 'scatter' - specify the kind of plot that you want to create within the .plot parameters x = - column from the df used for the x variable in the graph y = - column from the df used for the y variable in the graph
What is the function used in MatPlotLib to create a boxplot?
data_variable_name.boxplot(column = 'x', by 'y') data_variable_name - the name of the DataFrame which the data for your boxplot is stored in .boxplot - the function in MatPlotLib which creates a boxplot from the data column = 'x', by 'y' - the variables which you want to compare against one another
What function in scipy is used to generate the entire statistical description of an individual variable/column according to the column?
data_variable_name.describe( )
What function in scipy is used to group a DataFrame by its values in a particular column?
data_variable_name.group_by('variable_name') .group_by - function which generates a plot based on the variable within the parameters of the function 'variable_name' - the variable by which we want to group our plot based on
What function in scipy is used to find the location of a max value of individual variables/columns according the column?
data_variable_name.idxmax( )
What function in scipy is used to find the max value of individual variables/columns according the column?
data_variable_name.max( )
What function in scipy is used to find the mean of individual variables/columns according the column?
data_variable_name.mean( )
What function in scipy is used to find the median of individual variables/columns according the column?
data_variable_name.median( )
What function in scipy is used to generate the standard deviation of individual variables/columns according to column?
data_variable_name.std( )
What function in scipy is used to sum the total of individual variables/columns according the column?
data_variable_name.sum( )
What function in scipy is used to sum the total of individual variables/columns according the row?
data_variable_name.sum(axis = 1)
What function in scipy is used to generate the number of unique values of an individual variable/column according to column?
data_variable_name.value_counts( )
What function in scipy is used to generate the variance of individual variables/columns according to column?
data_variable_name.var( )
Write a formula for a function which generates a plot grouped by a specific variable. The data for the formula is in the DataFrame called 'cars_cat' and the variable we want to group by is 'gear'. Then, generate descriptive statistics based on the group.
gears_group = cars_cat.group_by('gear') gears_group.describe( ) gears_group - we created a variable name for the plot which we are generating cars_cat - the name of the DataFrame which our data will come from .group_by('gear') - the function which groups our plot according to the gear variable in the DataFrame
What is the code for the labels when creating pie chart labels.
labels = veh_type
What function is used to generate a series of sequential numbers to create an array via numpy?
np.arange(start_#, end_#) np.arange - function which generates a series of evenly spaces, sequential numbers, given the specific parameters start_# - the number at which we want to begin our sequence at end_# - the number at which we want to end our sequence at
What function is used to return evenly spaced values and when is this function commonly used?
np.arange(x) x = the interval you want the values to fall into This function is often used to create values within a series object to correspond to an index.
What function is used to add two arrays together to create a matrix via numpy?
np.array([[a,b,c],[d,e,f]]) [[a,b,c],[d,e,f]] - the numbers within the parameters of the function which we want in our matrix; we create two lists to specify the matrix is made up of more than 1 array
What function is used to create an array via numpy?
np.array([a,b,c,d,e]) np.array - the function from numpy which creates an array based off of the numbers within its parameters [a,b,c,d,e] - the numbers within the parameters of the function which we want in our array
What formula is used to perform the dot product of two matrices named 'aa' and 'bb' together via numpy?
np.dot(aa, bb)
What function is used to create a set of random numbers and when is it commonly used?
np.random.rand(#) # - the number of random numbers desired This is commonly used when creating a DataFrame filled with values.
What function is used to generate a certain amount of random numbers to create an array via numpy?
np.random.randn(#_of_desired_#s) np.random.randn - the function generates 6 random numbers, negative and positive, into an array #_of_desired_#s - parameter which explains how many numbers we want in our array
What libraries and modules do we need to import in order to create data visualizations including line, bar, and pie charts?
numpy - numpy.random, randn pandas - Series, DataFrame matplotlib.pyplot - rcParams
What is the function for creating a cross-tabulation of two variables via pandas?
pd.crosstab(DataFrame_name['variable 1'], DataFrame_name['gear 2'] ) DataFrame_name - the DataFrame which our variable is drawn from variable 1 - the name of the first comparison variable variable 2 - the name of the second comparison variable
Write an example formula for a cross-tabulation of two variables named 'a' and 'b' both located in a DataFrame named cars.
pd.crosstab(cars['a'], cars['b'])
What function from the Pandas library is used to read in a CSV file to be used for generating plots?
pd.read_csv pd - the library from which the function comes from .read_csv - the function in the pandas library which reads a csv file for the use of generating a plot
What is the function we can use to generate a bar chart plotting x and y variables (not generated from a CSV file) via the functional method in MatPlotLib?
plt.bar(x, y)
What function is used to create a histogram using the MatPlotLib functional method?
plt.hist('variable_name/column_name') variable_name/column_name - the name of the variable/column from a DataFrame that you want used as the data for a histogram
What is the function we can use to generate a pie chart plotting x and y variables (not generated from a CSV file) via the functional method in MatPlotLib?
plt.pie(x, y)
What function is used to generate subplots from a plot?
plt.subplots(subplot format) plt.subplots - function which indicates that we want to create subplots subplot format - the desired formats for the subplots; usually indicative of rows and columns
What is the function used in Seaborn to create a boxplot?
sb.boxplot(x = 'variable 1', y = 'variable 2', data = data_variable_name) sb.boxplot( ) - the Seaborn function which creates a boxplot based on the data within its parameters x, y - the variables which you want to compare against one another data = data_variable_name - the data from the DataFrame which you wish to use for the boxplot
What function is used in Seaborn to create a scatterplot matrix?
sb.pairplot(DF_variable_name) sb.pairplot - commands Seaborn to call the pairplot function to create a scatterplot matrix DF_variable_name - the name of the DataFrame which contains the data you want used for the pairplot
What function is used in Seaborn to create a subset of data from a scatterplot matrix?
sb.pairplot(DF_variable_name) subset_variable_name = Df_variable_name[['x','y','z']] sb.pairplot(subset_variable_name) plt.show( ) subset_variable_name - the name you are giving to the subset of the pairplot data as a variable x,y,z, - the specific data/columns from the DataFrame which you want plotted as a subset sb.pairplot(subset_variable_name) - the command in which Seaborn calls the pairplot function on the set of variables/columns from the DF which you want plotted
What is the formula used for the Seaborn .regplot method to create scatterplots?
sb.regplot(x = 'x_variable_name', y = 'y_variable_name', data = 'data_variable_name', scatter = True) sb.regplot - commands Seaborn to call the .regplot function to create a scatter plot data - the data in the form of a DataFrame that you want used for the scatterplot data_variable_name - the name of the DataFrame variable scatter = True - indicating the format of the plot will be a scatterplot
What is the formula for setting the label indices 'x', 'y', and 'z' in a series object to = a scalar value. Print the result. Give an example.
series_name['x','y','z'] = scalar value series_name x, y, z - label indices for which we want to change their corresponding values to a scalar value scalar value - the numerical value which we want to change x, y, and z's corresponding values to example: series_obj['row 1','row 5','row 8'] = 8 series_obj
What is the formula for slicing a slice of values in a series object according to the object's index labels? Give an example using the formula on a series object named 'series_obj'.
series_name['x':'y'] series_name[ ] - calls the series by it's name x - the index value at which you want to begin slicing values : - through y - the index value at which you want to finish slicing values series_obj['row 1':'row 3'] series_obj - the name of the series object we want to slice data from row 1 - the index value at which we want to return values for row 3 - the index value at which we want to return values for
How could we retrieve a series object value corresponding to a label index value of 'row 7' from a series object named 'series_obj'?
series_obj['row 7'] We call the series object by its name, 'series_obj', and then, within the list, the name of the label index value which we want the corresponding series object value for.
How could we retrieve series object values corresponding to integer index labels of '0' and '7' from a series object names 'series_obj'?
series_obj[[0,7]] We call the series object by its name, 'series_obj', and then call the integer index labels from which we want the series object values for: 0 and 7.
What 2 commands must be input in the code before using the object-oriented method to build plots? Explain them. Give an example.
(1) %MatPlotLib in line (2) rcParams['figure.figsize'] = x, y rcParams - MatPlotLib library for creating plots the object-oriented way figure - explains to MatPlotLib that we are generating figures .figsize - function which explains to MatPlotLib that we want to set the sizing for our figure to specific dimensions x, y - the dimensions in which we want our figure to be
What are the two methods which dan be used for plotting data? Define both.
(1) Functional Method - used by calling the plotting function on a variable or set of variables (2) Object Oriented Method - used by generating a blank figure object and then populated it with a plot and plot elements
What is an Index?
A list of integers or labels you use to uniquely identify rows or columns. A labeled array capable of holding data.
What library is necessary to import in order to perform math operations on arrays & matrices?
from numpy.random import randn
What library must be opened to generate summary statistics?
import scipy from scipy import stats
What function is used to create an index?
index = ['x','x','x','x',1,2,3,4] Set an index function equal to a list containing strings or integer values.
Write out a formula which creates a variable for a read-in CSV file, reads the file into pandas, and then generates a plot based on one of the columns in the file via the functional method in Pandas. Explain what each element in the formula is doing.
CSV file = 'CSV file address here' variable = pd.read_csv(CSV file) variable.columns = ['x', 'y', 'z'] x = variable['x'] x.plot( ) variable - the name we are giving to the read-in CSV file pd.read_csv(CSV file) - calling in the read_csv function from the pandas library to read in the CSV file variable.columns - reading in the columns from the read-in CSV file as columns for the plot 'x', 'y', 'z' - the columns from the CSV which are read-in to be used for the plot x = variable['x'] - creating a new variable for the specific column which we want to be shown on the plot x.plot() - using the plot function to plot the column x
Write out an example formula to read in a CSV file called mtcars, create a variable for the read-in file, and use columns x, y, and z from the file to plot just the x column via the functional method in Pandas.
CSV file = 'address of mtcars on my computer' cars = pd.read_csv(CSV file) cars.columns = ['x', 'y', 'z'] x = cars.['x'] x.plot( ) We begin by giving the mtcars CSV file the variable name of 'CSV file' We then create a variable called 'cars' for the read in CSV file and use the pd.read_csv function to read-in. We then generate columns for the data based on the columns in the read-in CSV file We then create a variable for the column we wish to plot and show where it comes from in the read-in CSV. We lastly plot the column via the plot function.
Write a formula to create a DataFrame object named 'DF_obj' that is comprised of 36 random values, formatted in a 6 x 6 frame, and has both row and column indices. Print the DataFrame once finished.
DF_obj = DataFrame(np.random.rand(36).reshape((6,6)), index = [ 'row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6'], columns = [ 'column 1', 'column 2', 'column 3', 'column 4', 'column 5', 'column 6']) DF_obj DF_obj - name the DataFrame 'DF_obj' DataFrame(x, index=['y'], columns=['z']) - call the DataFrame function np.random.rand(36) - generate 36 random values .reshape((6,6)) - shape the random values in a 6 x 6 format index = [ 'row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6'] - creates a vertical row index with 6 index values named row 1, row 2,... row 6 columns = [ 'column 1', 'column 2', 'column 3', 'column 4', 'column 5', 'column 6'] - creates a horizontal column index with 6 index values names column 1, column 2,... column 6 DF_obj - print the DataFrame object
What is Data Slicing?
Data Slicing is used to select and return a slice of values from a series object. It returns the index labels and their corresponding series values.
What is a CSV file transformed into when it is read-in to a pandas environment by functional method?
It is transformed into a DataFrame.
What two formats can you write an index?
Label index and integer index.
What library should we use if we are to create a plot via a CSV file?
Pandas
Write an example formula for generating 2 subplots based on a figure named fig (already created). We want 1 of the subplots to plot both column and row data and the other 1 to plot just column data.
fig = plt.figure( ) fig, (ax1, ax2) = plt.subplots(1, 2) ax1.plot(x) ax2.plot(x,y) fig = plt.figure( ) - create our figure objet fig, (ax1, ax2) - create 2 axes which defines how many subplots we want = plt.subplots( ) - function which indicates that we want to create subplots 1, 2 - how many rows and how many columns you want included in the subplots ax1.plot(x) - indicates we only want the x axis plotted in the ax1 subplot ax2.plot(x,y) - indicates we want both the x and y axes plotted in the ax2 subplot
Write a function via the MatPlotLib object-oriented method which adds axes to a figure object and names the axis 'ax'.
fig.plt.figure() ax = fig.add_axes([axes numbers]) ax = the name of the axis being created fig.add_axes([]) - function which adds axes to the figure object
What is the function we can use to add axes to our figure object?
figure_variable_name.add_axes
What is the formula that we can use to generate plots via the object-oriented method in MatPlotLib? Explain.
variable x = [x data points] variable y = [y data points] figure_variable_name = plt.figure( ) variable x - the x variable for the plot variable y - the y variable for the plot figure_variable_name - the name we are giving our generated figure (creating the figure) plt.figure( ) - function to plot the figure that we created a variable for