ALL business analytics Q&A
What is the shape of the following numPy array? np.random.seed(1955) x = np.random.randn(2, 2, 2, 2) print(x.shape) x #Hint: I have not loaded the necessary package here...but you should (load pandas, import numpy)
(2, 2, 2, 2)
Now that we have the homes dataset loaded, let's explore a little bit. What are 3 of the ways we have explored a dataset in the course videos? I don't do each of these every time, but each of these you have seen me run many times.
.info() .describe() .head()
What is the output of the following code? У = -0 if y >= 0: print('0 or more') else: print ('less than 0')
0 or more
First create a simple dataframe using the below code. import pandas as pd # create a list of lists data = [['A1', 2, 4, 8], ['A2', 3, 7, 17], ['A3', 1, None, 7], ['A4', 989, 186, 3698], ['A5', 0, 0 ,None]] # Create the pandas DataFrame df= pd.DataFrame(data, columns=['ID', 'Value 1', 'Value 2','Value 3']) df If you ran: df = df.dropna() df How many rows would remain in the dataframe?
3
What is the mean of the column "A" in this DataFrame generated below? (choose the closest value) (you may need to import additional packages to run the below code!) import numpy as np rng = np.random.default_rng(768561456987365) #create a dataframe using those random values! df = pd.DataFrame(rng.integers(0,100,size=(15, 4)), columns=list('ABCD'))
58.13
greater than or equal to less than not equal to greater than equal to less than or equal to
>= < != > == <=
What is the purpose of np.array in the below code? a = [6.1, 5.8, 5.97, 5.43, 7.34, 8.67, 6.55, 3.66, 2.31, 6.84] b = [2.5, 3.19, 2.26, 3.17, 8.17, 2.76, 5.22, 9.82, 3.95, 8.38] np_a = np.array(a) np_b = np.array(b)
Convert the lists 'a' and 'b' to a NumPy array
Price is a variable we are interested in building a model on (later, once we've learned that stuff) that makes missing values and outliers particularly important to address. If price has an outlier variable that is really really extreme, what should we do with it? (the choices I am offering you below are very narrow. There is obviously more we could do... but given what you see in the dataset, and what I have said before about this issue... what would you do???)
Delete those rows
Matplotlib is built on top of seaborn (uses seaborn code)
FALSE
for pandas to work, data must be formatted as lists before it is imported
FALSE
Missing values can be imputed/replaced with other values. If my dataset has 1000 rows, and 200 missing values for the category age. What could I impute for age? (This question is not asking which of values you SHOULD use. Just what you COULD use)
Impute the mean Impute the most common value Impute the median
Outliers are common in some types of variables, an example discussed was the income variable in an online survey. Imagine you have conducted a survey on shopping habits, and receive 1,000 responses. One of your variables is a question on income. The vast majority of people respond with an income of 50k-200k per year. 5 individuals respond with an income in the billions. What should you do?
Impute/overide/fix that value using a mean or median
Within a for loop, which line of code would you use to increase the number within a variable?
NONE OF THESE: count + '1' + '1' count number = count + '1' add(count, '1')
Sometimes when working with (struggling with!) missing values, you find that it is not missing at all! Sometimes someone has been helpful (!) and entered some placeholder value like "THISISMISSING" when that happens, what could you do (choose the best answer, it won't be the ONLY possible answer, just the best one here)
Replace "THISISMISSING" with a missing value (np.NaN)
How many outputs will this following code have? mosquito = 1 while mosquito > 0: print (mosquito) mosquito = mosquito + 1 print (mosquito)
There are infinite number of mosquitos
Imagine we have a dataframe, df. What would be the purpose for running code like the one below? (why would we run it?) df.loc[1]
To look for, and retrieve a value from df
Datasets to be joined generally need something in common, like a customer ID. The relationship does not need to be 1 to 1. (eg. Customer ID 75883 may occur once in the first dataset, and many times in the second dataset
True
Heatmaps can be used to quickly understand correlated variables in a dataset
True
Pandas can be used to join two data frames together
True
When creating a chart using seaborn, it is possible to make formatting changes to the chart using matplotlib code.
True
When importing data from a local drive, the relative path was defined as the path FROM where your code in your current working directory is, TO where your data is.
True
When using matplotlib, if the color is the only part of the format string, you can use any matplotlib colors spec (eg. full names like "red") or hex strings
True
In the titanic dataset we used in the videos: I discussed the cabin fare for the titanic, and how some values were really really big. I mention that is not necessarily a mistake, the fare could in fact be that high and be distributed this widely. This is different than if you see outliers that can't exist (like negative 100 for age). Nevertheless, if we WANTED to fix fare, and remove the outlier fare we could do one of a 2 things, demonstrated in the video.
Use some code to replace the outlier with the mean of the values OR Use some code to replace the outlier with the median of the values
What would be returned by the following code? Assume this is the only code in the workbook, nothing else is loaded or present. import pandas as pd today = datetime.datetime.now() print(now)
an error
What is the output of the following code? import numpy as np list1 = [5, 5, 5] list2 = [10, 10, 10] np_list1 = np.array(list1) np_list2 = np.array(list2) np_list1/np_list2
array([0.5, 0.5, 0.5])
Look at the below code carefully. It is not at all uncommon to see errors of omission in code chunks like this. How can you fix the below so that it produces the output 'array([50, 50, 100])' import numpy as np list1 = [5,5,5] list2 = [10,10,20] np_list1 = np.array(list1) np_list2 = np.array(list2) np_list1 is_and np_list2
change "np_list1 is_and np_list2" to "np_list1*np_list2"
University of Florida President Kent Fuchs wanted to define a function to count the student enrollment of the top three colleges: Liberal Arts, Engineering, and Business. What two parts are MISSING in his code? His code looks like this: MISSING college_count(liberal_arts, engineering, business): enrollment = liberal_arts + engineering + business MISSING enrollment
def and return
Imagine we have a pandas dataframe we have named 'df'. The dataframe consists of 2 columns. "col1" is 30 values long, and is a random mix of the letters 'a', 'b', and 'c'. "num1" is also 30 values long, and is a random set of numerical data (all integers). Which of the following would give you the mean of the numerical (num1) column, grouped by the values from column "col1"?
df.groupby('col1').mean()
Assume all packages are loaded that need to be to make the code run successfully. Assume the test data ('test.csv') is loaded into your environment and named 'df'. So something like df = pd.read_csv("../data/test.csv") What would produce the below result? (First 6 rows, starting from 0-5)
df.head(6) starts counting at zero! remember this
Imagine we create a pandas series using the below code. What is one simple way to retrieve the value 0.5 from the series? import pandas as pd df = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd']) df
df['b']
Assume we have all packages in place we need. Assume all spaces etc are correct. Canvas sometimes shows strange gaps etc. Let's download the "homes.csv" file located in the Canvas file folder under the data tab. What code do we need to import that dataframe? Use the code we have been using from the notebooks. There are many ways to do this, but the demonstrated approach is the most popular convention. So make sure you type that in. Assume the data file is in the same folder and location as your code. In other words you do not need to create a relative path or any path for that matter. You just need the name of the file. We will not use a path variable. Just the name of the file. Note that in week 3 we did this with the "segments.csv" dataset, so you could refer to that as an example.
homes = pd.read_csv("homes.csv")
Company is a list containing 4 strings, sales is a list with 4 integers. Which of these code snippets would create a bar chart?
import matplotlib.pyplot as plt plt.bar(company ,sales, color='grey')
What package have we been using to import our data, and what is the abbreviation (as..) we have been using? This would look something like the below in a line of code. Note I am not asking what would work... but rather what has been demonstrated in the class notebooks. import PACKAGE as ABBREVIATION
import pandas as pd
What is the purpose of the code below? %matplotlib inline
make the plots show up inline
Will the following nested if statement run? If not, why? y = -0 if y > 0 if y > 5 print('higher') if y <= 5 print('lower') if y <=0 print ('0 or less')
no, syntax error
Coach Napier is trying to count his total wins for the 2022 season. Which for loop function will help him do so and produce the following output: Coach Napier's Record at Florida 1 Loss 2 Win 3 Win 4 Win 5 Loss Games played = 5 Wins to date = 3
print( "Coach Napier's Record at Florida") games_played = 0 games_won = ["Loss", "Win", "Win", "Win", "Loss"] for N in games_won: games_played = games_played + 1 print (games_played, N) print("Games played = games_played) print Wins to date = ". • games_won.count "Win"))
When working with a pandas dataframe, what is one advantage seaborn has over using native matplotlib to visualize two of the columns. Note the question does not ask about using matplotlib in pandas.
seaborn can use columns from pandas. Matplotlib requires additional formatting of data.
Assuming this is a complete code chunk, and we expect to see output printed after running this, why is the below code incorrect? (chose the best answer, it may not be a great answer!) if tom brady == the goat: [TAB] print("The Bucs just won another Super Bowl")
the variables are not defined
Imagine you have a dataframe, called 'tickets', with 4 columns: ('name', 'address', 'parking_spot', 'number_of_tickets') If you wanted to subset out 2 columns, what code could you use (choose all that apply) (By subsets, I mean show just 2 of the 4 columns, not the entire dataframe
tickets.loc[:,['name', 'number_of_tickets']] OR tickets[['name', 'number_of_tickets']]
Usually a programmer will use conventional names when importing packages. But it is not strictly necessary. numpy for example can be imported as: import numpy as humpty_dumpty
true
numPy allows us to do more complicated math on lists and other data structures, and is used in most of the more advanced modules we will use (such as pandas)
true
pandas allows us to use multiple different data types (like objects and numbers) in a single table.
true
pandas can be imported as import pandas as pd
true
pandas has functionality to work with complicated dates.
true
Which is the correct IF statement to determine if you're accelerating, decelerating, or staying at constant velocity?
x= -0.4 if x == 0: [TAB]print( "you're cruising") if x > 0: [TAB]print("you feel the need, the need for speed!") if x < 0: [TAB]print("you're losing speed!")