S364 Review

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

To remove the "Name" and "Sex" columns completely from titanic_df (meaning that the dataframe itself is changed), what should be done to the arguments in the drop() method? Hints: Select ALL CORRECT This question assumes that we will type one line of code (not two separate lines)

"Name" and "Sex" must be passed in as a list axis must be set to be 1 inplace must be set to be True

Given the following classification matrix where True is the desired outcome, what would be the Accuracy under the naïve rule? Enter your answer in the space below. Predicted ActualFalseTrueFalse32080True180420

.6

Given the following classification matrix where True is the desired outcome, compute the Sensitivity. Enter your answer in the space below. Predicted ActualFalseTrueFalse32080True180420

.7

To reference (retrieve) a value by row name (Row Index value) we would use what method?

.loc()

Source: Lecture examples, slides, https://www.w3schools.com/python/pandas/pandas_getting_started.asp (Links to an external site.) data_set = {'sodas': ["Coke","Pepsi","Mt. Dew"],'passings': [10,20,30]} myvar = pd.DataFrame(data_set) The Row Index values for the Pandas Dataframe myvar would be?

0,1,2

Use a Pandas function learned to find out the number of passengers in the 2nd class cabin. How many are they? (Use the Pclass column) 1) Create a Pandas DataFrame called titanic_df by reading in the csv file

184

The lift charts shown below were produced by using the partitioning (decision tree) algorithm on the eBay auction data. The proportion of "competitive" auctions in the validation data is 56%. If 1,000 new auctions are scored with the model and the highest 30% of the scored auctions are examined, approximately how many "competitive" auctions are expected to be found? Competitive = 1Non-competitive = 0

260

What happens after groupby() is called on a DataFrame?

A DataFrameGroupBy object is created and returned

Which among the following options can be used to create a Data Frame in Pandas?

A NumPy array A dictionary A list

The following descriptive statistics were obtained on the variable X1: Mean = 22.5Standard deviation = 1.50 One observation of X1 in the data set was found to be 38. What would you recommend to the analyst regarding this observation before a predictive model was developed?

A domain expert should be consulted to see whether this observation is within the realm of possibility.

What need to be done to add a legend to a chart?

Add label keyword with the legend to be displayed in plot() function and call the legend() function

A supervised data mining model was to be created on a data set with a continuous target (dependent) variable and 10 continuous predictor variables. Which of the following chart(s) would be most appropriate for exploring the distribution of each of the continuous variables to check for outliers, skewness, and other concerns? (Check all that apply.)

Box plots Histograms

In a regression equation, if the independent (predictor) variable is measured in kilograms, then the dependent variable:

Can be any units.

You collected some survey data. All answers to questions are choices of 5 options: strongly disagree, disagree, neutral, agree, strongly agree. During data preparation, you need to code the text values into numbers. Which of the following is the best way to code the data?

Code the values strongly disagree,disagree, neutral, agree, strongly agree into 1, 2, 3, 4, 5 respectively

Consider the following variable contained in a data set of 2,500 observations: Location1 = Eastern U.S.2 = Southeastern U.S.3 = Midwest U.S.4 = Southwestern U.S.5 = Northwestern U.S. The variable Location is to be used as a predictor (or independent variable) in a multiple regression. Which of the following approaches to using Location would be correct and efficient?

Create four categorical predictors from the Location variable.

If custom row indexes and customer column titles are not provided the index and column titles will default to "blanks" or Null values.

False

Nominal (or categorical) variables such as gender, color, location (east, west, etc.) should not be used as predictors in regression models. Only continuous or ordinal numeric variables are appropriate for regression.

False

Pandas does not have it's own built in Statistical functions and we have to use other libraries or modules to perform statistical analysis.

False

The following code will print a Pandas Data Frame sorted by the column A data = np.random.normal(1, 1, (4, 4))# mean = 0, std = 1, 3 x 3 row = ["N", "S", "E", "W"]col = ['A', 'B', 'C', 'D']frame = pd.DataFrame(data, index=row, columns=col) frame.sort_values(by=['A']) print(frame)

False

Unlike a Pandas DataFrame a Pandas Series cannot use strings as the row index

False

If you have titanic_df, which of the following columns should not by used as a key to group the rows by as is?

Fare

Create a bar chart using Survived and Sex with titanic_df, based on the chart, which of the following statement is the most accurate?

Females are more likely to survive. Among all the survived, the number of females is almost twice as that of males.

Assume the following linear regression model to predict the value of a used car: Car_value = 1.45 - 0.756*Mileage - 1.59*Age The unit of the Car_value is $1. Which of the following statements is true related to the coefficient?

Increasing the age by one unit will decrease the value of the car by $1.59.

You have a dataset with customer information. One of the columns is education which has 6 distinct text values such as high school, associates, bachelor, master, professional and PhD. Which of the following is the most appropriate way to handle the data?

Map the 6 text values into 6 distinct numbers in order such as: high school: 1associates: 2bachelor: 3master: 4professional: 5PhD: 6

A data set has 10,000 records and 30 predictor variables (columns). Each variable has 5% of the values missing for that individual variable. The missing values are spread randomly and independently throughout the data set. The analyst uses a predictive model that automatically drops any row (record) that has even a single missing values on any of the variables. How many records would be dropped from the analysis?

More than 7,500

In the titanic data, which of the following column or columns should not be used as an argument to the pandas cut() or qcut() function? Select ALL CORRECT ones.

Name Pclass Survived

Assume the RMSE of a linear regression model on the training data is 52.5 and on the validation data is 41.03. Is the model overfitted?

No

Which of the following statements about the data structures are true? Select ALL CORRECT answers.

Pandas can store heterogeneous data, meaning different columns in a dataframe can have different data types Numpy arrays can store homogeneous data only. That's one of the reasons why calculations with numpy arrays are so efficient.

What is the best interpretation of r-squared in a linear regression?

R-squared indicates the proportion of variance in the target variable explained by the predictors.

In the following table based on a classification analysis, if the cutoff value for class=Owner is increased from 0.5 to 0.7, what is most likely to happen to the sensitivity and specificity?

Sensitivity will decrease and specificity will increase.

Which of the following measures of classification performance would you recommend where the goal was to predict positive cases and the cost of false positives was much greater than the cost of false negatives?

Specificity

Some data mining algorithms require that variables are standardized (sometimes called normalized) to zero mean and standard deviation of 1.0. What is the reason for this?

Standardization is useful when it is important that variables with large values do not dominate measures being used.

Collinearity occurs when two or more independent variables are ______________ with (from) each other in multiple regression.

Strongly correlated

The IRS found that using analytics improved its ability to find tax cheats. What type of model was the IRS most likely using?

The IRS is using predictive analytics.

The four V's of Big Data include: (Check all that apply.)

The challenges related to big data have been categorized as: volume, velocity, variety, and veracity.

In pandas functions or methods such as dropna(), fillna(), replace() etc. you can use a keyword argument inplace. What of the following statement about this argument is INCORRECT?

The default value is True

The following results were obtained for the training and validation data sets of a model developed to predict fraud. What do these results suggest?

The model was overfit.

Which statement best describes what information can be obtained from an ROC curve?

The tradeoff between true positives and false positives.

To get an "honest" or true estimate of the classification error in a binary prediction model, the classification matrix should be constructed using:

The validation data

Which of the following statements about 2 X 2 classification matrices is most accurate?

There is a tradeoff between sensitivity and specificity.

When using predictive analytics models, it is important to have a sufficient number of records to insure that stable results can be achieved. (The absolute minimum recommended is 10 observations (rows) per predictor (column) in the training data set.) A predictive study was planned with 20 predictor variables. The analyst planned to create training, validation, and testing sets with the following proportions: 50% training; 30% validation; and 20% testing. What is the total minimum number of records that should be available for this study?

There should be at least 400 records.

To retrieve values from a Pandas Data Frame we can reference rows and columns by name or by index value.

True

You have a dataset with employee compensation information. One column has employee department data. Employees are from 4 department: Sales, Marketing, Accounting and HR. These 4 values are associated with employees. You know that the data is clean except that department values are text. What is the most appropriate step of data cleaning related to this column?

Use pd.get_dummies() to convert this column into 4 dummy variables.

Assuming we have import the two needed libraries (NumPy and Matplotlib),the following code would produce which graph? xpoints = np.array([0, 6])ypoints = np.array([0, 250])plt.plot(xpoints, ypoints)plt.show()

a)

The last two print statements will return the same value because of what reason? data = {"calories": [420, 380, 390],"duration": [50, 40, 45]} #load data into a DataFrame object:df = pd.DataFrame(data) print(df.loc[1])print(df.iloc[1])

because a row index was not specified

Match colors to their abbreviations used in matplotlib

blue b Correct!green g Correct!red r Correct!black k Correct!white w Correct!yellow y

To retrieve the "Service" and "phd" columns of the following Pandas Data Frame we would the following code: Data Frame Name: df

filteredDF = df[['service', 'phd']]

The code s1 = pd.Series([3, 0.5, 0.7, 0.9]) Will create a series of what datatype?

float

Which of the following methods or attributes of a dataframe can tell you the data type of each column in a dataframe? Note, a method is a function associated with an object. It is called using syntax object.methodname(). For example, list1.append(1), append() is a List method. A attribute is a property of an object. It is called using the syntax object.attributename. (No parentheses) For example, arr.ndim. ndim is an attribute of a numpy array's

info()

The .plot() method of Matplotlib creates what type of graph when no additional parameters are given? i.e. data.csv Download data.csv df = pd.read_csv('data.csv')df.plot()plt.show()

line graph

To access elements or subsets of a dataframe, we learned to use index operator [], loc[] operator and iloc[] operator. Which of the following can be used to retrieve a row from a dataframe? Be careful answering this question, it is best to test it out in Jupyter notebook. Select ALL CORRECT answers.

loc[] with the row index value iloc[] with the positional integer starting with 0

To add symbols (circles, diamonds, squares, etc) to the datapoints in a Matplotlib graph we use what parameter?

marker

How is missing values represented in Pandas dataframes and series?

np.nan or nan (Sometimes printed out as NaN)

Which of the following code will create a pivot table like this (the pivot table image is created in Excel. You don't need to create a pivot table to be exactly like this. But the data and the layout should be the same as in the image)

pd.crosstab([titanic_df['Survived'], titanic_df['Pclass']], [titanic_df.Sex])

Which of the following about missing values in Pandas dataframes is not true?

pd.isnull() can be used to check whether a value is missing assuming import pandas as pd was executed already. There is no such function pd.isna()

If you have a dataframe named a like the following, which of the follow code will covert the day column into a numeric column? (Confirm your answer with the notebook) dayvisitors01d10013d25025d36731d131

pd.to_numeric(a['day'].str.replace('d', ''))

You have a dataframe df. The first 4 rows look like this below: itemPricenumber0Cap$7.99101Shorts$8.5202T-shirt$10.79153Socks$1.9930 You want to calculate the total revenue but realized that the price column is a text one. What should you do to make the calculation work? Select ALL correct answers.

pd.to_numeric(df['Price'].str[1:]) pd.to_numeric(df['Price'].str.replace("$", ""))

To create 4 charts in 2 rows and 2 columns and work on the chart on the lower right corner. What code should you write?

plt.subplot(2, 2, 4)

Which of the following code would add a x-axis label to your chart?

plt.xlabel('Age')

The following code would be used to reference a Row in a Pandas Data Frame by the name of the name of the Row data = np.random.normal(0, 1, (4, 4))# mean = 0, std = 1, 3 x 3 x = ["N", "S", "E", "W"]y = ['A', 'B', 'C', 'D']frame = pd.DataFrame(data, index=x, columns=y)

print(frame.loc["N"])

Source: Lecture examples, slides, https://www.w3schools.com/python/pandas/default.asp (Links to an external site.) If the file called data.txt was a comma separated value file (values in the file are separated by commas) the appropriate Pandas method to import the data into a DataFrame would be which? (Replace the 4 question marks with what?) df = pd.????('data.txt')

read_csv

Which Pandas function from the options given below can read a dataset from a large text file?

read_csv

Which of the following methods can be used to handle inconsistent data according to the data preparation video?

replace()

What is the datatype for the object/variable myvar below? calories = {"day1": 420, "day2": 380, "day3": 390}myvar = pd.Series(calories)

series

What is the datatype for the object/variable myvar below? a = [1, 7, 2]myvar = pd.Series(a)

series

Use titanic_df, which of the following code lines creates meaningful boxplots that shows the distribution of passenger age by their cabin class and gender? Select ALL correct ones. (Think about why)

sns.boxplot(x='Pclass', y='Age', data=titanic_df, hue='Sex', palette='Set3') sns.boxplot(x='Sex', y='Age', data=titanic_df, hue='Pclass', palette='Set3')

ls or linestyle are keyword arguments to control the line style. Match the argument value and the style represented by it.

solid line - (dash) dashed line -- (2 dashes) dotted line : colon dash-dot line -. (a dash and a dot)

Which of the following code will return only the columns of Age and Fare from the titanic_df dataframe? Select ALL correct answers. 1) Create a Pandas DataFrame called titanic_df by reading in the csv file

titanic_df[['Age', 'Fare']] titanic_df.iloc[:, [5, 9]]

A Pandas Series is a one-dimensional array which is labeled and can hold any data type.

true

In pandas functions or methods such as drop(), sort_values(), mean() etc. you can use a keyword argument axis to specify the operations to be on rows or columns, which of the following are true relating to the axis argument? Select ALL CORRECT ones. Hint: Week10_K303-Pandas_LAMBDAS_ALL_COLUMNS.ipynb or Google ;-)

axis=0 in drop() drops rows if axis is not specified, it is equivalent to axis = 0

To print the first 5 rows of the following Pandas Data Frame we use df = pd.read_csv("Salaries.csv")

df.head() df.tail() df.head(5) df.tail(5) Correct! More than one of the options would work.

The following .dropna() code will do what to the dataframe called df? df = pd.read_csv('data.csv')df = df.dropna()

drop rows that have one or more NaN (NULLS) in the row

dropna() with no arguments does which of the following?

drops all the rows with at least one missing value in that row


Set pelajaran terkait

HIST 150 Chapter 15: Reordering the World, 1750-1850

View Set

Procesos de Dirección de Proyectos

View Set

Unit 2 questions from previous test

View Set