python for data science

Ace your homework & exams now with Quizwiz!

What is the symbol that denotes Magic Functions in Jupyter?

%

Assuming the tuple below: tup1 = ('physics', 'chemistry', 1997, 2000, 2001, 1999) What will be the following result of tup1[1:4].

('chemistry', 1997, 2000)

Using markdown cells in Jupyter, how do you format text as bold?

**word**

Which Root Mean Square Error (RMSE) would represent a perfect prediction with no errors in regression?

0

What is an example of Unix time (int64)?

1138537770

Assuming the code below: tup1 = ('physics', 'chemistry', 1997, 2000, 2001, 1999) What is the result of: tup1[2]

1997

If you have a dataframe titled 'dat' with 5 rows and 2 columns and you run the following line of code, how many boolean values are returned? dat.isnull().any()

2

If you try to access rows of a 3-by-3 numpy array called "arr" using the command: arr[:2,] How many rows will be returned?

2

You are given the following lines of code: arr = np.array([[1,2,3],[4,5,6],[7,8,9]]) slice = arr[:2,1:3] slice[0,0] What is the result of "slice[0,0]" in your last line of code?

2

In an RGB images, which three values specifying a pixel's color correspond to the color white?

255, 255, 255

Assuming the list below: mylist = [6, 8, 12, 13] What is the output from the following code snippet? mylist[3] % mylist[1]

5

What is the default number of rows that the function head() will return for a dataframe in pandas?

5

What does the "within-cluster sum of squared error" provide?

A mathematical measure of the variation within a cluster.

What is the definition of 'data mining'?

Activities related to finding patterns in databases and data warehouses.

For a classification problem, if you want to predict the letter grade that a student would receive, what are 2 examples of reasonable input data to consider?

Amount of time spent studying Percentage grade these students received in the previous semester

What is the next step in building a classification model after the model is constructed and parameters are adjusted?

Apply model to new data

How do you assign each sample in a dataset to a centroid using the k-means algorithm?

Assign the sample to the cluster with the closest centroid.

As an example, you have a dataset containing numerical values of subjects' heart rates during exercise and categorical values describing how much they smoke. You want to determine whether smoking and heart rate are related. What machine learning category would this fall under?

Association analysis

How do you determine the new centroid of a cluster?

Calculate the mean of the cluster

Is age group a numeric or a categorical variable?

Categorical

For example, you want to predict the number of kids someone will have: either 0, 1, 2, or 3+. Is this an example of regression or classification?

Classification

What is the function Kmeans (from sklearn) used for in Python in the example soccer data analysis overview?

Clustering

Which of the following is NOT a step in the formal data science process?

Collaborate

When thinking about data visualization, what element is critical in making you value the visualization?

Context

Why are decision boundaries of a decision tree parallel to the axes formed by the variables?

Each split considers only a single variable

Given the following: Regular expression: '.*\,(.*)\&.*' String in Series: "Feuer, Eis & Dosenbier" What would be the output of the extract function with this string and regular expression?

Eis

You are given a data set that contains crime reports for different neighborhoods in New York City and told to analyze it. Would you consider this a declarative or explorative example?

Explorative

Which should you do first after you put together data needed for application?

Explore the data you put together

Changing an element of an array slice in numpy will NOT change the original array.

False

Cluster analysis is a supervised task.

False

Code in the Jupyter code cells are restricted to being one line.

False

Regression is an unsupervised task.

False

Test data is the same dataset as training data in classification models.

False

True or False: The function call train_test_split(a, b) where a and b are dataframes will always output the same result.

False

When you concatenate two dataframes using pandas concat function, the number of resulting columns will be the columns that BOTH dataframes have. As an example, if one dataframe has columns titled 'cat','dog' and another dataframe has columns titled 'dog','bunny', then the resulting dataframe columns will be 'dog'.

False

Suppose you are looking at a data set consisting of how much students liked a particular class. The rows are students. There are two columns: one asks if the student has taken the class and the other asks to rate the class on a scale of 1-10. However, you notice that some students have not taken the class and therefore do not want to include their ratings. Which data filtering technique would you use to clean this dataset?

Filter out rows

What 2 things are most important in creating elegant visualizations?

Focus on what is relevant. Remove anything which isn't adding to the figure.

Which library can you use to easily create geographic overlays?

Folium

What is true between supervised and unsupervised approaches?

In supervised approaches, the target is provided. In unsupervised approaches, the target is unavailable.

How do we show the histogram created with graph.hist(...)?

Include "%matplotlib inline"

What are the 3 reasons that data scientists working in Python use numpy all the time?

Its speed. Its functionality. Many packages rely on numpy.

Which layer of the 3 layer matrix of colors corresponds to the color blue when working with images in Numpy?

Layer 2

In a decision tree, which nodes do NOT have test conditions?

Leaf nodes

example of insight turned into action?

Marketing a new product based on past sales information

Which of the following does a boxplot NOT show you?

Mean

Which is NOT a quality of good data visualization according to Andy Kirk?

Meaningful

In the case study on cholera, people attributed poor smells to the source of the disease and thought weaker people were more susceptible to the disease. When drawing these conclusions, what mistake was made?

Mistaking correlation for causation.

If you create a DataFrame using pandas by accessing a column label that doesn't exist, what values are present in that column?

NaN

Which of the following are benefits of ndarrays over lists? Select 3.

Ndarrays are more space efficient. Ndarrays are more optimized for memory. Ndarrays often have faster computation.

What is the first step with any dataset?

Perform an initial exploration.

Which is an example of conceptually driven data visualization?

Physicists visualize the well-understood relationship between force and acceleration to teach introductory physics students.

As an example, let's consider a data set consisting of two variables, one representing how long students spend studying and the other representing their average test scores. Assuming students study effectively, how would you expect these two variables to be correlated?

Positively correlated because I would expect the test scores to go up if they spend more time studying.

What 2 statements describe classification in the context of machine learning?

Predict the category of the target given input data Supervised task

How would you initially handle an anomaly (apparent outlier) in cluster analysis?

Provide further analysis on the anomaly

What is the difference between regression and classification for machine learning in Python?

Regression is used to predict a numeric value while classification is used to predict a categorical value.

What is the correct word to describe an instance of an entity in your data?

Sample

What are ways to manipulate the cleaned data into the format needed for analysis?

Scaling Transformation Feature selection Dimensionality reduction Data manipulation

Which graphing method should you use to visualize the correlation between two arrays?

Scatter plot

What are the 2 main data structures in pandas?

Series Data Frame

Which is NOT mentioned in the course as a common similarity measure in cluster analysis?

Sine similarity

In what 3 ways can you quickly access numpy array elements?

Slicing Using an array of indices Boolean indexing

What is the first step in constructing a decision tree?

Start with all samples at a node.

In general, are classification and regression often supervised or unsupervised approaches?

Supervised

Think back to France's Russian Campaign case study. What made this visualization so special?

The figure represented many encoded types of data to depict a story.

What does the following method call return? accuracy_score(data_true = data_test, data_pred = predictions)

The fraction of correctly classified samples.

For the merge() function in pandas, how does the parameter "how" handle row indices when how='inner'?

The function takes the intersection of the row indices

How do you determine the size of a decision tree?

The number of nodes in the tree

When working with cells in Jupyter, what does "_" refer to?

The output of the last cell executed.

What does the parameter unit refer to in the to_datetime function?

The unit of the input

What requirement is needed to add two numeric numpy arrays?

They need to have the same or compatible dimensions.

What does a negative correlation score mean?

Those features in our dataset are inversely correlated

In building a machine learning model, why do we want to adjust the parameters?

To reduce the model's error

A Root Mean Square Error (RMSE) higher than our mean value would be too high. (Assume all values are positive)

True

Elements in numpy arrays must be all the same type.

True

Final clusters are sensitive to initial centroids.

True

It works out better mathematically to measure the impurity of a split in a decision tree, rather than the purity.

True

Outliers can sometimes be critical to finding convincing answers when analyzing data.

True

The target variable is always categorical in classification.

True

When you import a dataset using the read_csv function in pandas, the rows of the dataset are Series.

True

ndarrays are mutable.

True

Assuming the tuple below: tup1 = ('physics', 'chemistry', 1997, 2000, 2001, 1999) What will be the following result of: tup1[2]=1998 print(tup1)

TypeError: 'tuple' object does not support item assignment

Look at the following code: b = np.array([1,2,3]) b[1] = 'one' What error prints out after you run these two lines of code?

ValueError

In the parallel_plot function, what was represented on the y-axis of the resulting plot?

Values of each cluster center

Which of the following statistics does the describe() function NOT return on dataframe columns?

Variance

When is a prediction task referred to as simple linear regression?

When there is only one input variable.

When is it NOT acceptable to avoid axis labels in plots using matlibplot.

When you are presenting non-intuitive results to another person.

When would you use the machine learning technique 'regression'?

When your model has to predict a numerical value.

What is the difference between deleting a column and popping a column?

You can store a popped column.

What is the result of the following lines of code? a=np.array(["cat","dog","fish"]) b=np.array(["dog","fish","rabbit"]) print(np.setdiff1d(a,b))

['cat']

What will be printed by the following code snippet? x = [1, 2, 3] y = x x[1] = 42 print (y)

[1, 42, 3]

What is true about data science?

a model generated that leads to insights and can be improved management of data seeing how everything is connected

What command allows you to sum all of the elements in an Rank 2 ndarray called "a"?

a.sum()

Select two valid ways to get the odd values of an array "arr."

arr[(arr % 2 != 0)] arr[(arr % 2 == 1)]

You are given the following lines of code: arr = np.array([[1,2,3],[4,5,6],[7,8,9]]) slice = arr[:2,1:3] slice[0,0] What element in arr is equivalent to slice[0,0]?

arr[0,1]

Given the code below: arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]]) Which of the following two commands below produce the same result?

arr[0:1,1:3] arr[:1,1:3]

How would you change the number 5 to 7 in this matrix? arr = np.array([1,2,3,4,5])

arr[4] = 7

What is the correct way to access elements of an array "arr" that are less than 0?

arr[arr<0]

What is the result of the following line of code? import numpy as np np.unique([1,1,3,4,2,3,3])

array([1,2,3,4])

What is the output of the following broadcasting call? A = np.array([[1],[2]]) B = np.array([[1,2],[3,4]]) A + B

array([[2, 3], [5, 6]])

Which analysis technique has the goal of finding a set of rules to capture associations between items or events?

association analysis

Which command allows you to edit the view of the axes on a matlibplot plot?

axis()

Which function call will allow you to group a dataset (titled dat) by 'time'?

dat.groupby('time')

What is the command to get the number of rows in a data set titled "data"?

data.shape[0]

What is the general syntax for calling the mode() function on a dataframe?

data_frame.mode()

What does calling the function dropna() return from a dataframe?

dataframe omitting NA values

What are the ingredients to form a data science problem

define what it is you're trying to tackle assess the situation with respect to the problem define your goals and objectives

What does the parameter expand represent in the split function?

determines if the output is a dataframe

Which of the following will display the first 7 rows in a data frame object named df?

df.head(n=7) df.head(7) df[:7]

What is the function call to find cells in a dataframe df with timestamp on 2007-02-04, given the dataframe has a parsed time column labelled 'parsed_time'?

df['parsed_time'] == '2007-02-04'

Which function allows you to clean a data set by dropping NA values?

dropna()

Data science only generates actionable information for the future.

false

In machine learning, algorithms and programs directly aim to learn a given task.

false

Once you have devised an action based on an insight, there is no need to continue collecting data.

false

Raw data from sources can always be directly used to perform analysis.

false

What is NOT important in collecting data?

how the data was presented

Which algorithm to build classification models relies on the notion that samples with similar characteristics likely belong to the same class?

kNN

While lists access values using an index, dictionaries access values using a(n) _____.

key

What type of object does the function Kmeans output?

kmeans

What graph is best to show values in your data change over time?

line graph

Which parameter in the KMeans clustering algorithm do you have to specify for the number of clusters you want?

n_clusters

There is a syntax error in the code below. np.array( [11,12,13],[21,22,23] ) How would you fix it to create the intended 2x3 array?

np.array( [[11,12,13],[21,22,23]] )

How do you create an Rank 1 array with numpy using the numbers 1, 2, 3?

np.array([1,2,3])

To use scikit-learn: DecisionTreeRegressor, train_test_split, and mean_squared_error, which of the following libraries are necessary? (Choose the best two)

pandas scikitlearn

For which of the following scenarios would you use the analysis technique of classification?

predicting the weather

Which is NOT a way you can access data at a certain location using pandas series? As an example, let's say the series (called ser) has indices 'apple' and 'orange'.

ser.loc['apple','orange']

Given the following code has run successfully, import pandas as pd ser = pd.Series([100, 200, 300, 400, 500], index = ['tom', 'bob', 'nancy', 'dan', 'eric']) Which of the following calls have the same output?

ser.loc[['tom','bob']] ser[[0,1]] ser.iloc[[0,1]] ser[['tom','bob']]

What are the 3 main string operations?

split contains extract

Which of the following is true about a model?

trained by the training data set

Data science is an iterative process.

true

Take a look at the following lines of code: a = np.array([2, 3]) b1 = np.array([1]) b2 = 1 True or False: a+b1 and a+b2 result in the same ndarray.

true

True or False: In Python, you cannot insert duplicates in sets.

true

What will be the output of the following lines of code? d = {'one' : pd.Series([100.,200.], index=['apple','orange']), 'two' : pd.Series([111.,211], index=['apple','orange'])} df = pd.DataFrame(d) filter = df['one']>100 filter.any()

true

When you search an incorrectly spelled term online, suggested words is an example of machine learning.

true

What is the function call to output the name of columns of a dataframe named x?

x.columns

What is the appropriate input for the following line of code to make a linear regression prediction? y_prediction = regressor.predict(___)

x_test

You are given a dataframe labeled x where the column 'number' indicates the index of a record. Which function call would create a new dataframe y that takes more than 10 samples x if x has 100 records?

y = x[(x['number']%5)==0]


Related study sets

Canada's Provinces and Territories

View Set

META Digital Marketing Associate Study Guide

View Set

Psychology: Unit 4: CONSCIOUSNESS AND SLEEP

View Set

[Lección 4] Estructura 2.2 - Quiero ir

View Set

History of Architecture: Pre-historic Architecture

View Set

Christina Faith and Living Unit 10

View Set

Chapter 2 interpersonal communication

View Set

Chapter 49: Nursing Care of the Child With an Alteration in Genetics

View Set