Final study guide

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Which of the following regular expressions would match this string: "123A45"

/[0-9]+A[0-9]+/

Assuming you have the following document/term matrix: What is the relevance of the query Q={T2, T3} to document D2 using the TF.IDF scores?

1.5

decimal number 101 corresponds to which binary number

1100101

Assume your data consists of a single continuous feature and a single binary output class variable (positive and negative). Your training subset consists of the following examples (class labels in parenthesis):1 (+), 2 (+), 5(-), 6(+), 9(-)Your test data consists of the following examples:2 (+), 3(-), 8(-)Assume you apply the nearest neighbor classification algorithm with the absolute distance metric used to find closest neighbors (d=|x-y|), what would be the accuracy of the classifier on the test data?

2/3

Given the following code: if x%2==0:print('aaa')elif x%5==0:print('bbb')else:print('ccc') For which value of x will the code output 'bbb' ?

225

Assume you have the following confusion matrix: CLASS+-+16075-25140 What is the ERROR RATE?

25%

value of the decimal number 67 as a hexacdecimal

43

Copy of Assume you have the following confusion matrix: CLASS+-+16075-25140 What is the ACCURACY?

75%

Assume you received the following grades in class:Quiz 0: 3 out of 5 pointsQuiz 1: 5 out of 5 pointsLab 1: 8 out of 10 pointsMP1: 15 out of 15 pointsWithout taking into account future assignments and exams, what is your current estimated grade based on the grading policy outlined in the syllabus?

86.67%

111000110

910

Which vertex has the highest betweenness centrality measure in the following graph?

A

What is a data schema?

A description of a data set's attributes and their properties.

What could be a possible reason for eliminating an attribute from your data?

All of the above.

Assume you have the following data points:1, 2, 7, 8, 9, 34You then run the k-means algorithm with k=2 and use 1 for the centroid of cluster 1 and 2 for the centroid of cluster 2. What will the clusters contain after the algorithm finishes execution?

C1={1,2,7,8,9} C2={34}

Which dispersion statistics would you use if you wanted to compare dispersion across a set of numeric attributes?

Coefficient of variation

What usually differentiates a data scientist from a statistician?

Data scientists tend to use high performance distributed computing systems and work with Big Data. Statisticians generally work with smaller data samples.

A stopword is a word in a document that is used to delineate terms.

False

All outliers in the data should always be detected and eliminated before generating models.

False

Assume we want to use linear regression to build a model that predicts house prices given the number of bathrooms in the house. In that case, house price is the independent variable, and the number of bathrooms in the house is the dependent variable.

False

In a while loop, the body of the loop is executed until the condition becomes True.

False

Jupyter notebook files are saved using the .py extension.

False

There are two midterm exams in this course.

False

Web scarping means issuing a message to a web server to request data in a form of an XML document.

False

Which of the following functionalities are supported by default in Jupyter Notebooks?

Generate plots and tables within the notebook Use markdown to document parts of the workflow Write and execute code

The line below is an example of which of the following?GET / HTML/1.1

HTTP request message

Which tool is best to use to view binary file contents?

Hex editor

In Python 3, what is the purpose of the // operator?

Integer (floor) division

Assume you are conducting a survey of people's demographics and income levels, and that some income values are missing. If people with higher incomes are less likely to report their income levels, the missing values are said to be which of the following?

MNAR - Missing Not At Random

Which centrality measure is suitable for a nominal (non-ordinal) attribute?

Mode

When does the instructor have office hours?

Mondays and Fridays, 10am-12pm and 12:30pm-1pm.

Which of the following methods for filling in missing data values results in several copies of the dataset being generated with possibility different values of missing values filled in?

Multiple imputation

A measurement of height of individuals in cm is generally considered which of the following types of attributes?

Numeric (continuous)

Which of the following data science tasks is used to predict continuous quantities?

Regression

Assume you have the following table called PEOPLE stored in a relational database: ID,NAME,AGE1,John,212,Mary,453,Freddy,184,Agnes,315,Dolores,456,Michael,177,Kevin,288,Marie,52 Which SQL query would return the following table: ID,NAME1,John3,Freddy6,Michael7,Kevin

SELECT id,name FROM people WHERE age<30;

Which of the following are data structures implemented by the pandas package? (more than one may apply)

Series DataFrame

Assume you collected the mean Grade Point Average (GPA) of a sample of students from two different schools. Students from school 1 had a mean GPA of 2.47 and those from school 2 had a mean GPA of 3.89. You want to show whether the students from school 1 do in fact have lower GPAs than those from school 2. You set the alpha level to 0.05 and performed a T-Test. The resulting p-value was 0.0234. What can we conclude from this?

The difference between the GPAs is unlikely to have happened by chance, so we can consider it statistically significant.

How is Data Science defined?

The study of the generalizable extraction of knowledge from data.

Which of the following is a textbook referenced in the syllabus?

Think Python: How to Think Like a Computer Scientist

What is the primary purpose of SQL?

To communicate with a database management system and issue queries.

What is the purpose of the describe() method in Python pandas?

To output descriptive statistics about numerical columns

What is the primary purpose of Python's pandas package?

To provide data structures and functions that allow for ease of data access and manipulation.

A complete graph of five vertices (K5) exhibits greater small-worldness than another connected graph of five vertices with four edges.

True

A graph is simple if it contains at most one edge between any pair of vertices and does not contain self-loops.

True

A model is generalizable when it provides accurate predictions for unseen data.

True

Any nominal-valued attribute can be transformed into (possibly a set) of continuous-valued attributes.

True

If we increase the number of attributes in our data and perform pairwise statistical hypothesis testing at a fixed alpha value of 0.05, we are more likely to find pairs of attributes that pass the hypothesis test, but it's less likely those results are truly significant.

True

In Python 3, the # mark begins a code comment.

True

One advantage of the JSON format for web services is that it is simpler to parse than XML.

True

What is the best method to contact the instructor?

Via email

Assume we collect data on two numeric variables: NumFlus (number of cases of influenza) and AvgOutsideTemp (average temperature outside). We then compute the Pearson correlation coefficient between them and find the value to be -0.957. What can we conclude from this?

We cannot conclude anything about causation between the two variables.

Which of the following strings of characters (could be more than 1) would match on the this regular expression:

ab a

k'th nearest neighbot what is the purpose of K

chnace of overfitting how many neighbors generalization of the model

The following code creates which type of python data structure? {'a': 3, 'b': 2, 'c': 7}

dictionary

Clustering task involves making predictive models with labeled data

false

in equal-widht discretization the number of instances that are mapped ot each discrete value is always the same

false

logisting regression is generally used for solving regression problems

false

In the following code, what does "sqrt" correspond to? import mathprint(math.sqrt(2))

function

Which of the following tools would be best to view raw binary file contents?

hex editor

Which of the following terms is NOT synonymous with the others?

instance

The following code creates which type of python data structure? ['a', 'b', 'c']

list

assume we generate classifcation model that predicts the chance ofa person taking different modes of transportation to to the following confusion matrix

model was accurate had good accuracy had poor recall of the bike class

In the following code, what does "math" correspond to? import mathprint(math.sqrt(2))

module

What are the differences between list and string objects in Python 3? (more than one answer can apply)

string methods generally return new string objects strings are immutable, but lists are not list methods generally modify the object in-place

A for loop in Python 3 can be used to iterate through a string object.

true

The following code creates which type of python data structure? ('a', 'b', 'c')

tuple

The binary sequence 01111010 corresponds to which character in the ASCII character code set?

z

What is the bag of words representation of the following document, assuming "is" and "for" are stopwords: "data is useful for data science"

{"data": 2, "useful": 1, "science": 1}


Ensembles d'études connexes

Poetry is one of the three major types of literature, the others being prose and drama

View Set

Converting fractions, decimals, and percentages

View Set