Final study guide
Which of the following regular expressions would match this string: "123A45"
/[0-9]+A[0-9]+/
Assuming you have the following document/term matrix: What is the relevance of the query Q={T2, T3} to document D2 using the TF.IDF scores?
1.5
decimal number 101 corresponds to which binary number
1100101
Assume your data consists of a single continuous feature and a single binary output class variable (positive and negative). Your training subset consists of the following examples (class labels in parenthesis):1 (+), 2 (+), 5(-), 6(+), 9(-)Your test data consists of the following examples:2 (+), 3(-), 8(-)Assume you apply the nearest neighbor classification algorithm with the absolute distance metric used to find closest neighbors (d=|x-y|), what would be the accuracy of the classifier on the test data?
2/3
Given the following code: if x%2==0:print('aaa')elif x%5==0:print('bbb')else:print('ccc') For which value of x will the code output 'bbb' ?
225
Assume you have the following confusion matrix: CLASS+-+16075-25140 What is the ERROR RATE?
25%
value of the decimal number 67 as a hexacdecimal
43
Copy of Assume you have the following confusion matrix: CLASS+-+16075-25140 What is the ACCURACY?
75%
Assume you received the following grades in class:Quiz 0: 3 out of 5 pointsQuiz 1: 5 out of 5 pointsLab 1: 8 out of 10 pointsMP1: 15 out of 15 pointsWithout taking into account future assignments and exams, what is your current estimated grade based on the grading policy outlined in the syllabus?
86.67%
111000110
910
Which vertex has the highest betweenness centrality measure in the following graph?
A
What is a data schema?
A description of a data set's attributes and their properties.
What could be a possible reason for eliminating an attribute from your data?
All of the above.
Assume you have the following data points:1, 2, 7, 8, 9, 34You then run the k-means algorithm with k=2 and use 1 for the centroid of cluster 1 and 2 for the centroid of cluster 2. What will the clusters contain after the algorithm finishes execution?
C1={1,2,7,8,9} C2={34}
Which dispersion statistics would you use if you wanted to compare dispersion across a set of numeric attributes?
Coefficient of variation
What usually differentiates a data scientist from a statistician?
Data scientists tend to use high performance distributed computing systems and work with Big Data. Statisticians generally work with smaller data samples.
A stopword is a word in a document that is used to delineate terms.
False
All outliers in the data should always be detected and eliminated before generating models.
False
Assume we want to use linear regression to build a model that predicts house prices given the number of bathrooms in the house. In that case, house price is the independent variable, and the number of bathrooms in the house is the dependent variable.
False
In a while loop, the body of the loop is executed until the condition becomes True.
False
Jupyter notebook files are saved using the .py extension.
False
There are two midterm exams in this course.
False
Web scarping means issuing a message to a web server to request data in a form of an XML document.
False
Which of the following functionalities are supported by default in Jupyter Notebooks?
Generate plots and tables within the notebook Use markdown to document parts of the workflow Write and execute code
The line below is an example of which of the following?GET / HTML/1.1
HTTP request message
Which tool is best to use to view binary file contents?
Hex editor
In Python 3, what is the purpose of the // operator?
Integer (floor) division
Assume you are conducting a survey of people's demographics and income levels, and that some income values are missing. If people with higher incomes are less likely to report their income levels, the missing values are said to be which of the following?
MNAR - Missing Not At Random
Which centrality measure is suitable for a nominal (non-ordinal) attribute?
Mode
When does the instructor have office hours?
Mondays and Fridays, 10am-12pm and 12:30pm-1pm.
Which of the following methods for filling in missing data values results in several copies of the dataset being generated with possibility different values of missing values filled in?
Multiple imputation
A measurement of height of individuals in cm is generally considered which of the following types of attributes?
Numeric (continuous)
Which of the following data science tasks is used to predict continuous quantities?
Regression
Assume you have the following table called PEOPLE stored in a relational database: ID,NAME,AGE1,John,212,Mary,453,Freddy,184,Agnes,315,Dolores,456,Michael,177,Kevin,288,Marie,52 Which SQL query would return the following table: ID,NAME1,John3,Freddy6,Michael7,Kevin
SELECT id,name FROM people WHERE age<30;
Which of the following are data structures implemented by the pandas package? (more than one may apply)
Series DataFrame
Assume you collected the mean Grade Point Average (GPA) of a sample of students from two different schools. Students from school 1 had a mean GPA of 2.47 and those from school 2 had a mean GPA of 3.89. You want to show whether the students from school 1 do in fact have lower GPAs than those from school 2. You set the alpha level to 0.05 and performed a T-Test. The resulting p-value was 0.0234. What can we conclude from this?
The difference between the GPAs is unlikely to have happened by chance, so we can consider it statistically significant.
How is Data Science defined?
The study of the generalizable extraction of knowledge from data.
Which of the following is a textbook referenced in the syllabus?
Think Python: How to Think Like a Computer Scientist
What is the primary purpose of SQL?
To communicate with a database management system and issue queries.
What is the purpose of the describe() method in Python pandas?
To output descriptive statistics about numerical columns
What is the primary purpose of Python's pandas package?
To provide data structures and functions that allow for ease of data access and manipulation.
A complete graph of five vertices (K5) exhibits greater small-worldness than another connected graph of five vertices with four edges.
True
A graph is simple if it contains at most one edge between any pair of vertices and does not contain self-loops.
True
A model is generalizable when it provides accurate predictions for unseen data.
True
Any nominal-valued attribute can be transformed into (possibly a set) of continuous-valued attributes.
True
If we increase the number of attributes in our data and perform pairwise statistical hypothesis testing at a fixed alpha value of 0.05, we are more likely to find pairs of attributes that pass the hypothesis test, but it's less likely those results are truly significant.
True
In Python 3, the # mark begins a code comment.
True
One advantage of the JSON format for web services is that it is simpler to parse than XML.
True
What is the best method to contact the instructor?
Via email
Assume we collect data on two numeric variables: NumFlus (number of cases of influenza) and AvgOutsideTemp (average temperature outside). We then compute the Pearson correlation coefficient between them and find the value to be -0.957. What can we conclude from this?
We cannot conclude anything about causation between the two variables.
Which of the following strings of characters (could be more than 1) would match on the this regular expression:
ab a
k'th nearest neighbot what is the purpose of K
chnace of overfitting how many neighbors generalization of the model
The following code creates which type of python data structure? {'a': 3, 'b': 2, 'c': 7}
dictionary
Clustering task involves making predictive models with labeled data
false
in equal-widht discretization the number of instances that are mapped ot each discrete value is always the same
false
logisting regression is generally used for solving regression problems
false
In the following code, what does "sqrt" correspond to? import mathprint(math.sqrt(2))
function
Which of the following tools would be best to view raw binary file contents?
hex editor
Which of the following terms is NOT synonymous with the others?
instance
The following code creates which type of python data structure? ['a', 'b', 'c']
list
assume we generate classifcation model that predicts the chance ofa person taking different modes of transportation to to the following confusion matrix
model was accurate had good accuracy had poor recall of the bike class
In the following code, what does "math" correspond to? import mathprint(math.sqrt(2))
module
What are the differences between list and string objects in Python 3? (more than one answer can apply)
string methods generally return new string objects strings are immutable, but lists are not list methods generally modify the object in-place
A for loop in Python 3 can be used to iterate through a string object.
true
The following code creates which type of python data structure? ('a', 'b', 'c')
tuple
The binary sequence 01111010 corresponds to which character in the ASCII character code set?
z
What is the bag of words representation of the following document, assuming "is" and "for" are stopwords: "data is useful for data science"
{"data": 2, "useful": 1, "science": 1}