Data Science Exam 1, Data Science Exam II
grade is a Pandas DataFrame. Which string in the following expression is the column name? grade.ix["Bob"]["CS310"] Answers: "grade" Not enough data to tell. "CS310" "Bob"
"CS310"
Let price be a data series, possibly with some missing values. Alice decided to replace the missing values with the average of all non-missing values. Which operator should Alice use instead of X in the following statement? price[price.isnull()] = price[X price.isnull()].mean() Answers: ! ~ - not
-
Array texmex contains the numbers of people crossing the border between Texas and Mexico in both directions, by day. The number is positive if more people enter Texas and negative if more people enter Mexico. Assuming that on day 0 the population of Mexico was 0 and there is no way to enter Mexico other than from Texas, which expression calculates the array that contains the total population of Mexico, by day? Answers: texmex.sum() texmex.mean() -texmex.cumsum() texmex.cumsum()
-texmex.cumsum()
Set the NumPy random seed to 10. What is the mean of a pseudorandom sequence of 10000 numbers normally distributed with s=1 and m=0? (Rounded to 3 decimal positions.) Hint: Use np.random.normal. Answers: -0.033 0.005 0.000 0.017
0.005
Consider the GraphML description of the network of FIFA Soccer World Cup participants. An edge from A to B means that A won the game over B. How many national teams have lower eigenvector centrality than the WC winner? Hint: use pandas and the rank() method of a Series. Answers: 198 13 201 188
13
How many characters does the following string have? r"\\\n" Answers: Answers: 5 4 3 2
4
Consider the network of FIFA Soccer World Cup participants. A node represents a national team. A directed edge from A to B means that A won over B, the thickness of the edge being the score. How long is the shortest directed non-weighted path from Ghana to the Solomon Islands? Answers: The path does not exist. 4 2 5
5
Consider the network of FIFA Soccer World Cup participants. What are the unweighted in- and out-degrees of Uruguay? Answers: 6, 7 6, 8 5, 8 6, 5
5,8
What is the value of the following expression? (Do not run this program: Python will run out of memory!) numpy.eye(50000).trace() Answers: 50000 2500000000 1 0
50000
Consider the GraphML description of the network of FIFA Soccer World Cup participants. An edge from A to B means that A won the game over B. How many teams have more won games than lost games? Hint: Load the file as a graph using nx.read_graphml(), convert the lists of indegrees and outdegrees to pandas Series, compare the Series, and count the Trues. Alternatively, get yourself a looking glass and count the nodes by hand - yikes! Answers: 39 68 202 95
68
How many maximal cliques of the size of 2 or more are in the following network? Answers: 3 7 145 2
7
Which of the following strings are HTML tags? (Check all that apply.) Answers: html : { "tag" } <a href="http://suffolk.edu/"> <b> An <b>HTML</b> tag
<a href="http://suffolk.edu/"> <b>
Which symbol shall be inserted instead of the # sign in the following regular expression so that it correctly matches a human name with an optional generational title? (Such as Jr or Sr). r'[a-zA-Z]+\s[a-zA-Z]+(\s+(Jr\.|Sr\.|II|III))#' Answers: + ? $ *
?
What does the following regular expression describe? r"#.+$" Answers: A string that contains no more than one pound sign "#" A string that ends with a dollar sign "$" A string that begins with a pound sign "#" A Python comment
A Python comment
Which of the following objects are raw data? (As opposed to processed data.) Answers: A Web page downloaded using urllib.request.urlopen() or saved from a Web browser. A table that contains the frequencies of each unique word in Moby Dick. A TIFF image file obtained by scanning a book page. The text of Moby Dick extracted from a scanned image using optical character recognition (OCR).
A Web page downloaded using urllib.request.urlopen() or saved from a Web browser. A TIFF image file obtained by scanning a book page.
s is a Python string. What does the function s.upper() return? Answers: A copy of s with each alphabetic character converted to the upper case. A reference to s where each alphabetic character has been converted to the upper case and all other characters removed. A copy of s with each alphabetic character converted to the upper case and all other characters removed. A reference to s where each alphabetic character has been converted to the upper case.
A copy of s with each alphabetic character converted to the upper case.
Which graph type is the best match for each of the following networks? Question A network of biological siblings. A multimodal transportation network covering roads, railroads, and airways. A network of suppliers and customers, by city (a city may be its own supplier). A network of computers with both wired (Ethernet) and wireless (WiFi) connections. Ansers: A. A simple graph. B. A digraph with loops. C. A directed multigraph with no loops. D. An undirected multigraph
A network of biological siblings. -A simple graph. A multimodal transportation network covering roads, railroads, and airways. -A directed multigraph with no loops. A network of suppliers and customers, by city (a city may be its own supplier). -A digraph with loops. A network of computers with both wired (Ethernet) and wireless (WiFi) connections. -An undirected multigraph
Chuck created a Pandas series paychecks. The series index has three duplicate entries 2016. What is the value of the expression paychecks[2016]? Answers: A series of all values labeled 2016. A randomly selected value labeled 2016. The expression fails to evaluate, an exception is raised. The first value labeled 2016.
A series of all values labeled 2016.
What is model overfitting? Answers: A situation when the model "memorizes" each item used for its training. A situation when the model has too many features. A situation when k-mean clustering algorithm produces too many clusters. A situation when the Ordinary Least Square regression is fitted with very high R2.
A situation when the model "memorizes" each item used for its training.
Let corpus be a custom-made word corpus. What does the method corpus.raw() return? Answers: A string consisting of all the words in the corpus separated by white spaces. A non-human-readable (binary) representation of all the words in the corpus. A list of all the words in the corpus. A string consisting of all the words in the corpus separated by line breaks.
A string consisting of all the words in the corpus separated by line breaks.
Which variable is the worst predictor? Answers: A binary variable that is True in 95% of cases. A variable that is always True. A categorical variable with 10 equally probable values. A binary variable that is either True or False with the probability of 50%.
A variable that is always True.
What is the best external storage format for each of the following Python objects? Rich text with emphasized words, sections, paragraphs, etc., intended for human users. A two-dimensional list of unordered floating point numbers, intended for human users. A set of dictionaries, intended for another Python program. All Answer Choices A. HTML B. Pickle C. CSV
A. HTML C. CSV B. Pickle
What is the second most frequent part of speech, as reported by the NLTK POS tagger, among the first 10,000 common English words from the words corpus? Hint: Use a Counter! Answers: Proper noun. Adverb. Common noun. Adjective.
Adjective
Bob applied Pandas function get_dummies() to a series of 75 integer numbers between 0 and 10, inclusive. Which fact is true about the resulting data frame? Answers: Each column has at most 10 ones. All row sums are in the range between 0 and 10, inclusive. All row sums are equal to 1. All column sums are equal to 75.
All row sums are equal to 1.
What data source can be used to "cook" a BeautifulSoup object? (Check all that apply.) Answers: A Python dictionary. A list of numbers. An open file. An open URL.
An open file. An open URL.
What symbols can be used as delimiters in a CSV file? Answers: Any ASCII symbol, as long as it is not used in any dataitem stored in the file. Only commas. Any ASCII character. Commas, tabs, colons, and vertical bars
Any ASCII character.
What is the major difference between NumPy arrays and Python lists? Answers: Function sum() works on lists but not on arrays. Arrays must have an index, lists cannot have indexes. Arrays are homogeneous and lists are heterogeneous.
Arrays are homogeneous and lists are heterogeneous.
A is a two-dimensional array of integer numbers 50000x50000; B is a one-dimensional array of 30000 floating-point numbers; C is a floating-point number. Which of the following operations are valid? (Check all that applies.) Answers: A+C C*A.T B*C A/B
B*C A+C C*A.T
Why do we often convert English words to the lower case during text analysis? Answers: Because word stemmers do not recognize upper case letters. For no particular reason. It's just a tradition. Because words in lower case letters require less storage space, which may be crucial for large texts. Because stopwords and some other corpora store words in lower case letters.
Because stopwords and some other corpora store words in lower case letters.
How is the header row (the row that contains column headers) designated in a CSV file? Answers: All words in the header row are spelled in all capital letters. The header row begins with a '#'. CSV does not any special designation for the header row. The first row of a CSV file is the header row.
CSV does not any special designation for the header row.
Which facts are true about CSV files? (Check all that apply.) Answers: The field delimiter in a CSV file must be a comma. CSV files are human readable. A CSV file must have a header row. CSV files can be imported into Microsof Excel.
CSV files can be imported into Microsoft Excel. CSV files are human readable.
What is sentiment analysis? Answers: Classifying a text as being positive or negative, based on the frequencies of positive-charged and negative-charged words and expressions. Splitting a text into words and identifying which of them are nouns. Locating and classifying elements in a text into pre-defined categories such as the names of persons, organizations, etc.
Classifying a text as being positive or negative, based on the frequencies of positive-charged and negative-charged words and expressions.
What is the purpose of compiling a regular expression before using it for matching? Answers: Compiled regular expressions are faster to match. A regular expression cannot be matched unless it is first compiled. There is no real need to compile regular expressions, but compiling them makes the program easier to understand. A compiled regular expression is more robust and less likely to match a wrong string.
Compiled regular expressions are faster to match.
The following HTML fragment has been converted into a BeautifulSoup and stored in the variable soup: <a href='http://moon.ss' > Take me to the moon! </a> Match the Python expressions related to the soup and their values. All Answer Choices A. "http://moon.ss" B. " Take me to the moon! " C. "Take me to the moon!" D. "a"
D. "a" B. " Take me to the moon! " A. "http://moon.ss" C. "Take me to the moon!"
Which topic is not a part of Data Science? Answers: Text data analysis. Data structures and algorithms. Numeric data analysis. Complex network analysis.
Data structures and algorithms.
Which data analysis technique is a machine learning technique? Answers: Information retrieval. Linear regressions. Decision trees. Signal processing.
Decision trees.
Bob submitted a project report that states that the collected data set has 15,345 observations of two variables, foo and bar. Of them, 678 observations are incomplete and have missing data. The mean values of foo and bar are 3.76 and -6.12, respectively. All observed values of foo are positive and all observed values of bar are negative. What kind of data analysis did Bob perform? Answers: Predictive Exploratory Inferential Descriptive
Descriptive
Alice obtained a list of all elevators in NYC for the purpose of checking which part of the city has most high-capacity elevators. She imported the list as a Pandas Frame and learned that it has the following columns: Capacity; Date of last inspection; Elevation; Building ZIP code; Building latitude; Building longitude; Building street address. About 80% of values in the "Building street address" and "Bulding ZIP code" columns are missing. What is Alice's best policy for handling the missing values? Answers: Eliminate all rows with missing values. Eliminate all columns with missing values. Infer the street address from the longitude and latitude. Replace all missing street addresses with some default value (say, "11 Wall Street").
Infer the street address from the longitude and latitude.
At least what level of data analysis is needed to make the following statement: "Based on the observation of 20 students taking CMPSC-310, we conclude that 20% of the world population are female." Answers: Descriptive. Inferential. Predictive. Exploratory.
Inferential.
Which join/merge operation never creates new missing values? Answers: Outer join. Left join. Outer join. Inner join.
Inner join.
Which frame merge operation never produces new missing values? Answers: Outer merge Left merge Inner merge Right merge
Inner merge
Which facts are true about JSON? (Check all that apply.) Answers: It supports Python dictionaries and lists. It is a dialect of HTML. It works only with JavaScript. It is human readable
It supports Python dictionaries and lists. It is human readable
Which statements are true about JSON? (Check all that apply.) Answers: JSON supports boolean values. A JSON file produced by a Python program cannot be imported into a Java program. JSON is human-readable. JSON supports all Python data types. Any JSON file can be imported as a valid a Python object. JSON is supported by many programming languages.
JSON is human-readable. Any JSON file can be imported as a valid a Python object JSON supports boolean values. JSON is supported by many programming languages
Is it a good idea to use interactive GUI tools for reproducible data analysis (as opposed to command-line programmed tools), and why? Answers: No, it is not. Interactive GUI tools do not record the history of operations, which makes the analysis not reproducible. No, it is not. Interactive GUI tools are usually oversimplified, stripped down versions of powerful programmable data processing environments. Yes, it is. Interactive GUI tools simplify data analysis and make it more accessible. Yes, it is. Interactive GUI tools, as a rule, provide more powerful data analysis techniques.
No, it is not. Interactive GUI tools do not record the history of operations, which makes the analysis not reproducible.
What operations must be avoided in the course of reproducible data processing? (Check all that apply.) Answers: Use of commercial software. Operations requiring manual steps. Operations involving GUI (interactive) tools. Operations requiring Internet access.
Operations requiring manual steps. Operations involving GUI (interactive) tools.
Pandas frame Patients has columns "Age","Gender", and "Weight", and index "PatientID" with integer numerical labels between 1 and 50, inclusive. Which expressions evaluate to the weight of the 11th patient? (Check all that applies.) Answers: Patient["Weight"][11] Patient[11]["Weight"] Patient.ix[11]["Weight"] Patient["PatientID"==11]["Weight"]
Patient.ix[11]["Weight"] Patient["Weight"][11]
Which of the following words is probably an entity? Answers: vandalize actually strongly Philadelphia
Philadelphia
What is the motivation for precompiling regular expressions? Answers: Precompiled regular expressions match faster. There are no advantages, it's a purely aesthetic choice. Precompiled regular expressions are more accurate.
Precompiled regular expressions match faster.
Which centrality measures control the following network properties? Question Ability to reach immediate network neighbors. Ability to reach all network members. Ability to control information flows. Relative importance of a node with respect to its immediate network neighbors. All Answer Choices A. Closeness centrality. B. Eigenvector centrality. C. Degree centrality. D. Betweenness centrality.
Question Correct Match Ability to reach immediate network neighbors. -Degree centrality. Ability to reach all network members. -Closeness centrality. Ability to control information flows. -Betweenness centrality. Relative importance of a node with respect to its immediate network neighbors. -Eigenvector centrality.
Alice works for United Airlines (UA). She was asked to predict the overbooking of a flight (True or False), based on the probability of a UA crew member needing to fly to the same destination for another flight (one continuous variable in the range 0 through 1). Which models shall Alice use to make the prediction? (Check all that applies.) Answers: Linear regression Ridge regression Random forest classifier Logistic regression
Random forest classifier Logistic regression
What is the way to read a CSV file without the header row using the standard CSV reader? Answers: Read the first line from the file using readline() and then construct a CSV reader that reads from the file. Pass the option nrows=1 to the reader. Pass the option header=False to the reader. Read the whole file as a list of strings, remove the first string, and construct a CSV reader that reads from the list of the remaining strings.
Read the first line from the file using readline() and then construct a CSV reader that reads from the file.
Let s be a string. What does the following statement do? " ".join(s.split()) Answers: Correct Removes duplicate spaces from s. Inserts spaces between every two consecutive characters of s. Nothing. (The result is the same as s.) Removes all spaces from s.
Removes duplicate spaces from s.
What is data alignment with respect to Pandas Frames and Series? Answers: Selection of values with matching indeces (rather than in matching positions) for binary arithmetic operations. Replacement of the results of some binary operations with missing values (NAs). Elimination of missing values (NAs). Transformation of all values in a Frame/Series to the range [0...1].
Selection of values with matching indeces (rather than in matching positions) for binary arithmetic operations.
The data on the US population by age (0-99 years), gender (M/F), and census year (1900-2014) are used as input for the k-means clustering algorithm. Which statement is true about the clustering result? Answers: The number of clusters will be determined automatically, based on the data size. The age of 999 is impossible. The clustering algorithm will fail. All clusters will have the same number of data points. Sex will be mostly ignored by the algorithm.
Sex will be mostly ignored by the algorithm.
Why is it necessary to eliminate stopwords from a text before analyzing it? Answers: Stopwords considerably slow down text analysis tasks. Stopwords make certain text processing functions crash. Stopwords make texts look more similar that they actually are. Stopwords are different in different languages.
Stopwords make texts look more similar that they actually are.
numpy.inf is a special symbol representing a mathematical infinity. Which arithmetic operations on two numpy.inf operands do not produce a numpy.inf? Answers: Subtraction and division. All four operations produce a numpy.inf. Multiplication and division. Addition and subtraction.
Subtraction and division.
What is the difference between the Porter and Lancaster stemmers? Answers: The Lancaster stemmer is more aggressive than the Porter stemmer. (In produces shorter stems.) Both stemmers are very similar, there are no major differences. The Porter stemmer, unlike the Lancaster stemmer, removes prefixes as well as endings. The Porter stemmer, unlike the Lancaster stemmer, looks up stems on Wordnet.
The Lancaster stemmer is more aggressive than the Porter stemmer. (In produces shorter stems.)
Which missing data handling strategy is the best? Answers: The choice depends on the nature of the missing data. Replacing missing values with a constant. Removing the rows/columns with missing data. Replacing missing values with an average of non-missing values.
The choice depends on the nature of the missing data.
A NumPy array a contains both numbers and nans. Hows does the function numpy.sort(a) with the default parameters handle the nans? Answers: The nans are moved to the end of the array. The nans are moved to the beginning of the array. The nans are discarded from the array. The function raises an exception.
The nans are moved to the end of the array.
What is data serialization? Answers: The process of assigning a serial number to an object. The process of converting an object into a stream of bytes in order to store the object to a file. The process of reading a stream of bytes from a file in order to convert them into an object. The process of storing a stream of bytes to a file.
The process of converting an object into a stream of bytes in order to store the object to a file.
What happens when the alpha parameter in Ridge regression increases? Answers: The regression coefficients corresponding to the collinear predictors increase. The intersect decreases. The intersect decreases. The regression coefficients corresponding to the collinear predictors decrease.
The regression coefficients corresponding to the collinear predictors decrease.
Bob attempts to download a data file using the follolwing statement: data = urllib.request.urlopen ("http://foo.bar/foobar.tgz") However, the URL http://foo.bar/foobar.tgz is no longer valid. What will be the value of the variable data after the execution of the statement? Answers: The statement raises an exception. The variable data does not exist. The value of data is None. The value of data is unknown. The value of data is an empty string.
The statement raises an exception. The variable data does not exist.
In a certain predictive study, a trained predictive model has the score of 0.50 when applied to the training dataset and the score of 0.98 when applied to the testing dataset. Which fact is true about the study? Answers: The study results are very unrealistic. There must have been a mistake in the study setup. The results are typical, there is nothing specific about this study or the model. The model is performing better than a randomly fllipped coin, but still is not perfect. The model is overfitted and shall be trained on a different dataset or with different features.
The study results are very unrealistic. There must have been a mistake in the study setup.
An aircraft circles around an airport, waiting for a permission to land, staying at a constant distance of 10 miles from the landing strip. What is the relationship between the coordinates X and Y (or latitude and longitude) of the aircraft? Answers: They are correlated, but the correlation is not linear. They are not correlated at all. They are linearly correlated. Not possible to tell.
They are correlated, but the correlation is not linear.
When Pandas merges two frames that have columns with the same name, what happens to these columns in the result frame? Answers: They are renamed by adding suffixes _l and _r. Correct They are renamed by adding suffixes _l and _r or any other suffixes of our choice. Only one of the columns (from the left frame) becomes a part of the result. The merge operation fails.
They are renamed by adding suffixes _l and _r or any other suffixes of our choice.
What data structure is the most appropriate for storing an ordered immutable collection of items? Answers: Dictionary Set Tuple List
Tuple
Which feature is NOT provided by Pandas? Answers: Data aggregation. Data alignment. Generation of dummy variables. (Indicators.) Word tokenization.
Word tokenization.
Alice noticed that the scatter plot of variable Y vs variable X that she plotted as a part of the exploratory data analysis, looks almost like a straight diagonal line. What statement is Alice entitled to make, based on this observation? Answers: A change in X causes a change in Y. X and Y are correlated. A change in Y causes a change in X. A change in X causes a change in Y and a change in Y causes a change in X.
X and Y are correlated.
The correlation of the variables X and Y is 0.75, and the p-value of the correlation is .00001. When X increases, what happens to Y? Answers: Y is not likely to change in any predictable direction. Y most likely increases. Y does not change. Y definitely decreases.
Y most likely increases.
Which regular expression matches any positive decimal integer number? Answers: \d*[1-9]\d* [0-9]+ \d+ \d*
\d*[1-9]\d*
Which Python operator is used to find the words which are in one set but not in another? Answers: - & ^ |
^
Function numpy.unique(A) returns all distinct (unique) values in the array A. What is the value of the following expression? (Do not execute this code: Your Python will run out of memory.) numpy.unique(numpy.eye(500000) - 2) Answers: 500000 250000000000 array([0,2]) array([-2,-1])
array([-2,-1])
What is the value of the following expression? (Do not execute this code: Your Python will run out of memory.) numpy.unique(numpy.eye(50000) - 2) Answers: array([0,2]) 50000 array([-2,-1]) 2500000000
array([-2,-1])
Which statement extracts all non-zero elements from the array data? Answers: data.nonzero() data[data == 0] data[data.nonzero()] numpy.nonzero(data)
data[data.nonzero()]
Which statement extracts all non-zero elements from the array data? Answers: data.nonzero() numpy.nonzero(data) data[data == 0] data[data.nonzero()]
data[data.nonzero()]
Which statement removes all rows from the frame df that have at least one missing value?) Answers: df.dropna(how="any", axis=0) df.dropna(how="all", axis=0) df.dropna(how="all", axis=1) df.dropna(how="any", axis=1)
df.dropna(how="any", axis=0)
Bob uses BeautifulSoup to process a well-formed HTML document. The soup is stored in the variable doc. Which Python expression calculates the first top-level header of the document, represented as a human-readable string? Note: Incidentally, this question has two correct answers. Select any correct answer to get full credit. Answers: doc.body.h1() doc.body.h1 doc.find_all("h1") doc.h1
doc.body.h1
The data frame edges describes the edges of a directional network. The data frame has three columns: start (the label of the start node), end (the label of the end node), and weight (the weight of the edge). Which expression calculates the outdegree of each node? Answers: edges.groupby(['start']).count() edges.groupby(['end']).count() edges.groupby(['end']).sum() edges.groupby(['weight']).sum()
edges.groupby(['start']).count()
Alice uses BeautifulSoup to process an HTML document. She expects that the tag <div>, that defines a certain division of the document (<div>...</div>), has an attribute class, and would like to extract the class name. The division tag is in the variable fragment. Which expression correctly obtains the class name and stores it in the variable c? Answers: c=fragment.get("class") c=div["class"] if fragment.has_attr("class"): c=fragment["class"] fragment.find(div="class")
if fragment.has_attr("class"): c=fragment["class"]
Alice uses BeautifulSoup to process an HTML document. She expects that the tag <p>, that defines a paragraph within the document (<p>...</p>), has an attribute class, and would like to extract the class name. The paragraph tag is in the variable fragment. Which expression correctly obtains the class name and stores it in the variable c? Answers: c=fragment["class"] fragment.find(p="class") if fragment.has_attr("class"): c=fragment["class"] c=p["class"]
if fragment.has_attr("class"): c=fragment["class"]
infile is an open file of an unknown size. Which functions can be safely used to read from this file? (Check all that apply.) Answers: infile.read(100) infile.readline() infile.readlines() infile.read()
infile.read(100) infile.readline()
infile is an open text file of an unknown size, but with reasonably short lines (say, a poem). Which statements can be safely used to read from this file? (Check all that apply.) Answers: infile.read() infile.readline() infile.read(512) infile.readlines()
infile.readline() infile.read(512)
Which statement converts a string s into a list of unique vowels? Answer: list(set(c for c in s if c in "aouieAOUIE")) [c for c in s if c in "aouieAOUIE"] set(c for c in s if c.isvowel()) list([c for c in s if c in "aouieAOUIE"])
list(set(c for c in s if c in "aouieAOUIE"))
Which statement splits an English text text, represented as a string, into a list of words and punctuation? Answers: nltk.word_tokenize(text) nltk.sent_tokenize(text) nltk.PorterStemmer(text) nltk.pos_tag(text)
nltk.word_tokenize(text)
Which NumPy expression calculates the sum of factorials of the first ten thousand positive numbers? Answers: sum(np.factorial(np.arange(10000))) np.arange(1, 10001).cumprod().sum() np.arange(10000).cumsum().prod() np.arange (10000).cumprod().sum()
np.arange(1, 10001).cumprod().sum()
Which NumPy expressions evaluate to a 5000x5000 identity matrix? (Check all that apply.) Answers: Correct np.array(([1]+5000*[0]) * 4999 + [1]).reshape(5000, 5000) np.zeros((5000, 5000)) + np.ones((5000, 5000)) np.arange(0, 1, 5000).reshape(5000, 5000) Correct np.eye(5000)
np.eye(5000) np.array(([1]+5000*[0]) * 4999 + [1]).reshape(5000, 5000)
Which expression creates an array of the first 25 positive odd numbers? numpy.arange(25) numpy.arange(1,51,2) range(25) numpy.arange(1,26)
numpy.arange(1,51,2
What is used by NumPy to represent a missing value? Answers: An empty string. numpy.na numpy.nan None
numpy.nan
What data structure represents a DataFrame row or column? Answers: list dictionary pandas.Series numpy.array
pandas.Series
Which regular expression(s) correctly describe(s) North American ZIP codes? (Check all that apply.) Answers: r'[0-5]{9}' r'[0-9]+' r'[0-9]{5}' r'\d{5}'
r'[0-9]{5}' r'\d{5}'
Which Python statements correctly count the number of ocurrences of the characters 'a' and 'A' in the string s? (Check all that apply.) Answers: len([char for char in s if char in 'aA']) count = 1 for char in s: if char == 'a' or char == 'A': count += 1 s.count('aA') s.lower().count('a')
s.lower().count('a'), len([char for char in s if char in 'aA'])
Let x=[0, .1, .2, ...., .9, 1.0] and y=[0,0,0,0,0,1,1,1,1,1,1]. Fit the ordinary least square linear model (OLS), the Ridge model (RIDGE) with alpha=0.5, and the logistic regression model (LR) into these data and calculate the fitting score for each of the models. What is the relationships between the models' scores? Answers: score(LR) < score(OLS) < score(RIDGE) score(LR) < score(RIDGE) < score(OLS) score(OLS) < score(RIDGE) < score(LR) score(RIDGE) < score(OLS) < score(LR)
score(RIDGE) < score(OLS) < score(LR)
The Frame towns has the following columns: "STATE", "NAME", and "POP" (population). Which expression calculates the population of the smallest town in each state? Answers: towns.sort("POP", ascending = True).groupby(["STATE"]).min() towns.groupby(["STATE"]).min() towns.sort("POP", ascending = True).groupby(["STATE"]).first() towns.sort("POP", ascending = False).groupby(["STATE"]).min()
towns.sort("POP", ascending = True).groupby(["STATE"]).first()
How many English names listed in the names corpus are also common English words listed in the words corpus? Hint: Treat the corpora as sets of words. Answers: ~6000 ~400 ~800 ~1800
~1800