E20-007 Data Science Associate Exam
A. It is too processed
A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse contains data collected from many sources and transformed through a complex, multi-stage ETL process. What is a concern the data scientist should have about the data? A. It is too processed B. It is not structured C. It is not normalized D. It is too centralized
B. Goal 2 and 4
A call center for a large electronics company handles an average of 35, 000 support calls a day. The head of the call center would like to optimize the staffing of the call center during the rollout of a new product due to recent customer complaints of long wait times. You have been asked to create a model to optimize call center costs and customer wait times. The goals for this project include: 1. Relative to the release of a product, how does the call volume change over time? 2. How to best optimize staffing based on the call volume for the newly released product, relative to old products. 3. Historically, what time of day does the call center need to be most heavily staffed? 4. Determine the frequency of calls by both product type and customer language. Which goals are suitable to be completed with MapReduce? A. Goal 1 and 3 B. Goal 2 and 4 C. Goals 1,2,3,4 D. Goals 2,3,4
D. K Means Clustering
A data scientist is asked to implement an article recommendation feature for an on-line magazine. The magazine does not want to use client tracking technologies such as cookies or reading history. Therefore, only the style and subject matter of the current article is available for making recommendations. All of the magazine's articles are stored in a database in a format suitable for analytics. Which method should the data scientist try first? A. Association Rules B. Naive Bayesian C. Logistic Regression D. K Means Clustering
A. empdata[empdata$Age > 40,c("Salary","Occupation")]
A data scientist is given an R data frame, "empdata", with the columns Age, Salary, Occupation, Education, and Gender. The data scientist would like to examine only the Salary and Occupation columns for ages greater than 40. Which command extracts the appropriate rows and columns from the data frame? A. empdata[empdata$Age > 40,c("Salary","Occupation")] B. empdata[c("Salary","Occupation"),empdata$Age > 40] C. empdata[Age > 40,("Salary","Occupation")] D. empdata[,c("Salary","Occupation")]$Age > 40
A. Naïve Bayesian classifier
A data scientist plans to classify the sentiment polarity of 10, 000 product reviews collected from the Internet. What is the most appropriate model to use? Suppose labeled training data is available. A. Naïve Bayesian classifier B. Linear regression C. Logistic regression D. K-means clustering
C. The manufacturing process should be inspected for problems.
A disk drive manufacturer has a defect rate of less than 1.0% with 98% confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units. Which action should the team recommend? A. A smaller sample size should be taken to determine if the plant is functioning properly B. A larger sample size should be taken to determine if the plant is functioning properly C. The manufacturing process should be inspected for problems. D. The manufacturing process is functioning properly and no further action is required.
B. The manufacturing process is functioning properly and no further action is required
A disk drive manufacturer has a defect rate of less than 1.5% with 98% confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units. Which action should the team recommend? A. A larger sample size should be taken to determine if the plant is operating correctly B. The manufacturing process is functioning properly and no further action is required C. A smaller sample size should be taken to determine if the plant is operating correctly D. There is a flaw in the quality assurance process and the sample should be repeated
B. summary
Assume that you have a data frame in R. Which function would you use to display descriptive statistics about this variable? A. str B. summary C. attributes D. levels
B. There appears to be a constant variance around a constant mean.
Before you build an ARMA model, how can you tell if your time series is weakly stationary? A. The mean of the series is close to 0. B. There appears to be a constant variance around a constant mean. C. The series is normally distributed. D. There appears to be no apparent trend component.
A. {bread,milk} => {cheese}
Consider a database with 4 transactions: Transaction 1: {cheese, bread, milk} Transaction 2: {soda, bread, milk} Transaction 3: {cheese, bread} Transaction 4: {cheese, soda, juice} The minimum support is 25%. Which rule has a confidence equal to 50%? A. {bread,milk} => {cheese} B. {bread} => {milk} C. {juice} => {soda} D. {bread} => {cheese}
B. {cheese} => {bread}
Consider a database with 4 transactions: Transaction 1: {cheese, bread, milk} Transaction 2: {soda, bread, milk} Transaction 3: {cheese, bread} Transaction 4: {cheese, soda, juice} You decide to run the association rules algorithm where minimum support is 50%. Which rule has a confidence at least 50%? A. {juice} => {cheese} B. {cheese} => {bread} C. {milk} => {soda} D. {soda} => {milk}
B. Ordinal
Consider a scale that has five (5) values that range from "not important" to "very important". Which data classification best describes this data? A. Nominal B. Ordinal C. Real D. Ratio
D. ELT
Consider the example of an analysis for fraud detection on credit card usage. You will need to ensure higher-risk transactions that may indicate fraudulent credit card activity are retained in your data for analysis, and not dropped as outliers during pre-processing. What will be your approach for loading data into the analytical sandbox for this analysis? A. OLTP B. ETL C. EDW D. ELT
A. 75%
Consider these itemsets: (hat, scarf, coat) (hat, scarf, coat, gloves) (hat, scarf, gloves) (hat, gloves) (scarf, coat, gloves) What is the confidence of the rule (gloves -> hat)? A. 75% B. 60% C. 66% D. 80%
A. 66%
Consider these itemsets: (hat, scarf, coat) (hat, scarf, coat, gloves) (hat, scarf, gloves) (hat, gloves) (scarf, coat, gloves) What is the confidence of the rule (hat, scarf) -> gloves? A. 66% B. 40% C. 50% D. 60%
D. Data exploration
Data visualization is used in the final presentation of an analytics project. For what else is this technique commonly used? A. Model selection B. Descriptive statistics C. ETLT D. Data exploration
D. Embarrassingly parallel
For which class of problem is MapReduce most suitable? A. Non-overlapping queries B. Minimal result data C. Simple marginalization tasks D. Embarrassingly parallel
A. Rows retain their separate identities and the window function can access more than the current row.
How are window functions different from regular aggregate functions? A. Rows retain their separate identities and the window function can access more than the current row. B. Rows are grouped into an output row and the window function can access more than the current row. C. Rows retain their separate identities and the window function can only access the current row. D. Rows are grouped into an output row and the window function can only access the current row.
A. Pig's schema is optional
How does Pig's use of a schema differ from that of a traditional RDBMS? A. Pig's schema is optional B. Pig's schema requires that the data is physically present when the schema is defined C. Pig's schema is required for ETL D. Pig's schema supports a single data type
D. Line chart
If your intention is to show trends over time, which chart type is the most appropriate way to depict the data? A. Histogram B. Bar chart C. Stacked bar chart D. Line chart
A. Communication skill
Imagine you are trying to hire a Data Scientist for your team. In addition to technical ability and quantitative background, which additional essential trait would you look for in people applying for this position? A. Communication skill B. Scientific background C. Domain expertise D. Well Organized
D. Magnetic,Agile,Deep
In MADlib what does MAD stand for? A. Modular,Accurate,Dependable B. Machine Learning,Algorithms for Databases C. Mathematical Algorithms for Databases D. Magnetic,Agile,Deep
A. generic functions
In R, functions like plot() and hist() are known as what? A. generic functions B. virtual methods C. virtual functions D. generic methods
A. it is the area under the appropriate tails of the Student's distribution
In a Student's t-test, what is the meaning of the p-value? A. it is the area under the appropriate tails of the Student's distribution B. it is the "power" of the Student's t-test C. it is the mean of the distribution for the null hypothesis D. it is the mean of the distribution for the alternate hypothesis
D. quicker time to insight
In addition to less data movement and the ability to use larger datasets in calculations, what is a benefit of analytical calculations in a database? A. full use of data aggregation functionality B. more efficient handling of categorical values C. improved connections between disparate data sources D. quicker time to insight
C. Emphasis colors
In data visualization, what is used to focus the audience on a key part of a chart? A. Pastel colors B. Detailed text C. Emphasis colors D. A data table
B. Histogram
In data visualization, which type of chart is recommended to represent frequency data? A. Line chart B. Histogram C. Q-Q chart D. Scatterplot
C. Apply a transformation to a variable
In linear regression modeling, which action can be taken to improve the linearity of the relationship between the dependent and independent variables? A. Calculate the R-Squared value B. Use a different statistical package C. Apply a transformation to a variable D. Change the units of measurement on the independent variable
C. A small p-value
In linear regression, what indicates that an estimated coefficient is significantly different than zero? A. R-squared near 0 B. R-squared near 1 C. A small p-value D. The estimated coefficient is greater than 3
A. The use of 3 dimensions.
In the Exhibit. For effective visualization, what is the chart's primary flaw? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/3.jpg A. The use of 3 dimensions. B. The slanting of axis labels. C. The location of the legend. D. The order of the columns.
C. It processes the input and generates key-value pairs
In the MapReduce framework, what is the purpose of the Map Function? A. It sorts the results of the Reduce function B. It collects the output of the Reduce function C. It processes the input and generates key-value pairs D. It breaks the input into smaller components and distributes to other nodes in the cluster
A. It aggregates the results of the Map function and generates processed output
In the MapReduce framework, what is the purpose of the Reduce function? A. It aggregates the results of the Map function and generates processed output B. It distributes the input to multiple nodes for processing C. It writes the output of the Map function to storage D. It breaks the input into smaller components and distributes to other nodes in the cluster
A. Model planning
In which lifecycle stage are appropriate analytical techniques determined? A. Model planning B. Model building C. Data preparation D. Discovery
D. Discovery
In which lifecycle stage are initial hypotheses formed? A. Data preparation B. Model planning C. Model building D. Discovery
D. Model building
In which lifecycle stage are test and training data sets created? A. Data preparation B. Model planning C. Discovery D. Model building
C. Communicate Results
In which phase of the analytic lifecycle would you expect to spend most of the project time? A. Discovery B. Data preparation C. Communicate Results D. Operationalize
B. Data Preparation
In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project? A. Discovery B. Data Preparation C. Model Building D. Communicate Results
B. (y3-y2) - (y2-y1) = .........= (yn-yn-1)-(yn-1-yn-2)
On analyzing your time series data you suspect that the data represented as y1, y2, y3, ... , yn-1, yn may have a trend component that is quadratic in nature. Which pattern of data will indicate that the trend in the time series data is quadratic in nature? A. (y2-y1) = (y3-y2) = ....... = (yn-yn-1) B. (y3-y2) - (y2-y1) = .........= (yn-yn-1)-(yn-1-yn-2) C. ((y2-y1) /y1 ) * 100% = .......((yn-yn-1)/yn-1) * 100% D. (y4-y2) - (y3-y1) = .........= (yn-yn-2)-(yn-1-yn-3)
B. Create clusters based on the data and use them as model inputs
Refer to exhibit. You are asked to write a report on how specific variables impact your client's sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. After a preliminary analysis of the data, the following findings were made: 1. Multicollinearity is not an issue among the variables 2. Only three variables—A, B, and C—have significant correlation with sales You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. You cannot request additional datA. what is a way that you could try to increase the R2 of the model without artificially inflating it? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/1.jpg A. Force all 15 variables into the model as independent variables B. Create clusters based on the data and use them as model inputs C. Create interaction variables based only on variables A,B,and C D. Break variables A,B,and C into their own univariate models
B. Tree B
Refer to the Exhibit. In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output attribute "class". Which decision tree is valid for the data? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/10.jpg A. Tree A B. Tree B C. Tree C D. Tree D
B. Tree B
Refer to the Exhibit. In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output attribute "class". Which decision tree is valid for the data? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/9.jpg A. Tree A B. Tree B C. Tree C D. Tree D
D. GROUPING
Refer to the Exhibit. You are working on creating an OLAP query that outputs several rows of with summary rows of subtotals and grand totals in addition to regular rows that may contain NULL as shown in the exhibit. Which function can you use in your query to distinguish the row from a regular row to a subtotal row? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/12.jpg A. ROLLUP B. RANK C. GROUP_ID D. GROUPING
B. Variable B is interacting with another variable due to correlated inputs
Refer to the exhibit. After analyzing a dataset, you report findings to your team: 1. Variables A and C are significantly and positively impacting the dependent variable. 2. Variable B is significantly and negatively impacting the dependent variable. 3. Variable D is not significantly impacting the dependent variable. After seeing your findings, the majority of your team agreed that variable B should be positively impacting the dependent variable. What is a possible reason the coefficient for variable B was negative and not positive? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/13.jpg A. Variable B needs a quadratic transformation due to its relationship to the dependent variable B. Variable B is interacting with another variable due to correlated inputs C. The information gain from variable B is already provided by another variable D. Variable B needs a logarithmic transformation due to its relationship to the dependent variable
C. Document C
Refer to the exhibit. Click on the calculator icon in the upper left corner. An analyst is searching a corpus of documents for the topic "solid state disk". In the Exhibit, Table A provides the inverse document frequency for each term across the corpus. Table B provides each term's frequency in four documents selected from corpus. Which of the four documents is most relevant to the analyst's search? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/21.jpg A. Document A B. Document B C. Document C D. Document D
B. Document B
Refer to the exhibit. Click on the calculator icon in the upper left corner. An analyst is searching a corpus of documents for the topic "solid state disk". In the Exhibit, Table A provides the inverse document frequency for each term across the corpus. Table B provides each term's frequency in four documents selected from corpus. Which of the four documents is most relevant to the analyst's search? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/33.jpg A. Document A B. Document B C. Document C D. Document D
B. Rules B and D
Refer to the exhibit. Click on the calculator icon in the upper left corner. You are given a list of pre-defined association rules: A) RENTER => BAD CREDIT B) RENTER => GOOD CREDIT C) HOME OWNER => BAD CREDIT D) HOME OWNER => GOOD CREDIT E) FREE HOUSING => BAD CREDIT F) FREE HOUSING => GOOD CREDIT For your next analysis, you must limit your dataset based on rules with confidence greater than 60%. Which of the rules will be kept in the analysis? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/16.jpg A. Rules A and F B. Rules B and D C. Rules C and E D. Rules D and E
C. 63%
Refer to the exhibit. Click on the calculator icon in the upper left corner. You are going into a meeting where you know your manager will have a question on your dataset — specifically relating to customers that are classified as renters with good credit status. In order to prepare for the meeting, you create a rule: RENTER => GOOD CREDIT. What is the confidence of the rule? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/34.jpg A. 18% B. 41% C. 63% D. 73%
A. Classification Y = 0,Probability = 4/54
Refer to the exhibit. Consider the training data set shown in the exhibit. What are the classification (Y = 0 or 1) and the probability of the classification for the tuple X(1, 0, 0) using Naive Bayesian classifier? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/26.jpg A. Classification Y = 0,Probability = 4/54 B. Classification Y = 1,Probability = 4/54 C. Classification Y = 0,Probability = 1/54 D. Classification Y = 1,Probability = 1/54
C. Classification Y = 1,Probability = 4/54
Refer to the exhibit. Consider the training data set shown in the exhibit. What are the classification (Y = 0 or 1) and the probability of the classification for the tupleX(0, 0, 1) using Naive Bayesian classifier? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/18.jpg A. Classification Y = 1,Probability = 1/54 B. Classification Y = 0,Probability = 1/54 C. Classification Y = 1,Probability = 4/54 D. Classification Y = 0,Probability = 4/54
B. There appears to be no structure left to model in the data
Refer to the exhibit. In the exhibit, a correlogram is provided based on an autocorrelation analysis of a sample dataset. What can you conclude based only on this exhibit? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/5.jpg A. There appears to be a seasonal component in the data B. There appears to be no structure left to model in the data C. Lag 1 has a significant autocorrelation D. There appears to be a cyclical component in the data
A. There is significant autocorrelation through lag 3
Refer to the exhibit. In the exhibit, a correlogram is provided based on an autocorrelation analysis of a sample dataset. What can you conclude from only this exhibit? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/19.jpg A. There is significant autocorrelation through lag 3 B. There is no structure left to model in the data C. Lag 7 has a significant negative autocorrelation D. Differencing is required before proceeding with any analysis
C. Logistic Regression
Refer to the exhibit. In the exhibit, the x-axis represents the derived probability of a borrower defaulting on a loan. Also in the exhibit, the pink represents borrowers that are known to have not defaulted on their loan, and the blue represents borrowers that are known to have defaulted on their loan. Which analytical method could produce the probabilities needed to build this exhibit? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/6.jpg A. Discriminant Analysis B. Linear Regression C. Logistic Regression D. Association Rules
A. Fig-A
Refer to the exhibit. The exhibit shows four graphs labeled as Fig A thorough Fig D. Which figure represents the entropy function relative to a Boolean classification and is represented by the formula shown in Exhibit? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/23.jpg A. Fig-A B. Fig-B C. Fig-C D. Fig-D
D. S
Refer to the exhibit. The graph represents an ROC space with four classifiers labelled A through D. Which point in the graph represents a perfect classification? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/25.jpg A. P B. Q C. R D. S
D. 0.83
Refer to the exhibit. What provides the decision tree for predicting whether or not someone is a good or bad credit risk. What would be the assigned probability, p(good), of a single male with no known savings? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/22.jpg A. 0 B. 0.498 C. 0.6 D. 0.83
B. "Saturated" data,indicating potential issues with data definitions
Refer to the exhibit. Which type of data issue would you suspect based on the exhibit? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/20.jpg A. Incomplete data,indicating potential issues with data transmission B. "Saturated" data,indicating potential issues with data definitions C. Mis-scaled data,indicating potential issues with data entry D. The exhibit does not raise any obvious concerns with the data.
C. Variables A,B,and C are significantly impacting sales,but are not effectively estimating sales
Refer to the exhibit. You are asked to write a report on how specific variables impact your client's sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. After a preliminary analysis of the data, the following findings were made: 1. Multicollinearity is not an issue among the variables 2. Only three variables—A, B, and C—have significant correlation with sales You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. Which interpretation is supported by the analysis? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/2.jpg A. Due to the R2 of 0.10,the model is not valid - the linear regression should be re-run with all 15 variables forced into the model to increase the R2 B. Variables A,B,and C are significantly impacting sales and are effectively estimating sales C. Variables A,B,and C are significantly impacting sales,but are not effectively estimating sales D. Due to the R2 of 0.10,the model is not valid - a different analytical model should be attempted
A. Total Sales to Date
Refer to the exhibit. You are assigned to do an end of the year sales analysis of 1, 000 different products, based on the transaction table. Which column in the end of year report requires the use of a window function? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/11.jpg A. Total Sales to Date B. Daily Sales C. Average Daily Price D. Maximum Price
A. Credit Score
Refer to the exhibit. You are building a decision tree. In this exhibit, four variables are listed with their respective values of info-gain. Based on this information, on which attribute would you expect the next split to be in the decision tree? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/8.jpg A. Credit Score B. Age C. Income D. Gender
A. 2
Refer to the exhibit. You are using K-means clustering to classify customer behavior for a large retailer. You need to determine the optimum number of customer groups. You plot the within-sum-of-squares (wss) data as shown in the exhibit. How many customer groups should you specify? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/15.jpg A. 2 B. 3 C. 4 D. 8
C. Recreate the density plot using a log normal distribution of the purchase amount data
Refer to the exhibit. You have created a density plot of purchase amounts from a retail website as shown. What should you do next? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/7.jpg A. Recreate the plot using the barplot() function B. Use the rug() function to add elements to the plot C. Recreate the density plot using a log normal distribution of the purchase amount data D. Reduce the sample size of the purchase amount data used to create the plot
D. The data is extremely skewed. Replot the data on a logarithmic scale to get a better sense of it.
Refer to the exhibit. You have plotted the distribution of savings account sizes for your bank. How would you proceed, based on this distribution? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/4.jpg A. The data is extremely skewed. Split your analysis into two cohorts: accounts less than 2500,and accounts greater than 2500 B. The data is extremely skewed,but looks bimodal; replot the data in the range 2,500-10,000 to be sure. C. The accounts of size greater than 2500 are rare,and probably outliers. Eliminate them from your future analysis. D. The data is extremely skewed. Replot the data on a logarithmic scale to get a better sense of it.
A. The R-squared may be biased upwards by the extreme-valued outcomes. Remove them and refit to get a better idea of the model's quality over typical data.
Refer to the exhibit. You have run a linear regression model against your data, and have plotted true outcome versus predicted outcome. The R-squared of your model is 0.75. What is your assessment of the model? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/14.jpg A. The R-squared may be biased upwards by the extreme-valued outcomes. Remove them and refit to get a better idea of the model's quality over typical data. B. The R-squared is good. The model should perform well. C. The extreme-valued outliers may negatively affect the model's performance. Remove them to see if the R-squared improves over typical data. D. The observations seem to come from two different populations,but this model fits them both equally well.
C. Precision = 262/277 Recall = 262/288
Refer to the exhibit. You have scored your Naive bayesian classifier model on a hold out test data for cross validation and determined the way the samples scored and tabluated them as shown in the exhibit. What are the Precision and Recall rate of the model? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/27.jpg A. Precision = 277/262 Recall = 288/262 B. Precision =262/288 Recall = 262/277 C. Precision = 262/277 Recall = 262/288 D. Precision = 288/262 Recall = 277/262
B. FPR = 15/262 FNR = 26/288
Refer to the exhibit. You have scored your Naive bayesian classifier model on a hold out test data for cross validation and determined the way the samples scored and tabulated them as shown in the exhibit. What are the the False Positive Rate (FPR) and the False Negative Rate (FNR) of the model? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/32.jpg A. FPR = 26/288 FNR = 15/262 B. FPR = 15/262 FNR = 26/288 C. FPR = 262/15 FNR = 288/26 D. FPR = 288/26 FNR = 262/15
B. In this model,Variable D is not significantly interacting with the dependent variable
Refer to the exhibit. You ran a linear regression, and the final output is seen in the exhibit. Based only on the information in the exhibit and an acceptable confidence level of 95%, how would you interpret the interaction of variable D with the dependent variable? http://cdn.aiotestking.com/wp-content/uploads/e20-007-v1/24.jpg A. For every 1 unit increase in variable D,holding all other variables constant,we can expect the dependent variable to increase by 10.23 units B. In this model,Variable D is not significantly interacting with the dependent variable C. For every 1 unit increase in variable D,holding all other variables constant,we can expect the dependent variable to be multiplied by 10.23 units D. Variable D is more significant than variables A,B,and C.
D. ( (pn,vn),(pn),(vn),( ) )
Review the following code: SELECT pn, vn, sum(prc*qty) FROM sale GROUP BY CUBE(pn, vn) ORDER BY 1, 2, 3; Which combination of subtotals do you expect to be returned by the query? A. (pn,vn) B. ( (pn,vn),(pn) ) C. ( (pn,vn),(pn),(vn) ) D. ( (pn,vn),(pn),(vn),( ) )
C. nominal
Since R factors are categorical variables, they are most closely related to which data classification level? A. interval B. ordinal C. nominal D. ratio
B. Convert the extracted text into a suitable document representation and index into a review corpus
The Marketing department of your company wishes to track opinion on a new product that was recently introduced. Marketing would like to know how many positive and negative reviews are appearing over a given period and potentially retrieve each review for more in-depth insight. They have identified several popular product review blogs that historically have published thousands of user reviews of your company's products. You have been asked to provide the desired analysis. You examine the RSS feeds for each blog and determine which fields are relevant. You then craft a regular expression to match your new product's name and extract the relevant text from each matching review. What is the next step you should take? A. Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product B. Convert the extracted text into a suitable document representation and index into a review corpus C. Read the extracted text for each review and manually tabulate the results D. Group the reviews using Naïve Bayesian classification
C. The change in purchase size is not practically important,and the good p-value of the second study is probably a result of the large study size.
The average purchase size from your online sales site is $17, 200. The customer experience team believes a certain adjustment of the website will increase sales. A pilot study on a few hundred customers showed an increase in average purchase size of $1.47, with a significance level of p=0.1. The team runs a larger study, of a few thousand customers. The second study shows an increased average purchase size of $0.74, with a significance level of 0.03. What is your assessment of this study? A. The difference in the change in purchase size between the two studies is troubling; The team should run another,larger study. B. The change in purchase size is small,but may aggregate up to a large increase in profits over the entire customer base. C. The change in purchase size is not practically important,and the good p-value of the second study is probably a result of the large study size. D. The p-value of the second study shows a statistically significant change in purchase size. The new website is an improvement.
A. Chukwa
The web analytics team uses Hadoop to process access logs. They now want to correlate this data with structured user data residing in a production single-instance JDBC database. They collaborate with the production team to import the data into Hadoop. Which tool should they use? A. Chukwa B. Pig C. Sqoop D. Scribe
C. Sqoop
The web analytics team uses Hadoop to process access logs. They now want to correlate this data with structured user data residing in their massively parallel database. Which tool should they use to export the structured data from Hadoop? A. Chukwa B. Pig C. Sqoop D. Scribe
C. Irregular
Trend, seasonal, and cyclical are components of a time series. What is another component? A. Quadratic B. Linear C. Irregular D. Exponential
D. There is not enough data to create a test set.
Under which circumstance do you need to implement N-fold cross-validation after creating a regression model? A. There are categorical variables in the model. B. The data is unformatted. C. There are missing values in the data. D. There is not enough data to create a test set.
B. Data volume,business importance,and data structure variety.
What are the characteristics of Big Data? A. Data volume,business importance,and data structure variety. B. Data volume,business importance,and data structure variety. C. Data type,processing complexity,and data structure variety. D. Data volume,processing complexity,and business importance.
B. It does not handle missing values well.
What describes a true limitation of Logistic Regression method? A. It does not handle redundant variables well. B. It does not handle missing values well. C. It does not handle correlated variables well. D. It does not have explanatory values.
B. It is robust with redundant variables and correlated variables.
What describes a true property of Logistic Regression method? A. It handles missing values well. B. It is robust with redundant variables and correlated variables. C. It works well with discrete variables that have many distinct values. D. It works well with variables that affect the outcome in a discontinuous way.
B. Operates on queries and potentially increases the number of rows
What describes the use of UNION clause in a SQL statement? A. Operates on queries and potentially decreases the number of rows B. Operates on queries and potentially increases the number of rows C. Operates on tables and potentially decreases the number of columns D. Operates on both tables and queries and potentially increases both the number of rows and columns
C. Selects the values in vector v that are less than 1000 and assigns them to the vector nv
What does R code nv <- v[v < 1000] do? A. Removes elements of vector v less than 1000 and assigns the elements >= 1000 to nv B. Sets nv to TRUE or FALSE depending on whether all elements of vector v are less than 1000 C. Selects the values in vector v that are less than 1000 and assigns them to the vector nv D. Selects values of vector v less than 1000,modifies v,and makes a copy to nv
B. Assigns the first 10 rows of f to the vector z
What does the R code z <- f[1:10, ] do? A. Assigns the 1st 10 columns of the 1st row of f to z B. Assigns the first 10 rows of f to the vector z C. Assigns a sequence of values from 1 to 10 to z D. Assigns the 1st 10 columns to z
D. Java classes for HDFS types and MapReduce job management and HDFS
What is Hadoop? A. MapReduce paradigm and massive unstructured data storage on commodity hardware B. Java classes for HDFS types and MapReduce job management and the MapReduce paradigm C. MapReduce paradigm and HDFS D. Java classes for HDFS types and MapReduce job management and HDFS
A. It fits a smoothed curve to scatterplot data,to give a general sense of the data's behavior.
What is LOESS used for? A. It fits a smoothed curve to scatterplot data,to give a general sense of the data's behavior. B. It is a significance test for the correlation between two variables. C. It plots a continuous variable versus a discrete variable,to compare distributions across classes. D. It is run after a one-way ANOVA,to determine which population has the highest mean value.
C. A presentation for project sponsors
What is a core deliverable at the end of the analytic project? A. An implemented database design B. A whitepaper describing the project and the implementation C. A presentation for project sponsors D. The training materials
D. They can be used to calculate moving averages over various intervals.
What is a property of window functions in SQL commands? A. They don't require ordering of data within a window. B. They group rows into a single output row. C. They can be used between the keywords FROM and WHERE in a SELECT command. D. They can be used to calculate moving averages over various intervals.
C. Bar chart
What is an appropriate data visualization to use in a presentation for a project sponsor? A. Box and Whisker plot B. Pie chart C. Bar chart D. Density plot
D. ROC curve
What is an appropriate data visualization to use in a presentation for an analyst audience? A. Pie chart B. Area chart C. Stacked bar chart D. ROC curve
A. that a newly created model does not provide better predictions than the currently existing model
What is an example of a null hypothesis? A. that a newly created model does not provide better predictions than the currently existing model B. that a newly created model provides a prediction of a null sample mean C. that a newly created model provides a prediction of a null population mean D. that a newly created model provides a prediction that will be well fit to the null distribution
C. a subset of the provided data set selected at random and used to validate the model
What is holdout data? A. a subset of the provided data set that is removed by the data scientist because it contains data errors B. a subset of the provided data set selected at random and used to initially construct the model C. a subset of the provided data set selected at random and used to validate the model D. a subset of the provided data set that is removed by the data scientist because it contains outliers
B. Linear regression
What is one modeling or descriptive statistical function in MADlib that is typically not provided in a standard relational database? A. Expected value B. Linear regression C. Variance D. Quantiles
B. Operational process changes
What is required in a presentation for business analysts? A. Budgetary considerations and requests B. Operational process changes C. Detailed statistical explanation of the applicable modeling theory D. The presentation author's credentials
D. The "Big Picture" takeaways for executive level stakeholders
What is required in a presentation for project sponsors? A. Detailed statistical basis for the modeling approach used in the project B. Data warehouse design changes C. Line by line review of the developed code D. The "Big Picture" takeaways for executive level stakeholders
D. Key-value pairs
What is the format of the output from the Map function of MapReduce? A. Unique key record and separate records of all possible values B. Binary respresentation of keys concatenated with structured data C. Compressed index D. Key-value pairs
A. OVER
What is the mandatory Clause that must be included when using Window functions? A. OVER B. RANK C. PARTITION BY D. RANK BY
C. The availablilty of tagged training data.
What is the primary bottleneck in text classification? A. The high dimensionality of text data. B. The ability to parse unstructured text data. C. The availablilty of tagged training data. D. The fact that text corpora are dynamic.
D. imposes a structure on the unstructured/semi-structured text for downstream analysis
What is the purpose of the process step "parsing" in text analysis? A. computes the TF-IDF values for all keywords and indices B. performs the search and/or retrieval in finding a specific topic or an entity in a document C. executes the clustering and classification to organize the contents D. imposes a structure on the unstructured/semi-structured text for downstream analysis
B. Daily Log files from a web server that receives 100,000 hits per minute
What would be considered "Big Data"? A. An OLAP Cube containing customer demographic information about 100,000,000 customers B. Daily Log files from a web server that receives 100,000 hits per minute C. Aggregated statistical data stored in a relational database table D. Spreadsheets containing monthly sales data for a Global 100 corporation
B. Show how you met the project goals
When creating a presentation for a technical audience, what is the main objective? A. Show that you met the project goals B. Show how you met the project goals C. Show if the model will meet the SLA D. Show the technique to be used in the production environment
B. Show that you met the project goals
When creating a project sponsor presentation, what is the main objective? A. Show how you met the project goals B. Show that you met the project goals C. Show how well the model will meet the SLA (service level agreement) D. Clearly describe the methods and techniques used
A. When you are using several categorical input variables with over 1000 possible values each.
When would you prefer a Naive Bayes model to a logistic regression model for classification? A. When you are using several categorical input variables with over 1000 possible values each. B. When you need to estimate the probability of an outcome,not just which class it is in. C. When all the input variables are numerical. D. When some of the input variables might be correlated.
B. where all subtotals and grand totals are to be included in the output
When would you use GROUP BY ROLLUP clause in your OLAP query? A. where only the subtotals are to be included in the output B. where all subtotals and grand totals are to be included in the output C. where only the grand totals are to be included in the output D. where only specific subtotals and grand totals for a combination of variables are to be included in the output
D. When you cannot make an assumption about the distribution of the populations
When would you use a Wilcoxson Rank Sum test? A. When the data cannot easily be sorted B. When the data can easily be sorted C. When the populations represent the sums of other values D. When you cannot make an assumption about the distribution of the populations
B. List
Which R data structure allows elements to have different data types? A. Vector B. List C. Matrix D. Array
D. CUBE
Which SQL OLAP extension provides all possible grouping combinations? A. CROSS JOIN B. ROLLUP C. UNION ALL D. CUBE
A. Define the process to maintain the model
Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle? A. Define the process to maintain the model B. Try different analytical techniques C. Try different variables D. Transform existing variables
A. Run a pilot
Which activity might be performed in the Operationalize phase of the Data Analytics Lifecycle? A. Run a pilot B. Try different analytical techniques C. Try different variables D. Transform existing variables
B. K-means clustering
Which analytical method is considered unsupervised? A. Naïve Bayesian classifier B. K-means clustering C. Decision tree D. Linear regression
B. Advanced analytical methods
Which characteristic applies mainly to Data Science as opposed to Business Intelligence? A. Robust reporting B. Advanced analytical methods C. Focus on structured data D. Data dashboards
B. Uses only structured data
Which characteristic applies only to Business Intelligence as opposed to Data Science? A. Supports solving "what if" scenarios B. Uses only structured data C. Uses large data sets D. Uses predictive modeling techniques
C. Webserver log
Which data asset is an example of quasi-structured data? A. Database table B. XML data file C. Webserver log D. News article
C. XML data file
Which data asset is an example of semi-structured data? A. Webserver log B. Database table C. XML data file D. News article
C. A binary value
Which data type value is used for the observed response variable in a logistic regression model? A. Any positive real number B. Any integer C. A binary value D. Any real number
A. text pattern matching
Which functionality do regular expressions provide? A. text pattern matching B. underflow prevention C. increased numerical precision D. decreased processing complexity
B. box and whisker plot
Which graphical representation shows the distribution and multiple summary statistics of a continuous variable for each value of a corresponding discrete variable? A. dotplot B. box and whisker plot C. scatterplot D. binplot
A. Business User
Which key role for a successful analytic project can consult and advise the project team on the value of end results and how these will be used on a day-to-day basis? A. Business User B. Project Manager C. Data Scientist D. Business Intelligence Analyst
A. Business Intelligence Analyst
Which key role for a successful analytic project can provide business domain expertise with a deep understanding of the data and key performance indicators? A. Business Intelligence Analyst B. Project Manager C. Project Sponsor D. Business User
A. Ordinary Least squares
Which method is used to solve for coefficients b0, b1, .., bn in your linear regression model : Y = b0 + b1x1+b2x2+....+bnxn A. Ordinary Least squares B. Apriori Algorithm C. Ridge and Lasso D. Integer programming
D. Clickstream data
Which of the following is an example of quasi-structured data? A. OLAP B. OLTP C. Customer record table D. Clickstream data
A. Stemming
Which process in text analysis can be used to reduce dimensionality? A. Stemming B. Parsing C. Digitizing D. Sorting
A. Probability
Which type of numeric value does a logistic regression model estimate? A. Probability B. A p-value C. Any integer D. Any real number
A. Data Warehouse
Which word or phrase completes the statement? A spreadsheet is to a data island as a centralized database for reporting is to a ________? A. Data Warehouse B. Data Repository C. Analytic Sandbox D. Data Mart
C. Optimization and Predictive Modeling
Which word or phrase completes the statement? Business Intelligence is to ad-hoc reporting and dashboards as Data Science is to ______________ . A. Structured Data and Data Sources B. Alerts and Queries C. Optimization and Predictive Modeling D. Sales and profit reporting
A. PostgreSQL
Which word or phrase completes the statement? Mahout is to Hadoop as MADlib is to ____________ . A. PostgreSQL B. R C. Excel D. SAS
A. Clickstream data
Which word or phrase completes the statement? Structured data is to OLAP data as quasistructured data is to____ A. Clickstream data B. XML data C. Text documents D. Image files
C. "Communicative and Collaborative"
Which word or phrase completes the statement? Theater actor is to "Artistic and Expressive" as Data Scientist is to ________________ A. "Logical and Steadfast" B. "Introverted and Technical" C. "Communicative and Collaborative" D. "Independent and Intelligent"
A. Data frame
Which word or phrase completes the statement? A Data Scientist would consider that a RDBMS is to a Table as R is to a ______________ . A. Data frame B. List C. Matrix D. Array
A. Confusion matrix is to classifier
Which word or phrase completes the statement? Data-ink ratio is to data visualization as __________ . A. Confusion matrix is to classifier B. Data scientist is to big data C. Seasonality is to ARIMA D. K-means is to Naive Bayes
C. Collection of data assets for modeling
Which word or phrase completes the statement? A data warehouse is to a centralized database for reporting as an analytic sandbox is to a _______? A. Centralized database of KPIs B. Collection of low-volume databases C. Collection of data assets for modeling D. Collection of data assets for ETL
D. Predicting
Which word or phrase completes the statement? Business Intelligence is to monitoring trends as Data Science is to ________ trends. A. Optimizing B. Discarding C. Driving D. Predicting
B. Main message is to context
Which word or phrase completes the statement? Emphasis color is to standard color as _______ . A. Main message is to key findings B. Main message is to context C. Frequent item set is to item D. Pie chart is to proportions
D. Pig
Which word or phrase completes the statement? Unix is to bash as Hadoop is to: A. NameNode B. HDFS C. Sqoop D. Pig
D. Mahout
While having a discussion with your colleague, this person mentions that they want to perform Kmeans clustering on text file data stored in HDFS. Which tool would you recommend to this colleague? A. Sqoop B. HBase C. Scribe D. Mahout
A. ACF as an indication of stationarity,and PACF for the correlation between Xt and Xt-k not explained by their mutual correlation with X1 through Xk-1.
You are analyzing a time series and want to determine its stationarity. You also want to determine the order of autoregressive models. How are the autocorrelation functions used? A. ACF as an indication of stationarity,and PACF for the correlation between Xt and Xt-k not explained by their mutual correlation with X1 through Xk-1. B. PACF as an indication of stationarity,and ACF for the correlation between Xt and Xt-k not explained by their mutual correlation with X1 through Xk-1. C. ACF as an indication of stationarity,and PACF to determine the correlation of X1 through Xk-1. D. PACF as an indication of stationarity,and ACF to determine the correlation of X1 through Xk-1.
D. Decision Trees
You are analyzing data in order to build a classifier model. You discover non-linear data and discontinuities that will affect the model. Which analytical method would you recommend? A. Linear Regression B. Logistic Regression C. ARIMA D. Decision Trees
C. Linear regression
You are asked to create a model to predict the total number of monthly subscribers for a specific magazine. You are provided with 1 year's worth of subscription and payment data, user demographic data, and 10 years worth of content of the magazine (articles and pictures). Which algorithm is the most appropriate for building a predictive model for subscribers? A. Decision trees B. Logistic regression C. Linear regression D. TF-IDF
A. SQRT((2-8)2+(4-10)2) or 8.49
You are attempting to find the Euclidean distance between two centroids: Centroid A's coordinates: (X = 2, Y = 4) Centroid B's coordinates (X = 8, Y = 10) Which formula finds the correct Euclidean distance? A. SQRT((2-8)2+(4-10)2) or 8.49 B. SQRT(((2-8) x 2) + ((4-10) x 2)) or 12.17 C. ((2-8)2+(4-10)2) or 72 D. ((2-8) x 2 + (4-10) x 2) or 148
A. 42.0
You are building a logistic regression model to predict whether a tax filer will be audited within the next two years. Your training set population is 1000 filers. The audit rate in your training data is 4.2%. What is the sum of the probabilities that the model assigns to all the filers in your training set that have been audited? A. 42.0 B. 4.2 C. 0.42 D. 0.042
C. Run MapReduce to transform the data,and find relevant key value pairs.
You are given 10, 000, 000 user profile pages of an online dating site in XML files, and they are stored in HDFS. You are assigned to divide the users into groups based on the content of their profiles. You have been instructed to try K-means clustering on this data. How should you proceed? A. Run a Naive Bayes classification as a pre-processing step in HDFS. B. Divide the data into sets of 1,000 user profiles,and run K-means clustering in RHadoop iteratively. C. Run MapReduce to transform the data,and find relevant key value pairs. D. Partition the data by XML file size,and run K-means clustering in each partition.
B. Lift
You are performing a market basket analysis using the Apriori algorithm. Which measure is a ratio describing the how many more times two items are present together than would be expected if those two items are statistically independent? A. Leverage B. Lift C. Support D. Confidence
B. Visualize the data to further explore the characteristics of each data set
You are provided four different datasets. Initial analysis on these datasets show that they have identical mean, variance and correlation values. What should your next step in the analysis be? A. Select one of the four datasets and begin planning and building a model B. Visualize the data to further explore the characteristics of each data set C. Combine the data from all four of the datasets and begin planning and bulding a model D. Recalculate the descriptive statistics since they are unlikely to be identical for each dataset
C. K-means clustering
You are studying the behavior of a population, and you are provided with multidimensional data at the individual level. You have identified four specific individuals who are valuable to your study, and would like to find all users who are most similar to each individual. Which algorithm is the most appropriate for this study? A. Association rules B. Linear regression C. K-means clustering D. Decision trees
C. Either Formula A or Formula B is effective at promoting weight gain.
You are testing two new weight-gain formulas for puppies. The test gives the results: Control group: 1% weight gain Formula A. 3% weight gain Formula B. 4% weight gain A one-way ANOVA returns a p-value = 0.027 What can you conclude? A. Formula A and Formula B are both effective at promoting weight gain. B. Formula B is more effective at promoting weight gain than Formula A. C. Either Formula A or Formula B is effective at promoting weight gain. D. Formula A and Formula B are about equally effective at promoting weight gain.
A. Goodness of fit
You are using MADlib for Linear Regression analysis. Which value does the statement return? SELECT (linregr(depvar, indepvar)).r2 FROM zeta1; A. Goodness of fit B. Coefficients C. Standard error D. P-value
C. Decrease the number of clusters
You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters. What should you do? A. Identify additional measures to add to the analysis B. Remove one of the measures C. Decrease the number of clusters D. Increase the number of clusters
C. The rule is coincidental
You are using the Apriori algorithm to determine the likelihood that a person who owns a home has a good credit score. You have determined that the confidence for the rules used in the algorithm is > 75%. You calculate lift = 1.011 for the rule, "People with good credit are homeowners". What can you determine from the lift calculation? A. Support for the association is low B. Leverage of the rules is low C. The rule is coincidental D. The rule is true
A. There is a 3% chance that you have identified a difference between the populations when in reality there is none.
You do a Student's t-test to compare the average test scores of sample groups from populations A and B. Group A averaged 10 points higher than group B. You find that this difference is significant, with a p-value of 0.03. What does that mean? A. There is a 3% chance that you have identified a difference between the populations when in reality there is none. B. The difference in scores between a sample from population A and a sample from population B will tend to be within 3% of 10 points. C. There is a 3% chance that a sample group from population A will score 10 points higher that a sample group from population B. D. There is a 97% chance that a sample group from population A will score 10 points higher that a sample group from population B.
A. Report back to the business owner that the current data model does not support the business question.
You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. All the data currently available to you has been loaded into your analytics database; revenue data, pricing data, and online transaction data. You find that all the data comes in different levels of granularity. The transaction data has timestamps (day, hour, minutes, seconds), pricing is stored at the daily level, and revenue data is only reported monthly. What is your next step? A. Interpolate a daily model for revenue from the monthly revenue data. B. Report back to the business owner that the current data model does not support the business question. C. Aggregate all data to the monthly level in order to create a monthly revenue model. D. Disregard revenue as a driver in the pricing model,and create a daily model based on pricing and transactions only.
B. You have written documentation,and the code has been handed off to the Data Base Administrator and business operations.
You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. When have you completed the analytics lifecycle? A. You have a completely developed model,and the results have shown statistically acceptable results. B. You have written documentation,and the code has been handed off to the Data Base Administrator and business operations. C. You have presented the results of the model to both the internal analytics team and the business owner of the project. D. You have a completely developed model based on both a sample of the data and the entire set of data available.
C. Report that the results are insignificant,and reevaluate the original business question.
You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. You have tested all the theoretical models in the previous model planning stage, and all tests have yielded statistically insignificant results. What is your next step? A. Move forward on the model with the highest significance scores relative to the others. B. Run all the models again against a larger sample,leveraging more historical data. C. Report that the results are insignificant,and reevaluate the original business question. D. Modify samples used by the models and iterate until a significant result occurs.
D. MADlib
You have been assigned to run a linear regression model for each of 5, 000 distinct districts, and all the data is currently stored in a PostgreSQL database. Which tool/library would you use to produce these models with the least effort? A. HBase B. Mahout C. R D. MADlib
A. MADlib
You have been assigned to run a logistic regression model for each of 100 countries, and all the data is currently stored in a PostgreSQL database. Which tool/library would you use to produce these models with the least effort? A. MADlib B. Mahout C. RStudio D. HBase
B. The production team needs to understand how your model will interact with the processes they already support. Give them documentation on expected model inputs and outputs,and guidance on error-handling.
You have completed your model and are handing it off to be deployed in production. What should you deliver to the production team, along with your commented code? A. The production team are technical,and they need to understand how the processes that they support work,so give them the same presentation that you prepared for the analysts. B. The production team needs to understand how your model will interact with the processes they already support. Give them documentation on expected model inputs and outputs,and guidance on error-handling. C. The production team supports the processes that run the organization,and they need context to understand how your model interacts with the processes they already support. Give them the same presentation that you prepared for the project sponsor. D. The production team supports the processes that run the organization,and they need context to understand how your model interacts with the processes they already support. Give them the executive summary.
C. The tree is probably overfit. Try fitting shallower trees and using an ensemble method.
You have fit a decision tree classifier using 12 input variables. The resulting tree used 7 of the 12 variables, and is 5 levels deep. Some of the nodes contain only 3 data points. The AUC of the model is 0.85. What is your evaluation of this model? A. The tree did not split on all the input variables. You need a larger data set to get a more accurate model. B. The AUC is high,and the small nodes are all very pure. This is an accurate model. C. The tree is probably overfit. Try fitting shallower trees and using an ensemble method. D. The AUC is high,so the overall model is accurate. It is not well-calibrated,because the small nodes will give poor estimates of probability.
C. {grape,apple,orange} must be a frequent itemset.
You have run the association rules algorithm on your data set, and the two rules {banana, apple} => {grape} and {apple, orange}=> {grape} have been found to be relevant. What else must be true? A. {grape} => {banana,apple} must be a relevant rule. B. {banana,apple,grape,orange} must be a frequent itemset. C. {grape,apple,orange} must be a frequent itemset. D. {banana,apple} => {orange} must be a relevant rule.
A. Full outer join
You have two tables of customers in your database. Customers in cust_table_1 were sent an email promotion last year, and customers in cust_table_2 received a newsletter last year. Customers can only be entered in once per table. You want to create a table that includes all customers, and any of the communications they received last year. Which type of join would you use for this table? A. Full outer join B. Inner join C. Left outer join D. Cross join
D. Decrease the number of clusters
You have used k-means clustering to classify behavior of 100, 000 customers for a retail store. You decide to use household income, age, gender and yearly purchase amount as measures. You have chosen to use 8 clusters and notice that 2 clusters only have 3 customers assigned. What should you do? A. Identify additional measures to add to the analysis B. Increase the number of clusters C. Decrease the number of measures used D. Decrease the number of clusters
A. Ensure that the TaskTracker is running.
You submit a MapReduce job to a Hadoop cluster and notice that although the job was successfully submitted, it is not completing. What should you do? A. Ensure that the TaskTracker is running. B. Ensure that the JobTracker is running C. Ensure that the NameNode is running D. Ensure that a DataNode is running
A. Pig
Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has a strong background in data flow languages and programming. Which query interface would you recommend? A. Pig B. Hive C. Howl D. HBase
D. Hive
Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has previously worked extensively with SQL and databases. Which query interface would you recommend? A. HBase B. Pig C. Howl D. Hive
A. One-way ANOVA
Your company has 3 different sales teams. Each team's sales manager has developed incentive offers to increase the size of each sales transaction. Any sales manager whose incentive program can be shown to increase the size of the average sales transaction will receive a bonus. Data are available for the number and average sale amount for transactions offering one of the incentives as well as transactions offering no incentive. The VP of Sales has asked you to determine analytically if any of the incentive programs has resulted in a demonstrable increase in the average sale amount. Which analytical technique would be appropriate in this situation? A. One-way ANOVA B. Multi-way ANOVA C. Student's t-test D. Wilcoxson Rank Sum Test
A. K-means clustering
Your customer provided you with 2, 000 unlabeled records and asked you to separate them into three groups. What is the correct analytical method to use? A. K-means clustering B. Linear regression C. Naive Bayesian classification D. Logistic regression
D. One-way ANOVA
Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to visitors to your website has any impact on their purchase decision. Which analysis method should you use? A. K-means clustering B. Association rules C. Student T-test D. One-way ANOVA
A. Logistic regression
data scientist wants to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate method for this project? A. Logistic regression B. Linear regression C. K-means clustering D. Apriori algorithm