IBM Data Science

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Descriptive tables share which of the following characteristics? a. Measures of Central Tendency b. Measures of Dispersion c. Measures of Distribution d. All of the above

d. All of the above

Consider the following: one red fish detected, five blue fish and four red fish undetected. Given that red fish is relevant data (signal) and blue fish is irrelevant data (noise), what is the precision of this system? a. 0% b. 100% c. 50% d. Cannot determine with given data

b. 100%

Variance measures how far a set of (random) numbers are spread out from their average value. In a certain data set you have calculated that the variance of a certain data point is 16. What would be the standard deviation of that measure? a. 4 b. 256 c. 32 d. 8

a. 4

A particular machine learning model has detected 80 true positive signals plus 20 false positive signals (included them as relevant data, but they are not). What is the precision of the system? a. 80% b. 20% c. 40% d. 100%

a. 80%

Let's say you want to predict how much salary one would earn based on level of education. Your Y axis is salary and your x axis are educational buckets (high school, Bachelors, Master's and so forth). Which of the following models is best suited to help you predict, given a certain salary what might the education level of the individual be? a. Logistical regression b. Linear regression c. Sigmoid operation d. Classification

b. Linear regression

Which of the following best describes a Decision Tree Classifier? a. Analyzes a data set in which there are one or more independent variables that determine one of two outcomes b. Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). c. Constructs multiple decision trees to produce the label that is a mode of each decision tree. d. Produces a classification prediction model in the form of an ensemble of decision trees.

b. Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves).

After the data exploration stage,the data scientist begins the work of data representation and transformation where they gather descriptive statistics to further analyze the data. Which of the following activities depicts working with descriptive statistics? Select all that apply. a. Measure of p-value b. Measure of central tendency c. Measure of dispersion d. Measure of distribution

b. Measure of central tendency c. Measure of dispersion d. Measure of distribution

Linear regression tries to fit a line while ___________ the distance to each point. Fill in the blank. a. Maximizing b. Minimizing c. Optimizing d. Squaring

b. Minimizing

Logistical regressions looks like the S curve. Which of the following (activation functions) describe the S curve in a logistical regression distribution? a. Parabolic tangent operation b. Sigmoid operation c. Rectified Linear Units (ReLU) d. Gaussian

b. Sigmoid operation

The Watson Jeopardy! game used _____________ machine learning. Fill in the blank. a. Unsupervised b. Supervised c. Reinforcement d. Semi-supervised

b. Supervised

One of the fundamentals of visualization of data lies in the human psychology of how it is perceived such as: similarity, proximity and enclosure. Which of the following best describes the notion of proximity? a. Elements are perceived as groups depending on the visual characteristics they share—like color or value. b. The human eye perceives elements to be related based on how close they are to one another. c. Introduced by Stephan E. Palmer,1992, the common regions principle shows how enclosing elements in other elements helps people see individual items as distinct groups. d. Collaboration, with social interaction and multiple interpretations, is fundamental to the analysis and visualization process.

b. The human eye perceives elements to be related based on how close they are to one another.

If you are building a deep learning ecosystem, which of the following two concerns should be your starting points? a. Purchase the appropriate hardware and software for deep learning. b. Ensure that I have Python running plus all the necessary packages and libraries. c. Ensure that I have access to a robust platform as a service plus access to deep learning frameworks. d. Moved all my data to a cloud platform with robust deep learning algorithms and access to vast related databases.

c. Ensure that I have access to a robust platform as a service plus access to deep learning frameworks.

Data visualization comes in two broad categories. Which of the below depict this distinction: a. Visualization of structured versus unstructured data b. Visualization using maps versus Brunel c. Exploratory versus explanatory visualization d. Visualization from software products such as SAS or SPSS, versus visualization from open source packages and libraries such as PixieDust, Brunel and matplotlib.

c. Exploratory versus explanatory visualization

When working with Data Refinery in Watson Studio, you are presented with three tabs: Data, Profile and Visualization. What is the purpose of the Profile view? a. In the Profile view, the user can access their profile metrics, such as number services provisioned, account status and other user profile features. b. In the Profile view, the user can build detailed graphs to better view the raw data. c. In the Profile view, the user can validate the data to see if any features may need further Data Refinery. d. None of the above.

c. In the Profile view, the user can validate the data to see if any features may need further Data Refinery.

In Watson Studio, when you upload your csv file, you are presented with two data frame constructs that you can apply to your raw data. Which of the following depicts those data frames? a. Python and R b. Brunel and Bokeh c. Pandas and SparkSession d. NumPy and SciKit

c. Pandas and SparkSession

In October 2015, AlphaGo, an AI-powered system, beat Mr. Fan Hui, the reigning 3-times European Champion of the complex board game Go, by 5 points to 0. Which machine learning method did it use? a. Unsupervised b. Supervised c. Reinforcement d. Semi-supervised

c. Reinforcement

When using Jupyter Notebooks, inevitably, you will need to import libraries such as NumPy and SciPy. Which of the following integration layers best describes this kind of an activity? a. Data munging libraries and tools b. Visualization and plotting tools c. Scientific computing and statistics packages d. Deep learning frameworks

c. Scientific computing and statistics packages

If you are looking for tool that is easy to learn and very flexible with what you want to render, which of the following is the best fit for your needs? a. Matplotlib b. Seaborn c. Tableau d. Google Sheets

c. Tableau

The Brunel project defines a highly succinct and novel language that defines interactive data visualizations. Which of the following statements is true? a. Brunel visualization is based on tabular data. b. Brunel Visualization Language is a high-level language developed by IBM and open-sourced in 2015. c. Brunel describes visualizations in terms of composable actions and drives a visualization engine (D3) that performs the actual rendering and interactivity. d. All of the above.

d. All of the above.

If you had to choose one overarching difference between the methodology examples (KDD, SEMMA, and CRIPS-DM), which of the following would best depict that difference in approach? a. Unlike KDD and SEMMA, CRIPS-DM considers business understanding. b. SEMMA, unlike the other two methodologies employs data modeling in its approach. c. Data Transformation is done only with a KDD approach. d. All of the three methodologies consider the same approaches to data analytics and there is no overarching difference between them.

d. All of the three methodologies consider the same approaches to data analytics and there is no overarching difference between them.

Data representations such as univariate and bivariate analysis are used to immediately understand trends, distributions, and differences amongst groups. Which of the following is an example of univariate representation? a. Bar charts and box plots b. Line charts c. Scatterplots d. Histograms

d. Histograms

What makes a deep learning network "deep"? a. It has been trained many times and its accuracy has improved over time. b. The system had access to "deep" knowledge as its corpus c. The system has many neurons d. It is a multi-perceptron with many 'hidden' layers

d. It is a multi-perceptron with many 'hidden' layers

With ____________ data, you have categorical variables that can be described by groups rather than numbers. Fill in the blank. a. Messy b. Normalized c. Unstructured d. Structured

d. Structured

Which of the following algorithms is used for supervised learning? a. Clustering b. Gaussian mixture c. Hidden Markov model d. Support Vector Machines

d. Support Vector Machines

A network graph displays nodes that are connected and positioned depending on their mutual relationship. What type of data is best suited for network graphs? a. geographic distribution b. categorized data c. time-based data d. multi-dimension data

d. multi-dimension data

Which of the following are examples of unstructured data? Select all that applies. a. CSV files b. Facebook images c. Records in IBM DB2 database d. Twitter feeds

b. Facebook images d. Twitter feeds

A spam collection engine has quarantined messages that were not spam, were not unsolicited and that they were important for the user. How would you characterize those important yet automatically removed messages? a. False negative b. False positive c. Low precision d. Low recall

b. False positive

Supervised learning has many advantages, which of the following may be shortcomings of supervised learning? a. It requires vast amounts of data. b. Labeling the data is arduous and expensive. c. They are not used much as of late. d. Clustering is difficult in supervised learning.

b. Labeling the data is arduous and expensive.

Consider the following: one blue fish and three red fish detected, four blue fish and two red fish undetected. Given that red fish is relevant data (signal) and blue fish is irrelevant data (noise), what is the precision of this system? a. 0.75 b. 0.60 c. 0.40 d. 0.25

a. 0.75

As a data journalist, which of the following tasks are most germane to your role? a. Communication skills b. Database and data storage c. Scripting language d. Cloud Infrastructure

a. Communication skills

What is meant by 'pure subset' when working with decision trees? Select all that apply. a. All attributes of a leaf had yes for answer. b. All attributes of a leaf had no for answer. c. Half of the answers were yes and the other half, no. d. The leaf cannot be divided any further.

a. All attributes of a leaf had yes for answer. b. All attributes of a leaf had no for answer. d. The leaf cannot be divided any further.

Which of the following activities highlights the merits of data normalization? Select one or more. a. Allows your model to update its weights on a relatively stable range of values b. It speeds up training time (common for neural nets to perform normalization for each layer). c. Allows you to see outlier more clearly d. This is an essential step in calculating coefficient percent change in target variable.

a. Allows your model to update its weights on a relatively stable range of values b. It speeds up training time (common for neural nets to perform normalization for each layer).

Hadley Wickham is known for saying "Tidy datasets are all alike, but every messy dataset is messy in its own way." Which of the following statements supports this assertion? Select all that apply. a. Avoid redundancy, logical errors, or issues with updates. b. Complement programming languages' ability to perform vectorized operations. c. Ensure Boolean values are encoded appropriately. d. Ensure to deploy the correct machine learning models.

a. Avoid redundancy, logical errors, or issues with updates. b. Complement programming languages' ability to perform vectorized operations. c. Ensure Boolean values are encoded appropriately.

Consider the following scenario: you are interested to discover why certain employees leave and others stay. You have access to a CSV file that contains columns (features) regarding metrics such as distance from home, age and other categorical info such as male, female, level of education marital status and so forth. If you were to choose a model to study the problem of employee attrition which of the following would be the best fit? a. Binary classification b. Multiclass classification c. Convolutional networks d. Linear regression

a. Binary classification

How is isotonic regression different from a linear regression? a. By fitting a free-form line to the observations; and the fitted free-form line must be non-decreasing everywhere. b. It supports both binary and multiclass labels, as well as both continuous and categorical features. c. It only supports binary labels, as well as both continuous and categorical features. d. It tries to fit the best line on a regression plot of data points

a. By fitting a free-form line to the observations; and the fitted free-form line must be non-decreasing everywhere.

The Profile view, under the Refinery tab of Watson Studio is designed to present you with which of the following pieces of information? a. Frequency and statistics b. Variance and Standard deviation c. Accuracy and recall d. Anomalies and outliers

a. Frequency and statistics

The Venn diagram that depicts the intersection of Science, Technology and Data has highlighted a cross section known as the 'danger zone.' Which of the following is an accurate depiction of this overlap in the Venn diagram? a. Has technology and data experience but no science (analytics) background. b. Is expert in technology and science but has no domain expertise on the data collected. c. Is expert in science and data, but not well versed with technology and programming. d. It is called danger zone because the individual is a "unicorn," one who is an expert in all concerns of data science.

a. Has technology and data experience but no science (analytics) background.

Sometimes we do not have access to the entire data set (population) and we have to infer our conclusions using sample data. Which of the following approaches addresses working with sample data to conclude about the population? a. Inferential statistics b. Descriptive statistics c. Measure of central tendency and measure of spread d. Variance and standard deviation measures

a. Inferential statistics

Business understanding is the first part of your analytics journey. Which of the following come to mind when you are planning your business approach? Select one or more: a. Perform demand planning and supply chain optimization for your offerings across different segments b. Reduce costs c. Decide which deep learning model will best suit your needs d. Gather more data

a. Perform demand planning and supply chain optimization for your offerings across different segments b. Reduce costs

If you had to describe a Naïve Bayes theorem, which of the following would apply? Select all that apply. a. Prior probabilities are based on previous experience. b. The Classifies features assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. c. It is particularly suited when the dimensionality of the inputs is high d. Models the linear relationship between a scalar-dependent variable y and one or more explanatory variables (or independent variables) x.

a. Prior probabilities are based on previous experience. b. The Classifies features assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. c. It is particularly suited when the dimensionality of the inputs is high

Select all that apply to the characteristics of data: a. Volume b. Variety c. Vertical d. Vibrant e. Verbose

a. Volume b. Variety

There are many ideas as to why some data scientists prefer Python over RStudio. Which of the following seems to be the prevailing argument that favors Python over R? a. Python is a more generalized language versus R which is more statistics focused. b. Python is much easier to learn because you can use Jupyter Notebooks with Python but not with R c. All data scientists use Python d. Python uses a graphical interface, but RStudio uses command line statements

a. Python is a more generalized language versus R which is more statistics focused.

You can flag missing observations using machine learning (ML) model. Not all models address missing data equally. Which of the following statements is true regarding using ML models to flag missing data? Select one or more. a. Regression models handle summary statistics better. b. Tree based models handle outliers better. c. Neural networks, such as Convolutional Neural Networks can detect missing data that may cause bias

a. Regression models handle summary statistics better. b. Tree based models handle outliers better.

When would you use a histogram? a. To understand the distribution of a variable b. To help an analyst compare groups c. To understand trends over time d. When working with historical data

a. To understand the distribution of a variable

The biggest risk of overfitting data is that the model will work well on training data but perform poorly on new data. What should be done to mitigate that problem? Select all that apply. a. Use hold out data to evaluate the performance of the model on new data. b. Do not use hold out data to select model. c. You must collect much more data. d. Your model needs to be a neural network not linear regression.

a. Use hold out data to evaluate the performance of the model on new data. b. Do not use hold out data to select model.

Standard deviation (σ) and variance (σ2) are both derived from the mean of the data set. However, standard deviation is a square root of the variance, why is that? a. Standard deviation measures the average degree to which each point differs from the mean. The greater the variance, the larger the overall data range. b. The calculation of standard deviation uses squares because it weighs outliers more heavily than data very near the mean. c. Because of squaring, the variance is no longer in the same unit of measurement as the original data. Taking the root of the variance means the standard deviation is restored to the original unit of measure d. All of the above are true.

c. Because of squaring, the variance is no longer in the same unit of measurement as the original data. Taking the root of the variance means the standard deviation is restored to the original unit of measure

Decision trees, support vector machines, and naive Bayes are different technique to solve a _____________ problem. Fill in the blank. a. Regression b. Clustering c. Classification d. Reinforcement

c. Classification

The eight data science methodology approaches can be viewed as two larger groupings, the second grouping comprises: train, validate, deploy models and the feedback environment. How is this second grouping different in overall approach from the first grouping (business understanding, exploration, transformation and visualization of data)? a. The second grouping uses algorithms to uncover insights whereas the first grouping does not. b. The second grouping is an iterative process whereas the first grouping is done only once at the beginning of the process. c. The second grouping addresses predictive and prescriptive analytics, whereas the first grouping addresses descriptive analytics. d. The second grouping is actual data science, whereas the first grouping is merely analytics.

c. The second grouping addresses predictive and prescriptive analytics, whereas the first grouping addresses descriptive analytics.

The data science methodology includes the following stages: (fill in the missing stage) business understanding, data exploration and preparation, data representation and transformation, ________________, validate data models, ______________, and environment feedback. a. Visualize data models, select appropriate models b. Transform unstructured data into structured data, normalize data c. Train data models, deploy data models d. Decide if it is a classification problem or a regression problem, deploy the models

c. Train data models, deploy data models

Is there any risk or danger of relying solely on summary statistics? a. Since summary statistics is all about presenting the mean, median and variance of data, inherently there are no built-in dangers in summarizing results. b. All summary statistics must be accompanied by inferential statistics to best ascertain the validity of the hypothesis. c. Yes, there are risks. Summary statistics may depict similar statistical properties, such as mean, median and variance, yet ignore the overall distribution. d. Bayesian inferences is often used to mitigated the risk of summary statistics.

c. Yes, there are risks. Summary statistics may depict similar statistical properties, such as mean, median and variance, yet ignore the overall distribution.

A network graph is a graph where nodes are connected and positioned depending on their mutual relationship. Which of the following are accurate characteristics of network graphs? a. Used to identify clusters in large and complex relationship data sets. b. Used to show relationships. c. Used when you have multi-dimensional data. d. All of the above

d. All of the above

Should you choose a multiclass classification tree in Watson Studio, which of the following estimators (algorithms) are available to you? a. Decision tree classifier b. Random forest classifier c. Naive Bayes d. All of the above

d. All of the above

The Communities tab of Watson Studio provides which of the following artifacts? a. Tutorials b. Data Sets c. Articles d. All of the above

d. All of the above

When transforming messy data to tidy data, which of the following is a good practice? a. Multiple variables are stored in one column. b. Variables are stored in both rows and columns. c. Multiple types of observational units are stored in the same table. d. All of the above

d. All of the above

When would you use a bar chart? a. When I want to explore in time b. When I have categorized data c. When I want to show correlations d. All of the above

d. All of the above

Which of the following is a true statement? a. Data scientists transform data into knowledge to solve business problems. b. Data journalists capture domain knowledge for successful business alignment. c. Data engineer architect how data is organized and ensure operability. d. All of the above

d. All of the above

When training models, you would typically place your data into three buckets: train, test and hold out. What is the purpose of having hold out data? a. A holdout sample is a part of the data you leave out of the model building so it can be used to evaluate the model afterward b. A holdout sample helps you compare models and ensures that you can generalize results to data that the model has not yet seen. c. Working with a holdout sample helps you pick the best-performing model d. All of the above are true.

d. All of the above are true.

Data scientist and data engineers often access RDBMS databases to retrieve data. Which of the following specific tasks is an example of such tasks? a. Data scientists access the data via SQL or language-specific libraries. b. Data engineers perform a task called ETL (Extract, Transform, Load) where they take data from one source and move it to another. c. Use of NoSQL, since it is best for high latency and JSON based storage d. All of the above.

d. All of the above.


Kaugnay na mga set ng pag-aaral

Review Questions Ch.7 (SQL for Database Construction and Application Processing)

View Set

Intermediate Accounting 2 Ch. 17,19,20,21,23

View Set

BOOK HR 500: Organizational Behavior and Leadership

View Set

虚拟学习 Virtual Learning Vocabulary

View Set