IBM Data Science Quiz Questions
As a data Journalist, which of the following tasks are most germane to your role?
Communication skills
Sometimes we do not have access to the entire data set (population) and we have to infer our conclusions using sample data. Which of the following approaches addresses working with sample data to conclude about the population?
Inferential statistics
Linear regression tries to fit a line while ___________ the distance to each point. Fill in the blank.
Minimizing
With ____________ data, you have categorical variables that can be described by groups rather than numbers.
Structured
Which of the following algorithms is used for supervised learning?
Support Vector Machines
The Communities tab of Watson Studio provides which of the following artifacts?
Tutorials Data Sets Articles All of the above are correct.
When would you use a bar chart?
When I want to explore in time When I have categorized data When I want to show correlations All of the above are correct.
The Venn diagram that depicts the intersection of Science, Technology and Data has highlighted a cross section known as the 'danger zone.' Which of the following is an accurate depiction of this overlap in the Venn diagram?
Has technology and data experience but no science (analytics) background.
The Watson Jeopardy! game used _____________ machine learning. Fill in the blank.
Supervised
Data representations such as univariate and bivariate analysis are used to immediately understand trends, distributions, and differences amongst groups. Which of the following is an example of univariate representation?
Histograms
When would you use a histogram?
To understand the distribution of a variable
A network graph displays nodes that are connected and positioned depending on their mutual relationship. What type of data is best suited for network graphs?
multi-dimension data
If you are building a deep learning ecosystem, which of the following two concerns should be your starting points?
Ensure that I have access to a robust platform as a service plus access to deep learning frameworks.
Data visualization comes in two broad categories. Which of the below depict this distinction:
Exploratory versus explanatory visualization
Which of the following are examples of unstructured data? Select all that applies.
Facebook images Twitter feeds
In October 2015, AlphaGo, an AI-powered system, beat Mr. Fan Hui, the reigning 3-times European Champion of the complex board game Go, by 5 points to 0. Which machine learning method did it use?
Reinforcement
How is isotonic regression different from a linear regression?
By fitting a free-form line to the observations; and the fitted free-form line must be non-decreasing everywhere.
The Profile view, under the Refinery tab of Watson Studio is designed to present you with which of the following pieces of information?
Frequency and statistics
What makes a deep learning network "deep"?
It is a multi-perceptron with many 'hidden' layers
In Watson Studio, when you upload your csv file, you are presented with two data frame constructs that you can apply to your raw data. Which of the following depicts those data frames?
Pandas and SparkSession
Data scientist and data engineers often access RDBMS databases to retrieve data. Which of the following specific tasks is an example of such tasks?
Data scientists access the data via SQL or language-specific libraries. Data engineers perform a task called ETL (Extract, Transform, Load) where they take data from one source and move it to another. Use of NoSQL, since it is best for high latency and JSON based storage All of the above are correct.
The data science methodology includes the following stages: (fill in the missing stage) business understanding, data exploration and preparation, data representation and transformation, ________________, validate data models, ______________, and environment feedback.
Train data models, deploy data models
Data Modeling Quiz
...
Data Visualization and Transformation Quiz
...
Machine Learning Algorithms Quiz
...
Represent and Transform Data Quiz
...
Consider the following diagram: Given that red fish is relevant data (signal) and blue fish is irrelevant data (noise), what is the precision of this system?
0.75
When training models, you would typically place your data into three buckets: train, test and hold out. What is the purpose of having hold out data?
A holdout sample is a part of the data you leave out of the model building so it can be used to evaluate the model afterward A holdout sample helps you compare models and ensures that you can generalize results to data that the model has not yet seen. Working with a holdout sample helps you pick the best-performing model All of the above are true.
The biggest risk of overfitting data is that the model will work well on training data but perform poorly on new data. What should be done to mitigate that problem? Select all that apply.
Use hold out data to evaluate the performance of the model on new data. Do not use hold out data to select model
Business understanding is the first part of your analytics journey. Which of the following come to mind when you are planning your business approach? Select one or more.
Perform demand planning and supply chain optimization for your offerings across different segments Reduce costs
There are many ideas as to why some data scientists prefer Python over RStudio. Which of the following seems to be the prevailing argument that favors Python over R?
Python is a more generalized language versus R which is more statistics focused.
You can flag missing observations using machine learning (ML) model. Not all models address missing data equally. Which of the following statements is true regarding using ML models to flag missing data?
Regression models handle summary statistics better. Tree based models handle outliers better.
One of the fundamentals of visualization of data lies in the human psychology of how it is perceived such as: similarity, proximity and enclosure. Which of the following best describes the notion of proximity:
The human eye perceives elements to be related based on how close they are to one another.
A network graph is a graph where nodes are connected and positioned depending on their mutual relationship. Which of the following are accurate characteristics of network graphs?
Used to identify clusters in large and complex relationship data sets. Used to show relationships. Used when you have multi-dimensional data. All of the above are correct.
If you are looking for tool that is easy to learn and very flexible with what you want to render, which of the following is the best fit for your needs?
Tableau
What is meant by 'pure subset' when working with decision trees? Select all that apply.
All attributes of a leaf had yes for answer. All attributes of a leaf had no for answer. The leaf cannot be divided any further.
Which of the following activities highlights the merits of data normalization?
Allows your model to update its weights on a relatively stable range of values It speeds up training time (common for neural nets to perform normalization for each layer).
Hadley Wickham is known for saying "Tidy datasets are all alike, but every messy dataset is messy in its own way." Which of the following statements supports this assertion? Select all that apply.
Avoid redundancy, logical errors, or issues with updates. Complement programming languages' ability to perform vectorized operations. Ensure Boolean values are encoded appropriately.
The Brunel project defines a highly succinct and novel language that defines interactive data visualizations. Which of the following statements is true?
Brunel visualization is based on tabular data. Brunel Visualization Language is a high-level language developed by IBM and open-sourced in 2015. Brunel describes visualizations in terms of composable actions and drives a visualization engine (D3) that performs the actual rendering and interactivity. All of the above are correct.
Which of the following is one of the most fundamental characteristics of a data scientist?
Having a sense of curiosity about all things
Which of the following best describes a Decision Tree Classifier?
Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves).
Logistical regressions looks like the S curve. Which of the following (activation functions) describe the S curve in a logistical regression distribution?
Sigmoid operation
Consider the following diagram: Given that red fish is relevant data (signal) and blue fish is irrelevant data (noise), what is the precision of this system?
100%
Variance measures how far a set of (random) numbers are spread out from their average value. In a certain data set you have calculated that the variance of a certain data point is 16. What would be the standard deviation of that measure?
4
A particular machine learning model has detected 80 true positive signals plus 20 false positive signals (included them as relevant data, but they are not). What is the precision of the system?
80%
Descriptive tables share which of the following characteristics?
Measures of Central Tendency Measures of Dispersion Measures of Distribution All of the above answers are correct
Standard deviation (σ) and variance (σ2) are both derived from the mean of the data set. However, standard deviation is a square root of the variance, why is that?
Because of squaring, the variance is no longer in the same unit of measurement as the original data. Taking the root of the variance means the standard deviation is restored to the original unit of measure
When transforming messy data to tidy data, which of the following is a good practice?
Multiple variables are stored in one column. Variables are stored in both rows and columns. Multiple types of observational units are stored in the same table. All of the above are correct.
Consider the following scenario: you are interested to discover why certain employees leave and others stay. You have access to a CSV file that contains columns (features) regarding metrics such as distance from home, age and other categorical info such as male, female, level of education marital status and so forth. If you were to choose a model to study the problem of employee attrition which of the following would be the best fit?
Binary classification
Decision trees, support vector machines, and naive Bayes are different technique to solve a _____________ problem. Fill in the blank.
Classification
Which of the following is a true statement?
Data scientists transform data into knowledge to solve business problems. Data journalists capture domain knowledge for successful business alignment. Data engineer architect how data is organized and ensure operability. All of the above are true
Should you choose a multiclass classification tree in Watson Studio, which of the following estimators (algorithms) are available to you?
Decision tree classifier Random forest classifier Naive Bayes All of the above are correct.
Which of the following is an example of open source visualization and plotting tool or tools?
Matplotlib Pixiedust OpenCV All of the above are correct.
A spam collection engine has quarantined messages that were not spam, were not unsolicited and that they were important for the user. How would you characterize those important yet automatically removed messages?
False positive
When working with Data Refinery in Watson Studio, you are presented with three tabs: Data, Profile and Visualization. What is the purpose of the Profile view?
In the Profile view, the user can validate the data to see if any features may need further Data Refinery.
Which of the following best describes what summary statistics calculates?
Mean Median Mode All of the above are correct.
After the data exploration stage,the data scientist begins the work of data representation and transformation where they gather descriptive statistics to further analyze the data. Which of the following activities depicts working with descriptive statistics? Select all that apply.
Measure of central tendency Measure of dispersion Measure of distribution
Supervised learning has many advantages, which of the following may be shortcomings of supervised learning?
Labeling the data is arduous and expensive.
Let's say you want to predict how much salary one would earn based on level of education. Your Y axis is salary and your x axis are educational buckets (high school, Bachelors, Master's and so forth). Which of the following models is best suited to help you predict, given a certain salary what might the education level of the individual be?
Linear regression
When using Jupyter Notebooks, inevitably, you will need to import libraries such as NumPy and SciPy. Which of the following integration layers best describes this kind of an activity?
Scientific computing and statistics packages
The eight data science methodology approaches can be viewed as two larger groupings, the second grouping comprises: train, validate, deploy models and the feedback environment. How is this second grouping different in overall approach from the first grouping (business understanding, exploration, transformation and visualization of data)?
The second grouping addresses predictive and prescriptive analytics, whereas the first grouping addresses descriptive analytics.
If you had to choose one overarching difference between these methodologies in Question 19, which of the following would best depict that difference in approach?
Unlike KDD and SEMMA, CRIPS-DM considers business understanding.
Is there any risk or danger of relying solely on summary statistics?
Yes, there are risks. Summary statistics may depict similar statistical properties, such as mean, median and variance, yet ignore the overall distribution.
If you had to describe a Naïve Bayes theorem, which of the following would apply? Select all that apply.
Prior probabilities are based on previous experience. The Classifies features assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. It is particularly suited when the dimensionality of the inputs is high