Data II and III
Select all of the following that are true about Data Warehouses
1. Data will not be modified by the end user. 2. Data may be integrated and cleaned from many large sources.
Which of the following are methods of dimension reduction?
1. Feature selection 2. Feature extraction 3. Forward selection and backward selection 4. Attribute relevance analysis (e.g. information gain)
Which of the following are true about Forward Selection?
1. Forward selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model. 2. Forward selection is a greedy algorithm that runs a classification algorithm over and over as part of evaluating subsets of features. 3. Using forward selection can result in a model that generalizes better, i.e. is less subject to overfitting.
Which of the following statements are true?
1. If correlation is equal to -1 then two features are perfectly negatively correlated 2. Correlation between two features ranges between [-1, 1] 3. If correlation is equal to 1 then two features are perfectly positively correlated 4. If correlation is equal to 0 then two features have no correlation
Which of the following is true about data normalization?
1. Normalization scales the range of the data into some (generally smaller) specified range. 2. Z-Score normalization is useful for finding outliers because each point is represented by how far from the mean it is 3. When subtracting an offset and dividing by a range we change the mean and standard deviation of data without actually changing the shape of its distribution (as seen in a histogram)
Which of the following are issues in data integration?
1. Two different databases may have different column names for the same actual information (e.g. customerID vs cust-id). 2. An attribute named 'weight' may be in different units in different databases. 3. There may be discrepancies between entries in two different databases for the same actual real-life entity (e.g. for an employee).
Which of the following are issues in data integration? (which would actually cause conflicts)
1. Two different databases may have different column names for the same actual information (e.g. customerID vs cust-id). 2. An attribute named 'weight' may be in different units in different databases. 3. There may be discrepancies between entries in two different databases for the same actual real-life entity (e.g. for an employee).
Which of the following are ways to deal with missing data values?
1. Use a special value like "unknown" to capture that there is meaning to the fact that value is missing. 2. Replace with the average value of the attribute among data points with the same class. 3. Predict missing value with a model based on the data you do have (i.e. classification or regression).
Binning numerical data into chunks (bins) can be useful for
1. dealing with noisy data by smoothing out lots of variation into chunks with reasonable ranges 2. drawing a histogram
Scatter Plot
Can handle multiple Y values per X value
The two major types of data reduction are:
Dimensionality reduction and numerosity reduction (the number of variables and the number of points)
If all available data cleaning algorithms are run in sequence, there is no need to include human judgement in the process.
False
If the covariance between two variables x and y is equal to 0 then we can say for certain that x and y are independent
False
Scatter plot is not an effective graphical method to look for correlation between two numerical variables
False
Which of the following are true about Forward Selection?
Forward selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model.
Bar Chart
Good for categorical X values and cases where the Y value is ratio scaled.
Line Graph
Implies some importance of the connection between the data points.
We discussed one method of Feature Extraction, Principle Component Analysis (PCA). Which of the following describes PCA?
PCA creates new features from the original attributes which can efficiently account for most of the variance of the data with fewer dimensions.
The main criteria optimized in methods for projecting high dimensional data to 2D (like MDS) is to
Pairwise distances between points in the new 2D space are as close as possible to the corresponding distances in high-dimensional space.
We've discussed several uses of clustering. Which of the following are included?
Smoothing noise Numerosity reduction. Finding outliers
Data discretization is part of data reduction
True
Looking at clusters when smoothing data helps you see outliers with respect to multiple variables at a time, as opposed to outliers in just one dimension.
True
Which of these are true of using clustering for smoothing?
We replace data points by an average or representatives of points in their cluster.
In data science, visualization is used
at the beginning of the process to understand data, in the middle to debug results and at the end to communicate.
What are other names for features?
attributes, predictors and explanatory variables
Text data can be stored in a matrix with a "bag-of-words" model. This means:
each row represents a unit of text (e.g. document) and each column represents a word.
decimal scaling
result is guaranteed to be between -1 and 1, but original zeros stay zero
Line graphs show a different type of pattern from bar graphs because
the lines imply there is a connection between the plotted points, often helping showcase a trend.
z-score normalization (standardization)
the new values tell how many standard deviations the sample is from the mean of the original data.
min-max normalization
the values are linearly scaled from one interval into another; the middle value means nothing special.
