Data Science

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Describe the 3 steps in Agglomerative Clustering

1. Assign each record in N data records to its own cluster, forming N clusters; 2. Merge records with minimal Euclidean distance between them into a single cluster; 3. Repeat step 2 until there is remaining one hierarchical cluster.

Describe the 5 steps of K-Means Clustering

1. Decide on number of clusters (k); 2. Select random data records to represent the centroids of each cluster; 3. Calculate the distance between each record and each centroid, then assign each record to its closest cluster based on Euclidean distance; 4. Recalculate a new centroid for each cluster using new combined records; 5. Repeast steps 3 and 4 until there are no further changes in the centroids, and the final clusters are formed.

1. The classifier correctly labels a "Yes" data record as "Yes": _____. 2. The classifier mistakenly labels a "No" data record as "Yes": _____. 3. The classifier mistakenly labels a "Yes" data record as "No: _____. 4. The classifier correctly labels a "No" data record as "No": _____.

1. True positive. 2. False positive. 3. False negative. 4. True negative.

By increasing the area under the ROC curve we get...

A better performance by the developed classification model

State four metrics to evaluate a regression model's performance.

Absolute error, relative error, square error, mean square error, root mean square error.

The objective of applying reinforcement learning is to allow the _____ to notice which _____ yields the maximum _____.

Agent, action, reward

What are the types of electronic data processing?

Batch, online, real-time, distributed, time-sharing. (Bad Orange Rabbits Die Tomorrow)

The data transmission rate is expressed in terms of _____, and normally takes _____ bits to transmit an individual digit.

Bits per second (bps), 8 - 10.

Name three forms of processed data.

CSV, SQL, XLS

The classification approaches are implemented if the outputs are _____, and the regression approaches are implemented if the outputs are _____.

Categorized into classes, outputs are discrete numbers.

Support vector machines is a linear _____ approach where the rule is to develop a _____ function of the _____ dataset variables.

Classification, linear function of the input dataset variables

What are the stages of data processing?

Collection, Preparation, Input, Analysis, Interpretation, Storage. (Cold Potatoes If Aaron Insists Saving)

The principal componenet analysis is applied to transform _____ variables into _____ variables called principal components. It sorts the produced variables according to their _____ along the data records.

Correlated, uncorrelated, level of changeability.

What are three dimensions of data science activities?

Data flow, data curation, data analytics

The variable scaling , decomposition, and aggregation are the main three methods of _____.

Data transformation

One important difference in time-series forecasting from basic regression analysis is that all data records are ______ and the list of observations are necessarily ordered with respect to _____.

Dependent, time instances

_____ analysis is performed on a data sample to provide an understanding of the data in clear ways. Meanwhile _____ deduces that the full dataset is like by the analysis of the data sample.

Descriptive, Bayesian statistics

Business intelligence is mainly used for _______ analytics, wheras data science is implemented for _______ analytics.

Descriptive, predictive

Cognitive and motivational biases that can greatly _____ the model inputs, and seriously _____ the model quality and the dependent _____. As a result, an _____ may be achieved.

Distort the inputs, degrade the model quality and dependent analysis, inaccurate decision making.

T/F: A confusion matrix is a metric to measure the performance of a predicted regression model.

FALSE

T/F: During correlation analysis, redundant records and data with missing values are dealt with.

FALSE

T/F: Fuzzy logic is almost synomymous with the theory of fuxxy sets, a theory which relates to two classes such as {+1, -1}.

FALSE

T/F: In asynchronous serial data transmission, any gaps between the bit streams are filled with idle streams of bits of 0 or 1.

FALSE

T/F: Mutually exclusive events can occur at the same time and do not have an impact on each other.

FALSE

T/F: The autogressive model predicts a future observation as a function of the errors in previous forecasts.

FALSE

T/F: The back propogation algorithm is applied in ANN to predict the network's inputs given its outputs.

FALSE

T/F: The bar chart is used to show proportions of a whole, where the total of the variable values are 100 percent.

FALSE

T/F: The classification's precision measures how often the model produced true positives in all correct predictions.

FALSE

T/F: The coefficients of a regression model are estimated by applying a correlation analysis.

FALSE

T/F: The dataset variables are divided into two groups: a training set and a testing set.

FALSE

T/F: The differencing concept is utilized to convert time-series data into linear varied data.

FALSE

T/F: The eigenvectors are sorted according to their degree of correlation.

FALSE

T/F: The existence of duplicated data records degrades the accuracy of the data analysis.

FALSE

T/F: The kernal trick is utilized in SVM to maximize the separating margin.

FALSE

T/F: The power law transformation combines two variables and transforms them in two variables, one that represents a radius and one that represents an angle.

FALSE

T/F: The prediction model is evaluated through a defined list of a business's KPIs.

FALSE

T/F: The prediction step in machine learning removes misleading information from the dataset.

FALSE

T/F: The proximity matrix in agglomerative clustering is formed from the correlation coefficients between each pair of data variables.

FALSE

True/false: An e-mail that includes text and an image is considered structured data.

FALSE

A _____ represents the variable's values as color scale to show their densities on selected geographical area.

Heat map

Most KPIs focus on the measurement of improving _____, reducing _____, increasing _____, and/or enhancing _____.

Improving revenue, reducing costs, increasing efficiency, enhancing customer satisfaction.

A positive covariance indicates that both variables _____. If the covariance is zero, that indicates that the variables are _____.

Increase/decrease together, uncorrelated.

The analysis of an organization's use case should therefore result in increasing _____, reducing _____, and/or decreasing _____.

Increasing gain, reducing risk, decreasing effort

The value of the _____ variable is the output of a regression model.

Independent

Data processing is the extraction of _____ from _____.

Information from data.

The objective of applying transformations on a dataset is to improve its _____, and to transform its variables to a new space where more _____ can be easily extracted.

Interpretability, variables

There are two main approaches in clustering, namely the ______ and the _____.

K-means, agglomerative clustering.

The result of identifying the right use case for a business are to understand the _____, address the _____, and improve the _____.

Key insights, business challenges, and improve the achievable gains

An area chart is based on a _____ chart where the area between the axis and the line displays _____.

Line chart, quantitative variables.

The _____ transformation is usually applied in linear regression problems, while the _____ transformation changes largely the shape of the variables to its inverse.

Logarithm, reciprocal

What are the methods of data processing?

Manual, mechanical, electronic.

In _____ data processing method, data is processed by using various devices like calculators, or typewriters. This method is _____ than the ____ data processing method, but still forms the _____ of data processing.

Mechanical, faster than the manual method, forms the primitive levels.

During data collection, the variable's values which are not observed are called _____, and the variable's values which are wrongly observed are called _____.

Missing values, outliers.

ANN is composed of many layers of _____, where the _____ layer is for input values of the dataset variables, and the _____ layer is the one producing the value of the _____ variable. The intermediate layers are called the _____.

Neurons, input, output, target, hidden

The quality issues of a raw data are the existence of _____, _____, _____, and _____.

Noisy variables, irrelevant variables, outliers, and missing values.

What are the three main probability distributions?

Normal, binomial, poisson

Data analytics is applied to uncover the _____ in the data and transform the data into _____ to support _____.

Patterns in the data, data into insights, to support decision making.

The operation of sorting data variables according to their level of changeability along data records is part of...

Principal component analysis

A memory cell is a concept which exists in...

Recurrent networks

Correlation analysis is applied to handle...

Redundant variables

If there are duplicated records in the dataset, they are _____ before moving forward in the data analysis to reduce the _____.

Removed, to reduce the computational time.

The true positive rate achieved by a developed machine learning model is defined as...

Sensitivity

In _____ transmission, the digital data are sent bit by bit over one channel. While in _____ transmission, there are more than one channel to deliver multiple data bits each time.

Serial, parallel

Each line inserted in a csv file denotes a _____ of the date, with values separated by _____, to specify the value of each _____.

Single record, comma, feature.

Examples in business that require time-series analyses are _____ and _____.

Stock prices and sales.

The data records which lie on each side of the SVM separating channel, are called _____.

Support vectors

The bit stream is combined into long frames, and there is a constant period between deliverables in...

Synchronous data transmission

In _____ serial data transmission, the bit stream is combined into longer frames, and a constant period between transmissions. While in _____ serial data transmission, the bit stream has a start and stop bits, with a variable period between transmissions.

Synchronous, asynchronous.

Serial transmission is more reliable than parallel transmission.

TRUE

Synchronous serial data transmission is fast with no additional overhead for the start and stop data bits.

TRUE

T/F: A neuron has mostly nonlinear activation functions which allow the network to learn the nonlinear relationships between the variables.

TRUE

T/F: Agglomerative clustering merhes the two nearest clusters into a bigger cluster.

TRUE

T/F: Building a regression model is an iterative process.

TRUE

T/F: In ARIMA(2, 0, 0), there are two (AR) terms.

TRUE

T/F: One of the value propositions in customer-related use cases is to recognize which parameters influence the purchasing of a product.

TRUE

T/F: The Naïve Bayes approach assumes that the independent variables are random variables.

TRUE

T/F: The autogressive method assumes the expected output is a linear function of some past outputs.

TRUE

T/F: The classification's accuracy is the ratio of correct predictions over total predictions.

TRUE

T/F: The combo chart is used to highlight different types of information, and particularly when the variables vary widely.

TRUE

T/F: The data science model defines the relationship between the relevant features of a dataset and its outputs.

TRUE

T/F: The eigenvalues are found by transforming the covariance matrix to a diagonal matrix.

TRUE

T/F: The logarithm transformation is usually applied in linear regression problems.

TRUE

T/F: The objective of correlation analysis is to handle redundant data variables.

TRUE

The objective of a regression model is to predict the value of a _____ at new situation, given the remainder of the variables' values at _____.

Target variable, previous situations.

The objective of a prediction model is to produce reasonably high accuracy with respect to the...

Testing set

What is meant by data science?

The practice of arrangement, analysis, interpretation, and visualization of data in order to gain useful insights and extract information and meaning.

In linear regression, the main assumption is _____.

There are relationships between the data variables.

A bubble chart is used to visualize a dataset with _____ variables, where the first two variables are displayed as _____, while the third and the fourth variables are displayed as _____; respectively.

Two to four variables, axes values, color and size.

The agglomerative clustering is mainly applied to the data generated from a process defined by an _____.

Underlying hierarchy

What are the five V's of data?

Value, Variety, Validity, Velocity, Volume (Va Va Va Ve Vo)

The process of removing the variable's average and dividing by the variable's standard deviation is called...

Variable scaling

The variable scaling is made to ensure all the variables are _____.

Weighted equally.

The neuron sums up its _____ coming from its preceding links and applies a predefined _____ on the result to produce its output.

Weighted inputs, transfer function

The excel file includes _____ and store the data in tables of _____ (for the records) and _____ (for the variables), with a support to develop _____ on the recorded data.

Worksheets, rows, columns.

What are the main motivations for dimensionality reduction (3)? What are the main drawbacks (4)?

motivations: 1--to speed up a training algorithm, 2--to visualize the data, 3--to save space (compression). Drawbacks: 1--information loss that could degrade the performance of subsequent training algorithms, 2--can be computationally expensive, 3--adds some complexity to your machine learning pipelines, 4--transformed features are hard to interpret

In the correlation coefficient (p) between two variables is +1, the two variables are _____, and if (p = 0) the two variables are _____. While negative correlation coefficients imply that the variables are _____.

p = +1 -> fully correlated, p = 0 -> independent,


Kaugnay na mga set ng pag-aaral

Neuro - Test Banks - Brunner & Suddarth's Textbook of Medical-Surgical Nursing 14e Chapter 65 - 70

View Set

Cells, mitosis, review of POB lecture quiz

View Set

Sigmon- Earth & Enviromental Science Final Exam Review

View Set

left + right sided heart failure

View Set

PreAlgebra: sections 5-5 to 5-9 Vocab

View Set

Social psychology - Prosocial behavior and altruism Ch 11

View Set

Chapter 29: Communication, History, & Physical Assessment

View Set