Midterm

¡Supera tus tareas y exámenes ahora con Quizwiz!

Select the best fit answer: · Semi-supervised ML represents a mixture between supervised and unsupervised learning. · Many machine-learning researchers found that the joint use of unlabeled data with a small amount of labeled data can improve the learning accuracy. · Semi-supervised learning is closer to human learning due our capability to handle fuzziness. · All above answers are correct.

All the above

Select the best fit answer: · Bayesian methods are based on statistical decision theory and are often applied in pattern recognition, feature extraction and regression applications. · The Bayesian networks offers a directed acyclic graph (DAG) model represented by a set of statistically independent random variables. · With Bayesian methods, both prior and posterior probabilities are applied in making predictions. · All above answers are correct.

All the above

Select the best fit answer: · Machine Learning (ML) algorithms are operated by building a decision-making model from sample data inputs · To implement the ML task, we need to explore or construct computer algorithms to learn from data and make predictions on data based on their specific features, similarity or correlations. · Machine learning (ML) is an actionable discipline extended from the study of pattern recognition and computational learning theory in artificial intelligence (AI). · All above answers are correct

All the above

Select the correct answer: · Today Oracle, IBM and Microsoft control almost 90% of RDBMS Market · MariaDB, PostgreSQL, SQLite are open source relational database. · The Structure Query Language (SQL) was first created at IBM. · All above answers are correct

All the above

Select the correct answer: · The use of cognitive systems will be a major disruptor within the next 5 years · Gartner predicted that in 2017 the use of applied "analytics will go viral" · 54% of companies is analytics for competitive advantage · All above are correct answers

All the above

Select the correct answers: · Internet of things is major contributor to Big Data content · Healthcare substantial are of applicability for Analytics Computing · Social Networks generate very substantial amount of unstructured data · All above answers are correct

All the above

Which of the following are critical issues for Big Data operation within the Cloud? Provide the best applicable answer. · Preprocessing of unstructured data · Data analytics software tools · Machine learning and cloud analytics algorithms · Data governance and security · All above.

All the above

Select the correct answer: · Unlike classification, clustering-based division is uncertain. · Cluster analysis assigns a set of observations and labels a sample data space. · Clusters are separated by similar features or properties

All the above OR 3

List three or more cloud infrastructure and services provider company.

Amazon web services. IBM Google cloud Microsoft Azure

Name at least three engines, Google Cloud Platform (GCP) offers to customers, part of its Compute Portfolio of services. Explain what type of virtualization each runs.

App engine - PaaS Compute engine - VMs on googles infrastructure Container engine - run containers on GCP

Select the correct answer: · Unsupervised machine learning methods doesn't require data mining skills. · Data mining methods always process labeled data. · Association Analysis methods focus on finding frequent patterns, associations, correlation or causal structures

Association Analysis methods focus on finding frequent patterns, associations, correlation or causal structures

: Select the false answer: · Artificial Neural Networks (ANN) are cognitive models inspired by the structure and function of biological neurons. · Deep Learning methods extend ANNs by building much deeper and complex neural networks; they are built of multiple layers of interconnected artificial neurons. · Association rules-based methods represent supervised machine learning algorithms.

Association rules-based methods represent supervised machine learning algorithms.

: Explain in your own words what body area sensor networks (BANs) are. Provide example(s) of sensors placed on human body.

BAN are networks of sensors attached to a human body that connect with relay points near the body in order to upload medical or other data to specific databases. Blood pressure sensor. Motion sensor

EPIC: Describe what divisive hierarchical clustering is.

Begin from the cluster containing all the points (i.e. the collection of all data), dividing a cluster from each division, and get the two clusters which are farthest from each other, until it cannot be divided (namely, only a single point cluster left).

Name at least three types of physiological data that sensors can collect from human body or in relationship of the movement of the human body.

Blood pressure Blood glucose brain wave

Choose the best answer from below: · Both Structured and Unstructured Data is Common in IoT Environments · IoT collects and uses only unstructured data · IoT collects and uses only structured data

Both structured and unstructured

: Name at least three Machine Learning API type services, Google Cloud Platform (GCP) offers to customers, part of its Machine Learning portfolio of services.

Cloud Speech - speech to text Cloud Maching learning - machine learning on any data Cloud Vision - image recognition

Select the correct answer below: · OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter. · Openstack is a Big Data NO-SQL database solution Openstack is the newest opens source Software as a Service package cloud providers offer as part of their portfolio of cloud services

Cloud operating system that controls large pools of compute

Select the correct answer: · The concept of cloud computing has evolved from cluster, grid and utility computing · Cloud computing evolved from open source Linux platform · Cloud computing evolved as result of consolidation of multiple large data centers.

Cluster grid and utility

: List at least three (3) unsupervised machine learning algorithm

Clustering: k-means Dimensionality Recustion: Principal component analysis Artificial Neural Networks: Back propagation

Describe in your own words what neuroinformatics means.

Combines informatics and brain modelling

Select the false answer: · With decision tree algorithm, the model is based on observation of the data's target values along various feature nodes in a tree-structured decision process. · With decision tree algorithm, various decision paths fork in the tree structure until a prediction decision is made at the leave node, hierarchically. · Decision tree methods cannot be used to solve classification and regression problems.

Decision tree methods cannot be used to solve classification and regression problems.

How many satellites (or wireless communication towers are needed to identify a device or human's position). Select the best answer: · Three or more satellite devices or telecommunication towers are needed · Two satellite devices or telecommunication towers are sufficient as long the person is located close enough the proximity of the straight line connecting them. · One satellite device or telecommunication tower is sufficient, as long the object or person is located very close to them.

Three or more. But four or more is preferrable

EPIC: Describe within your own words what Representation Machine Learning Method is.

Unsupervised attempts to preserve the crucial information in input data

EPIC: Describe within your own words what Reinforcement Machine Learning Method is.

Unsupervised demands a policy that links th states of prediction model to the levels of reinforment actions to be taken

Select the correct answer: · With supervised learning, all input data are not labeled with a known result. · With supervised learning, the input data is called training data with a known label or result. · With semi-supervised learning, the input data is a mixture of labeled and unlabeled examples.

With supervised learning, the input data is called training data with a known label or result. AND With semi-supervised learning, the input data is a mixture of labeled and unlabeled examples.

Select the correct answer: · 75% of world's data is stored within relational databases · 80% of the world's data today is unstructured · All the data in the world is somewhat structured.

80% of the worlds data is unstructured

Select the correct answer: · 90% of the world's data was created in the last two years · 70% of the word's data was created within the last 10 years. · Every year additional 20% of new data is created (in terms of Big Data)

90% in the last two years

Describe in your own words what a Data Warehouse and Data Lake are. Describe common features and well as where they differ.

A data lake is a pool of raw/unstructured data that has data "flows " from multiple archives a data warehouse is structured data that has been made or used for a specific purpose.

Name two or more sensors, commonly available within modern smart phones.

Accelerometer GPS Bluetooth

Describe how global positioning systems can be used to track the position of transportation vehicles, farm equipment, as well as humans or animals.

Active GPS transmits data from the GPS to a satellite that then transports the data to a GPS center. These GPS systems can be attached to humans animals or vehicles.

Select the correct answer below: There are multiple approaches (called "common chains") to determine proximity between clusters, including single chain, whole chain, group average and others. · Within single chain, the proximity of clusters is defined by MAX as the distance between two closest points in the clusters. · Within whole chain, the proximity of clusters is defined by MIN as the distance between two closest points in the clusters. · Agglomerative hierarchical clustering requires a constant merger of the two most adjacent clusters. It needs to determine the proximity of each cluster, so a specific criteria must be given for this measurement.

Agglomerative hierarchical clustering requires a constant merger of the two most adjacent clusters There are multiple approaches (called "common chains") to determine proximity between clusters, including single chain, whole chain, group average and others.

Find the best fit correct answer: · Logistic regression is a linear regression analysis model in a broad sense and may be used for prediction and classification. · Logistic regression is commonly used in fields such as data mining, automatic diagnosis for diseases and economical prediction. · Logistic regression combines multiple input features into one feature and uses maximum likelihood estimation is adopted to transform it into an optimization problem. · All above answers are correct

All above answers are correct

Find the best fit correct answer: · Naive Bayesian networks assume that all attributes are statistically independent. This assumption is too strict in some cases. · The Bayesian Belief Network is a graphical representation of the relationship among attributes. · Through the aggregation of multiple classifiers, classification accuracy is improved, and we call this technique an ensemble model or a combination of classifiers. · All above answers are correct

All above answers are correct

Describe two or more skills data analysts typically have; in the same way list two or more skills data scientists have.

Data analysts excel at examining the data for patterns and use SQL/scripting languages to create reports Data scientists excel at machine learning and datamining using Python and R

Describe in your own words what Data Science is and how it differs from Business Intelligence and/or Predictive Analytics. Explain the difference for at least one of these two.

Data science uses machine learning and artificial intelligenceto connect the dots in data. Buisiness intelligence looks at historical data Predictive analysis gains insights on future

Provide two or more examples for supervised classification methods

Decision Trees or KNN

Find the false answer: · A decision tree offers a predictive model in both data mining and machine learning. · In a classification tree, leaves represent class labels and branches represent conjunctions of attributes that lead to the class labels. · Decision tree algorithm cannot use continues values for its variables, such as real numbers. It only uses logistic values or 1 and 0.

Decision tree algorithm cannot use continues values for its variables, such as real numbers. It only uses logistic values or 1 and 0.

List and describe at least two levels of Data Analytics

Descriptive analytics is the lowest level of data analytics. It describes what has happened. Predictive analysisis the third level of data analytics. It describes what will happen

Name at least two types of VM (virtual machine) type virtualization (hint: think of different types of hypervisors).

Full Virtualization: complete simulation of host hardware to a virtual CPU Partial: some of the selected resources are virtualized and some are not

List four or more technologies that positively enabled the evolution of Internet of Things (IoT).

GPS Sensor Networks RFID tracking Biometrics

Describe at least two deficiencies of classic MapReduce, which where compensated by Apache Spark. Use your own words and common-sense scenario if you don't remember.

Forces data producessing into map and reduce so cant use any other SQL/database functions Only supported by Java

List three or more cloud data center characteristics.

Generally data centers contain a thousands or millions of servers servers are usually of the same type Low cost hosting software

Name at least three services, Google Cloud Platform (GCP) offers to customers, part of its Big Data Portfolio of services.

Genomics BigQuery Cloud Dataflow

Choose the correct answer: · Hadoop is good for storing large files · Hadoop is good for low latency applications Hadoop is not good for processing and string streamed data

Good for large files

List and provide short description for at least three Big Data Application category.

Health care: for holding medical records and genomics Government: national agencies can use it to track crime Scientific: large hadron collider and NASA

Name at least two smart devices people wear when performing sports activities. Describe how these sensors collect data and what type of data.

Heart rate watch Step counter

Select the false answer: Association analysis (association mining) finds frequent · Patterns · Associations · Correlation · Causal structures · Hidden labels

Hidden Labels

List three or mode cloud as a service ("...aaS") platforms and provide a simple example for each.

Infastructure: puts to ether servers, storage, and networks as demanded by users: AWS Platform: makes a platform for users to develop applications. provides middleware and database tools. Azure Software: browser based software that serves a specific purpose. google docs

List three or more organizations who used and /or contributed to Big Data in very early years.

Internation Association for Statistical Computing (IASC) Knowledge Discovery in Databases International Federation of Classification Societies(IFCS)

Find the false answer · Machine learning gradually loses relevance with the rise of data science and the big data industry. · Machine learning algorithms by nature of training can be categorized as supervised, unsupervised and semi-supervised algorithms. · Random forest algorithm can be used to implement decision tree classification.

Machine learning gradually loses relevance with the rise of data science and the big data industry.

List three or more cloud computing models. Describe them using at least one sentence for each model.

Public clouds: these are clouds that are based on the internet and can be used by anyone. Services such as AWS or Azure are public clouds Private clouds: built within a network of computers and used by a single/small number of users. Better for specific tasks than public clouds Community clouds: collaborative infastructure shared by multiple organizations with some common or social or buisiness interest. build over multiple datacenters

Choose the best correct answer from below: · Node-red sensor programming package uses Java and Eclipse · Node-RED is a programming tool for wiring together sensors, hardware devices, APIs and online services · Node-red is used to generate and transmit critical alerts and signals between IoT devices and cell-phones

Node-RED is a programming tool for wiring together sensors, hardware devices, APIs and online services

Describe in your own words, why is Node Red so popular? Optionally name which cloud service includes Node-Red application as part of Node-Red starter or IoT platform starter.

Node-RED is popular because it is very accesable. It is broswer and JAVA based so it is easy to run. It is lightweight can can run on low cost hardware like Raspberry Pi for robotics.

Describe in your own words what a "neural computer" is.

Nueral computers are computers that are designed to mimic the human brain. They include neural networks that can learn, memorize, and handle more advanced questions than normal computers

Select the correct answer: · PCA is designed to transfer multiple indicators (variable in the regression) to several aggregate indicators (principal component) with dimensionality reduction · Principal components are not reflecting information about the original variables. · PCA algorithm never loses and details and all information is kept following reduction.

PCA is designed to transfer multiple indicators (variable in the regression) to several aggregate indicators (principal component) with dimensionality reduction

Name at least three database types Amazon Web Services (AWS) offers as a service to customers.

RDS -> mySQL DynamicDB noSQL data store Redshift - warehouse service

Name the three essential components of an RFID application (or RFID system).

RFID tag RFID reader Backend system

Find the false answer: · Regression analysis methods apply mathematical statistics to establish dependent variables and independent variables in a machine learning process. · The independent variables are the inputs of the regression process, also known as the predictors. The dependent variable is the output of the process. Regression uses only independent variables; output is randomly generated

Regression uses only independent variables; output is randomly generated

List at least three (3) supervised machine learning algorithm

Regression: Linear Classification: KNN Decision Trees: Random Forests

Name two or more modules of a typical sensor node architecture

Sensor, radio, memory

Name at least three "smart" technologies that catalyzed (or will catalyze) the evolution of IoT, starting around 2015 and continuing through 2020 and beyond.

Smart antennas Energy Harvesting Ubiquitious positioning

List three or more Big Data (Hadoop or other) providers, which can be platform and/or distribution providers.

Spark, GFS, MapReduce

Explain the difference between standalone versus cloud-enabled (cloud centric) IoT application; explain these concepts using your own words

Standalone applications focus on improving quality of lives. Objects with primative intelligence without communication capabilities Cloud applications are gradually maturing and support smart applications involving a large amount of data

Describe three or more technologies which form the SMACT technologies grouping term.

Stands for Social Mobile Analytics Cloud Computing Internet of Things Mobile systems are telecomuncations like 4g, smart phones and iOS Social is things such as youtube, twitter, facebook big data analytics is data mining and machine

Explain in your own words what structured and unstructured data are and provide examples for each.

Structured data is well organized and cleaned. An example of structured data would be a dataset that has been cleaned of empty cells. Unstructured data is data that is not cleaned or organized in any way and is data you found in its "wild" form. An example of this would be a dataset recorded long ago with missing cells.

Select the correct answer: · Dimensionality reduction represents a supervised machine learning algorithm · Support Vector Machines (SVM): are often used in supervised learning methods for regression and classification. · Ensemble methods include models composed of multiple strong models, trained in dependency with each other

Support Vector Machines (SVM): are often used in supervised learning methods for regression and classification.

Name at least three Openstack service and describe them via at least one sentence each.

Swift - scaleable store system spread over large datacenter servers Sahara - hadoop cluster module Nova - compute module

List three layers within a generic cloud computing reference architecture.

The bottom layer: physical servers and hardware that the cloud is built on top of. Middle Layer: virtualization and resources management Top layer: cloud applications for user services

Select the false answer: · Regression algorithm offers a supervised approach using statistical learning to model the relationship between input data characteristics. · The regression process is iteratively refined using an error criterion to make better predictions · This regression algorithm minimizes the error between predicted value and actual experience in input data.

The regression process is iteratively refined using an error criterion to make better predictions

Name at least 4 items from the five "V"-s of Big Data and provide examples for each

Volume - the amount of data. ex. Tables and datasets Velocity - streams Value - correlations Variety - structured vs unstructured

: Find the false answer: · We cannot extract rules from a decision tree algorithm. · The Nearest Neighbor Classifier can use active and passive learning methods * Support vector machines classifier can be used to classify a multi-dimensional dataset

We cannot extract rules from a decision tree algorithm.

Describe in your own words what Hadoop is.

a software framework that stores large amounts of data. can be used to process clusters of large data sets

Describe the term "Big Data" in your own words.

big data is a large amount of data that can be structured or unstructured. Generally, it is not feasable to work with such large amounts of data on one system. So cloud computing is used.

Typically, analytics computing platforms have multiple layers and they are composed as a stack of underpinning elements (hardware, software, processing, etc.). Describe at least two layers within the analytics computing platform stack.

bottom layer: cloud infrastructure Middle Layer: indexes to visual and access data Top level: report and display results. Visualization

Name at least five objects that can play the role of "things" as they can interact with IoT infrastructure and Software. (Hint: thing of different type of sensors, boards and smart devices).

cars. laptops smart phones subway turnstiles refridgerators

Describe in your own words some of the characteristics of the "cloud analytics" concept.

cloud analytics combines analytics computing with a large amount of data. Working to do analytics at a massive scale

Describe in your own words three (3) benefits of cloud computing; use full sentences.

cloud computing reduces costs, demand more resources at peak workload, and reduce costs.

Describe in in your own words what data mining is. Provide at least three data mining methods and /or procedures that are typically used to perform data mining. If you don't remember, use common sense and past experience with data.

data mining is taking information from the dataset and transforming it into a readable and understandable structure. Classification, clustering, association rules

Name at least one Google paper (publication), that contributed to the development of the Hadoop ecosystem general. Don't need the exact paper name, just a short description of what was about.

google file system paper

: Select the correct answer: · The purpose of PCA is to recombine the related variables to a group of new unrelated comprehensive variables to replace the original variables · There are no relations between principal components and original variables Principal components are related to each other

he purpose of PCA is to recombine the related variables to a group of new unrelated comprehensive variables to replace the original variables

List and describe at least two MapReduce operations.

mapping: indexes and sorts data into the desired structure shuffle: redistribution of the mappers outputs across nodes depending on the key

EPIC: Describe within your own words what density-based clustering is.

n is the point within the dense areas. Its neighborhood is determined by the distance function (Euclidean distance is commonly used), the distance parameter specified by the user and the threshold value of the number of internal points. If this point is the core one, then the number of the points in defined fields would surpass the given threshold value. nFrontier point: is the point on the edge of the dense areas. The number of the points within the neighborhood of this point is less than the threshold value of the number of internal points specified by the user, but this point is located in the neighborhood interior of one certain core point. nNoise point: is the point in the sparse areas. The number of the points within the neighborhood of this point is less than the threshold value of the number of internal points specified by the user. But this point is not located in the neighborhood interior of any core point. nThe points in the data space can be classified into the following three data density types according to the intensive degree: nCore point: is the point within the dense areas. nFrontier point: is the point on the edge of the dense areas. nNoise point: is the point in the sparse areas.

EPIC: Describe within your own words what Dimensionality Reduction Algorithm is. Mention and describe at least one benefit of this algorithm.

nDimensionality reduction refers to the transfer of the points in high-dimensional space to low-dimensional space through the mapping function to relieve the "curse of dimensionality". nDimensionality reduction may not only reduce the correlation of data, but also accelerate the operation speed of algorithm (decrease of data volume).

Describe in your own words what a rule-based classification algorithm is.

nRule-based classifier is a technique to use a set of "if then..." rules to classify records, usually representing model rules in disjunctive normal form as given by R = (r1∨r2⋅⋅⋅∨rk), where R means rule set, while ri is the classification rule or disjunction.

EPIC: Describe what agglomerative hierarchical clustering is.

nStart with individual objects against a cluster, merge the two nearest objects or clusters, until all objects are in one cluster (namely, all collections data).


Conjuntos de estudio relacionados

GW- Ch 10-CompTIA Security SYO-501

View Set

Educational Psychology Exam 2 (Chapters 5, 6, 7, 8 ,9)

View Set

Triangle Congruence: ASA and AAS

View Set