BCOR 2205: Info Management Midterm

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

timeline of AI, ML, and deep learning

AI - 1950's ML - 1980's deep learning - 2010's

relationship between AI, ML, and deep learning

AI is the overarching topic Ml is a subset of AI deep learning is a subset of ML

structured database

ER db, MS access, SAP, CRM

C

ERP stands for A - engaged research and planning B - enterprise reasoned plan C - enterprise resource planning D - effective resource planning E - electronic resource plan

competitive advantage of big data

1/3 business leaders make frequent decisions with a lack of trust in the information; 1/2 say they don't have access to the info they need to do their jobs; 83% see BI/Analytics as part of their road map; 60% see a need to do a better job capturing and understanding information rapidly in order to make swift business decisions

cloud based tools

Amazon Web Services, Microsoft Azure SQL Database, Cloud SQL by Google, Oracle Database as a service, Hadoop

database/data base management systems

a collection of records, stores both end user data (raw facts) and metadata (data about data) designed as means of collecting all the records, software system designed to allow the definition, update, retrieval and administration of databases

relational database

a database that represents data as a collection of tables in which all data relationships are represented by common values in related tables

entity relationship model

a graphical approach to database design, graphically represents the logical relationship of entities (objects)

data wharehouse

a logical collection of information - gathered from many different operational databases - that supports business analysis activities and decision-making tasks

ordinal measurement

a measure for which the scores represent ordered categories that are not necessarily equidistant from each other ex: poor, average, good, excellent not measured or equidistant, but is ordered, no meaningful zero

interval measurement

a rank with meaningful equidistant intervals between scale points ex: temperature, dates zero is arbitrary ratios are meaningless (multiplication/division not possible), but addition and subtraction can be performed measured, ordered, equidistant, but no meaningful zero

attribute instance

a single value of an attribute for an instance of an entity ex: 'Jane Hathaway' and '3 March 1989' are instances of the attributes name and hire date

deep learning

a subset of AI and ML; based on neural networks, inspired by the human brain; with ML we have to define features that separate whereas DL finds these on its own

machine learning

a subset of AI; a field of study that gives computers the ability to learn without being explicitly programmed; the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the word

B

according to you textbook, features are your A - responses B - predictors

A

according to your textbook, which is faster? A - training a subject matter expert in autoML B - training a data scientist to understand subject matter

the 8 automated machine learning criteria

accuracy, productivity, ease of use, understanding and learning, resource availability, process transparency, generalizability across contexts, recommended actions

holdout set

aka "testing data", a subsection of a dataset to provide a final estimate of the machine learning model's performance after it has been trained and validated, should never be used to make decisions about which algorithms to use for improving tuning algorithms

categorical data

aka qualitative, described by words rather than numbers ex: freshmen, sophomores, juniors, seniors

numerical data

aka quantitative, arise from counting, measuring, or some kind of mathematical operation ex: 20 of you visited the class website since last class

relationships

an association among entities

foreign key

an attribute borrowed from another related table in order to make the relationship between the two tables

primary key

an attribute that uniquely identifies a specific instance of an entity, every entity in the data model must have one whose values uniquely identify instances of the entity to qualify, must be a non-null value and must be unique

ease of use

analysts should find the system easy to use system should minimize the machine learning knowledge necessary to be immediately effective, especially if the analyst does not know what needs to be done to the data for a given algorithm to work optimally operationalizing that model into production should be easy

random forrest

build many trees, average predictions from each one many splits, without relying on just a few data points for final prediction significantly more accurate than linear regression or single decision trees in most circumstances

regression

continuous outcome, used for prediction the values of a dependent variable based on values of at least one independent variable ex: how much money is in someone's pocket

transform data

converts the observation from the original units of measurement to a standardized scale using excel, SQL, Altrex, Python or R studio

discrete

countable (integers 1, 2, 3, 4, etc)

cloud database

created and maintained using cloud data services that provide defined performance measures for the database, runs on a cloud computing platform, network of remote servers hosted on the Internet to store, manage, and process data ex: Hadoop, mongoDB

CRM

customer relationship management

C

customer: purchase amount, name the entity(its) is/are ________ and the attribute(s) is/are ________ A - purchase amount, customer B - purchase amount, customer and name C - customer, purchase amount and name D - customer, name E - customer, purchase amount

features

different independent 'variables', columns of data, predictors

examples of ML

driverless cars (Tesla, Volvo), medicine (find cancer), detect fraud (PayPal, Amazon, Home Depot), MT Everest Weather, Missing Children based on Facial Recognition, Alexa/Cortana/Siri, Facebook ads, Zillow home prices

dirty data

duplicate, incorrect, and/or missing - biggest problem today you must examine data, not knowing data is not clean = bad decisions

C

ease of use refers to: A - a system being easy to use for the analyst B - a model that is easy to put into operation/production C - both A and B

querying tools

easy to use software allowing users to get specific information from a data base; released the power of the data warehouse to the masses

unsupervised machine learning

machine learning that does not need input for the algorithms and does not need to be trained, machine determines what to learn/what is important

ERP

enterprise resource planning

ERP

enterprise resource planning system

big data in business (health)

epidemic/disease outbreak prediction; individual disease predictions; increased access to databases on symptoms and new treatments; real time monitoring of patients

decide whether to continue (6 in define project objectives)

estimate resources required, understand alternatives to creating model, consider technical risks, estimate models' business values

model diagnosis

evaluation/ranking of the models; models can be run simultaneously; combinations of models can be run (Blenders)

supervised classification

event/no event (binary target) ex: 0/1, yes/no

what is 'Big Data'?

everything we do is leaving a trace (data), which in analyzable; 'the Internet of Things'

B

excel data would be considered A - useful B - structured C - unstructured D - it depends

spreadsheets

excel, google sheets, numbers useful for small amounts of data

what is ETL?

extract: pull data out of wherever it resides in the world transform: convert data into homogenous format for joining, do stuff to the data, create new "features" (variables) load: put clean data into final location

the Big Data opportunity

extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible variety: manage the complexity of multiple relational and non-relational data types and schemas velocity: streaming data and large volume data movement volume: scale from terabytes to zettabytes

feature engineering

fifth and last step in acquire and explore data

implement, document, and maintain

fifth and last step in the machine learning life cycle

consider risks and success criteria

fifth step in defining project objectives

acquire and explore data (2)

find appropriate data, merge data into single table, conduct exploratory data analysis, find and remove any target leakage, feature engineering

using data to secure data

fingerprints, voice recognition (banking)

find appropriate data

first step in acquire and explore data

specify business problem

first step in defining project objectives

set up batch or API prediction system

first step in implement, document, and maintain

interpret model

first step in interpret and communicate

variable selection

first step in model data

define project objectives

first step in the machine learning life cycle

find and remove any target leakage

fourth step in acquire and explore data

prioritize modeling criteria

fourth step in defining project objectives

interpret and communicate

fourth step in the machine learning life cycle

image data

graphics, shapes, and pictures

B

heights are discrete data A - true B - false C - sometimes

overfitting features

high variance, low bias

supervised machine learning

machine learning that requires humans to provide input and desired output as well as feedback about prediction accuracy during the beginnings of the system, data scientist (DS) select what they want the machine to learn

artificial intelligence

machines that can perform tests that are characteristics of human intelligence (earliest AI was chess and training machines how to play games)

what is data?

numbers, characters, images, etc; organized facts that need to be processed; everything quantifiable; meaningless without human or machine interpretation

alphanumeric data

numbers, letters, and other characters

examples of classification task

predicting tumor cells as benign or malignant classifying credit card transactions as legitimate or fraudulent classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil categorizing news stores as finance, weather, entertainment, sports, etc

prioritize model criteria (4 in define project objectives)

predictive performance, familiarity, prediction speed, speed to build model, interpretability?

entities

real world object distinguishable from other objects, described using a set of attributes

auto ML - feature engineering

recognizes categorical vs. numerical despite initial appearances, recognizes reference ID, dealing with nulls, combining features (age into categories 20-30 etc), expanding features (dates days/months/years etc)

examine data

remove columns that are clearly not necessary (or duplicate), be careful, if in doubt don't throw it out check descriptive stats verify column types match expectation

B

salesforce is an example of A - ERP B - CRM C - ABD D - a flat file E - B and D

communicate model insights

second and last step in interpret and communicate

merge data into single database

second step in acquire and explore data

acquire subject matter expertise

second step in defining project objectives

document modeling process for reproducibility

second step in implement, document, and maintain

build candidate models

second step in model data

acquire and explore data

second step in the machine learning life cycle

big data in business (sports)

sensor technology in sports equipment; tracking outside practice; predictive insights into fan preferences; coaching decisions/drafting insights

text data

sentences and paragraphs used in written communication

implement, document, and maintain (5)

set up batch or AP prediction system, document modeling process for reproducibility, create model monitoring and maintenance plan

ER database structure

table with primary key, attributes, foreign key

narrow AI (aka weak AI)

technologies that are able to perform specific tasks as well as (or better-faster, more accurate, better endurance) than humans can, learns from thousands of examples, reflexive tasks with no understanding, knowledge does not transfer to other tasks, today's AI

define unit of analysis (3 in define project objectives)

the 'Who', or 'What' we are studying what are we trying to predict (prediction target)?unit of analysis must be at this level

prediction target

the behavior of a "thing" (ex. person, stock, organization, etc.) in the past that we will need to predict in the future without one, there is no (easy) way for humans or machines to learn what associations drive an outcome

D

what is meant by "word birth"? A - the first time a word is searched for online B - the first existence of a word in writing C - when an older word takes on a new form, gives 'birth' to a subword D - the first time a baby/or toddler says a new word

C

what is the first stage in the machine learning life cycle? A - acquire and explore data B - acquire executive support C - define project objectives D - define project goal E - none of the above

A

what is the first step in define project objectives> A - specify business problem B - acquire subject matter expertise C - define prediction target D - define unit of analysis

A

what type of machine learning will we learn in this class? A - supervised B - unsupervised C - neither

D

when can we use excel for data exploration/manipulation? A - always B - never C - when the data set contains only #'s D - when our data set is small enough E - when our data set is large enough

C

which came first? A - deep learning B - machine learning C - artificial intelligence

A

which does the textbook identify as the most important criterion of autoML? A - accuracy B - process transparency C - precision D - understanding and learning E - generalization

C

which is NOT one of the eight criteria of autoML? A - accuracy B - process transparency C - precision D - understanding and learning E - generalization

C

which machine learning example was NOT mentioned in your textbook or in the podcast? A - predicting poverty through nighttime light intensity B - Google earth C - translating between English and Dolphin D - Chef Watson, IBM's recipe development tool E - self-driving cars

D

which of the following is correct? A - an attribute is defined using a set of entities B - entities and attributes are synonymous C - an entity is defined using a set of relationships D - an entity is defined using a set of attributes

D

which of the following is/are a criteria that should be used to evaluate any propose project? A - is the project statement presented in the language of business? B - does the project statement specify actions that should result from the project? C - how could solving this problem impact the bottom line D - all of the above

success criteria (5 in define project objectives)

who uses the model? how much value can the model drive? what modeling criteria will help you get there?

C

you are trying to predict whether a customer will fill out a credit card application. using "total time on bank webpage" would be an example of A - carefully selecting features B - needing to define a common unit (seconds, minutes) C - target leakage D - a good data decision

text (strings)

you specify a number of characters, this is either exact or a maximum depending on the data type varchar, variable length string, text varying, VSTR almost always have to convert text or extract text to make something useful out of it

C

your textbook refers to binary coding as A - dummy variables B - if-then coding C - one-hot-encoding D - a bad idea

create model monitoring and maintenance plan

third and last step in implement, document, and maintain

build validation and selection

third and last step in model data

conduct exploratory data analysis

third step in acquire and explore data

define unit of analysis and prediction target

third step in defining project objectives

model data

third step in the machine learning life cycle

binary variables

1 if true, 0 if false (Boolean)

excel

CSV (common separated variable)

difference between ML and deep learning

ML is being able to get the machine to learn, using examples deep learning is more specific, being able to learn on its own without being told anything

autoML tools

context specific tools: Salesforce Einstein, GE Wise.io general platforms: Open Source - Spearmint, Auto-sklearn, Hyper opt; Phython or R based commercial: Google Prediction API, Amazon ML; GUI

examples of ML algorithms

linear Regression, logistic Regression, decision trees, nearest neighbor, naïve Bayes, support vector machines (SVM), neural networks

acquire subject matter expertise (2 in define project objectives)

Why? indicates obstacles and opportunities, suggests data collection and modeling ideas, finds data quality problems and improvement opportunities, sets expectations for model performance, clarifies alternatives to building model How? talk to colleagues or subject matter experts, read (trade journals, Google, etc)

big data

any data that cannot reside in a hard disk or in a single system

what is a business problem?

anything a company would want to know in order to increase sales or reduce costs: customers likely to buy a product, customers likely to return a product, why customers do not purchase a product, why customers do purchase a product, why customers are dissatisfied, why customers do not renew their contracts, who will be a bad customer

Information Management

application of management techniques to collect information, communicate it within and outside the organization, and process it to enable managers to make quicker and better decisions.

AI

artificial intelligence; a collection of machine learning algorithms with a central unit deciding which of the ML algorithms need to kick in at what times

clean data

authentic: data that has been validated, a single source of truth, standardized data reliable: easier for users to find the right records, users can focus on their job, fewer customer mistakes complete: reduces user frustrations and objections, users know what to input, helps to close deals and close cases faster

A

autoML is automated machine learning A - true B - false

continuous

can take on any value ex: physical measurements, distance, weight, time, speed, financial measurements, sales, assets

types of data

categorical, numerical, discrete, and continuous

examples of risk scenarios

change in underlying patterns so past is no longer predictive business loses interest in outcome being modeled model insufficiently predictive model too predictive (uses variables that can't work in the real world—you don't have access to them in a real modeling situation)

CSV

common separated values

big data in business

companies acquire data form within (internal data)as well as purchase data (external data); what data to look for comes from project objectives; customers can be directly asked for data or indirectly tracked: social media/website activity; how a customer has interacted with the company in the past; geographic location

D

compressing a salary column into ranges would be an example of A - target leakage B - target engineering C - feature leakage D - feature engineering

B

data and information are the same A - yes B - no

unstructured data

data does not exist in a fixed location and can include text documents, PDFs, voice messages, emails

attributes

data items that describe an entity ex: 'name' and 'hire date' are attributes of the entity 'employee'

structured data

data that (1) are typically numeric or categorical; (2) can be organized and formatted in a way that is easy for computers to read, organize, and understand; and (3) can be inserted into a database in a seamless fashion.

small data

data that can reside in RAM or memory

medium data

data that can reside on a hard disk

external data

data that is not collected by your organization (can be public or bought) twitter handles, google data, online blogs, social site postings, likes, credit, etc.

internal data

data that is procured and consolidated from different branches within your organization purchase orders from sales teams, transactions from accounting, reorders from inventory, customer demographics, internet of things records, click through data, many many more

flat file database

database which consists of just one table, stored as an unstructured file until it is read by a program, must upload it from a flat file ex: CSV, zip file

B

deep neural networks would be considered A - general AI B - narrow AI

the machine learning life cycle

define project objectives -> acquire and explore data -> model data -> interpret and communicate -> implement, document, and maintain

audio data

human voice and other sounds

opportunity

if it's not a problem it is a(n)...

why is cross validation important?

if the original validation partition is not representative of the overall population, then the resulting model may appear to have a high accuracy when in reality it just happens to fit the unusual validation set well

big data in business (retail)

improving customer experience/refining marketing strategy; location specific advertising (Internet connected devices IP address or the zip code you gave them); modifying goods and services to better suit current market places; consumer trends; stocking goods; customer churn

C

in five fold cross validation A - the whole dataset is split into 5 sections B - the data set less the validation set and the holdout is split into 5 folds C - the dataset less the holdout is split into five folds

A

in the podcast Facebook's facial recognition tool was used as an example of A - neural networks B - general AI

B

in the podcast the "email spam filter" was used to illustrate A - deep learning B - machine learning C - the need for human intelligence

process transparency

interacts with the earlier criteria of understanding and learning in that without transparency; learning is limited focuses on improving the knowledge of the machine learning process, whereas the understanding and learning criterion focuses more on learning about the business context should enable evaluation of models against each other beyond just their overall performance; which features are the most important in the neural network vs. the logistic regression and in which cases each excels

interpret and communicate (4)

interpret model, communicate model insights

information

interpreted data; has context, meaning, is processed

quantitative data

interval and ratio

how to know whether to proceed with a business problem?

is solving this problem the best use of time and resources? do we have access to the data needed to solve the problem? internally or for purchase

true potential of autoML

it enables the democratization of data science, makes it available and understandable to most

productivity

large part of improvements of AutoML will come from the same processes listed under accuracy means perpetually living with a voice at the back of their head whispering that if only they could find a better algorithm or could tune an existing algorithm, the results would improve graceful handling of algorithm-specific needs assuming good-quality data and existing subject matter expertise, the analyst should be able to conduct the analysis and be ready to present results to top management within two hours

general AI (aka strong AI)

machines that have all the senses (maybe more), all the reason, can have complex nuanced conversations that can pass for human, can solve new problems on the spot, can interpret accents it has never heard before, understand vocabulary through context and can create sentences it has never had to express before; DOES NOT EXIST TODAY, it's more specific now and used for specific tasks/goals

E

mathematical operations can be performed on which of the following data types? A - nominal and interval B - ordinal and ratio C - interval only D - ratio only E - interval and ratio

creation of initial file/table

modeling requires data to be in one table access data from different sources and create one or more tables from each source to join tables, look for "identity field" or primary key results should match desired unit of analysis

accuracy

most important criterion, stems from the system selecting which features to use and creating new ones automatically, as well as comparing and selecting a variety of relevant models and tuning those models automatically system should also automatically set up validation procedures, including cross validation and holdout, and should rank the candidate models by performance, blending the best models for potential improvements

recommended actions

mostly for context-specific AutoML system should be able to transfer a probability into action we must become the subject matter experts ourselves to take the findings from the platform and recommend actions.

class label

multi class problem ex: red, green, yellow

D

multiplication/division can be performed on which of the following data types? A - nominal and interval B - ordinal and ratio C - interval only D - ratio only E - interval and ratio

A

narrow AI exists today? A - true B - false

qualitative data

nominal and ordinal

Hadoop

non-structured cloud based database

target leakage

occurs when you train your algorithm on a dataset that includes info that would not be available at the time of prediction when you apply your model to the data you collect in the future also occurs if one of the predictors is very correlated with the response (but not always) "too good to be true" performance is a dead give away very common because historical data is frequently used

reporting tools

often bundled with query tools; present query output in meaningful, understandable formats

a zettabyte

one billion laptop hard drives

Turing test

one method of determining the strength of artificial intelligence, in which a human tries to decide if the intelligence at the other end of a text chat is human

target

outcome/what you are looking for; response

over training

over fitting, poor generalization, model simply memorizes the training examples and is not able to give correct outputs for patterns that were not in the training dataset, we won't have the precision we are expecting

B

overtraining can be classified as A - low prediction B - poor generalization C - good generalization D - high prediction

k-fold cross validation

partition the data set (less the holdout set) into k equal subsets, each subset is called a fold, k number training/testing experiments are conducted keep the fold f1 as the validation set and keep all the remaining k-1 folds in the training set train your machine learning model using the training set and calculate the accuracy of your model by validating the predicted results against the validation set estimate the accuracy of your machine learning model by averaging the accuracies derived in all k cases across cross validation

examples finding and removing target leakage

predicting rainy days off days it rained, did someone pass away after leaving the hospital? one of your columns is where they were discharged. if hospice was included this would be leakage

resource availability

should be compatible with existing business systems and easily integrated with other tools in the business intelligence ecosystem should be able to connect to existing databases and file formats when ingesting data should also allow easy use of the resulting model, either through an application programming interface (API) or through code that can be placed easily into the organizational workflow should address memory issues, storage space, and processing capabilities in a flexible manner support for the system should be available, either through online forums or customer service, allowing the analyst access to support from experienced machine learning experts

understanding and learning

should improve an analyst's understanding of the problem context system should visualize the interactions between features and the target and should "explain" so that the analyst can stand her ground when presenting findings to management and difficult questions rain down from above system should support thinking around business decisions a good AutoML system will allow the analyst to uncover such potentially useful findings, but only subject matter expertise will allow confident evaluation of this finding

generalizable across contexts

should work for all target types, data sizes, and different time perspectives can predict targets that are numerical (regression problems), as well as targets that contain categorical values (classification problems system should be capable of handling small, medium, and big data should be able to handle both cross-sectional (data collected at one time, or treated as such) and longitudinal data (data where time order matters)

decide whether to continue

sixth and last step in defining project objectives

underfitting features

small variance, high bias

define project objectives (1)

specify business problem, acquire subject matter expertise, define unit of analysis and prediction target, prioritize modeling criteria, consider risks and success criteria, decide whether to continue

where does data come from?

spreadsheets, CSV, databases, ERP, CRM

specify the business problem (1 in define project objectives)

state problem in language of business (not language of modeling) what actions might result from this modeling project specify actions that might result include specifics (number of customers affected, costs etc.) explain impact to the bottom line

ratio measurement

strongest measurement meaningful zero meaningful ratio (multiplication and division can be performed) ex: sales, distance, weight, time it takes to perform a task measures, ordered, equidistant, has meaningful zero

training set

subsection of a dataset from which the machine learning algorithm uncovers or "learns" relationships between the features and the target variable

validation set

subsection of a dataset to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset's other features

types of targets

supervised classification: event/no event (binary target), 0/1, yes/no class label: multi class problem, red, green, yellow regression: continuous outcome, how much money is in someone's pocket

why split data?

the goal of machine learning is to build computational models with high prediction and generalization capabilities at the end of the training process the final model should predict correct outputs for the input samples, but it should also be able to generalize to unforeseen data high prediction and generalization capabilities are conflicting improper splitting can lead to excessively high variance in model performance

analysis gap

the large gap between data businesses collect and the information that decision makers require

raw data

the original data as it was collected; unprocessed

automated machine learning

the process of automating Machine Learning; makes ML possible with out extensive math/stat/programming

Machine Learning

the science of getting computers to act and learn without being explicitly programmed

big data in business (data brokers)

turning data into cash flow

A

up to how much faster can Google predict the flu than the CDC? A - two weeks B - one year C - 5 days D - two months

validation data set

used to fine-tune a model, a lot less variance

feature engineering (5 in acquire and explore data)

using personal and business incite to change the features ex: sorting employee positions into correct levels cleaning data, combining features, missing values data robot will do most of this for us

date and time

usually stores internally as an unsigned integer lead to conversion nightmares because of how many formats there are

model data (3)

variable selection, build candidate models, model validation and selection

the three V's

volume: data quantity, size velocity: data speed, speed of change variety: data types, sources

nominal measurement

weakest level, a qualitative measurement ex: did you change your state of residence in the last year? yes/no is not measured, ordered, or equidistant, no meaningful zero

B

what is meant by "digital exhaust"? A - exhaust generated by electric cars B - the trail of data that each leaves online C - when problems surpass the limitations of correct computing systems


Ensembles d'études connexes

Combo with Manual Muscle Tests of the hand and wrist and 4 others

View Set

PSY200 Module 03: Quiz - Research Methods

View Set

Ch 7 Standard Costing and Variance Analysis MCQ's

View Set

Advanced Civics Semester 2 final study guide

View Set

Chapter 20 - Characteristics of the Sahel, the Sahara, and Oases

View Set

Government, Ethics, And International Business

View Set