BCOR 2205: Info Management Midterm
timeline of AI, ML, and deep learning
AI - 1950's ML - 1980's deep learning - 2010's
relationship between AI, ML, and deep learning
AI is the overarching topic Ml is a subset of AI deep learning is a subset of ML
structured database
ER db, MS access, SAP, CRM
C
ERP stands for A - engaged research and planning B - enterprise reasoned plan C - enterprise resource planning D - effective resource planning E - electronic resource plan
competitive advantage of big data
1/3 business leaders make frequent decisions with a lack of trust in the information; 1/2 say they don't have access to the info they need to do their jobs; 83% see BI/Analytics as part of their road map; 60% see a need to do a better job capturing and understanding information rapidly in order to make swift business decisions
cloud based tools
Amazon Web Services, Microsoft Azure SQL Database, Cloud SQL by Google, Oracle Database as a service, Hadoop
database/data base management systems
a collection of records, stores both end user data (raw facts) and metadata (data about data) designed as means of collecting all the records, software system designed to allow the definition, update, retrieval and administration of databases
relational database
a database that represents data as a collection of tables in which all data relationships are represented by common values in related tables
entity relationship model
a graphical approach to database design, graphically represents the logical relationship of entities (objects)
ordinal measurement
a measure for which the scores represent ordered categories that are not necessarily equidistant from each other ex: poor, average, good, excellent not measured or equidistant, but is ordered, no meaningful zero
interval measurement
a rank with meaningful equidistant intervals between scale points ex: temperature, dates zero is arbitrary ratios are meaningless (multiplication/division not possible), but addition and subtraction can be performed measured, ordered, equidistant, but no meaningful zero
attribute instance
a single value of an attribute for an instance of an entity ex: 'Jane Hathaway' and '3 March 1989' are instances of the attributes name and hire date
deep learning
a subset of AI and ML; based on neural networks, inspired by the human brain; with ML we have to define features that separate whereas DL finds these on its own
machine learning
a subset of AI; a field of study that gives computers the ability to learn without being explicitly programmed; the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the word
holdout set
aka "testing data", a subsection of a dataset to provide a final estimate of the machine learning model's performance after it has been trained and validated, should never be used to make decisions about which algorithms to use for improving tuning algorithms
categorical data
aka qualitative, described by words rather than numbers ex: freshmen, sophomores, juniors, seniors
numerical data
aka quantitative, arise from counting, measuring, or some kind of mathematical operation ex: 20 of you visited the class website since last class
relationships
an association among entities
foreign key
an attribute borrowed from another related table in order to make the relationship between the two tables
primary key
an attribute that uniquely identifies a specific instance of an entity, every entity in the data model must have one whose values uniquely identify instances of the entity to qualify, must be a non-null value and must be unique
ease of use
analysts should find the system easy to use system should minimize the machine learning knowledge necessary to be immediately effective, especially if the analyst does not know what needs to be done to the data for a given algorithm to work optimally operationalizing that model into production should be easy
random forrest
build many trees, average predictions from each one many splits, without relying on just a few data points for final prediction significantly more accurate than linear regression or single decision trees in most circumstances
regression
continuous outcome, used for prediction the values of a dependent variable based on values of at least one independent variable ex: how much money is in someone's pocket
transform data
converts the observation from the original units of measurement to a standardized scale using excel, SQL, Altrex, Python or R studio
discrete
countable (integers 1, 2, 3, 4, etc)
cloud database
created and maintained using cloud data services that provide defined performance measures for the database, runs on a cloud computing platform, network of remote servers hosted on the Internet to store, manage, and process data ex: Hadoop, mongoDB
C
customer: purchase amount, name the entity(its) is/are ________ and the attribute(s) is/are ________ A - purchase amount, customer B - purchase amount, customer and name C - customer, purchase amount and name D - customer, name E - customer, purchase amount
features
different independent 'variables', columns of data, predictors
unsupervised machine learning
machine learning that does not need input for the algorithms and does not need to be trained, machine determines what to learn/what is important
supervised classification
event/no event (binary target) ex: 0/1, yes/no
spreadsheets
excel, google sheets, numbers useful for small amounts of data
feature engineering
fifth and last step in acquire and explore data
implement, document, and maintain
fifth and last step in the machine learning life cycle
consider risks and success criteria
fifth step in defining project objectives
find appropriate data
first step in acquire and explore data
specify business problem
first step in defining project objectives
set up batch or API prediction system
first step in implement, document, and maintain
interpret model
first step in interpret and communicate
variable selection
first step in model data
define project objectives
first step in the machine learning life cycle
find and remove any target leakage
fourth step in acquire and explore data
prioritize modeling criteria
fourth step in defining project objectives
interpret and communicate
fourth step in the machine learning life cycle
B
heights are discrete data A - true B - false C - sometimes
overfitting features
high variance, low bias
artificial intelligence
machines that can perform tests that are characteristics of human intelligence (earliest AI was chess and training machines how to play games)
examples of classification task
predicting tumor cells as benign or malignant classifying credit card transactions as legitimate or fraudulent classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil categorizing news stores as finance, weather, entertainment, sports, etc
entities
real world object distinguishable from other objects, described using a set of attributes
auto ML - feature engineering
recognizes categorical vs. numerical despite initial appearances, recognizes reference ID, dealing with nulls, combining features (age into categories 20-30 etc), expanding features (dates days/months/years etc)
examine data
remove columns that are clearly not necessary (or duplicate), be careful, if in doubt don't throw it out check descriptive stats verify column types match expectation
B
salesforce is an example of A - ERP B - CRM C - ABD D - a flat file E - B and D
communicate model insights
second and last step in interpret and communicate
merge data into single database
second step in acquire and explore data
acquire subject matter expertise
second step in defining project objectives
document modeling process for reproducibility
second step in implement, document, and maintain
build candidate models
second step in model data
acquire and explore data
second step in the machine learning life cycle
ER database structure
table with primary key, attributes, foreign key
prediction target
the behavior of a "thing" (ex. person, stock, organization, etc.) in the past that we will need to predict in the future without one, there is no (easy) way for humans or machines to learn what associations drive an outcome
D
what is meant by "word birth"? A - the first time a word is searched for online B - the first existence of a word in writing C - when an older word takes on a new form, gives 'birth' to a subword D - the first time a baby/or toddler says a new word
C
what is the first stage in the machine learning life cycle? A - acquire and explore data B - acquire executive support C - define project objectives D - define project goal E - none of the above
A
what is the first step in define project objectives> A - specify business problem B - acquire subject matter expertise C - define prediction target D - define unit of analysis
A
what type of machine learning will we learn in this class? A - supervised B - unsupervised C - neither
D
when can we use excel for data exploration/manipulation? A - always B - never C - when the data set contains only #'s D - when our data set is small enough E - when our data set is large enough
C
which came first? A - deep learning B - machine learning C - artificial intelligence
A
which does the textbook identify as the most important criterion of autoML? A - accuracy B - process transparency C - precision D - understanding and learning E - generalization
D
which of the following is correct? A - an attribute is defined using a set of entities B - entities and attributes are synonymous C - an entity is defined using a set of relationships D - an entity is defined using a set of attributes
D
which of the following is/are a criteria that should be used to evaluate any propose project? A - is the project statement presented in the language of business? B - does the project statement specify actions that should result from the project? C - how could solving this problem impact the bottom line D - all of the above
success criteria (5 in define project objectives)
who uses the model? how much value can the model drive? what modeling criteria will help you get there?
C
you are trying to predict whether a customer will fill out a credit card application. using "total time on bank webpage" would be an example of A - carefully selecting features B - needing to define a common unit (seconds, minutes) C - target leakage D - a good data decision
text (strings)
you specify a number of characters, this is either exact or a maximum depending on the data type varchar, variable length string, text varying, VSTR almost always have to convert text or extract text to make something useful out of it
C
your textbook refers to binary coding as A - dummy variables B - if-then coding C - one-hot-encoding D - a bad idea
create model monitoring and maintenance plan
third and last step in implement, document, and maintain
build validation and selection
third and last step in model data
conduct exploratory data analysis
third step in acquire and explore data
define unit of analysis and prediction target
third step in defining project objectives
model data
third step in the machine learning life cycle
binary variables
1 if true, 0 if false (Boolean)
excel
CSV (common separated variable)
continuous
can take on any value ex: physical measurements, distance, weight, time, speed, financial measurements, sales, assets
types of data
categorical, numerical, discrete, and continuous
D
compressing a salary column into ranges would be an example of A - target leakage B - target engineering C - feature leakage D - feature engineering
attributes
data items that describe an entity ex: 'name' and 'hire date' are attributes of the entity 'employee'
flat file database
database which consists of just one table, stored as an unstructured file until it is read by a program, must upload it from a flat file ex: CSV, zip file
why is cross validation important?
if the original validation partition is not representative of the overall population, then the resulting model may appear to have a high accuracy when in reality it just happens to fit the unusual validation set well
C
in five fold cross validation A - the whole dataset is split into 5 sections B - the data set less the validation set and the holdout is split into 5 folds C - the dataset less the holdout is split into five folds
process transparency
interacts with the earlier criteria of understanding and learning in that without transparency; learning is limited focuses on improving the knowledge of the machine learning process, whereas the understanding and learning criterion focuses more on learning about the business context should enable evaluation of models against each other beyond just their overall performance; which features are the most important in the neural network vs. the logistic regression and in which cases each excels
interpret and communicate (4)
interpret model, communicate model insights
information
interpreted data; has context, meaning, is processed
quantitative data
interval and ratio
how to know whether to proceed with a business problem?
is solving this problem the best use of time and resources? do we have access to the data needed to solve the problem? internally or for purchase
true potential of autoML
it enables the democratization of data science, makes it available and understandable to most
productivity
large part of improvements of AutoML will come from the same processes listed under accuracy means perpetually living with a voice at the back of their head whispering that if only they could find a better algorithm or could tune an existing algorithm, the results would improve graceful handling of algorithm-specific needs assuming good-quality data and existing subject matter expertise, the analyst should be able to conduct the analysis and be ready to present results to top management within two hours
general AI (aka strong AI)
machines that have all the senses (maybe more), all the reason, can have complex nuanced conversations that can pass for human, can solve new problems on the spot, can interpret accents it has never heard before, understand vocabulary through context and can create sentences it has never had to express before; DOES NOT EXIST TODAY, it's more specific now and used for specific tasks/goals
E
mathematical operations can be performed on which of the following data types? A - nominal and interval B - ordinal and ratio C - interval only D - ratio only E - interval and ratio
creation of initial file/table
modeling requires data to be in one table access data from different sources and create one or more tables from each source to join tables, look for "identity field" or primary key results should match desired unit of analysis
accuracy
most important criterion, stems from the system selecting which features to use and creating new ones automatically, as well as comparing and selecting a variety of relevant models and tuning those models automatically system should also automatically set up validation procedures, including cross validation and holdout, and should rank the candidate models by performance, blending the best models for potential improvements
recommended actions
mostly for context-specific AutoML system should be able to transfer a probability into action we must become the subject matter experts ourselves to take the findings from the platform and recommend actions.
class label
multi class problem ex: red, green, yellow
D
multiplication/division can be performed on which of the following data types? A - nominal and interval B - ordinal and ratio C - interval only D - ratio only E - interval and ratio
A
narrow AI exists today? A - true B - false
qualitative data
nominal and ordinal
target leakage
occurs when you train your algorithm on a dataset that includes info that would not be available at the time of prediction when you apply your model to the data you collect in the future also occurs if one of the predictors is very correlated with the response (but not always) "too good to be true" performance is a dead give away very common because historical data is frequently used
over training
over fitting, poor generalization, model simply memorizes the training examples and is not able to give correct outputs for patterns that were not in the training dataset, we won't have the precision we are expecting
B
overtraining can be classified as A - low prediction B - poor generalization C - good generalization D - high prediction
k-fold cross validation
partition the data set (less the holdout set) into k equal subsets, each subset is called a fold, k number training/testing experiments are conducted keep the fold f1 as the validation set and keep all the remaining k-1 folds in the training set train your machine learning model using the training set and calculate the accuracy of your model by validating the predicted results against the validation set estimate the accuracy of your machine learning model by averaging the accuracies derived in all k cases across cross validation
examples finding and removing target leakage
predicting rainy days off days it rained, did someone pass away after leaving the hospital? one of your columns is where they were discharged. if hospice was included this would be leakage
resource availability
should be compatible with existing business systems and easily integrated with other tools in the business intelligence ecosystem should be able to connect to existing databases and file formats when ingesting data should also allow easy use of the resulting model, either through an application programming interface (API) or through code that can be placed easily into the organizational workflow should address memory issues, storage space, and processing capabilities in a flexible manner support for the system should be available, either through online forums or customer service, allowing the analyst access to support from experienced machine learning experts
understanding and learning
should improve an analyst's understanding of the problem context system should visualize the interactions between features and the target and should "explain" so that the analyst can stand her ground when presenting findings to management and difficult questions rain down from above system should support thinking around business decisions a good AutoML system will allow the analyst to uncover such potentially useful findings, but only subject matter expertise will allow confident evaluation of this finding
generalizable across contexts
should work for all target types, data sizes, and different time perspectives can predict targets that are numerical (regression problems), as well as targets that contain categorical values (classification problems system should be capable of handling small, medium, and big data should be able to handle both cross-sectional (data collected at one time, or treated as such) and longitudinal data (data where time order matters)
decide whether to continue
sixth and last step in defining project objectives
underfitting features
small variance, high bias
where does data come from?
spreadsheets, CSV, databases, ERP, CRM
ratio measurement
strongest measurement meaningful zero meaningful ratio (multiplication and division can be performed) ex: sales, distance, weight, time it takes to perform a task measures, ordered, equidistant, has meaningful zero
training set
subsection of a dataset from which the machine learning algorithm uncovers or "learns" relationships between the features and the target variable
validation set
subsection of a dataset to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset's other features
types of targets
supervised classification: event/no event (binary target), 0/1, yes/no class label: multi class problem, red, green, yellow regression: continuous outcome, how much money is in someone's pocket
why split data?
the goal of machine learning is to build computational models with high prediction and generalization capabilities at the end of the training process the final model should predict correct outputs for the input samples, but it should also be able to generalize to unforeseen data high prediction and generalization capabilities are conflicting improper splitting can lead to excessively high variance in model performance
analysis gap
the large gap between data businesses collect and the information that decision makers require
raw data
the original data as it was collected; unprocessed
automated machine learning
the process of automating Machine Learning; makes ML possible with out extensive math/stat/programming
Machine Learning
the science of getting computers to act and learn without being explicitly programmed
big data in business (data brokers)
turning data into cash flow
A
up to how much faster can Google predict the flu than the CDC? A - two weeks B - one year C - 5 days D - two months
validation data set
used to fine-tune a model, a lot less variance
feature engineering (5 in acquire and explore data)
using personal and business incite to change the features ex: sorting employee positions into correct levels cleaning data, combining features, missing values data robot will do most of this for us
date and time
usually stores internally as an unsigned integer lead to conversion nightmares because of how many formats there are
model data (3)
variable selection, build candidate models, model validation and selection
the three V's
volume: data quantity, size velocity: data speed, speed of change variety: data types, sources
nominal measurement
weakest level, a qualitative measurement ex: did you change your state of residence in the last year? yes/no is not measured, ordered, or equidistant, no meaningful zero
big data in business (sports)
sensor technology in sports equipment; tracking outside practice; predictive insights into fan preferences; coaching decisions/drafting insights
text data
sentences and paragraphs used in written communication
implement, document, and maintain (5)
set up batch or AP prediction system, document modeling process for reproducibility, create model monitoring and maintenance plan
define project objectives (1)
specify business problem, acquire subject matter expertise, define unit of analysis and prediction target, prioritize modeling criteria, consider risks and success criteria, decide whether to continue
structured data
data that (1) are typically numeric or categorical; (2) can be organized and formatted in a way that is easy for computers to read, organize, and understand; and (3) can be inserted into a database in a seamless fashion.
small data
data that can reside in RAM or memory
medium data
data that can reside on a hard disk
external data
data that is not collected by your organization (can be public or bought) twitter handles, google data, online blogs, social site postings, likes, credit, etc.
acquire and explore data (2)
find appropriate data, merge data into single table, conduct exploratory data analysis, find and remove any target leakage, feature engineering
difference between ML and deep learning
ML is being able to get the machine to learn, using examples deep learning is more specific, being able to learn on its own without being told anything
using data to secure data
fingerprints, voice recognition (banking)
acquire subject matter expertise (2 in define project objectives)
Why? indicates obstacles and opportunities, suggests data collection and modeling ideas, finds data quality problems and improvement opportunities, sets expectations for model performance, clarifies alternatives to building model How? talk to colleagues or subject matter experts, read (trade journals, Google, etc)
data wharehouse
a logical collection of information - gathered from many different operational databases - that supports business analysis activities and decision-making tasks
B
according to you textbook, features are your A - responses B - predictors
A
according to your textbook, which is faster? A - training a subject matter expert in autoML B - training a data scientist to understand subject matter
the 8 automated machine learning criteria
accuracy, productivity, ease of use, understanding and learning, resource availability, process transparency, generalizability across contexts, recommended actions
big data
any data that cannot reside in a hard disk or in a single system
what is a business problem?
anything a company would want to know in order to increase sales or reduce costs: customers likely to buy a product, customers likely to return a product, why customers do not purchase a product, why customers do purchase a product, why customers are dissatisfied, why customers do not renew their contracts, who will be a bad customer
Information Management
application of management techniques to collect information, communicate it within and outside the organization, and process it to enable managers to make quicker and better decisions.
AI
artificial intelligence; a collection of machine learning algorithms with a central unit deciding which of the ML algorithms need to kick in at what times
clean data
authentic: data that has been validated, a single source of truth, standardized data reliable: easier for users to find the right records, users can focus on their job, fewer customer mistakes complete: reduces user frustrations and objections, users know what to input, helps to close deals and close cases faster
A
autoML is automated machine learning A - true B - false
examples of risk scenarios
change in underlying patterns so past is no longer predictive business loses interest in outcome being modeled model insufficiently predictive model too predictive (uses variables that can't work in the real world—you don't have access to them in a real modeling situation)
CSV
common separated values
big data in business
companies acquire data form within (internal data)as well as purchase data (external data); what data to look for comes from project objectives; customers can be directly asked for data or indirectly tracked: social media/website activity; how a customer has interacted with the company in the past; geographic location
autoML tools
context specific tools: Salesforce Einstein, GE Wise.io general platforms: Open Source - Spearmint, Auto-sklearn, Hyper opt; Phython or R based commercial: Google Prediction API, Amazon ML; GUI
CRM
customer relationship management
B
data and information are the same A - yes B - no
unstructured data
data does not exist in a fixed location and can include text documents, PDFs, voice messages, emails
internal data
data that is procured and consolidated from different branches within your organization purchase orders from sales teams, transactions from accounting, reorders from inventory, customer demographics, internet of things records, click through data, many many more
B
deep neural networks would be considered A - general AI B - narrow AI
the machine learning life cycle
define project objectives -> acquire and explore data -> model data -> interpret and communicate -> implement, document, and maintain
examples of ML
driverless cars (Tesla, Volvo), medicine (find cancer), detect fraud (PayPal, Amazon, Home Depot), MT Everest Weather, Missing Children based on Facial Recognition, Alexa/Cortana/Siri, Facebook ads, Zillow home prices
dirty data
duplicate, incorrect, and/or missing - biggest problem today you must examine data, not knowing data is not clean = bad decisions
C
ease of use refers to: A - a system being easy to use for the analyst B - a model that is easy to put into operation/production C - both A and B
querying tools
easy to use software allowing users to get specific information from a data base; released the power of the data warehouse to the masses
ERP
enterprise resource planning
ERP
enterprise resource planning system
big data in business (health)
epidemic/disease outbreak prediction; individual disease predictions; increased access to databases on symptoms and new treatments; real time monitoring of patients
decide whether to continue (6 in define project objectives)
estimate resources required, understand alternatives to creating model, consider technical risks, estimate models' business values
model diagnosis
evaluation/ranking of the models; models can be run simultaneously; combinations of models can be run (Blenders)
what is 'Big Data'?
everything we do is leaving a trace (data), which in analyzable; 'the Internet of Things'
B
excel data would be considered A - useful B - structured C - unstructured D - it depends
what is ETL?
extract: pull data out of wherever it resides in the world transform: convert data into homogenous format for joining, do stuff to the data, create new "features" (variables) load: put clean data into final location
the Big Data opportunity
extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible variety: manage the complexity of multiple relational and non-relational data types and schemas velocity: streaming data and large volume data movement volume: scale from terabytes to zettabytes
image data
graphics, shapes, and pictures
audio data
human voice and other sounds
opportunity
if it's not a problem it is a(n)...
big data in business (retail)
improving customer experience/refining marketing strategy; location specific advertising (Internet connected devices IP address or the zip code you gave them); modifying goods and services to better suit current market places; consumer trends; stocking goods; customer churn
A
in the podcast Facebook's facial recognition tool was used as an example of A - neural networks B - general AI
B
in the podcast the "email spam filter" was used to illustrate A - deep learning B - machine learning C - the need for human intelligence
examples of ML algorithms
linear Regression, logistic Regression, decision trees, nearest neighbor, naïve Bayes, support vector machines (SVM), neural networks
supervised machine learning
machine learning that requires humans to provide input and desired output as well as feedback about prediction accuracy during the beginnings of the system, data scientist (DS) select what they want the machine to learn
Hadoop
non-structured cloud based database
what is data?
numbers, characters, images, etc; organized facts that need to be processed; everything quantifiable; meaningless without human or machine interpretation
alphanumeric data
numbers, letters, and other characters
reporting tools
often bundled with query tools; present query output in meaningful, understandable formats
a zettabyte
one billion laptop hard drives
Turing test
one method of determining the strength of artificial intelligence, in which a human tries to decide if the intelligence at the other end of a text chat is human
target
outcome/what you are looking for; response
prioritize model criteria (4 in define project objectives)
predictive performance, familiarity, prediction speed, speed to build model, interpretability?
specify the business problem (1 in define project objectives)
state problem in language of business (not language of modeling) what actions might result from this modeling project specify actions that might result include specifics (number of customers affected, costs etc.) explain impact to the bottom line
narrow AI (aka weak AI)
technologies that are able to perform specific tasks as well as (or better-faster, more accurate, better endurance) than humans can, learns from thousands of examples, reflexive tasks with no understanding, knowledge does not transfer to other tasks, today's AI
define unit of analysis (3 in define project objectives)
the 'Who', or 'What' we are studying what are we trying to predict (prediction target)?unit of analysis must be at this level
B
what is meant by "digital exhaust"? A - exhaust generated by electric cars B - the trail of data that each leaves online C - when problems surpass the limitations of correct computing systems
C
which is NOT one of the eight criteria of autoML? A - accuracy B - process transparency C - precision D - understanding and learning E - generalization
C
which machine learning example was NOT mentioned in your textbook or in the podcast? A - predicting poverty through nighttime light intensity B - Google earth C - translating between English and Dolphin D - Chef Watson, IBM's recipe development tool E - self-driving cars