Info MGMT Midterm 1
Primary Key Qualifications
-Non-null value -Must be unique
Model Criteria
-Predictive performance -Familiarity -Prediction speed -Speed to build model - Interpretability?
One-Hot Encoding
0/1's (True/False, Yes/No) Binary Coding
Max # of cells for excel
200,000
Database
A database is an organized collection of data. A relational database, more restrictively, is a collection of schemas, tables, queries, reports, views, and other elements
Entity Relationship Model
A graphical approach to database design. •Graphically represents the logical relationship of entities (objects)
Machine Learning
A subset of AI •"A field of study that gives computers the ability to learn without being explicitly programmed"-McClendon & Meghanathan •The practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the word Driverless Cars, Detecting Fraud, Finding Cancer
Deep Learning
A subset of AI and ML •Based on neural networks •Inspired by the human brain •With ML we have to define features that separate, DL finds these on it's own
APIs (Application Programming Interfaces)
APIs are communication protocols that allow you to connect to an external provider of data and, by authenticating with a provided username and password, upload parameters for the kinds of data you want to download
Creation of Initial File/Table
Access data from different sources and create one of more tables from each sourc
Most Important Auto Ml criteria
Accuracy
Automated Machine Learning Criteria (8)
Accuracy •Productivity •Ease of use •Understanding and learning •Resource availability •Process transparency •Generalizability across contexts •Recommended actions
Large Data
Any data that cannot reside in a Hard Disk or in a single system
Automated Machine Learning
Any machine learning system that automates the repetitive tasks required for effective ML
Business Problem
Anything a company would want to know in order to increase sales or reduce costs State problem in language of business (not language of modeling) •What actions might result from this modeling project •Specify actions that might result •Include specifics (number of customers affected, costs etc.) •Explain impact to the bottom line
Continuous vs. Discrete
Continuous is a decimal [ Length in minutes of a country song] Discrete is a whole number [Points scored by a soccer player in a season]
Data Splitting
Data accessed and manipulated by multiple threads must be divided to run on separate cores.
Structured Data
Data that can be stored in a table or data base
Small Data
Data that can reside in RAM or memory
Medium Data
Data that can reside on a Hard Disk
Unstructured Data
Data that does not reside in fixed locations
External Data
Data that is not collected by your organization (can be public or bought) •Twitter handles (from name, address, or SSN), Google data, Online Blogs, Social Site postings, Likes, Credit
Internal Data
Data that is procured and consolidated from diferent branches within your organization •Purchase orders from sales teams, Transactions from accounting, Re-orders from inventory, Customer demographics, Internet of Things records, Click through Data, Etc. many many more
Data Warehouses
Data warehouses are collections of data from different information systemsTransforming the data into a consistent format is a key part of data warehouse formation
Machine Learning Life Model
Define project objectives Acquire/explore data Model Data Interpret/communicate Implement/Document/maintain
Features
Different independent 'variables', columns of data, predictors
Trajectory Mining
Direction something is heading
Dirty Data
Duplicate, Incorrect, Missing
Model Diagnostics
Evaluation of top models and different probability cutoffs
Model Diagnosis
Evaluation/ Ranking of the top models Which Model predicts the best results
Data Sources
Excel, Comma Separated Values (CSV) Structured Databases Cloud Based Data ERP (Enterprise Resource Planning) Hadoop (Non-structured cloud based DB) Salesforce CRM (Customer Relationship Management)
ETL Summary
Extract: Pull data out of wherever it resides in the world Transform: Convert data into homogenous format for joining •Do stuff to the data Create new "features" (variables) Load: Put Clean Data into final location
Big Data Opportunity
Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible
General Platforms: Commercial
For a price
Union
For example, if a company has multiple customers, and we store their records in various databases, a union will create one table containing all customers for that company.
Image Data
Graphic images and pictures
Audio
Human voice and other sounds
Context Specific Tools
Implemented with another system or for a specific purpose
Types of Machine Learning
Linear Regression Logistic Regression Decision Trees NearestNeighbor Naive Bayes Support Vector Machines (SVM) Neural Networks
artificial intelligence
Machines that can preform tasks that are characteristic of human intelligence
Zettabyte
One billion laptop hard drives
Turing Test
One method of determining the strength of artificial intelligence, in which a human tries to decide if the intelligence at the other end of a text chat is human.
Target
Outcome/What you are looking for, response
Over training can be classified as
Poor Generalization
Over fitting
Poor generalization can be classified as over training •The model simply memorizes the training examples and is not able to give correct outputs also for patterns that were not in the training dataset
Reporting Tools
Present query output in meaningful, understandable formats
3 criteria used to judge projects.
Project statements should be presented in the language of business. Does the project statements have speciic actions that should result from the project? ow could this solving the problem statements impact the bottom line?
Two Types of Targets in Machine Learning
Regression and Classification
Complete Match Removal
Removal of full rows based on identical content in all columns
Partial Match Removal
Removal of full rows based on the identical content of a few columns
Time Series Motifs
Repeated Segments long time series data
Attribute Instance
Represents a single cell within a column
algorithm selection
Selecting the algorithm
Model
Set of weighted relationships between the features and the target +Strength
Validation Set
Subsection of a dataset to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset's other features
Summarize
Summarize is a feature available in most programming tools and is also known under the names group and group by. When summarizing, one or more columns by which to group data is selected, essentially creating a virtual "bucket" for each unique group. For example, if you are interested in employees, you would summarize by EmployeeID, and in the given case, the nine unique employees in the dataset would each be assigned a bucket. Every row belonging to that employee would then be placed inside the bucket. Once all the relevant rows for that employee are in the bucket, the data within are available to be summarized.
Types of Targets
Supervised Classification Event/no event (binary target) -0/1 -yes/no Class label (multiclass problem) -Red -Green -Yellow• Regression Continuous outcome-How much money is in someone's pocket
Narrow AI
Technologies that are able to perform specific tasks as well as, or better than, we humans can
Project
Temporary group activity designed to produce a unique product service or result
Word Birth
The first time a baby/toddler says a new word
Feature engineering
The process of cleaning data, combing features, splitting features, handling missing values Using personal and business incite to change the features•Ie: less work is done on the Sunday after the supper bowl
data exhaust
The trail of data data generated as a byproduct of people's online actions and choices.
Digital exhaust is ...
The train that each of us leaves online
Filtering
Tool for splitting up a set of data into two separate tables based o characteristics of that data
k-fold cross validation
Train your machine learning model using the training set and calculate the accuracy of your model by validating the predicted results against the validation set Estimate the accuracy of your machine learning model by averaging the accuracies derived in all k cases across cross validation
Data
Unorganized facts that need to be processed
Regression
Used for predictingthe values of a dependent variable based on values of at least one independent variable
The three V's
Variety: Manage the complexity of multiple relational and non-relational data types and schemas Velocity: Streaming data and large volume data movement Volume: Scale from terabytes to zettabytes
Target Leak
When a dataset is collected over an extended period such that you have unrealistic data available @ the time of prediction Occurs when you train your algorithm on a dataset that includes information that would not be available at the time of prediction when you apply your model to the data you collect in the future. •Also occurs if one of the predictors is very correlated with the response (not always)"too good to be true" performance is a dead giveaway •Very common because historical data is frequently used
Model Success Criteria
Who uses the model? How much value can the model drive? What modeling criteria will help get you there? 1. Who will use the model? 2. Is management on board with the project? 3. Can the model drivers be visualized? 4. How much value can the model produce
Better to train a Subject matter expert on Auto ML than a data scientist
YA
Text String
You specify a number of characters •This is either exact or a maximum depending on the data type
Holdout Set
aka "testing" data), a subsection of a dataset to provide a final estimate of the machine learning model's performance after if has been trained and validated. Holdout sets should never be used to make decisions about which algorithms to use for improving tuning algorithms.
Primary Key
an attribute that uniquely identify a specific instance of an entity. Every entity in the data model must have a primary key whose values uniquely identify instances of the entity.
Foreign Key
attribute borrowed from another related table in order to make the relationship between the two tables.
Join
combines two datasets (or tables) with a shared identity value, such as a customer identifier.15 For example, you might "join" the row containing your customer record (CustomerID, name, address, etc.; A in Figure 9.1) with your login information on a website (CustomerID, visit time, products purchased, etc.; B in Figure 9.1)
data brokers
companies that collect and sell personal information about consumers
Attributes
data items that describe an entity.
Nominal Data
data of categories only. Data cannot be arranged in an ordering scheme. (Gender, Race, Religion)
General Platforms: Open Source
designed by/for computer scientists
Querying Tools
easy to use software allowing users to get specific information from a data base
Exploratory data analysis
examining descriptive statistics for all features as well as their relationship with the target
Information
knowledge communicated or received concerning a particular fact or circumstance, created from data
Unsupervised Machine Learning
machine learning that does not need input for the algorithms and does not need to be trained
Supervised Machine Learning
machine learning that requires humans to provide input and desired output as well as feedback about prediction accuracy during the beginnings of the system
General AI
machines that have all our senses (maybe even more) Can have complex nuanced conversations that can pass for human •Can solve new problems on the spot •Can interpret accents it has never heard before •That understand vocabulary through context and can create sentences it has never had to express before
alphanumeric
numbers and letters
Entities
real world object distinguishable from other objects. An entity is described using a set of attributes.
Crosstab
rows (skinny form-typical set for data to take) to column form
Text Data
sentences and paragraphs used in written communication
Training Set
subsection of a dataset from which the machine learning algorithm uncovers or "learns" relationships between the features and the target variable.
Five Fold Validation
the data set less the validation set and the holdout is split into five folds
Analysis Gap
the large gap between data businesses collect and the information that decision makers require
Triple Constraint
time, cost, scope
Regression (Target of Machine Learning)
to predict the target's numeric value With this second kind of target, we might, for example, target how many years of blissful union a couple has ahead of them.
Classification (Target of Machine Learning)
which predicts the category to which a new case belongs. For example, we might build a model of divorce during the first ten years of marriage (our target is divorced: TRUE or FALSE) When working through a dataset for classification, we carefully examine the possible states in which a variable can exist and consider options for simplification
unit of analysis
who or what is being studied
Churn
will this customer ever stop being a customer