Info MGMT Midterm 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Primary Key Qualifications

-Non-null value -Must be unique

Model Criteria

-Predictive performance -Familiarity -Prediction speed -Speed to build model - Interpretability?

One-Hot Encoding

0/1's (True/False, Yes/No) Binary Coding

Max # of cells for excel

200,000

Database

A database is an organized collection of data. A relational database, more restrictively, is a collection of schemas, tables, queries, reports, views, and other elements

Entity Relationship Model

A graphical approach to database design. •Graphically represents the logical relationship of entities (objects)

Machine Learning

A subset of AI •"A field of study that gives computers the ability to learn without being explicitly programmed"-McClendon & Meghanathan •The practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the word Driverless Cars, Detecting Fraud, Finding Cancer

Deep Learning

A subset of AI and ML •Based on neural networks •Inspired by the human brain •With ML we have to define features that separate, DL finds these on it's own

APIs (Application Programming Interfaces)

APIs are communication protocols that allow you to connect to an external provider of data and, by authenticating with a provided username and password, upload parameters for the kinds of data you want to download

Creation of Initial File/Table

Access data from different sources and create one of more tables from each sourc

Most Important Auto Ml criteria

Accuracy

Automated Machine Learning Criteria (8)

Accuracy •Productivity •Ease of use •Understanding and learning •Resource availability •Process transparency •Generalizability across contexts •Recommended actions

Large Data

Any data that cannot reside in a Hard Disk or in a single system

Automated Machine Learning

Any machine learning system that automates the repetitive tasks required for effective ML

Business Problem

Anything a company would want to know in order to increase sales or reduce costs State problem in language of business (not language of modeling) •What actions might result from this modeling project •Specify actions that might result •Include specifics (number of customers affected, costs etc.) •Explain impact to the bottom line

Continuous vs. Discrete

Continuous is a decimal [ Length in minutes of a country song] Discrete is a whole number [Points scored by a soccer player in a season]

Data Splitting

Data accessed and manipulated by multiple threads must be divided to run on separate cores.

Structured Data

Data that can be stored in a table or data base

Small Data

Data that can reside in RAM or memory

Medium Data

Data that can reside on a Hard Disk

Unstructured Data

Data that does not reside in fixed locations

External Data

Data that is not collected by your organization (can be public or bought) •Twitter handles (from name, address, or SSN), Google data, Online Blogs, Social Site postings, Likes, Credit

Internal Data

Data that is procured and consolidated from diferent branches within your organization •Purchase orders from sales teams, Transactions from accounting, Re-orders from inventory, Customer demographics, Internet of Things records, Click through Data, Etc. many many more

Data Warehouses

Data warehouses are collections of data from different information systemsTransforming the data into a consistent format is a key part of data warehouse formation

Machine Learning Life Model

Define project objectives Acquire/explore data Model Data Interpret/communicate Implement/Document/maintain

Features

Different independent 'variables', columns of data, predictors

Trajectory Mining

Direction something is heading

Dirty Data

Duplicate, Incorrect, Missing

Model Diagnostics

Evaluation of top models and different probability cutoffs

Model Diagnosis

Evaluation/ Ranking of the top models Which Model predicts the best results

Data Sources

Excel, Comma Separated Values (CSV) Structured Databases Cloud Based Data ERP (Enterprise Resource Planning) Hadoop (Non-structured cloud based DB) Salesforce CRM (Customer Relationship Management)

ETL Summary

Extract: Pull data out of wherever it resides in the world Transform: Convert data into homogenous format for joining •Do stuff to the data Create new "features" (variables) Load: Put Clean Data into final location

Big Data Opportunity

Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible

General Platforms: Commercial

For a price

Union

For example, if a company has multiple customers, and we store their records in various databases, a union will create one table containing all customers for that company.

Image Data

Graphic images and pictures

Audio

Human voice and other sounds

Context Specific Tools

Implemented with another system or for a specific purpose

Types of Machine Learning

Linear Regression Logistic Regression Decision Trees NearestNeighbor Naive Bayes Support Vector Machines (SVM) Neural Networks

artificial intelligence

Machines that can preform tasks that are characteristic of human intelligence

Zettabyte

One billion laptop hard drives

Turing Test

One method of determining the strength of artificial intelligence, in which a human tries to decide if the intelligence at the other end of a text chat is human.

Target

Outcome/What you are looking for, response

Over training can be classified as

Poor Generalization

Over fitting

Poor generalization can be classified as over training •The model simply memorizes the training examples and is not able to give correct outputs also for patterns that were not in the training dataset

Reporting Tools

Present query output in meaningful, understandable formats

3 criteria used to judge projects.

Project statements should be presented in the language of business. Does the project statements have speciic actions that should result from the project? ow could this solving the problem statements impact the bottom line?

Two Types of Targets in Machine Learning

Regression and Classification

Complete Match Removal

Removal of full rows based on identical content in all columns

Partial Match Removal

Removal of full rows based on the identical content of a few columns

Time Series Motifs

Repeated Segments long time series data

Attribute Instance

Represents a single cell within a column

algorithm selection

Selecting the algorithm

Model

Set of weighted relationships between the features and the target +Strength

Validation Set

Subsection of a dataset to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset's other features

Summarize

Summarize is a feature available in most programming tools and is also known under the names group and group by. When summarizing, one or more columns by which to group data is selected, essentially creating a virtual "bucket" for each unique group. For example, if you are interested in employees, you would summarize by EmployeeID, and in the given case, the nine unique employees in the dataset would each be assigned a bucket. Every row belonging to that employee would then be placed inside the bucket. Once all the relevant rows for that employee are in the bucket, the data within are available to be summarized.

Types of Targets

Supervised Classification Event/no event (binary target) -0/1 -yes/no Class label (multiclass problem) -Red -Green -Yellow• Regression Continuous outcome-How much money is in someone's pocket

Narrow AI

Technologies that are able to perform specific tasks as well as, or better than, we humans can

Project

Temporary group activity designed to produce a unique product service or result

Word Birth

The first time a baby/toddler says a new word

Feature engineering

The process of cleaning data, combing features, splitting features, handling missing values Using personal and business incite to change the features•Ie: less work is done on the Sunday after the supper bowl

data exhaust

The trail of data data generated as a byproduct of people's online actions and choices.

Digital exhaust is ...

The train that each of us leaves online

Filtering

Tool for splitting up a set of data into two separate tables based o characteristics of that data

k-fold cross validation

Train your machine learning model using the training set and calculate the accuracy of your model by validating the predicted results against the validation set Estimate the accuracy of your machine learning model by averaging the accuracies derived in all k cases across cross validation

Data

Unorganized facts that need to be processed

Regression

Used for predictingthe values of a dependent variable based on values of at least one independent variable

The three V's

Variety: Manage the complexity of multiple relational and non-relational data types and schemas Velocity: Streaming data and large volume data movement Volume: Scale from terabytes to zettabytes

Target Leak

When a dataset is collected over an extended period such that you have unrealistic data available @ the time of prediction Occurs when you train your algorithm on a dataset that includes information that would not be available at the time of prediction when you apply your model to the data you collect in the future. •Also occurs if one of the predictors is very correlated with the response (not always)"too good to be true" performance is a dead giveaway •Very common because historical data is frequently used

Model Success Criteria

Who uses the model? How much value can the model drive? What modeling criteria will help get you there? 1. Who will use the model? 2. Is management on board with the project? 3. Can the model drivers be visualized? 4. How much value can the model produce

Better to train a Subject matter expert on Auto ML than a data scientist

YA

Text String

You specify a number of characters •This is either exact or a maximum depending on the data type

Holdout Set

aka "testing" data), a subsection of a dataset to provide a final estimate of the machine learning model's performance after if has been trained and validated. Holdout sets should never be used to make decisions about which algorithms to use for improving tuning algorithms.

Primary Key

an attribute that uniquely identify a specific instance of an entity. Every entity in the data model must have a primary key whose values uniquely identify instances of the entity.

Foreign Key

attribute borrowed from another related table in order to make the relationship between the two tables.

Join

combines two datasets (or tables) with a shared identity value, such as a customer identifier.15 For example, you might "join" the row containing your customer record (CustomerID, name, address, etc.; A in Figure 9.1) with your login information on a website (CustomerID, visit time, products purchased, etc.; B in Figure 9.1)

data brokers

companies that collect and sell personal information about consumers

Attributes

data items that describe an entity.

Nominal Data

data of categories only. Data cannot be arranged in an ordering scheme. (Gender, Race, Religion)

General Platforms: Open Source

designed by/for computer scientists

Querying Tools

easy to use software allowing users to get specific information from a data base

Exploratory data analysis

examining descriptive statistics for all features as well as their relationship with the target

Information

knowledge communicated or received concerning a particular fact or circumstance, created from data

Unsupervised Machine Learning

machine learning that does not need input for the algorithms and does not need to be trained

Supervised Machine Learning

machine learning that requires humans to provide input and desired output as well as feedback about prediction accuracy during the beginnings of the system

General AI

machines that have all our senses (maybe even more) Can have complex nuanced conversations that can pass for human •Can solve new problems on the spot •Can interpret accents it has never heard before •That understand vocabulary through context and can create sentences it has never had to express before

alphanumeric

numbers and letters

Entities

real world object distinguishable from other objects. An entity is described using a set of attributes.

Crosstab

rows (skinny form-typical set for data to take) to column form

Text Data

sentences and paragraphs used in written communication

Training Set

subsection of a dataset from which the machine learning algorithm uncovers or "learns" relationships between the features and the target variable.

Five Fold Validation

the data set less the validation set and the holdout is split into five folds

Analysis Gap

the large gap between data businesses collect and the information that decision makers require

Triple Constraint

time, cost, scope

Regression (Target of Machine Learning)

to predict the target's numeric value With this second kind of target, we might, for example, target how many years of blissful union a couple has ahead of them.

Classification (Target of Machine Learning)

which predicts the category to which a new case belongs. For example, we might build a model of divorce during the first ten years of marriage (our target is divorced: TRUE or FALSE) When working through a dataset for classification, we carefully examine the possible states in which a variable can exist and consider options for simplification

unit of analysis

who or what is being studied

Churn

will this customer ever stop being a customer


Ensembles d'études connexes

Preferred and permissible blood types for transfusions

View Set

rnr wetland foundations (lecture 1)

View Set

Chapter 7: Interest Rates and Bond Valuation

View Set