Data Mining

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Stacking

---A way to ensemble multiple classifications where a new model is trained to combine the predictions from 2 or more models already trained ---Also know as blending as the predictions of the existing models are blended together using a new model

Six Sigma DMAIC

A systematic approach for eliminating errors Use statistical methods to improve quality by mining variability in business processes

Data Mining Techniques

* Neutral networks * Online Analytics Procressing (OLAP) * Text mining * Decision Trees (Chi squared automatic interaction detection CHAID) * Algorithm *Nearest Neighbors *Brushing *Rule Induction

3Vs of Big Data

Volume, Variety, Velocity

Boosting

--- Constructing a sequence of datasets & models in such a way that a dataset in the sequence weighs heavily when the previous model has misclassified it --- improve classification accuracy --- merge models in the sequence

Bagging

---Create a bootstrap replicate of the datasets and fit a model into each one ---Bootstrap aggregating ---Voting predictions of each model ---Stabilizes data ---Combine the predicted classification from multiple models or the same model to learn different data

Machine Learning

---focus of designing algorithms that can learn from and make predictions on the data without being explicitly programmed to perform the task --- "training data"

Exploratory Data Analysis (EDA)

--known as model building or pattern identification --- "Data investigation" that is not limited by assumptions and biases --Pattern discovery is a complex phase of data mining -- yield a highly predictive, consistent pattern identifying model --- often use visual methods -- used to see what data can tell us beyond the formal hypothesis

drill down

exploring deeper layers of multidimensional data moving from one level of detail to the next

Ethics of data mining

--- Depends on the use of PHI --- Insures data is de-identifiable and confidentiality is maintained --- Follow HIPPA guidelines

Feature Selection

-one of the initial stages of data mining - the selection of attribute in your data that are most relevant to the predictive modeling program

3 data mining methods

1. CRISP -DM 2. SIX SIGMA DMAIC 3. SEMMA

primary analysis

The analysis of original research data by the researchers who collected them

De-identification

The process of removing identifying information from data sets in order to assure the anonymity of individuals.

knowledge discovery KDD

Which model consists of the following steps: 1. Generating/Understanding application domain 2. Forming target data set 3. Data Cleaning/ pre-processing 4. Data reductions/transformation 5. Select purpose of data 6. Choose data mining algorithm 7. Implementation of data mining 8. Interpret patterns 9. Utilized knowledge discovered

Meta-learning

combines the predictions from several data mining models with the goal of synthesizing these predicted classifications to generate a final best predicted classification

data reduction

combing information from large datasets into manageable, smaller information nuggets

____________ facilitates data exploration

data mining

Results from knowledge discovery foster proactive knowledge driven_______________

decision making

secondary analysis

the analysis of original research data that have been collected by other researchers

data mining

the extraction of large amounts of data to identify meaningful patterns and relationships among data for classification and prediction using algorithms to solve problems

Predictive Data Mining

the most common type of data mining; usually applied to identify data mining projects with the goal to identify a statistical set of models that can be used to predict some response of interest

data mining

*iterative process *explores and models big data *identity patterns *provides meaningful insight *considered secondary bc data miners use data created by others

What are the steps of CRISP-DM?

1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment

Knowledge Deployment

Application of new data to generate predictions

data mining

the process of finding correlations or patterns among data

What are 2 types of secondary analysis?

1. Data mining 2. Meta analysis

Name 2 benefits of KDD

1. Enhances business aspects 2. Improves patient care

Data mining focuses on producing solutions that generates useful forecasting through a 4 phase process:

1. Problem identification 2.Explore the data 3.Pattern discovery 4.Knowlege deployment

Classification and Regression Trees (CART)

A data mining method used for analyzing outcomes and service use; used for sorting or classifying a data set; A set of rules that can be applied to a new data set that has not been classified; the set of rules is designed to predict which records will have a specified outcome.

Cross-Industry Standard Process for Data Mining (CRISP-DM)

A industry proven way to guide data mining efforts and projects. Provides an overview for data mining lifecycle An accepted standard for the steps in any data mining process used to provide business solutions Includes descriptions of typical phases of a project, the tasks involved with each phase, and a explanation of the relationship between these phases Many tasks can be performed out of order and repeated

meta-analysis

A integrative analysis of findings from many studies that examine the same question; useful when research outcomes were inconsistent and contradictory

SEMMA

An alternative process for data mining projects proposed by the SAS Institute. Use large amounts of data to uncover previously unknown patterns which can be used as a business advantage

Data Mining Categories

Classification: assigns records to one of a predefined set of classes Estimation: determines values for an unknown continuous variable behavior or estimated future value Affinity Grouping: determines which things go together Clustering: segments a heterogeneous population of records into a number of more homogeneous subgroups

Six Sigma DMAIC

Define - the problems and issues Measure - how to collect data Analyze - whats causing the problem Improve - create solutions Control - revise systems to incorporate improvements & ensure the process is properly managed & monitored

*When data mining looks at data from different vantage points, aspects, and perspectives *Brings new insights *Find hidden, valuable information from large databases *9 multi step repetitive process

Knowledge discovery KDD

Big Data in Healthcare

Refers to the abundant health data amassed from numerous sources including EHRs, digital medical imaging, wearables, medical devices, smartphones, etc

SEMMA stands for

Sample, Explore, Modify, Model, and Assess

De-identifing

Under HIPPA ________ data is no longer considered PHI and may be used for statistical based research. But the process of ___________ personal data is considered a "use" of the data and thus falls under HIPPA guidelines & protections


Kaugnay na mga set ng pag-aaral

Ethics & Responsibilities in Tax Practice

View Set

Global Test 6 + Map + Primary Source IDs/Short Answer + Timeline

View Set

Psychology Chapter 9: Study Guide

View Set