Data Mining
Stacking
---A way to ensemble multiple classifications where a new model is trained to combine the predictions from 2 or more models already trained ---Also know as blending as the predictions of the existing models are blended together using a new model
Six Sigma DMAIC
A systematic approach for eliminating errors Use statistical methods to improve quality by mining variability in business processes
Data Mining Techniques
* Neutral networks * Online Analytics Procressing (OLAP) * Text mining * Decision Trees (Chi squared automatic interaction detection CHAID) * Algorithm *Nearest Neighbors *Brushing *Rule Induction
3Vs of Big Data
Volume, Variety, Velocity
Boosting
--- Constructing a sequence of datasets & models in such a way that a dataset in the sequence weighs heavily when the previous model has misclassified it --- improve classification accuracy --- merge models in the sequence
Bagging
---Create a bootstrap replicate of the datasets and fit a model into each one ---Bootstrap aggregating ---Voting predictions of each model ---Stabilizes data ---Combine the predicted classification from multiple models or the same model to learn different data
Machine Learning
---focus of designing algorithms that can learn from and make predictions on the data without being explicitly programmed to perform the task --- "training data"
Exploratory Data Analysis (EDA)
--known as model building or pattern identification --- "Data investigation" that is not limited by assumptions and biases --Pattern discovery is a complex phase of data mining -- yield a highly predictive, consistent pattern identifying model --- often use visual methods -- used to see what data can tell us beyond the formal hypothesis
drill down
exploring deeper layers of multidimensional data moving from one level of detail to the next
Ethics of data mining
--- Depends on the use of PHI --- Insures data is de-identifiable and confidentiality is maintained --- Follow HIPPA guidelines
Feature Selection
-one of the initial stages of data mining - the selection of attribute in your data that are most relevant to the predictive modeling program
3 data mining methods
1. CRISP -DM 2. SIX SIGMA DMAIC 3. SEMMA
primary analysis
The analysis of original research data by the researchers who collected them
De-identification
The process of removing identifying information from data sets in order to assure the anonymity of individuals.
knowledge discovery KDD
Which model consists of the following steps: 1. Generating/Understanding application domain 2. Forming target data set 3. Data Cleaning/ pre-processing 4. Data reductions/transformation 5. Select purpose of data 6. Choose data mining algorithm 7. Implementation of data mining 8. Interpret patterns 9. Utilized knowledge discovered
Meta-learning
combines the predictions from several data mining models with the goal of synthesizing these predicted classifications to generate a final best predicted classification
data reduction
combing information from large datasets into manageable, smaller information nuggets
____________ facilitates data exploration
data mining
Results from knowledge discovery foster proactive knowledge driven_______________
decision making
secondary analysis
the analysis of original research data that have been collected by other researchers
data mining
the extraction of large amounts of data to identify meaningful patterns and relationships among data for classification and prediction using algorithms to solve problems
Predictive Data Mining
the most common type of data mining; usually applied to identify data mining projects with the goal to identify a statistical set of models that can be used to predict some response of interest
data mining
*iterative process *explores and models big data *identity patterns *provides meaningful insight *considered secondary bc data miners use data created by others
What are the steps of CRISP-DM?
1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment
Knowledge Deployment
Application of new data to generate predictions
data mining
the process of finding correlations or patterns among data
What are 2 types of secondary analysis?
1. Data mining 2. Meta analysis
Name 2 benefits of KDD
1. Enhances business aspects 2. Improves patient care
Data mining focuses on producing solutions that generates useful forecasting through a 4 phase process:
1. Problem identification 2.Explore the data 3.Pattern discovery 4.Knowlege deployment
Classification and Regression Trees (CART)
A data mining method used for analyzing outcomes and service use; used for sorting or classifying a data set; A set of rules that can be applied to a new data set that has not been classified; the set of rules is designed to predict which records will have a specified outcome.
Cross-Industry Standard Process for Data Mining (CRISP-DM)
A industry proven way to guide data mining efforts and projects. Provides an overview for data mining lifecycle An accepted standard for the steps in any data mining process used to provide business solutions Includes descriptions of typical phases of a project, the tasks involved with each phase, and a explanation of the relationship between these phases Many tasks can be performed out of order and repeated
meta-analysis
A integrative analysis of findings from many studies that examine the same question; useful when research outcomes were inconsistent and contradictory
SEMMA
An alternative process for data mining projects proposed by the SAS Institute. Use large amounts of data to uncover previously unknown patterns which can be used as a business advantage
Data Mining Categories
Classification: assigns records to one of a predefined set of classes Estimation: determines values for an unknown continuous variable behavior or estimated future value Affinity Grouping: determines which things go together Clustering: segments a heterogeneous population of records into a number of more homogeneous subgroups
Six Sigma DMAIC
Define - the problems and issues Measure - how to collect data Analyze - whats causing the problem Improve - create solutions Control - revise systems to incorporate improvements & ensure the process is properly managed & monitored
*When data mining looks at data from different vantage points, aspects, and perspectives *Brings new insights *Find hidden, valuable information from large databases *9 multi step repetitive process
Knowledge discovery KDD
Big Data in Healthcare
Refers to the abundant health data amassed from numerous sources including EHRs, digital medical imaging, wearables, medical devices, smartphones, etc
SEMMA stands for
Sample, Explore, Modify, Model, and Assess
De-identifing
Under HIPPA ________ data is no longer considered PHI and may be used for statistical based research. But the process of ___________ personal data is considered a "use" of the data and thus falls under HIPPA guidelines & protections