big data exam 2
K-NN algorithm
1.choose the number of K neighbors 2.take the K nearest neighbors of the new data point, according to the Euclidean distance 3.among these K neighbors, count the number of data points in each category 4.assign the new data point to the category where you counted the most neighbors
pre-processing
Organize and Integrate.
statistics
and data mining both look for relationships within data •In statistics, we first make hypothesis and collect sample data to test our hypothesis •In data mining, we have a loosely defined discovery statement and use all the data available to find novel patterns
what are the different types of data quality errors?
Typographical & transcription, floating data, implicit and explicit nullness, format conformance, transformation errors, and overloaded attributes
classification
a class of supervised learning algorithms used for predicting categorical variables
regression
a class of supervised learning algorithms used for predicting continuous variables
outlier
a data point that's distant from other data points. plotting will help check for errors or rare events in the data
data cleansing
a definite process but that process is flexible •Not all organizations view quality in the same way, so not all organizations clean in the same way •All processes include first finding errors, and then correcting them •General definition is the assessment of data to determine quality failures (inaccuracy, incompleteness, etc.) and then improving the quality by correcting as possible any errors found.
big data project
a multi-disciplinary craft that combines people teaming up around application-specific purpose that can be achieved through a process, and big data computing platforms to create a product.
classification
a supervised predictive model that segments data by assigning them to groups that are already defined •examines already classified data and develops a predictive pattern (rule)
business intelligence
a technology-driven process for analyzing data and presenting actionable information to help executives, managers and other corporate end users make informed business decisions.
data cleansing processing
eliminate duplicate records, parsing, standardization, abbreviation expansion, correction, updating missing fields
artificial intelligence (AI)
enabling machines to become "smart"
Capture
includes anything that makes us retrieve data including: finding, accessing, acquiring, and moving data.
goals of integrate
integrate all data that is essential for our problem •clean the data to address data quality issues •transform the raw data to make it suitable for analysis •includes scaling, transformation, feature selection, dimensionality reduction, and data manipulation
Integrate
integration of multiple data sources, cleaning data, filtering data, creating datasets which programs can read and understand, such as packaging raw data using a specific data format.
organize
involves looking at the data to understand its nature, what it means, its quality and format. Aims for some preliminary explore in order to gain a better understanding of the specific characteristics of data
categorical data
labels of multiple classes dividing variables into groups
where does data come from?
many places, local and remote, in many varieties, structured and un-structured, and with different velocities.
ratio data
measurement variables in physical science, engineering, math. •Examples: Length, time, distance •These variables can be meaningfully added, subtracted, multiplied, and divided
Data Mining: Intersection of Many Disciplines
statistics, mathematical modeling, artificial intelligence, machine learning, data mining
summary statistics
such as mode, mean, median, and standard deviation provide numerical values to describe your data
numerical data
the numeric value of specific variable
interval data
variables measured on interval scales, therefore we know the order and difference between values. Example: •Temperature intervals, 60-65-70-75 F •Time intervals, 5-10-15-20 minutes
graphing the general trends of data
will show you if there is a consistent direction in which the values of these variables are moving towards, like sales prices going up or down
4 components to a BI system
•Data warehouse for storing and querying data •Business analytics for manipulating, mining and analyzing data. •Business performance management (BPM) for monitoring and analyzing performance. •User interface (UI)for controlling the system and visualizing data.
data in data mining
•Data: a collection of facts obtained as the result of experiences, observations, or experiments. •Data: consist of numbers, letters, words, images, voice recordings... •Data: structured, unstructured, semi-structured •Structured data → data mining algorithms •Unstructured/semi-structured → text mining, web mining •Data → Information → Knowledge
common data cleansing framework
•Define and determine error types •Search and identify error instances •Correct errors •Document error instances and error types •Modify data entry procedures to reduce future errors
general process of knowledge discovery
•Develop an understanding of the business problem •Determine what data are relevant for study •Identify missing data fields, data noise, etc. •Develop a mathematical model to search for patterns of interest (data mining) •Review results to refine model •Use refined model to predict output for set of inputs where output is not yet known •Take action on the discovered patterns
parsing
•Divide according to tokens, a group of characters that have meaning •Look for patterns •i.e., find two spaces in the name field •Therefore, divide field into first, middle, and last names
accuracy
•Does the data correctly reflect what is true?•Should agree with an identified source •May be difficult to detect because of data errors (e.g., misspelled name, transposed numbers in a phone number) •Best controlled when data is entered as close to the source as possible
education
•Everyone in the process should be responsible for ensuring data quality •A good understanding of why data quality is important, and ways to manage quality, is critical •Everyone in the process must be proactive not only with the data of which they are in charge, but anything unusual they may see
format conformance
•Expected format •i.e, date formats differ between countries •Month/day/year •Day/month/year
updating missing fields
•Fill fields that are missing data if reasonable •May be caused by errors in original data
conformity
•For instances of similar/same data, •Same data type •Same format •Same size •e. g., date of graduation is MM/DD/YYYY •Best controlled at the time of data structure creation •Secondary control during the ETL process
data preparation
•Gather relevant data and perform integration processes •Clean data to the extent possible (more later) •Check data for quality (more later) •Transform data into consistent formats, ranges, and aggregations as necessary •Remove unnecessary or redundant data
current limitations & challenges to data mining
•Identification of missing information •Data noise and missing values •Large databases and high dimensionality
Implicit and Explicit nullness
•Is absence of a value allowed? •Implicit nulls: missing allowed •Explicit nulls: What value is to be used if data is missing? •e.g., for a telephone field use (000) 000-0000
prevention
•It can be very difficult to change data so quality data collection methods are necessary •Includes both application and data structure design, e.g., •dropdown boxes to minimize text entries •ranges •data types
data understanding
•Know what data is relevant •Know what data is available or acquirable •Understand the data types (determines analytic technique)
Floating data
•Lack of clarity as to what types of data go into specific fields •Consider Address1 & Address2 fields:which is for the street address or are both for the street address? •Data placed in wrong field
prediction
•used to understand the possibility of future values based on past patterns •Many times prediction is used after an explanatory technique has discovered a pattern •Two major types of prediction are classification and regression
steps to start a big data project
1. Define the problem 2. Assess the situation 3. Define the purpose
machine learning
"is a current application of AI based around the idea that we should really just be able to give machines access to data and let them learn for themselves"
descriptive analytics
(alternately reporting analytics or exploratory analytics) refers to knowing what is happening in and organization and understanding some underlying trends and causes of such occurrences. •First involves consolidation of data sources •Visualization is key to this exploratory analysis step
data mining
(knowledge discovery from data) •Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)patterns or knowledge from huge amount of data -aka Knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching
ordinal data
codes assigned to objects/events as labels representing the rank order among them. Examples: •Credit score: low/medium/high •Age groups: child/young/middle- aged/elderly
decision support systems
-In the early 1970s, the term Decision Support Systems (DSS) was defined as: Interactive computer-based systems which help decision makers utilize data and models to solve unstructured problems. -Another classic DSS definition:DSS couple the intellectual resources of individuals with the capabilities of the computer to improve the quality of decisions. It is a computer-based support system for management decision makers. -The term DSS in the field of IT is a content-free expression (it means different things to different people) =In practice DSS was used as an umbrella term to describe any computerized system that supports decision making in an organization. •A knowledge management system to guide all personnel •Separate support systems for marketing, finance, supply chain
business intelligence (BI)
-a technology-driven process for analyzing data and presenting actionable information to help executives, managers and other corporate end users make informed business decisions. •Computerized decision support for operations -Organizations use BI technologies to make sense of data and to make better decisions. -used as an umbrella term that includes databases, applications, methodologies and other tools used for executive and managerial decision making.
predictive analytics
-aims to determine what is likely to happen in the future. It is based on statistical techniques as well as other more recently developed techniques that fall under the categories of data mining or machine learning. •Can be used to predict risk, customer behavior, customer preferences, future earnings, marketing campaign revenue, product recommendation, etc. •Can be broken down into regression and classification methods. •Use supervised learning to train predictive models. •Techniques such as decision trees, regression, logistic regression, neural networks, text mining, etc.
business understanding
-know what the analysis is for -specific goals tied to potential action are critical -these allow development of a project plan
prescriptive analytics
-seeks to recognize what is going on as well as the likely forecast and to make decisions to achieve the best performance possible. •Historically studied as operations research or management science and has generally been aimed at optimizing the performance of a system. •Examples: Assigning locations to hundreds of classes with conflict schedules Finding shortest paths for UPS drivers to deliver packages Optimizing the thousand of flights' daily schedule at Atlanta airport
CRISP-DM
1. Business understanding 2. Data understanding 3. Data preparation 4. Model building 5. Testing & evaluation 6. Deployment
big data management cycle
1. Capture 2. Organize. 3. Integrate 4. Analyze 5. Act
K-Means algorithm
1. choose the number of K clusters 2. select at random K points, the centroids (not necessarily from your data set) 3. Assign each data point to the closest centroid --> that forms K cluster 4. Compute and place the new centroid of each cluster 5. Reassign each data point to the new closest centroid, If any reassignment took place, go to step 4, otherwise go to FIN
Capture Steps
1. identify data sources: Machine, human, organizational data 2. identify data types: structured, unstructured, semi-structured 3. Transport data from sources to storage platforms
data quality improvement cycle
1. import data 2. merge data sets 3. rebuild missing data 4. standardise data 5. normalise data 6. de-duplicate 7. verify & enrich 8. export data
clustering
an unsupervised learning way to segment data into groups that are NOT previously defined
Understanding the data scale is important because
•it defines appropriate analytic techniques •it prevents inappropriate interpretation of analyses
separation of duties
Data should be collected at a minimum number of places in the organization •Student data should be entered once by the registrar and then used by other areas
big data quality
Data is inherently unclean (typically, the more it is unstructured, the less clean it is). Data can speak volumes but have little to say (noise).
mathematical modeling
and data mining are both look for make decisions to achieve the best possible performance of the system •Mathematical modeling can use data mining results as an input to provide optimal actions based on the restrictions and limitations of the system
what are the 6 data quality dimensions?
accuracy, completeness, conformity, consistency, integrity, timeliness
correlation graphs
can be used to explore the dependencies between different variables in the data
predictive modeling
predict Y (outcome) given X (a set of conditions) •Y is real-valued => regression •sales revenue (y) = (35.202 × advertising (x)) + 21,792 (on average sales revenue should be at least $21,792 even with no advertising. With each dollar spent on advertising we can increase sales by $35.202) •Y is categorical => classification •Apply to Grad School (yes, no) = .50*GPA + .15*age + .10*work experience
visualization techniques
provide a quick and effective way to look at data •give you the idea of where the hotspots are •what the distribution of data is •show correlations between variables
business analytics
refers to the application of models directly to business data. It involves the use of DSS tools, especially models, in assisting decision makers.
data mining
serves as the foundation for AI and machine learning
nominal data
simple codes assigned to objects as labels which are not measurements. Examples: •Binomial values: yes/no, true/false •Multinomial: single/married/divorced
example of decision support system (DSS)
•"Runoff Risk Decision Support is a real-time forecasting guidance that gives farmers information about when to apply fertilizers to their fields. Fertilizer application generally occurs during the winter and spring, the riskiest times of year for runoff from rain and snowmelt." •It is not just a simple data visualization! •It runs some mathematical models and algorithms in the background to make accurate forecasting.
what is data mining?
•(knowledge discovery from data) •Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)patterns or knowledge from huge amount of data •Alternative names:Knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching... •Examples: •Certain names are more prevalent in certain US locations (O'Brien, O'Rourke, O'Reilly... in Boston area) •Group together similar documents returned by search engine according to their context (e.g., Amazon rainforest, Amazon.com)
documentation
•A data dictionary should be available to anyone in the organization who collects or works with data •Data field names, types, ranges, constraints, defaults •Who is responsible for which data •Where is the data collected•What is the data collection process
how does data mining work?
•A model is a mathematical representation of a pattern •Profit = Revenue - Cost (predictive) •Determine historical revenue and cost averages for a given timeframe •Subtract average cost from average revenue to determine expected profit for a future timeframe •What age group buys the most song downloads? (explanatory) •Classify the number of songs purchased by age group •Compare the results of the classification and choose the highest number •The result is the age group that corresponds to the highest number
deployment
•ACTION!! •May be further testing, refinement, or new business policy/process, etc.
timeliness
•As up to date as possible •Time should be reflected in the data and/or report (e.g., date gathered, time frame of report) •The more timely the data, the more costly and difficult to produce•Very context oriented
most common standard processes
•CRISP-DM (Cross-Industry Standard Process for Data Mining) •SEMMA (Sample, Explore, Modify, Model, and Assess) •KDD (Knowledge Discovery in Databases)
correction
•Change data values not recognized •misspellings (auto correct) •Default values to correct data type errors
abbreviation expansion
•Clearly define •INC for incorporated •ST for street •USA for United States of America
completness
•Contains all the required information •Nothing is missing •All data is usable (no errors, all data is "understood") •If it doesn't exist at the time of the analysis, recreation is rarely successful •Best controlled when planning process for data collection occurs
examples of data mining application
•Customer Relationship Management •Maximize return on marketing campaigns •Banking and Other Financial •Detecting fraudulent transactions •Retailing and Logistics •Optimize inventory levels at different locations •Healthcare •Outbreak analysis •Disease prediction
organization
•Data can be examined for errors more quickly if it is reasonably organized •Sorted by entity (student, store, vendor) •Sorted by date •Sorted by location •Differences stand out among similar data
why data mining? scientific viewpoint
•Data collected and stored at enormous speeds •remote sensors on a satellite •NASA EOSDIS archives over petabytes of earth science data / year •telescopes scanning the skies •Sky survey data •High-throughput biological data •scientific simulations •terabytes of data generated in a few hours •Data mining helps scientists in automated analysis of massive datasets
Typographical & transcription
•Data entry errors •Misspelling & abbreviations errors •Miskeyed letters
planning
•Data management is a process that must be guided from start to end •Understanding the organization's needs and the data that supports those needs is the place to start •Data structures and data collection should be controlled to best facilitate the needs of the organization
knowledge discovery process
•Data mining plays an essential role in the knowledge discovery process
how does data mining work?
•Data mining uses data to build models that identify patterns or other relationships •Patterns can either explain a relationship: •Most customers between the ages of 18 and 24 prefer downloadable music over CDs •or help predict a future value: •Given the past sales data, we can expect to sell 6000 individual song downloads in the next month to the 18 to 24 age demographic
how does data mining work?
•Data mining uses data to build models that identify patterns or other relationships •Patterns can either explain a relationship: •Most customers between the ages of 18 and 24 prefer downloadable music over CDs •or help predict a future value: •Given the past sales data, we can expect to sell 6000 individual song downloads in the next month to the 18 to 24 age demographic
why data mining? commercial viewpoint
•Lots of data is being collected and warehoused •Web data •Yahoo has Peta Bytes of web data •Facebook has billions of active users •purchases at department/grocery stores, e- commerce • Amazon handles millions of visits/day •Bank/Credit Card transactions •Computers have become cheaper and more powerful •Competitive Pressure is Strong •Provide better, customized services for an edge (e.g. in Customer Relationship Management)
evolution of dss into business intelligence (BI)
•Managers was using DSS tools for some supportive analysis •With technology advances, they were more comfortable with computing and accept that technology can directly help make intelligent business decisions faster •The concept of BI began to emerge in the 1990s with the rise of the internet and new tools for computer-aided decision making •Over the next decade the term became used more widely and by 2006 major commercial products and services were being marketed for BI.
testing & evaluation
•Often uses a portion of the dataset •Evaluate outcome for reasonableness •Refine the model as necessary •Retest
data cleansing guidelines
•Planning •Education •Organization •Separation of Duties •Prevention •Documentation
big data quality
•Quality software is available that helps with quality issues, particularly consistency. •Data profiling software and data quality dashboards are available to help the user understand data structure, relationships, and content and helps identify inconsistencies. •Provides statistics and informative summaries about data from an information source •Quality is a moving target - less important when doing exploratory data analysis; critical when using data for decision support.
transformation errors
•Reconstruct data field size may truncate existing data •"Jones" becomes "Jon" •Changing data field type may change existing data •Date becomes a number
model building
•Select an appropriate technique based on need and data types
eliminate duplicate records
•Sorting method to find duplicates •Problem with finding non-exact duplicates
data mining characteristics/objectives
•Source of data for DM is often a consolidated data warehouse •DM environment is usually a client-server or a Web-based information systems architecture •The miner is often an end user •Data mining tools' capabilities and ease of use are essential (Web, parallel processing, etc.)
classification and regression
•The goal is to discover rules that define whether an item belongs to a particular subset of data, either a real value or a category •Which insurance claims are most likely fraudulent? •What is the level of a student with a certain number of earned credit hours? •What are the demographics of a purchaser? •What is the average income of our customers?
prescriptive analytics
•The goal of these systems is to provide a decision or recommendation for a specific action. •Results may be presented to a decision maker in a report or integrated in an automated system (e.g. airline pricing systems). •Expert systems apply reasoning methodologies (if-then rules) to knowledge in a specific domain to render expert level advice or recommendations.
large scale data is everywhere
•There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies •New mantra •Gather whatever data you can whenever and wherever possible. •Expectations •Gathered data will have value either for the purpose collected or for a purpose not envisioned.
integrity
•This primarily applies to relationships within the data structure •All data in the structure should be retrievable regardless of where it is •This is why keys are so important in the data structure •Best controlled at the time of data structure development
overloaded attributes
•Too much information in one field •"Joe Smith IRA" •"John Smith in Trust for Mary Smith" •"John and Mary Smith" •Implies they are married
standardization
•Use legal name instead of nickname •i.e., use "Robert" for Bob, Rob, Bobby, and Robbie
consistency
•Values are the same/agree between data sets •patient is male in one data set, but listed as pregnant in another data set •customer charges only $10/month for the past five years, but shows $10,000 for the current month •Best controlled with application and database level constraints
unstructured data in data mining
•Vast majority of business data is stored in text documents •Text mining is a process of extracting patterns from large amount of unstructured data sources such as word, pdf, txt, XML and HTML •Sentimental analysis: product reviews on Amazon website •Speech analysis: iPhone's siri
clustering
•an unsupervised learning technique that attempts to create partitions in the data according to some distance metric •By examining the characteristics of each cluster, it may be possible to establish rules for classification •Divides data into different groups •Finds groups that are different from each other AND whose members are similar •Attributes to be clustered are data (not theory) driven •Clusters should be interpreted by someone knowledgeable in the organization •Use clusters to classify new data