big data exam 2

Ace your homework & exams now with Quizwiz!

K-NN algorithm

1.choose the number of K neighbors 2.take the K nearest neighbors of the new data point, according to the Euclidean distance 3.among these K neighbors, count the number of data points in each category 4.assign the new data point to the category where you counted the most neighbors

pre-processing

Organize and Integrate.

statistics

and data mining both look for relationships within data •In statistics, we first make hypothesis and collect sample data to test our hypothesis •In data mining, we have a loosely defined discovery statement and use all the data available to find novel patterns

what are the different types of data quality errors?

Typographical & transcription, floating data, implicit and explicit nullness, format conformance, transformation errors, and overloaded attributes

classification

a class of supervised learning algorithms used for predicting categorical variables

regression

a class of supervised learning algorithms used for predicting continuous variables

outlier

a data point that's distant from other data points. plotting will help check for errors or rare events in the data

data cleansing

a definite process but that process is flexible •Not all organizations view quality in the same way, so not all organizations clean in the same way •All processes include first finding errors, and then correcting them •General definition is the assessment of data to determine quality failures (inaccuracy, incompleteness, etc.) and then improving the quality by correcting as possible any errors found.

big data project

a multi-disciplinary craft that combines people teaming up around application-specific purpose that can be achieved through a process, and big data computing platforms to create a product.

classification

a supervised predictive model that segments data by assigning them to groups that are already defined •examines already classified data and develops a predictive pattern (rule)

business intelligence

a technology-driven process for analyzing data and presenting actionable information to help executives, managers and other corporate end users make informed business decisions.

data cleansing processing

eliminate duplicate records, parsing, standardization, abbreviation expansion, correction, updating missing fields

artificial intelligence (AI)

enabling machines to become "smart"

Capture

includes anything that makes us retrieve data including: finding, accessing, acquiring, and moving data.

goals of integrate

integrate all data that is essential for our problem •clean the data to address data quality issues •transform the raw data to make it suitable for analysis •includes scaling, transformation, feature selection, dimensionality reduction, and data manipulation

Integrate

integration of multiple data sources, cleaning data, filtering data, creating datasets which programs can read and understand, such as packaging raw data using a specific data format.

organize

involves looking at the data to understand its nature, what it means, its quality and format. Aims for some preliminary explore in order to gain a better understanding of the specific characteristics of data

categorical data

labels of multiple classes dividing variables into groups

where does data come from?

many places, local and remote, in many varieties, structured and un-structured, and with different velocities.

ratio data

measurement variables in physical science, engineering, math. •Examples: Length, time, distance •These variables can be meaningfully added, subtracted, multiplied, and divided

Data Mining: Intersection of Many Disciplines

statistics, mathematical modeling, artificial intelligence, machine learning, data mining

summary statistics

such as mode, mean, median, and standard deviation provide numerical values to describe your data

numerical data

the numeric value of specific variable

interval data

variables measured on interval scales, therefore we know the order and difference between values. Example: •Temperature intervals, 60-65-70-75 F •Time intervals, 5-10-15-20 minutes

graphing the general trends of data

will show you if there is a consistent direction in which the values of these variables are moving towards, like sales prices going up or down

4 components to a BI system

•Data warehouse for storing and querying data •Business analytics for manipulating, mining and analyzing data. •Business performance management (BPM) for monitoring and analyzing performance. •User interface (UI)for controlling the system and visualizing data.

data in data mining

•Data: a collection of facts obtained as the result of experiences, observations, or experiments. •Data: consist of numbers, letters, words, images, voice recordings... •Data: structured, unstructured, semi-structured •Structured data → data mining algorithms •Unstructured/semi-structured → text mining, web mining •Data → Information → Knowledge

common data cleansing framework

•Define and determine error types •Search and identify error instances •Correct errors •Document error instances and error types •Modify data entry procedures to reduce future errors

general process of knowledge discovery

•Develop an understanding of the business problem •Determine what data are relevant for study •Identify missing data fields, data noise, etc. •Develop a mathematical model to search for patterns of interest (data mining) •Review results to refine model •Use refined model to predict output for set of inputs where output is not yet known •Take action on the discovered patterns

parsing

•Divide according to tokens, a group of characters that have meaning •Look for patterns •i.e., find two spaces in the name field •Therefore, divide field into first, middle, and last names

accuracy

•Does the data correctly reflect what is true?•Should agree with an identified source •May be difficult to detect because of data errors (e.g., misspelled name, transposed numbers in a phone number) •Best controlled when data is entered as close to the source as possible

education

•Everyone in the process should be responsible for ensuring data quality •A good understanding of why data quality is important, and ways to manage quality, is critical •Everyone in the process must be proactive not only with the data of which they are in charge, but anything unusual they may see

format conformance

•Expected format •i.e, date formats differ between countries •Month/day/year •Day/month/year

updating missing fields

•Fill fields that are missing data if reasonable •May be caused by errors in original data

conformity

•For instances of similar/same data, •Same data type •Same format •Same size •e. g., date of graduation is MM/DD/YYYY •Best controlled at the time of data structure creation •Secondary control during the ETL process

data preparation

•Gather relevant data and perform integration processes •Clean data to the extent possible (more later) •Check data for quality (more later) •Transform data into consistent formats, ranges, and aggregations as necessary •Remove unnecessary or redundant data

current limitations & challenges to data mining

•Identification of missing information •Data noise and missing values •Large databases and high dimensionality

Implicit and Explicit nullness

•Is absence of a value allowed? •Implicit nulls: missing allowed •Explicit nulls: What value is to be used if data is missing? •e.g., for a telephone field use (000) 000-0000

prevention

•It can be very difficult to change data so quality data collection methods are necessary •Includes both application and data structure design, e.g., •dropdown boxes to minimize text entries •ranges •data types

data understanding

•Know what data is relevant •Know what data is available or acquirable •Understand the data types (determines analytic technique)

Floating data

•Lack of clarity as to what types of data go into specific fields •Consider Address1 & Address2 fields:which is for the street address or are both for the street address? •Data placed in wrong field

prediction

•used to understand the possibility of future values based on past patterns •Many times prediction is used after an explanatory technique has discovered a pattern •Two major types of prediction are classification and regression

steps to start a big data project

1. Define the problem 2. Assess the situation 3. Define the purpose

machine learning

"is a current application of AI based around the idea that we should really just be able to give machines access to data and let them learn for themselves"

descriptive analytics

(alternately reporting analytics or exploratory analytics) refers to knowing what is happening in and organization and understanding some underlying trends and causes of such occurrences. •First involves consolidation of data sources •Visualization is key to this exploratory analysis step

data mining

(knowledge discovery from data) •Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)patterns or knowledge from huge amount of data -aka Knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching

ordinal data

codes assigned to objects/events as labels representing the rank order among them. Examples: •Credit score: low/medium/high •Age groups: child/young/middle- aged/elderly

decision support systems

-In the early 1970s, the term Decision Support Systems (DSS) was defined as: Interactive computer-based systems which help decision makers utilize data and models to solve unstructured problems. -Another classic DSS definition:DSS couple the intellectual resources of individuals with the capabilities of the computer to improve the quality of decisions. It is a computer-based support system for management decision makers. -The term DSS in the field of IT is a content-free expression (it means different things to different people) =In practice DSS was used as an umbrella term to describe any computerized system that supports decision making in an organization. •A knowledge management system to guide all personnel •Separate support systems for marketing, finance, supply chain

business intelligence (BI)

-a technology-driven process for analyzing data and presenting actionable information to help executives, managers and other corporate end users make informed business decisions. •Computerized decision support for operations -Organizations use BI technologies to make sense of data and to make better decisions. -used as an umbrella term that includes databases, applications, methodologies and other tools used for executive and managerial decision making.

predictive analytics

-aims to determine what is likely to happen in the future. It is based on statistical techniques as well as other more recently developed techniques that fall under the categories of data mining or machine learning. •Can be used to predict risk, customer behavior, customer preferences, future earnings, marketing campaign revenue, product recommendation, etc. •Can be broken down into regression and classification methods. •Use supervised learning to train predictive models. •Techniques such as decision trees, regression, logistic regression, neural networks, text mining, etc.

business understanding

-know what the analysis is for -specific goals tied to potential action are critical -these allow development of a project plan

prescriptive analytics

-seeks to recognize what is going on as well as the likely forecast and to make decisions to achieve the best performance possible. •Historically studied as operations research or management science and has generally been aimed at optimizing the performance of a system. •Examples: Assigning locations to hundreds of classes with conflict schedules Finding shortest paths for UPS drivers to deliver packages Optimizing the thousand of flights' daily schedule at Atlanta airport

CRISP-DM

1. Business understanding 2. Data understanding 3. Data preparation 4. Model building 5. Testing & evaluation 6. Deployment

big data management cycle

1. Capture 2. Organize. 3. Integrate 4. Analyze 5. Act

K-Means algorithm

1. choose the number of K clusters 2. select at random K points, the centroids (not necessarily from your data set) 3. Assign each data point to the closest centroid --> that forms K cluster 4. Compute and place the new centroid of each cluster 5. Reassign each data point to the new closest centroid, If any reassignment took place, go to step 4, otherwise go to FIN

Capture Steps

1. identify data sources: Machine, human, organizational data 2. identify data types: structured, unstructured, semi-structured 3. Transport data from sources to storage platforms

data quality improvement cycle

1. import data 2. merge data sets 3. rebuild missing data 4. standardise data 5. normalise data 6. de-duplicate 7. verify & enrich 8. export data

clustering

an unsupervised learning way to segment data into groups that are NOT previously defined

Understanding the data scale is important because

•it defines appropriate analytic techniques •it prevents inappropriate interpretation of analyses

separation of duties

Data should be collected at a minimum number of places in the organization •Student data should be entered once by the registrar and then used by other areas

big data quality

Data is inherently unclean (typically, the more it is unstructured, the less clean it is). Data can speak volumes but have little to say (noise).

mathematical modeling

and data mining are both look for make decisions to achieve the best possible performance of the system •Mathematical modeling can use data mining results as an input to provide optimal actions based on the restrictions and limitations of the system

what are the 6 data quality dimensions?

accuracy, completeness, conformity, consistency, integrity, timeliness

correlation graphs

can be used to explore the dependencies between different variables in the data

predictive modeling

predict Y (outcome) given X (a set of conditions) •Y is real-valued => regression •sales revenue (y) = (35.202 × advertising (x)) + 21,792 (on average sales revenue should be at least $21,792 even with no advertising. With each dollar spent on advertising we can increase sales by $35.202) •Y is categorical => classification •Apply to Grad School (yes, no) = .50*GPA + .15*age + .10*work experience

visualization techniques

provide a quick and effective way to look at data •give you the idea of where the hotspots are •what the distribution of data is •show correlations between variables

business analytics

refers to the application of models directly to business data. It involves the use of DSS tools, especially models, in assisting decision makers.

data mining

serves as the foundation for AI and machine learning

nominal data

simple codes assigned to objects as labels which are not measurements. Examples: •Binomial values: yes/no, true/false •Multinomial: single/married/divorced

example of decision support system (DSS)

•"Runoff Risk Decision Support is a real-time forecasting guidance that gives farmers information about when to apply fertilizers to their fields. Fertilizer application generally occurs during the winter and spring, the riskiest times of year for runoff from rain and snowmelt." •It is not just a simple data visualization! •It runs some mathematical models and algorithms in the background to make accurate forecasting.

what is data mining?

•(knowledge discovery from data) •Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)patterns or knowledge from huge amount of data •Alternative names:Knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching... •Examples: •Certain names are more prevalent in certain US locations (O'Brien, O'Rourke, O'Reilly... in Boston area) •Group together similar documents returned by search engine according to their context (e.g., Amazon rainforest, Amazon.com)

documentation

•A data dictionary should be available to anyone in the organization who collects or works with data •Data field names, types, ranges, constraints, defaults •Who is responsible for which data •Where is the data collected•What is the data collection process

how does data mining work?

•A model is a mathematical representation of a pattern •Profit = Revenue - Cost (predictive) •Determine historical revenue and cost averages for a given timeframe •Subtract average cost from average revenue to determine expected profit for a future timeframe •What age group buys the most song downloads? (explanatory) •Classify the number of songs purchased by age group •Compare the results of the classification and choose the highest number •The result is the age group that corresponds to the highest number

deployment

•ACTION!! •May be further testing, refinement, or new business policy/process, etc.

timeliness

•As up to date as possible •Time should be reflected in the data and/or report (e.g., date gathered, time frame of report) •The more timely the data, the more costly and difficult to produce•Very context oriented

most common standard processes

•CRISP-DM (Cross-Industry Standard Process for Data Mining) •SEMMA (Sample, Explore, Modify, Model, and Assess) •KDD (Knowledge Discovery in Databases)

correction

•Change data values not recognized •misspellings (auto correct) •Default values to correct data type errors

abbreviation expansion

•Clearly define •INC for incorporated •ST for street •USA for United States of America

completness

•Contains all the required information •Nothing is missing •All data is usable (no errors, all data is "understood") •If it doesn't exist at the time of the analysis, recreation is rarely successful •Best controlled when planning process for data collection occurs

examples of data mining application

•Customer Relationship Management •Maximize return on marketing campaigns •Banking and Other Financial •Detecting fraudulent transactions •Retailing and Logistics •Optimize inventory levels at different locations •Healthcare •Outbreak analysis •Disease prediction

organization

•Data can be examined for errors more quickly if it is reasonably organized •Sorted by entity (student, store, vendor) •Sorted by date •Sorted by location •Differences stand out among similar data

why data mining? scientific viewpoint

•Data collected and stored at enormous speeds •remote sensors on a satellite •NASA EOSDIS archives over petabytes of earth science data / year •telescopes scanning the skies •Sky survey data •High-throughput biological data •scientific simulations •terabytes of data generated in a few hours •Data mining helps scientists in automated analysis of massive datasets

Typographical & transcription

•Data entry errors •Misspelling & abbreviations errors •Miskeyed letters

planning

•Data management is a process that must be guided from start to end •Understanding the organization's needs and the data that supports those needs is the place to start •Data structures and data collection should be controlled to best facilitate the needs of the organization

knowledge discovery process

•Data mining plays an essential role in the knowledge discovery process

how does data mining work?

•Data mining uses data to build models that identify patterns or other relationships •Patterns can either explain a relationship: •Most customers between the ages of 18 and 24 prefer downloadable music over CDs •or help predict a future value: •Given the past sales data, we can expect to sell 6000 individual song downloads in the next month to the 18 to 24 age demographic

how does data mining work?

•Data mining uses data to build models that identify patterns or other relationships •Patterns can either explain a relationship: •Most customers between the ages of 18 and 24 prefer downloadable music over CDs •or help predict a future value: •Given the past sales data, we can expect to sell 6000 individual song downloads in the next month to the 18 to 24 age demographic

why data mining? commercial viewpoint

•Lots of data is being collected and warehoused •Web data •Yahoo has Peta Bytes of web data •Facebook has billions of active users •purchases at department/grocery stores, e- commerce • Amazon handles millions of visits/day •Bank/Credit Card transactions •Computers have become cheaper and more powerful •Competitive Pressure is Strong •Provide better, customized services for an edge (e.g. in Customer Relationship Management)

evolution of dss into business intelligence (BI)

•Managers was using DSS tools for some supportive analysis •With technology advances, they were more comfortable with computing and accept that technology can directly help make intelligent business decisions faster •The concept of BI began to emerge in the 1990s with the rise of the internet and new tools for computer-aided decision making •Over the next decade the term became used more widely and by 2006 major commercial products and services were being marketed for BI.

testing & evaluation

•Often uses a portion of the dataset •Evaluate outcome for reasonableness •Refine the model as necessary •Retest

data cleansing guidelines

•Planning •Education •Organization •Separation of Duties •Prevention •Documentation

big data quality

•Quality software is available that helps with quality issues, particularly consistency. •Data profiling software and data quality dashboards are available to help the user understand data structure, relationships, and content and helps identify inconsistencies. •Provides statistics and informative summaries about data from an information source •Quality is a moving target - less important when doing exploratory data analysis; critical when using data for decision support.

transformation errors

•Reconstruct data field size may truncate existing data •"Jones" becomes "Jon" •Changing data field type may change existing data •Date becomes a number

model building

•Select an appropriate technique based on need and data types

eliminate duplicate records

•Sorting method to find duplicates •Problem with finding non-exact duplicates

data mining characteristics/objectives

•Source of data for DM is often a consolidated data warehouse •DM environment is usually a client-server or a Web-based information systems architecture •The miner is often an end user •Data mining tools' capabilities and ease of use are essential (Web, parallel processing, etc.)

classification and regression

•The goal is to discover rules that define whether an item belongs to a particular subset of data, either a real value or a category •Which insurance claims are most likely fraudulent? •What is the level of a student with a certain number of earned credit hours? •What are the demographics of a purchaser? •What is the average income of our customers?

prescriptive analytics

•The goal of these systems is to provide a decision or recommendation for a specific action. •Results may be presented to a decision maker in a report or integrated in an automated system (e.g. airline pricing systems). •Expert systems apply reasoning methodologies (if-then rules) to knowledge in a specific domain to render expert level advice or recommendations.

large scale data is everywhere

•There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies •New mantra •Gather whatever data you can whenever and wherever possible. •Expectations •Gathered data will have value either for the purpose collected or for a purpose not envisioned.

integrity

•This primarily applies to relationships within the data structure •All data in the structure should be retrievable regardless of where it is •This is why keys are so important in the data structure •Best controlled at the time of data structure development

overloaded attributes

•Too much information in one field •"Joe Smith IRA" •"John Smith in Trust for Mary Smith" •"John and Mary Smith" •Implies they are married

standardization

•Use legal name instead of nickname •i.e., use "Robert" for Bob, Rob, Bobby, and Robbie

consistency

•Values are the same/agree between data sets •patient is male in one data set, but listed as pregnant in another data set •customer charges only $10/month for the past five years, but shows $10,000 for the current month •Best controlled with application and database level constraints

unstructured data in data mining

•Vast majority of business data is stored in text documents •Text mining is a process of extracting patterns from large amount of unstructured data sources such as word, pdf, txt, XML and HTML •Sentimental analysis: product reviews on Amazon website •Speech analysis: iPhone's siri

clustering

•an unsupervised learning technique that attempts to create partitions in the data according to some distance metric •By examining the characteristics of each cluster, it may be possible to establish rules for classification •Divides data into different groups •Finds groups that are different from each other AND whose members are similar •Attributes to be clustered are data (not theory) driven •Clusters should be interpreted by someone knowledgeable in the organization •Use clusters to classify new data


Related study sets

AP LANG TERMS AND DEFINITIONS, Shea Chapter 2, Chapter 1 (Shea), Rhetorical Terms

View Set

ASA Physical Status Classifications

View Set

NAFTA (North American Free Trade Agreement)

View Set

Spanish 202 exam 3 Culture and gallerias

View Set