Big Data Final

Ace your homework & exams now with Quizwiz!

Tokenization example: "The first chapter is good, but the rest is terrible and confusing. It's a bad idea to buy this book" What is the score

"Good": +2 "terrible": -3 "confusing": -2 "bad":-2 Total= -5 We miss the positive comment b/c it is negated by the other comments

CC for Big Data?

-Access to large infrastructure that is hard to operate -Bursty workloads benefitting from pay-as-you-go Ex: Amazon S3 and EC2; Google Compute Engine Vertical Apps: Salesforce, Splunk, Tableau

Clouds & Big Data: Benefits

-Access to reliable distributed storage (hard to do alone) -Elasticity for large computations (100 nodes for 1 hours) -Data sharing across tenants (e.g. public data sets)

CC Service Examples

-Amazon EC2: virtual machine hosting, IaaS, Public, Standard -Rackspace: VMH, IaaS, Private, Standard -Amazon RDBMS, Host MySQL, Oracle, Paas, Standard -Amazon DynamoDB: key-value, PaaS, Proprietary -Tableau Online: Visualization etc, SaaS, Public, Prop

Detailed use of Sentiment Analysis (2)

-Applied to social media reviews to help mktg and customer service teams to ID customer's feelings -measure impact of new product, ad campaign, or news on social media -Used to determine "level of urgency" in customer service requests- proactive ID of frustrated users

Steps to improve Data Quality Mgmt: Improving Data capture

-Automate data entry as much as possible -manual data entry should be selected from preset options (no free fields) -use trained operators if possible -follow good user interface design principles -immediate data validations for entered data

Steps to improve Data Quality Mgmt: Busn Buy In

-C-suite sponsorship - building a business case (ROI analysis) - avoidance of cost - avoidance of opportunity loss

R "Atomic" Classes

-Character -Numeric (real #s) -Integer -Complex -Logical (True/False)

CC Types: Software as a Service (SaaS)

-Complete user-facing apps -ex: splunk storm, tableau online

CC Types: Platform as a Service (PaaS)

-Developer facing services & abstractions that are higher level than raw machines -ex: hosted DBs (Amazon RDS), MapReduce

CC Benefits for Providers

-Economies of Scale: share expertise & resources across many customers; lower costs per user due to scale -Fast Deployment: compared to traditional software sales lifecycle, new features reach users directly -Optimization across users

Causes of Poor Data Quality

-External Data sources: lack of control over data quality -Redundant data storage and inconsistent metadata: proliferation of databases with uncontrolled redundancy & metadata -Data Entry: Poor data capture controls -Lack of Organizational Commitment: not recognizing poor data quality as an organizational issue

CC Benefits for Users

-Fast Development: can start up in minutes -Outsourced Mgmt-provider handles admin, reliability, security etc -Lower Costs: benefits from Economies of Scale of providers; only pay for resources while in use -Elasticity: easy to acquire lots of infrastructure for a short period

Summary of Sentiment Analysis use

-Flame Detection (bad rants) -New product perception -Brand perception -Reputation Mmgt

Purpose of Data quality

-Minimize IT project risk -make timely business decisions -ensure regulatory compliance

Steps for Data Visualization

-Prep Data: Manage data, load, & join data and create calculated columns -Explore Data: Perform ad hoc, interactive data exploration to discover new insights -Design your visuals: Create reports and graphs that visually convey your discoveries -Deliver visualizations: easily share visuals w/ others

Primary File Types

-R Script (.R) -R Markdown Script (.rmd)

Clouds & Big Data: Challenges

-Security and privacy guarantees -data import and export -lock-in

CC Economics: Economies of Scale

-Small firm: 1 sysadmin for 100 servers =$100K/year>$1K/server/year -Amazon: 1 sysadmin for 10K servers =$100K/year>$10/server/year Amazon can also but hardware, power, security etc @ scale @ lower prices

Data Governance Reqs

-Sponsorship from Sr. Mgmt & Busn Units -Data steward mgr to support train & coordinate data stewards -Data stewards for diff busn units, subjects/systems -governance committee to provide data mgmt guidelines and standards

Steps to improve Data Quality Mgmt: Data Quality Audit

-Statistically profile all data files -doc the set of values for all fields - analyze data patters (distributions, outliers, frequency) -verify whether controls and busn rules are enforced -use specialized data profiling tools

Key Features of R

-Syntax is similar to S -Runs on almost all standard platforms (even PS3) -Frequent Releases (annual fixes) -Active User Community and Development (Stack overflow) -Quite lean and modular packages provide many functionalities -Sophistical graphics capabilities -Interactive tool as well as programming language -it's FREE

Characteristics of Quality Data

-Uniqueness -Conformance -Timeliness -Completeness -Accuracy -Referential Integrity -Currency -Consistency

More Tokenization challenges

-how do we treat new words -domain specific terms

Public Cloud

-shared by multiple tenants from the general public

Steps to improve Data Quality Mgmt: apply data principles and tech

-use software tools for analyzing and correcting data quality problems: Pattern matching, fuzzy logic, expert systems -sound data modeling and data base design

Does SA work?

-wider the range of emotions, less accurate it will be (best w/ binary, yes/no, +/- etc) -Works best for simple responses like: thumbs up/down, Facebook's new "like emotions", Start ratings etc

ETL (extract, transform, load) Process

1. Capture/Extract 2. Scrub or data cleansing 3. Transform 4. Load and Index Done during initial load of Enterprise DW & any subsequent periodic updates to EDW

Process of Tokenization

1. Create sentiment reference Dictionary (like +1, good +2, Bad -2, sucks -3) 2. Break string into words 3. analyze words --This is a GOOD book->2> positive --This is a GOOD book! I LIKE it>3>(more positive) --This is a BAD book!>-2> negative --The first chapter is GOOD, but the rest SUCKS>-1>neg 4. compute score >>

Things Data Scientists Do

1. Define Question 2. Define "ideal" Data set 3. Determine what data can be accessed 4. Obtain the Data 5. Clean the Data 6. Exploratory Data analysis 7. Statistical production/modeling 8. Interpret Results 9. Challenge Results 10. Synthesize/Write-up results 11. Create reproducible code 12. Distribute results to others

Steps to improve Data Quality Mgmt

1. Get business & mgmt buy in 2. perform Data quality audit 3. establish data stewardship program 4. improve data capture process 5. apply modern data mgmt principles and tech 6. Apply TQM (total quality mgmt) practices

Challenges of Sentiment Analysis

1. How does a machine subjectively define sentiment? 2. How does....define/analyze polarity (+/-)? 3. How....deal w/ subjective word senses? 4. How...assign an opinion rating? 5. How....define sentiment intensity? Essentially: hard to differentiate between fact and opinion

R-Data Frame

A table structure in the memory created when an external file (ex CSV) is read; so the R programs can access data using the in-memory tables directly without having to go back to the hard disk

CC Economics: Cost Associativity

Associativity: A x B = B x A For cloud: 100 servers for 1 hour costs the same as1 server for 100 hours Result: for parallel workloads, can get answer faster; same CPU cycles/$, but more productivity/$

S-Language

Basis of R; another statistical programming language developed by John Chambers and Bell Labs in 1976

Data Visualization Perspective Risk

Be aware of false perspective from charts How not to lie: -show entire scale -show data in context -consistent linear scale (log scale for log data) -use visual variables appropriate to data: dont falsely imply ordinal/quantitative relationships -Avoid size encoding-use height or width (not both); avoid area/volume encoding

Techniques for Data Integration: Consolidation (ETL)

Consolidating all data into a centralized DB (like a data warehouse) ETL= Extract, Transform, Load

Data Integration (DI)

Data integration creates a unified view of business data Other possibilities include: -application integration -business process integration -user interaction integration Any approach requires changed data capture (CDC) -indicates which data have changed since previous data integration activity

Reconciled Data Layer (after ETL)

Data should be: -detailed; not summarized yet -historic-periodic -normalized: 3rd normal form or higher -comprehensive- enterprise-wide perspective -timely: data should be current enough to assist decision making -quality controlled: accurate w/ full integrity

Visualization

Data+Visualization=Value -"a picture is worth 1000 words" Far higher bandwidth than thought -react to image much faster than thinking -Perceive and remember details w/o thinking Leverages pre-attentive processing-information we process w/o thinking

Data Visualization Tools

Datameer: Native Hadoop analytics platform; Cloudera, MapR, IBM, Amazon, Apache Tableau SAS Visual Analytics TIBCO-Data Connector

CC Economics-Statistical Multiplexing

Different variable workloads peak at different times, making the sum more predictable, therefore making usage more efficient -Providers can use "idle resources" for their own needs to capacity isn't "wasted"

Master Data Mgmt (MDM)

Disciplines, technologies, & methods to ensure the currency, meaning, and quality of reference data with and across various subject areas

R System Breakdown

Divided into 2 conceptual parts 1. "Base" R system (downloadable from CRAN) 2. Everything else

Data Governance

High-level organizational groups and processes overseeing data stewardship across the org

Importance of Data Quality

If data is bad, business fails. Period. -GIGO: garbage in, garbage out -SOX: compliance by law sets data & metadata quality standards

R Created by...

In 1991 by Ross Ihaka and Robert Gentleman

(SA) Ambiguous Language

Language is hard to decipher intent without context: -non obvious negativity: "As much use as trapdoor in lifeboat" -Comparisons: "Canon>Fisher Price" -Slang: "imo ice cream is luuurvvly"-hard to process general communication

What do we miss in SA

May miss granularity in comments due to one side cancelling out the other

R Objects

Most basic object: -Vector: can only contain values of the same class -List: can contain vectors and values of different classes -Data Frame-A table structure in the memory created when an external file is read

Sentiment Analysis is used for

Mostly Binary Decisions -For/Against -Like/Dislike -Good/Bad

Access Interfaces

Open: standard across vendor and even on-premises Proprietary: specific to vendor

Sentiment Analyst AKA

Opinion Mining-includes IDing customer responses, ie: -attitudes -emotions -opinions of company's product, brand, or service

DI (+/-): Federation (EII)

PROs -Data is always current -simple for the calling application -works well for read only apps b/c only requested data needs to be retrieved -ideal when copies of source data are not allowed CONs -heavy workloads are possible for each request due to performing all integration tasks for each request -write access to data sources may not be supported

DI (+/-): Propagation (EAI/ERD)

PROs -data available in real time -possible to work with ETL for real-time DW-ing -Transparent access available to the data source CONs -there is considerable (but background) overhead associated with synchronizing duplicate data

DI (+/-): Consolidation (ETL)

PROs -users are isolated from conflicting workloads on source systems, especially updates - possible to retain history, just not current values -data store designed for specific reqs can be accessed quickly -works well when scope of data needs are known in advance -Data transformations can be batched for greater efficiency CONs -network, storage, & data maintenance costs can be high -performance can degrade when the DW becomes too large

R-Programming

Programming language and software environment for statistical computing and graphics. Mostly focuses upon statistics and data mining to develop statistical software and data analysis; competes w/ Python

Downloadable from

R software and packages: https://www.r-project.org/ R-Studio: https://www.rstudio.com/products/rstudio/download/

CC Types: Infrastructure as a Service (IaaS)

Raw computing resources; ex virtual machines, disks

Steps to improve Data Quality Mgmt: Stewardship prog

Roles: -oversight of data stewardship program -manage data subject area -oversee data definitions -oversee production of data -oversee use of data Report to: business unit vs IT org?

Steps to improve Data Quality Mgmt: TQM principles

TQM: -Defect Prevention -continuous improvement -use of enterprise data solutions -strong foundation of measurement

Cloud Computing

The practice of using a network of remote servers hosted on the Internet to store, manage, and process data; on demand resources; pay as you go

SA Method: Machine Learning ex

This is a good book-positive This is an awesome book-positive this is a bad book-negative >>>Model this is a terrible book-negative This is a good article >>> >positive This is an awesome time>>>trained model >positive This is a bad article >>> >negative

Single Field Transformation

Translates data from old form to new -Algorithmic: transformation uses a formula or logical expression -Table Lookup: uses a separate table keyed by source record code

Mineable Web Assets?

Twitter Facebook LinkedIn Google+ Web pages Blogs

Reconciled Data Layer (before ETL)

Typical operational data is: -transient-not historical -not normalized (perhaps due to denormalization for performance) -restricted in scope; not comprehensive -sometimes poor quality-inconsistencies and errors

SA Method: Machine Learning

Use training data with labels-> feed into model so it understands how to read the data->feed real data into trained model-> prediction

Detailed Use of Sentiment Analysis

Used to gather: -Why consumers buy a product -Opinion of customer service (+/-) -Did support meet expectations?-

4th V of Big Data

Value=Data Visualization Volume/Variety/Velocity>>Data Visualization tools>Value

R-Dialect

Variation of the S language-Integrated Development Environment (IDE); It is an Open Source Language

CC Economics-Variable Utilization

With on-site hosting must provision for the highest possible usage (still run risk of over or under provisioning); Clouds run charge @ a much higher granularity-can adjust need on a pay-as-you-go basis

Recommended add't package

boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nime, rpart, survival

MDM Architectures: Persistent

central "golden record" maintained; all applications have access; requires applications to push data.

ETL Process: Transform

convert data from format of operational system to format of DW -Record level: Selection-data partitioning, Joining-data combining, Aggregation-data summarization -Field Level: single-field-from one field to one field, multi-field-from many fields to one, or one field to many

MDM Architectures: Integration Hub

data changes broadcast through central service to subscribing DBs

Sentiment Classification Techniques: support vector machine (SVM)

data is set as 2 vectors in an n-dimensional space (most accurate classifer @ ~80%)

Techniques for Data Integration: Data Propagation (EAI/ERD)

duplicate data across DBs, with near real time delay

Sentiment Classification Techniques: Pang et al.

found the SVM to be the most accurate classifier

Pre-attentive Variables

length, width, colors, closure, line ends, contrast, tilt, curvature & size

MDM Architectures: Identity Registry

master data remains in source systems; registry provides application with location

ETL Process: Capture/Extract

obtaining a snapshot of a chosen subset of the source data for loading into the DW -Static Extract: capturing a snapshot of the source data at a point in time Incremental extract: capturing changes that have occurred since the last static extract

Data Steward

person responsible for ensuring that organization applications properly support the organizations data quality goals

ETL Process: Load/Index

place transformed data into the DW & create indexes -Refresh mode: bulk rewriting of target data at periodic intervals -Update mode: only changes in source data are written to DW

Tokenization example: Count positive and negative

positive: "good" +2 Negative: "terrible" (-3); "confusing" (-2); "bad" (-2): -7 Total= -5

Sentiment Classification Techniques: Naive Bayes

probabilistic classifier using Bayes Theorom

SA Method: Tokenization

process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens

Record Level Transformation Functions: Normalization

process of decomposing relations with anomalies to produce smaller, well-structured relations

Record Level Transformation Functions: Selection

process of partitioning the data according to predefined criteria

pre-attentive processing

processing of sensory information that occurs before the conscious mind starts to pay attention to any specific objects in its vicinity; quick, effortlessly, and in parallel w/o any attention being focused on the display

Techniques for Data Integration: Data Federation (EII)

provides a virtual view of data without actually creating one centralized DB

"Base" R system contains

the base package to run R and contains the fundamental actions includes: utils, stats, datasets, graphics, grDevices, grid, methods, tools, parallel, compiler, spines etc.

Record Level Transformation Functions: Joining

the process of combining data from various sources into a single table or view

Record Level Transformation Functions: Aggregation

the process of transforming data from detailed to summary level

Sentiment Analysis (SA)

use of natural language processing, stats, and text analysis to extract, and ID the sentiment of text into positive, negative or neutral categories AKA Opinion Mining

Private Cloud

used by a single organization for internal workloads may be hosted either on or off premises

R Data "Head"

useful tool for exploring R data at the beginning of a set

R Data "Tail"

useful tool for exploring R data at the end of a set

ETL Process: Scrub/Cleanse

uses pattern recognition & AI techniques to upgrade data quality -Fixing Errors: misspellings, erroneous dates, missing data, duplicate data etc. -Also: Decoding, reformatting, time stamping, conversion, locating missing data

Sentiment Classification Techniques: Maximum Entropy

uses probability distributions on the basis of partial knowledge

Mining Social Web

wealth of unstructured data -who knows whom & commonalities in networks -frequency of communication -which social networks create most value for niche -effects of geography on social connections -most influential/popular people in a social network -social trends-what's trending

Big Data Final

Related study sets

finance test 2

5.1 Number Theory

Chapter 4 - Internal Analysis

Investment analysis exam 1: ch 1-2 short answer problems

Mass Media Final Test 3

Chapter 3 Macro Economics

Personal Finance Ch 4 Practice

MSRB Rules

AGEC210 Quiz 1

Chapter 18: Human Resource Management: Small Business Considerations

BA 232 Business Management Chapter 9

Module 10 quiz

chapter 11

Ch. 11

Nitrogen containing Organic compounds

Entrepreneurship Chapter 2

Chapter 17 quiz

BIO EXAM PART 1

Rec Systems

5/200.01 Search and Seizure