OIM 350 Exam 3

Edward R. Tufte, quantitative data is displayed in printed form, some ink showing data and some visual content (non-data)

solutions to MapReduce shortcomings

Pig and HIVE

tools/components in the Hadoop ecosystem

Pig, HIVE, Mahout, Sqoop, Flume, YARN, Zookeeper

horizontal scaling

scale-out, adding more computers with less processors and RAM as data servers, much cheaper than vertical scaling, can scale infinitely


all operations or none

Gestalt principles of visual design

connection, continuity, closure, enclosure, figure & ground, proximity, symmetry, similarity


coordination layer for the Hadoop ecosystem - A distributed configuration service, a synchronization service and a naming registry for distributed systems


creating a schema before inserting data into a system, SQL (DDL), explicit standards/governance, relational data, fast retrieval


data at rest, volume is increasing exponentially

soft state

data doesn't have to be consistent all the time

relational analysis

scatterplot, trend line, two measures, ex) does relationship btw profit and sales differ by customer segment


select condition on dimension(s), sub-cube, selects and filters data, diced if one or more dimensions are sliced


selection of one dimension of cube, selects and filters data

Hadoop cluster

set of computers with HDFS and MapReduce

relational data relationships


situational awareness

small, concise, direct, clear, display media customized for specific context

relative comparison

stacked bar, pie, ex) percentage of total net profit break down across customer segments

relational data schema

static (on-write)


streaming framework for transferring large amounts of event-based data to HDFS

relational data type/format



structured tool designed for transferring bulk data between HDFS and structured data stores such as relational databases

big data type/format

structured, semi-structured, unstructured


summarize data, drill hierarchy, further aggregation, coarser granularity, ex. cities to countries


symmetrical elements are seen as a group

example of Japan's 7/11 stores

the Seven-Eleven Japan approach to generating big value from little data relies on providing transparent information to decision makers and setting dear expectations for how they will use it

visual perception

the end product of vision, the way the brain interprets what the eyes see, can be altered by previous experiences, can affect the way you see a situation

design considerations

titles, labels, legends, captions, reference lines, sorted data by dimensions, no distractive marks, consistency


transactions persist even after system crash

how can the New Deal address concerns with big data and privacy

transparency: see what's being collected on you and opt in or opt out, be in control, builds trust with consumers, companies don't have to take on so much security risk with hacks, personal data is a new internet currency so users should be able to share their info at their preference

the New Deal

a set of principles and practices to define the ownership of data and control its flow, companies don't own their data, rebalancing of the ownership of data to the individual whose data is being collected


a visualization is more effective if the information is more readily perceived


a workspace used to assemble a collection of worksheets based on specific analytical objectives

OLAP cube cells

aggregate measures

distribution system hierarchy

aggregation switch, rack switches, nodes

Hadoop business area examples

risk modeling, customer churn analysis, recommendation engine, ad targeting


rotates orientation of data for reporting purposes, moves dimensions from one axis to another

distributed system

2 level architecture, a system of communication between computers and networks (where nodes are commodity PCs

Miller's magic number of 7 (plus/minus 2)

7 is the number of chunks of information that a person can hold in working memory at the same time

establish one undisputed source of performance data

Aetna created a common single information system for data across divisions, allowing them to target the data to a more effective change, companies will then want to improve means of data capture if initial data is inaccurate, use of the single data will allow for analysis of business processes

visual encoding

Bertin & Gauthier-Villars (1967), marks, positions, and retinal

basic availability

DB system appears to work most of the time

3 data platforms at comScore

Greenplum database, Greenplum enterprise datawarehouse, MapR's Hadoop


Hadoop distributed file system, when data is loaded onto HDFS, it is divided into small blocks


Not only SQL, by Carl Strozzi (1998) to name file-based DB, systems which allow quantities of unstructured/semi-structured data to be stored/managed

fault-tolerance: data processing

MapReduce restarts task, once failures are detected, job tracker reassigns the work to a task tracker on a different node, speculative execution

figure & ground

a figure is the element in focus that rests on the ground (element in the background)

context filters

always applied to data view first, need to improve view performance or create dependent numerical or top N filter, enables interactivity

filtering data

analysis on narrower data sub-set, drills into detail, can be at data source or at data view


atomicity, consistency, isolation, durability

OLAP cube axes

attributes; discrete-valued, categorical

absolute comparison

bar, tree map, ex) most profitable product category in each month


basic availability, soft state, eventual consistency

big data-processing method


reduce phase

boils all outputs into single result set, intermediate result are aggregated under control of a job tracker who sends the final results back to client application

why data scientists are important

bring structure to large quantities of formless data and make analysis possible, identify rich data sources, join them with other, potentially incomplete data sources, and show the resulting set, help decision makers shift from ad hoc analysis to an ongoing conversation with data in a competitive world

benchmark comparison


consciously articulate their business rules and regularly update them in response to facts

business rules align the actions of operational decision makers with the strategic objectives of the company, must understand the rules and management regularly adjusts them in response to new information, specific changes in rules allows managers to clearly analyze small deviations of the rules with no need for big data, effective to embed more complex rules in software systems for speed and clarity

fault-tolerance: data storage

by default, HDFS maintains 3 copies of file and these copies are scattered along different computers, system can detect failure via heartbeats, when node fails, system keeps on running and data is available from different nodes


cannot be set to null or default

relational data storage environment


pre-attentive processing

certain info can be processed in parallel by the low-level visual system, some visual elements stand out more than others

general framework for a data visualization project

clarify business question, choose analysis/chart types, prepare your data, create your data view(s)

critical design practices

co-locate items that belong together, support comparisons, include supplementary info, aesthetic, no bright colors except for emphasis, varied font sizes, real time monitoring

provide high-quality coaching to employees who make decisions on a regular basis

coaching and counselors allow for constant support in following business rules, and a constant reminder to check data as to whether a change is necessary or if a complaint is justified

column-oriented DB

collection of column families, semi-structured, high scalability, good for versioning, can't query blob content, not optimized for joins


color hue, size, shape, color value, orientation, texture


combines data fields at an aggregate level, worksheet specific


combines data sheets/tables at row level, data-file-specific

big data relationships



concurrent transactions do not interfere

how HDFS works

data files are split into uniform sized blocks, blocks are split and stored across many computers at once, blocks are replicated across multiple computers, one computer keeps track of which blocks make up a file and where they are stored


data in doubt, quality is uncertain due to inconsistency, incompleteness, ambiguity, deception, latency, model approximations


data in many formats; structured, semi-structured, unstructured


data in motion, generated fast and needs to be processed fast to maximize data's business value

Greenplum enterprise datawarehouse

data is historical and aggregated, purpose: long term trends, time span: 1 year

why comScore needs all 3 platforms

data is stored for a short period of time and contains all the data available for the past 50 days, then summarized/aggregated/moved into Greenplum enterprise data warehouse, used for analytical functions when looking at trends, data is summarized even further and placed into Hadoop data warehouse, comScore can perform analysis over multiple years

document-oriented DB

data stored in nested hierarchies, logical data stored together as a unit, collection of documents, ideal for search, complex to implement, incompatible with SQL


data warehousing package created to facilitate easy data summarization, supports ad-hoc queries and analyses, report dashboards, mechanism to bring structure to unstructured data, popular among data analysts

eventual consistency

data will become consistent at some later time

data-ink ratio

data-ink / total ink/pixel

NoSQL design

designed to handle high level of reads/writes while scaling horizontally

big data storage environment



distributed and scalable library with common machine learning algorithms on Hadoop

big data schema

dynamic (on read)


elements arranged on a line or curve are related


elements close together are seen as a group


elements with the same visual characteristics are seen as a group


elements within the same region are seen as a group

four practices a company can follow for evidence-based decision making

establish one undisputed source of performance data, give decision makers at all levels near-real-time feedback, consciously articulate their business rules and regularly update them in response to facts, provide high-quality coaching to employees who make decisions on a regular basis

McKinlay design criteria

expressiveness and effectiveness


facts are expressible in visual language if only and all of the facts in the data set are expressed

characteristics of a Hadoop cluster

fault-tolerance, parallelism, data locality, horizontal scaling

why is it challenging to find professionals that can effectively work with big data

for important decisions, these people are typically high up in the organization, or they're expensive outsiders brought in because of their expertise and track records, highest-paid person's opinion, when it comes to knowing which problems to tackle, domain expertise remains critical, more important are skills in cleaning and organizing large data sets, visualization tools and techniques, expertise in the design of experiments can help cross the gap between correlation and causation, the best data scientists are also comfortable speaking the language of business and helping leaders reformulate their challenges in ways that big data can tackle, Hadoop skill set is new to many IT departments

panel data

from 2 million internet users who gave permission for comScore to use passive measurement, full online browsing, transaction behavior, behavior of computer: location, who used it, when it entered or left website, actual number of ads delivered to each computer and how much of online purchasing was immediate or delayed

perceptual data

from panel data, survey results

how comScore made big data more consumable/less overwhelming for customers

gave clients 4-5 trends/insights and help them digest those, support self service with software tools with graphics/dashboards/wizards, campaign essentials- real time/in flight information to customers

relational data volume


DataNodes- slaves

handles block integrity, stores/provides direct access to HDFS blocks, sends heartbeats to NameNode every 3 seconds

key-value DB

handles massive load and reads (no joins), keys access opaque data blobs, values can contain any type of data, scalable, simple API, can only query based on keys

big data

high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making

data scientist

high-ranking professional with the training and curiosity to make discoveries in the world of big data

data-ink goal

highest possible data-ink ratio, data looks better naked, ink should change as data changes

distribution analysis

histogram, one measure and frequency of that measure, ex) how many orders a company processes each day in each region

big data scaling method

horizontal (scale-out)

speculative execution

if a task tracker is running slowly, job tracker can redundantly execute another instance of the same task


in a complex arrangement of elements, we look for a single, recognizable pattern

data scientist skills

indentify/join/clean rich data sources, communicate findings/make recommendations, hybrid of a data hacker, analyst, communicator, and trusted advisor, ability to write code, intense curiosity, strong social skills, solid foundation in math, statistics, probability, and computer science, feel for business issues and empathy for customers, creative in displaying information visually through patterns, turn unstructured data into structured


inserting data without applying any schema, programming or non-SQL query language, big data, high flexibility, fast loading

relational data-processing method


map phase

job tracker breaks processing task into small pieces of computation and sends each piece to a task tracker who processes individual pieces in parallel under global control of job tracker

components of MapReduce

job tracker on master node and task tracker on slave node

types of NoSQL DBs

key-value, graph, document-oriented, column-oriented

trend analysis

line, ex) do all product categories have similar year-to-year changes in sales performance

strategic data

loyalty card-in store purchase data: tie online advertising campaigns to offline in store purchases, strategic partners

NameNode- master

manages metadata; list of files, list of blocks for each file, list of DataNodes for each block, file attributes, monitors DataNode health, replicates blocks on DataNode failure, balances space/access speed

MapReduce shortcomings

many queries/computations need multiple programs, API programmers not used to it, common query operations are coded line-by-line

phases of MapReduce

map and reduce

mapping data to chart

marks are geometric primitives, channels control the appearance of marks, ex. line and texture

how to arrange disparate data in a way that makes sense and support its efficient perception

maximize data-ink ratio, follow Gestalt principles

graph DB

modeling data structure in series of nodes/relationship/properties, ideal when relationships btw data is key, fast network search, automate joins with public data, poor scalability, specialized query language

dashboard display

most important info, single screen, monitored at a glance, situational awareness

data locality

move computation to where data is stored, a job tracker divides up tasks based on location of data and tries to map tasks on same machine as physical data block, or at least same rack


open source software framework designed for storage and processing of large-scale data (big data) on clusters of commodity hardware

4 sources of comScore's online data

panel data, census data, perceptual data, strategic data

Greenplum database

parallel processing database for event level analysis, purpose: full detail analysis, structure: massively parallel processing, storage: 200 servers 2 trillion rows, time span: event level data for 50 days loaded hourly

big data volume


census data

place sensors on major websites to track behavior, mobile phones


platform created to analyze data without writing MapReduce programs, good for ETL, not good for ad-hoc querying, preferred by data scientists and programmers


points, lines, areas

how to make the most important data stand out from the rest

pre-attentive processing


programming model that pairs with HDFS, data-processing task divided into two phases, divide and conquer processing model for parallel computation

MapR's Hadoop

purpose: longer history and richer data types, data storage: 4.4 Petabyte, data structure: 230 node cluster, time span: years

why practices a company can follow for evidence-based decision-making are important

rather than investing in high end analytic systems, businesses need to know how to properly analyze the data they already have, evidence based decision making can improve business performance/profits, empower employees

relational data updates

read and write many times, updates allowed

give decision makers at all levels near-real-time feedback

regular scorecards clarify individual accountability and provide consistent feedback so that individuals know how they are doing, scorecard focuses on information that individuals can actually control and improve on, not summations of company wide profits

relational data integrity

relatively high (ACID)

big data integrity

relatively low (BASE)


values are processed independently Map functions run in parallel, creating different intermediate values from different input data sets, part of reduce functions also run in parallel, each working on a different output key

relational data scaling method

vertical (scale-up)


vertical, horizontal, spatial

attributes that can be used to influence pre-attentive processing

visual encoding, design considerations, critical design practices, Mackinlay design criteria


visually connected elements are related

law of Pragnanz (simplicity)

we tend to perceive and interpret ambiguous or complex images as the simplest forms as possible

big data updates

write once read many times (WORM), updates not allowed


yet another resource negotiator, resource management layer for the Hadoop ecosystem, a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters

