OIM 350 Exam 3
data-ink
Edward R. Tufte, quantitative data is displayed in printed form, some ink showing data and some visual content (non-data)
solutions to MapReduce shortcomings
Pig and HIVE
tools/components in the Hadoop ecosystem
Pig, HIVE, Mahout, Sqoop, Flume, YARN, Zookeeper
horizontal scaling
scale-out, adding more computers with less processors and RAM as data servers, much cheaper than vertical scaling, can scale infinitely
atomicity
all operations or none
Gestalt principles of visual design
connection, continuity, closure, enclosure, figure & ground, proximity, symmetry, similarity
Zookeeper
coordination layer for the Hadoop ecosystem - A distributed configuration service, a synchronization service and a naming registry for distributed systems
schema-on-write
creating a schema before inserting data into a system, SQL (DDL), explicit standards/governance, relational data, fast retrieval
volume
data at rest, volume is increasing exponentially
soft state
data doesn't have to be consistent all the time
relational analysis
scatterplot, trend line, two measures, ex) does relationship btw profit and sales differ by customer segment
dice
select condition on dimension(s), sub-cube, selects and filters data, diced if one or more dimensions are sliced
slice
selection of one dimension of cube, selects and filters data
Hadoop cluster
set of computers with HDFS and MapReduce
relational data relationships
simple/known
situational awareness
small, concise, direct, clear, display media customized for specific context
relative comparison
stacked bar, pie, ex) percentage of total net profit break down across customer segments
relational data schema
static (on-write)
Flume
streaming framework for transferring large amounts of event-based data to HDFS
relational data type/format
structured
Sqoop
structured tool designed for transferring bulk data between HDFS and structured data stores such as relational databases
big data type/format
structured, semi-structured, unstructured
roll-up
summarize data, drill hierarchy, further aggregation, coarser granularity, ex. cities to countries
symmetry
symmetrical elements are seen as a group
example of Japan's 7/11 stores
the Seven-Eleven Japan approach to generating big value from little data relies on providing transparent information to decision makers and setting dear expectations for how they will use it
visual perception
the end product of vision, the way the brain interprets what the eyes see, can be altered by previous experiences, can affect the way you see a situation
design considerations
titles, labels, legends, captions, reference lines, sorted data by dimensions, no distractive marks, consistency
durability
transactions persist even after system crash
how can the New Deal address concerns with big data and privacy
transparency: see what's being collected on you and opt in or opt out, be in control, builds trust with consumers, companies don't have to take on so much security risk with hacks, personal data is a new internet currency so users should be able to share their info at their preference
the New Deal
a set of principles and practices to define the ownership of data and control its flow, companies don't own their data, rebalancing of the ownership of data to the individual whose data is being collected
effectiveness
a visualization is more effective if the information is more readily perceived
dashboard
a workspace used to assemble a collection of worksheets based on specific analytical objectives
OLAP cube cells
aggregate measures
distribution system hierarchy
aggregation switch, rack switches, nodes
Hadoop business area examples
risk modeling, customer churn analysis, recommendation engine, ad targeting
pivot
rotates orientation of data for reporting purposes, moves dimensions from one axis to another
distributed system
2 level architecture, a system of communication between computers and networks (where nodes are commodity PCs
Miller's magic number of 7 (plus/minus 2)
7 is the number of chunks of information that a person can hold in working memory at the same time
establish one undisputed source of performance data
Aetna created a common single information system for data across divisions, allowing them to target the data to a more effective change, companies will then want to improve means of data capture if initial data is inaccurate, use of the single data will allow for analysis of business processes
visual encoding
Bertin & Gauthier-Villars (1967), marks, positions, and retinal
basic availability
DB system appears to work most of the time
3 data platforms at comScore
Greenplum database, Greenplum enterprise datawarehouse, MapR's Hadoop
HDFS
Hadoop distributed file system, when data is loaded onto HDFS, it is divided into small blocks
NoSQL DB
Not only SQL, by Carl Strozzi (1998) to name file-based DB, systems which allow quantities of unstructured/semi-structured data to be stored/managed
fault-tolerance: data processing
MapReduce restarts task, once failures are detected, job tracker reassigns the work to a task tracker on a different node, speculative execution
figure & ground
a figure is the element in focus that rests on the ground (element in the background)
context filters
always applied to data view first, need to improve view performance or create dependent numerical or top N filter, enables interactivity
filtering data
analysis on narrower data sub-set, drills into detail, can be at data source or at data view
ACID
atomicity, consistency, isolation, durability
OLAP cube axes
attributes; discrete-valued, categorical
absolute comparison
bar, tree map, ex) most profitable product category in each month
BASE
basic availability, soft state, eventual consistency
big data-processing method
batch/near-time
reduce phase
boils all outputs into single result set, intermediate result are aggregated under control of a job tracker who sends the final results back to client application
why data scientists are important
bring structure to large quantities of formless data and make analysis possible, identify rich data sources, join them with other, potentially incomplete data sources, and show the resulting set, help decision makers shift from ad hoc analysis to an ongoing conversation with data in a competitive world
benchmark comparison
bullet
consciously articulate their business rules and regularly update them in response to facts
business rules align the actions of operational decision makers with the strategic objectives of the company, must understand the rules and management regularly adjusts them in response to new information, specific changes in rules allows managers to clearly analyze small deviations of the rules with no need for big data, effective to embed more complex rules in software systems for speed and clarity
fault-tolerance: data storage
by default, HDFS maintains 3 copies of file and these copies are scattered along different computers, system can detect failure via heartbeats, when node fails, system keeps on running and data is available from different nodes
consistency
cannot be set to null or default
relational data storage environment
centralized
pre-attentive processing
certain info can be processed in parallel by the low-level visual system, some visual elements stand out more than others
general framework for a data visualization project
clarify business question, choose analysis/chart types, prepare your data, create your data view(s)
critical design practices
co-locate items that belong together, support comparisons, include supplementary info, aesthetic, no bright colors except for emphasis, varied font sizes, real time monitoring
provide high-quality coaching to employees who make decisions on a regular basis
coaching and counselors allow for constant support in following business rules, and a constant reminder to check data as to whether a change is necessary or if a complaint is justified
column-oriented DB
collection of column families, semi-structured, high scalability, good for versioning, can't query blob content, not optimized for joins
retinal
color hue, size, shape, color value, orientation, texture
blending
combines data fields at an aggregate level, worksheet specific
joining
combines data sheets/tables at row level, data-file-specific
big data relationships
complex/unknown
isolation
concurrent transactions do not interfere
how HDFS works
data files are split into uniform sized blocks, blocks are split and stored across many computers at once, blocks are replicated across multiple computers, one computer keeps track of which blocks make up a file and where they are stored
veracity
data in doubt, quality is uncertain due to inconsistency, incompleteness, ambiguity, deception, latency, model approximations
variety
data in many formats; structured, semi-structured, unstructured
velocity
data in motion, generated fast and needs to be processed fast to maximize data's business value
Greenplum enterprise datawarehouse
data is historical and aggregated, purpose: long term trends, time span: 1 year
why comScore needs all 3 platforms
data is stored for a short period of time and contains all the data available for the past 50 days, then summarized/aggregated/moved into Greenplum enterprise data warehouse, used for analytical functions when looking at trends, data is summarized even further and placed into Hadoop data warehouse, comScore can perform analysis over multiple years
document-oriented DB
data stored in nested hierarchies, logical data stored together as a unit, collection of documents, ideal for search, complex to implement, incompatible with SQL
HIVE
data warehousing package created to facilitate easy data summarization, supports ad-hoc queries and analyses, report dashboards, mechanism to bring structure to unstructured data, popular among data analysts
eventual consistency
data will become consistent at some later time
data-ink ratio
data-ink / total ink/pixel
NoSQL design
designed to handle high level of reads/writes while scaling horizontally
big data storage environment
distributed
Mahout
distributed and scalable library with common machine learning algorithms on Hadoop
big data schema
dynamic (on read)
continuity
elements arranged on a line or curve are related
proximity
elements close together are seen as a group
similarity
elements with the same visual characteristics are seen as a group
enclosure
elements within the same region are seen as a group
four practices a company can follow for evidence-based decision making
establish one undisputed source of performance data, give decision makers at all levels near-real-time feedback, consciously articulate their business rules and regularly update them in response to facts, provide high-quality coaching to employees who make decisions on a regular basis
McKinlay design criteria
expressiveness and effectiveness
expressiveness
facts are expressible in visual language if only and all of the facts in the data set are expressed
characteristics of a Hadoop cluster
fault-tolerance, parallelism, data locality, horizontal scaling
why is it challenging to find professionals that can effectively work with big data
for important decisions, these people are typically high up in the organization, or they're expensive outsiders brought in because of their expertise and track records, highest-paid person's opinion, when it comes to knowing which problems to tackle, domain expertise remains critical, more important are skills in cleaning and organizing large data sets, visualization tools and techniques, expertise in the design of experiments can help cross the gap between correlation and causation, the best data scientists are also comfortable speaking the language of business and helping leaders reformulate their challenges in ways that big data can tackle, Hadoop skill set is new to many IT departments
panel data
from 2 million internet users who gave permission for comScore to use passive measurement, full online browsing, transaction behavior, behavior of computer: location, who used it, when it entered or left website, actual number of ads delivered to each computer and how much of online purchasing was immediate or delayed
perceptual data
from panel data, survey results
how comScore made big data more consumable/less overwhelming for customers
gave clients 4-5 trends/insights and help them digest those, support self service with software tools with graphics/dashboards/wizards, campaign essentials- real time/in flight information to customers
relational data volume
gigabytes/terabytes
DataNodes- slaves
handles block integrity, stores/provides direct access to HDFS blocks, sends heartbeats to NameNode every 3 seconds
key-value DB
handles massive load and reads (no joins), keys access opaque data blobs, values can contain any type of data, scalable, simple API, can only query based on keys
big data
high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making
data scientist
high-ranking professional with the training and curiosity to make discoveries in the world of big data
data-ink goal
highest possible data-ink ratio, data looks better naked, ink should change as data changes
distribution analysis
histogram, one measure and frequency of that measure, ex) how many orders a company processes each day in each region
big data scaling method
horizontal (scale-out)
speculative execution
if a task tracker is running slowly, job tracker can redundantly execute another instance of the same task
closure
in a complex arrangement of elements, we look for a single, recognizable pattern
data scientist skills
indentify/join/clean rich data sources, communicate findings/make recommendations, hybrid of a data hacker, analyst, communicator, and trusted advisor, ability to write code, intense curiosity, strong social skills, solid foundation in math, statistics, probability, and computer science, feel for business issues and empathy for customers, creative in displaying information visually through patterns, turn unstructured data into structured
schema-on-read
inserting data without applying any schema, programming or non-SQL query language, big data, high flexibility, fast loading
relational data-processing method
interactive
map phase
job tracker breaks processing task into small pieces of computation and sends each piece to a task tracker who processes individual pieces in parallel under global control of job tracker
components of MapReduce
job tracker on master node and task tracker on slave node
types of NoSQL DBs
key-value, graph, document-oriented, column-oriented
trend analysis
line, ex) do all product categories have similar year-to-year changes in sales performance
strategic data
loyalty card-in store purchase data: tie online advertising campaigns to offline in store purchases, strategic partners
NameNode- master
manages metadata; list of files, list of blocks for each file, list of DataNodes for each block, file attributes, monitors DataNode health, replicates blocks on DataNode failure, balances space/access speed
MapReduce shortcomings
many queries/computations need multiple programs, API programmers not used to it, common query operations are coded line-by-line
phases of MapReduce
map and reduce
mapping data to chart
marks are geometric primitives, channels control the appearance of marks, ex. line and texture
how to arrange disparate data in a way that makes sense and support its efficient perception
maximize data-ink ratio, follow Gestalt principles
graph DB
modeling data structure in series of nodes/relationship/properties, ideal when relationships btw data is key, fast network search, automate joins with public data, poor scalability, specialized query language
dashboard display
most important info, single screen, monitored at a glance, situational awareness
data locality
move computation to where data is stored, a job tracker divides up tasks based on location of data and tries to map tasks on same machine as physical data block, or at least same rack
Hadoop
open source software framework designed for storage and processing of large-scale data (big data) on clusters of commodity hardware
4 sources of comScore's online data
panel data, census data, perceptual data, strategic data
Greenplum database
parallel processing database for event level analysis, purpose: full detail analysis, structure: massively parallel processing, storage: 200 servers 2 trillion rows, time span: event level data for 50 days loaded hourly
big data volume
petabytes+
census data
place sensors on major websites to track behavior, mobile phones
Pig
platform created to analyze data without writing MapReduce programs, good for ETL, not good for ad-hoc querying, preferred by data scientists and programmers
marks
points, lines, areas
how to make the most important data stand out from the rest
pre-attentive processing
MapReduce
programming model that pairs with HDFS, data-processing task divided into two phases, divide and conquer processing model for parallel computation
MapR's Hadoop
purpose: longer history and richer data types, data storage: 4.4 Petabyte, data structure: 230 node cluster, time span: years
why practices a company can follow for evidence-based decision-making are important
rather than investing in high end analytic systems, businesses need to know how to properly analyze the data they already have, evidence based decision making can improve business performance/profits, empower employees
relational data updates
read and write many times, updates allowed
give decision makers at all levels near-real-time feedback
regular scorecards clarify individual accountability and provide consistent feedback so that individuals know how they are doing, scorecard focuses on information that individuals can actually control and improve on, not summations of company wide profits
relational data integrity
relatively high (ACID)
big data integrity
relatively low (BASE)
parallelism
values are processed independently Map functions run in parallel, creating different intermediate values from different input data sets, part of reduce functions also run in parallel, each working on a different output key
relational data scaling method
vertical (scale-up)
positions
vertical, horizontal, spatial
attributes that can be used to influence pre-attentive processing
visual encoding, design considerations, critical design practices, Mackinlay design criteria
connection
visually connected elements are related
law of Pragnanz (simplicity)
we tend to perceive and interpret ambiguous or complex images as the simplest forms as possible
big data updates
write once read many times (WORM), updates not allowed
YARN
yet another resource negotiator, resource management layer for the Hadoop ecosystem, a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters