OIM 350 Exam 3
Which forms of preattentive processing are best for presenting quantitative information?
-2D Locations: (ex. location of data points in a scatter plot) -Line Length (ex. length of bar in a bar graph)
HBR: Big Data Big Responsibility - What is the New Deal and how does it address concerns with Big Data and privacy?
-New Deal= a set of principles and practices to define the ownership of data and control its flow -with sensors built in more and more products people feel as if their privacy is invaded -how it addresses this: it balances the ownership of data in favor of the individual whose data is collected (opt out or in on the data being collected about them and they can see the data) -> transparency
Success factors for big data analytics
-a clear business need -strong, committed sponsorship -alignment between the business and IT strategy -a fact-based decision-making culture -a strong data infrastructure -the right analytics tools -personnel with advanced analytical skills
Data Warehouse/Mart Front End (BI) Applications
-can retrieve data from the data warehouse itself or from a dependent data mart that contain a subset of the data from the data warehouse -can retrieve data from independent data marts * a collection of pre-developed queries organized for simpler access and use by end users
Corporate Use of Big Data
-common big data mistakes= treating big data as a completely separate issue -big data techniques increase the ability to analyze the data that the organization owns or to which it has access -big data methods do not replace database and data warehousing approaches but instead allow an organization to analyze and get insight from the kinds of data not suited for regular database and data warehouse technologies
Executive Dashboards
-contains an organized easy to read display of a number of important queries describing the performance of the organization visually display graphs, charts, and scorecards of key performance indicators and information on a real-time basis -a computer reporting tool that presents data about an organization's performance in a graphical or visual manner.
MIT: comScore - How did comScore make their big data more consumable, and less overwhelming, for customers?
-gave clients a few insights then encouraged them to answer key questions -designed an experience group to understand what the user flow needs to be-> provide users with easier to digest insights by creating visualization techniques and dashboards -developed campaign essentials which provided real time insights
MIT: What did comScore do to hire and develop in-house workers with data scientist abilities?
-hired analytical people from business and math schools and gave them more training -established groups with varying and overlapping skill levels -used matrix organizational structures to bring people together with diverse skill sets
How has Hadoop enhanced our ability to leverage (effectively use) Big Data?
-it is a open source framework/ technical infrastructure used for handling big data: capture data, store it, query it, preserve it in a safe way -runs on inexpensive commodity hardware so projects can occur inexpensively Process: 1. splits data into multiple blocks, makes copies of it, stores it across multiple systems so data isn't lost 2. facilitator and job tracker nodes 3. jobs distributed to clients, then when complete results are collected and aggregated using MapReduce
Big Data differences compared to operational databases/ data warehouses
-less structured w/ little to no metadata -exhibit larger volume, velocity, and variety -higher likelihood of having data quality (veracity) issues -higher possibility of different interpretations (variability) -need a more explorative and experimental approach to yield a value -are as likely to benefit from innovative visualization
Hadoop
-one of the big data technologies -captures the data, stores it, queries it, and preserves it in a safe way -enables distributed parallel processing of huge amounts of data across many inexpensive computers.
MapReduce
-one of the big data technologies: has two phases: 1. Map Phase= program is used to map every record in the data set to zero or key-value pairs 2. Reduce Phase gathers all the records with the same key and generates a single output record for every key to generate a single result goal: achieve high performance with simple computers
HBR: Big Data - Why is it challenging to find professionals that can effectively work with Big Data?
-people rely too much on experience and intuition and not enough on the actual data in front of them; they rely on the highest paid person's opinion instead of the data itself 5 Management Challenges: 1. Leadership: need a leadership team that sets clear goals, defines what success looks like, and asks the right questions 2. Talent Management: as data becomes cheaper, the complements to data become more valuable such as data scientists, visualization tools and techniques but people with these skills are hard to find 3. Technology: tech such as Hadoop require a skill set that is new to IT departments 4. Decision Making: an effective organization places info and relevant decision rights in the same location, need to place the correct people together 5. Company Culture: need to ask themselves what do we know, and stop pretending to be more data driven than they actually are
Monitoring and Maintaining Database System (database administration component)
-recognizes when maintenance activities are needed -observes usage of tables a) view materialization= saving a view as an actual physical table; improve the performance of queries on frequently used views -manages and upgrades database software and hardware resources -data dictionary= allows access to the metadata in a database (names of tables, names/types of columns, etc.) -catalog= data dictionary created by the DBMS
Securing the database against unauthorized access (database administration component)
-requires controlling access to the database 1. authentication of the user= login procedure using user ID/ pass 2. access privilege= assigned to the database user account; determines users' privileges on database columns, relations, and views -include SELECT, UPDATE, ALTER, DELETE, INSERT -implemented by an authorization matrix-> composed of subjects and objects -dcl commands GRANT and REVOKE -role-based access control= users assigned to a roles that have predefined privileges 3. encryption= changing info using a scrambling algorithm -encryption key= info becomes unreadable, decryption key= reverts info to original state
What are the six types of analysis (w/chart types and data types).
1. Analysis Type: Absolute comparison Chart Type: Bar Data Type: at least one dimension and one measure 2.Analysis Type: Relative comparison Chart Type: Stacked Bar, Pie Data Type: at least one dimension and one measure 3.Analysis Type: Benchmark comparison Chart Type: bullet Data Type: at least one measure 4. Analysis Type: Distribution Analysis Chart Type: histogram Data Type: at least one measure 5.Analysis Type: Trend Analysis Chart Type: Line Data Type: at least one measure and one date/time as dimension 6.Analysis Type: Relational Analysis Chart Type: Scatterplot Data Type: at least two measures 7.Analysis Type: Geographic Analysis Chart Type: map Data Type: geographic data
MIT: Describe the three metrics that comScore used to measure long-term success.
1. Client Retention: tracked contract renewals 2. Product and Service Usage: tracked usage of products at its client companies, encouraged clients to use the product/services they purchased 3. Up-Selling: tracked up selling
DBMS Components (4 of them)
1. Data Definition Component: -used to create the components of the database (ex. database tables, referential integrity constraints connecting the created tables) -uses DDL (data definition language) SQL commands 2. Data Manipulation Component -allows end users to insert, read, update, and delete info (can be used directly or indirectly by end users) -uses DML (data manipulation language) SQL commands -single user system= data mani. component used by 1 user at a time; multiple user system= used by multiple users at the same time 3. Data Administration Component: -used for technical, administrative, and maintenance tasks of database systems (ex. ensuring security, providing backup and recovery) -uses DCL (data control lang) and TCL (transaction control lang) SQL commands 4.Application Development Component: -used to develop front end applications
Compare operational database, data warehouse and big data infrastructure: Describe how Operational Databases, Data Warehouses, and Big Data infrastructure are similar and different based on the characteristics listed below.
1. Data Models: a) operational DB: ERD (relational schema) b) Data Warehouse: Star Schema (dimensional schema) c) Big Data:none 2. Use of SQL (this counts on data being structured) a) operational DB: Yes ex. SELECT b) Data Warehouse: Yes ex. CREATE c) Big Data: noSQL (b/c there is not structure to the data/data is coming from all different places) 3.Data Redundancy a) operational DB: absent (bc we create relational tables and then normalize the data to reduce repetition b) Data Warehouse: little redundancy c) Big Data: lots of it (b/c data is stored all over the place/ at different places) 4. Normalization a) operational DB: does need (want to get data in 3NF) b) Data Warehouse: uses some c) Big Data: does not use 5. Time Horizon a) operational DB: every second, day to day data gets loaded b) Data Warehouse: historical/ time variant (3 months to years) c) Big Data: does not have time constraint, just a constant stream of unstructured data 6. Sources of Data a) operational DB: everything a company does becomes a source of data ex. transactions, operations b) Data Warehouse: operational sources c) Big Data: huge variety of data from all different sources, it is everywhere 7. Volume of data stores a) operational DB: little data, detailed data b) Data Warehouse: medium amount, summarized data c) Big Data: huge quantities (petabytes) 8. Data updates (frequency) a) operational DB: seconds, very frequent b) Data Warehouse: 3-6 months, not as often c) Big Data: real-time 9. How is new data added? a) operational DB: operations b) Data Warehouse: ETL c) Big Data: copy (everything gets copied in there) 10. Other Differences? Data Quality Issues a) operational DB: none b) Data Warehouse: not as often c) Big Data: subject too
ETL Process (Extract, Transform, and Load)
1. Extraction: retrieve analytically useful data from the operational data sources, what to extract is determined in the requirements and modeling stages (data model provides a blueprint) 2. Transformation: transforming extracted data in order to fit the structure of the target data warehouse model, data quality control and improvement= data cleansing; may standardize different versions of the same data in different sources -2 kinds of transformations: active and passive 3. Load: loading the extracted and transformed data into the target data warehouse -batch processes that insert the data into the data warehouse tables automatically without user involvement -first load= initial load populating empty data warehouse -refresh load= every subsequent load, refresh cycle= the frequency with which the data warehouse is reloaded with new data
MIT: What are the three large data platforms at comScore and what purpose do they serve? In answering this question, be sure to address why comScore needs all three platforms.
1. Greenplum (data processing)-> parallel processing database for event level analysis -tells us what is going on with the company, full detailed analysis 2. Greenplum enterprise (enterprise data warehouse)-> historical and aggregated -helps us understand long term trends 3. Hadoop (big data infrastructure) -> 2.3 node 4.4 petabyte -provides larger history and richer data types
Types of Transformations
1. active transformation= produce a different number of rows in the output data as compared to the incoming data extracted from sources; reason= quality issues 2. passive transformation= do not affect the number of rows, and incoming and outgoing rows have the same counts
OLAP/BI Tools Purpose
1. ad hoc direct analysis of dimensionally modeled data 2. creation of front-end applications for indirect access of dimensionally modeled data
MIT: comScore - Describe the four sources of comScore's online data.
1. panel data= captures the behavior of each computer in its panel (such as online browsing/ transaction behavior), used passive measurement 2. census data= based on sensors places on some websites (90% of top 100 cites) and recorded what people were doing when they went on them; anytime a user went to the site in notified comScore 3. perceptual data= surveyed panel members 4. data obtained from strategic partners= data was processed and integrated into a data factory
7 Vs of Big Data (textbook)
1. volume= very large 2. variety= abundance of different types of data sources; structured/ unstructured/ semi-structured 3. velocity= high speed of incoming data 4. veracity= data quality issue in big data 5. variability= can interpret data in many different ways 6.value= useful and actionability of the info extracted from data sets; need to be able to recognize valuable big data instances 7. visualization= necessity for illustration and rich visualization of big data sets in order to grasp meaning of it
Draw and describe a bullet graph and a sparkline. Why are these graphics better than other widgets such as gauges and other graphics?
Bullet Graph: single instance of a measure along with comparative measure and qualitative ranges Sparkline: provides historical (trend) context for measures -works better because takes up less space without losing clarity on dashboard
Enablers of Big Data Analytics
In-memory analytics: -Storing and processing the complete data set in RAM In-database analytics: -Placing analytic procedures close to where data is stored Grid computing & MPP: -Use of many machines and processors in parallel (MPP - massively parallel processing) Appliances: -Combining hardware, software, and storage in a single unit for performance and scalability
HBR: Big Data - What are the three Vs and how do they define Big Data? Give examples of each that differentiate Big Data from non-Big Data. FIX THIS ONE
The 3 Vs are the differences between big data and analytics and are seen as big datas competitive advantages. 1. volume: how much info is coming in Big Data: very large volume of data (petabytes), specialized software to manage the data non-Big Data: not as large, can be stored in only a few computers 2. velocity: the speed of the information Big Data:rapid insights/data creation, generates at high speed, ex. using gps tracking on cellphones to see how many customers were at the store to give sales estimates, collecting data all the time non Big Data: connected to business functions 3. variety: how the data differences being structured, unstructured, or semi structured Big Data: many different sources of big data-> huge amounts of info from social media, GPS signal from cell phones, etc. non Big Data: only looks at text or numbers
HBR: Big Data - What is meant by the statement - "Each of us is now a walking data generator." What type of data do each of us generate on a daily basis? Considering the three Vs, how does the data that we generate qualify as big data?
We generate data on our cellphones from things such as our GPS location, online shopping, electronic communication, etc. These all produce large amounts digital information in the form of mostly unstructured data and collected to data platforms
Database Administration (6)
activities needed for proper functioning of a database system: -monitoring and maintaining the database system -data access control -data backup and recovery -data integrity assurance -optimizing database performance; -developing and implementing database policies and standards
developing/implementing policies and standards (database administration component)
aims to reflect and support business processes and business logic -policies and standards for database development ex. naming conventions -" " for database use ex. business rules - " " for database management and administration ex. policy for assigning administration tasks
Data Warehouse Deployment
allowing the end user access to the data warehouse and its front end applications -the process: 1. alpha release= data warehouse/FEA are deployed internally to members of the development team for initial testing of its functionalities 2. beta release= subsequent release, data warehousing system is deployed to a selected group if users to test the usability of the system 3. production release= the actual deployment of the functioning data warehousing system
Big Data Technologies/ When you need it
approaches to deal with/ manage big data: -MapReduce -No SQL (do not impose a relational structure on the managed data) -Hadoop when you need it: 1. can not process the amount and variety of data you want because of your platforms limitations 2. you need to integrate data very quickly to be current on your analysis 3. the data is arriving so fast that your traditional analysis platform can not handle it
What are the general design guidelines for dashboards?
dashboard= visual display of the most important info needed to achieve 1 or more objectives -arranged on single screen so info can be monitored quickly guidelines: a) small, concise, direct, clear display media b) customized for specific context c) provides situation awareness
OLAP/ BI Tools
designed for analysis of dimensionally modeled data -3 most basic OLAP/BI. features uses: 1. Slice and Dice: -to a way of segmenting, viewing and comprehending data in a database; break information down into smaller parts slice= filter, dice= 2 or more slices 2. Pivot(Rotate): -doesn't change the values displayed in the original query but just reorganizes them 3. Drill Down/ Up: -drill down= make the granularity of the data in the query finer; allows users to drill through hierarchies within dimensions -drill up: makes the granularity coarser drill hierarchies: allow the user to expand a value at one level to show the details below it (drill down) or collapse the details to show only the higher level values (drill up) *does not filter overall data, overall total does not change
data cleansing (scrubbing)
during the transformation part of the ETL process; enables detection and correction of low quality data
Data Lake
large data pool in which the schema and data requirements are not defined until the data is queried (storage repository that holds a vast amount of raw data in its native formate until it is needed) -if companies want to have a cheaper/ more flexible approach while at the same time less organized/ less powerful approach then data lake would be their choice
active data warehouse
loads occuring in micro batches that happen continuously so the data is updated in real-time
big data
massive volumes of diverse and rapidly growing data sets that are NOT formally modeled for querying and retrieval and are not accompanied with detailed meta data
optimizing database performance (database administration component)
minimizes the response time for queries that retrieve data from the database; involves indexing, denormalization, view materialization, and query optimization -query optimization= examining multiple ways of executing the same query and choosing the fastest option
providing backup and recovery (database administration component)
needed to ensure no data is lost -backup= saving additional copies of the data -recovery= recovering content of database after a failure -updates recored in a recovery log= even if update is lost before written to the disk, it still has the info about the update -checkpoint= part of the recovery log; indicates a point when updates are written on the disk -in even of failure: roll back to check point state, redo the updates in the recovery log since the last checkpoint -TCL commands used: COMMIT (causes all updates on the database to be recovered on the disk), ROLLBACK (rolls back all updates since the last commit) -complete mirror backup= ensures against complete database destruction, copies of databases are kept in different locations
MIT: comScore - What does passive measurement mean and how did comScore use it?
passive measurement is the act of tracking user behavior real time without a need for user intervention or input. comscore used passive measurement to capture panel members online browsing and transaction behavior. This allowed comScore to track the number of ads delivered to each computer and thus how immediate an individual's online purchasing was.
HBR: Data Scientist - What are the essential skills for a data scientist?
people with the skills sets to put data breakthroughs to use: data hacker, analyst, communicator, and trusted advisor -curious and creative -communicative and interpersonal -understands social networking technologies -understand programming, scripting, and hacking -understand data access and management -understand domain expertise, problem definition, and decision modeling -ability to write code
What is preattentive processing and how can a dashboard provide it? List several forms of preattentive processing and provide examples of how these forms can be used in a chart.
preattentive processing: in psych terms it is the subconscious gathering of info from an environment; an object's basic visual attributes that are perceived without any conscious effort -W/ Dashboard: use it to pin point certain data in graphs/ charts-> direct users eyes to the data you want them to see examples: -FORM (line length, width, shape) -COLOR (hue, intensity, -SPACIAL POSITION (2-D)
ensure database integrity (database administration component)
prevents unauthorized or accidental insertion, modification, or deletion that result in invalid, corrupt, or low quality data -can be compromised by unauthorized malicious data updates, update failure, accidental misuse
Online Analytical Processing (OLAP)
querying and presenting data from data warehouses/ data marts for analytical purposes; can not update info as it is read only -users can quickly read and interpret the data and make fact based decisions from it
Database Management System (DBMS)
software that is used to create, manipulate, maintain databases and create front end applications.
What is the data-ink ratio in the context of dashboard design?
the portion of ink that makes up the data information on the dashboard -amount of data ink/ total ink or pixels 1.0- proportion of graphic that can be erased without loss of data info
HBR: You May Not Need Big Data - What are some of the principles of evidence-based decision-making? (4)
the preconditions a company needs to have met before implementing big dat: evidence-based decision making 1. Agree on a single source of truth: -use performance data from one authorized source 2. Use Scorecards: - provide employees with data about their own performance-> provides accountability and feedback so they know how they are doing (focuses on results individuals can control) 3. Explicitly Mange your business -align the actions of operational decision makers with the strategic objectives of the company -continually assessed and improved 4. Use Coaching to Improve Performance -help people shift from basing their decision on instinct to basing it on data -help employers realize the importance of their behavior
Online Transaction Processing (OLTP)
the updating, querying, and presenting data from databases for operation purposes; everyday update transactions done on operational database systems