Data Science Final
How can you use Excel's VLOOKUP() function to transform data from one value to another?
-=VLOOKUP(value, array, column index) -Result is lookup value OR #N/A, depending if it was able to find the value
What is the "Quantified Self" movement and how does it relate to data? To KPIs?
-Apps that help us learn more about ourselves -Quantified all your moves/seconds/ amount you sleep -ex. fitbit -Insurance companies can change your premiums based on these numbers
What is a Database?
-Bytes grouped into columns (aka fields) -Columns grouped into rows (aka records) -Group of similar records, table or file -DB- collection of tables plus relationships among rows in tables, plus special data called metadata, which describes DB structure
What is data science? What is data (vs. information)?
-Data - raw facts, raw numbers, don't mean anything unless you perform action -Information - putting data in meaningful context -Knowledge - information in action
What is data integration? Why is it necessary?
-Data integration involves combining data residing in different sources and providing users with a unified view of them. -It is necessary in order to connect data sources together in order to merge
What to do to narrow confidence interval?
-Decrease confidence interval -99% to 95%
How do you clean it?
-Fix how you collect data (fix the structure) instead of going back and fixing the data itself -Correct spellings, redo calculations, look to the raw data.
Explain the "tyranny of success" and how a reliance of KPIs and exacerbate the problem. What can you do in those situations?
-Grey area -not everything is a yes or no -You can provide more options when collecting data to ensure all bases are covered
Tableau Functions
-IF milk/soda > 1 then set the value of MostExpensive to "Milk" -Else if milk/soda <= 1 then set the value of MostExpensive to "Soda" -IIF (test,then,else [unkownn])
What to do to widen the confidence interval?
-Increase confidence interval -95%-99%
What is the difference between infographics and data visualization? What are the similarities?
-Infographics may contain data visualizations -Infographics tell stories -Data visualizations explain data
How can you use Excel's MATCH() function to find invalid values?
-MATCH (E2,Lookup!Names$A$3:$A$54, 0) = row # -Value searches array, if there is a match then it will output row number
What is open data? The three "V"s of big data?
-Open data- free and accessible to everyone -Velocity, Volume, and Variety
Explain why each of the "SMART" criteria are important
-Specific purpose for the business (clearly define goal) -Measurable (results should be tangible) -Achievable by the organization (Goal should stretch you slightly enough to challenge you) -Relevant to success (goals should measure outcomes, not activities in the process) -Time-phased (Time frame must be incorporated)
Differences between statistics and machine learning
-Statistics deals with all assumptions of the validity of the model Always need to justify the model in statistics -ML deals with predicting the outcome of a model (model validation is accuracy)
What is the difference between structured and unstructured data?
-Structured is easily searchable with basic algorithms. Data is formatted in a way that all the related values is under separate columns with the same type and format -Unstructured is more of human language and doesn't fit well in a relational database. Images, sound files, etc. They do not follow any structure; there is not one single type. It is difficult to manage unstructured data.
What are some examples of structured and unstructured data?
-Structured: numbers, dates, strings -Unstructured: emails, videos, pictures, presentations, audio files
What is sentiment analysis? How does it work?
-The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral . Singular words within the piece of text are evaluated as positive or negative
Best practices of dealing with dirty data
-focus on getting new data right -limit time fixing old data -data producers should communicate with data consumers -check work
Characteristics of open data
-it can be redistributed to others -it can come from any source -it can be available for free
How does data get dirty
-missing values -inconsistent data -nonintegrated data -wrong granularity (too fine or too coarse) -too much data
4 Types of Sentiment analysis
-sentence-level -document level -aspect-based -comparative-level
What does it mean for data to be dirty
-spelling errors -punctuation errors -incorrect data in specified field -duplication data -non integrated data
According to Redman, which is cause of bad data
-users of data create "work arounds" instead of addressing root causes -people creating the data don't understand how others will use it -Creators and users of data have poor communication
What are the steps for communicating an analysis (and what do they mean)?
1)Understand the problem 2)How will I measure? 3)What data is available? 4)Initial solution hypothesis 5)Solution 6)Impact of solution
According to Hayes, what percentage of business leaders do not trust the information they use to make decisions
33%
the dangers of big data analytics
33% of Managers don't trust data
People spend ____ of the time on data cleaning, the rest is on _____. They are:
50% data analysis Searching for data correcting errors verifying correctness
Stein was able to eventually predict the gender of a caller
80% of the time
What is a (good) hypothesis? What are the criteria?
A hypothesis does not always have to be true - (testable predictions) it needs to be testable, falsifiable, and grounded in rationale.
What is the difference in definition and purpose between a scorecard and a dashboard?
A scorecard only shows you binary responses - a check or an x. A dashboard can be interactive and has a scale - showing you exact instances of what is being measured.
How do tables of data become associated in a relational database?
By creating relationships between them. Think: we do Tableau to connect. There should also be a primary key (EmpNo, for example) to connect all tables
What are the benefits of sentiment analysis?
Can help determine how people feel towards a brand, product, etc.
Gandel cites the case of Barclays effort to purchase Lehman Brothers, where the following Excel error resulted in the accidental purchase of 179 toxic assets
cells were hidden instead of deleted
What is the best data visualization to find outliers? To create scorecards?
Data visualization for outliers is scatter plot.
According to Unwin, one issue with map-based graphical visualizations is that
Distance is not directly related to similarity
According to Hoven, which of the following is NOT one of Few's 8 core principles of data visualizations?
Explain (some that are include Attend, simplify, be skeptical)
According to Acohido, Microsoft uses all of the following data to combat cybercrime EXCEPT
FBI watchlist (some that are included are threat reports, malicious files, early warning reports)
How do you resolve conflicts in data (i.e., PA versus Penna. Versus Pennsylvania)?
Find which name was recorded in the Lookups tab and change any differing names to match it
What are the uses of forecasting?
Forecasting sales amount for the future (predictive) To widen the range means better accuracy (moving from 95% to 99% accuracy, increasing the number of data points for better accuracy)
According to Farmer, the purpose of creating an information scent is to
Give the user hints where they should explore further
What is the difference between Hadoop and MapReduce?
Hadoop allows you to store the big data while MapReduce is a software that allows you to perform tasks on the slices of data across servers.
According to Peck, the use of analytics to determine workers' potential is most widely used in
Hourly work, where the jobs are standardized
How do you create a KPI for a scorecard using a calculated field in Tableau?
IIF
When should you not bother resolving conflicts or even fixing your data?
If it's too damaged
What are the uses of association mining?
Items that are usually purchased together (eg. diapers and beer)
What is a KPI
Key Performance Indicator: a quantifiable measure used to evaluate the success of an organization, employee, etc., in meeting objectives for performance
According to Crawford, a key problem of Boston's StreetBump app is that
Low income residents have less access to smartphone
Hadoop is often paired with another piece of software called
MapReduce
How do they facilitate the analysis of Big Data?
MapReduce operates on slices of data (the stations are the workers). Data is processed by each slice with the main finding reported back to the main machine (the queen). It allows big data to be analyzed more efficiently
What does Crawford propose was the reason for Google's overestimation of flu outbreaks
Media coverage of the flu season
What is the difference between metadata (data dictionary) and data?
Metadata- self-describing data
The Tableau article suggests the following to when creating dashboards
Minimize distracting text and formatting
In the article by Weisberg, Eli Pariser argues that the Filter Bubble is caused by
Personalization of web content
MapReduce
Programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment
According to Strickland, which of the following is a type of data integration approach
common user interface common data storage manual integration
Which of the following is the most appropriate technique to analyze the statement "Coke tastes better than Pepsi"
comparative sentiment analysis
What is the difference between relational databases and data formatted for Pivot Table analysis?
Relationship-separate themes in different files. In a flat file- everything is merged into one big file so you can make pivot tables and analyse the data as a whole. Same structure (column/row), same number of records, etc.
According to Feldman, the most common application of sentiment analysis is
Reviews of consumer products and services
Which of the following are part of the SMART criteria for KPIs
Specific Relevant Measurable Relevant Attainable(?)
According to Bialik, a key issue with counting steps as a measure of fitness is
Steps are not the only kind of exercise All steps do not burn the same calories Tools like Fitbit don't accurately count steps
How is data arranged in a relational database? For Pivot Table analysis?
Stored in a group of similar records (table (file))
What is the difference between Tableau and Excel?
Tableau: data visualization tool, organizes data Excel: Spreadsheet tool
The Ashley Madison hack is different from previous hacks in that
The Ashley Madison hack resulted in more personal damages to users
According to Krum, the fact that people remember messages with images more often than ones with just text is called
The Picture Superiority Effect
According to Peck, "people analytics" is
The application of predictive analytics to people's careers
According to Di Justo's article, "telephony metadata" includes
The callers duration
In what situations can sentiment analysis be inaccurate?
The meaning of the words can change in respect to how they are used
Data transformation for association mining
The two conditions we need to apply: -order IDs should be the same -product names should differ
According to Olson, insurance companies are seeking real-time customer data so that
They can build better customer risk profiles
FiveThirtyEight's search for America's best burrito began with data from
Yelp
How should you deal with outliers?
You hide them or you fix the incorrect formula
What is a scoreboard
a statistical record used to measure achievement or progress toward a particular goal.
A basic rule of the Pivot Table data structure is that
all values of the same type need to be in one column
According to Silver's article "What the Fox Knows," the "explanation" step involves
answering the questions "why" and "how"
The most detailed type of sentiment analysis is
aspect-based sentiment analysis
Explain the impact of bad data on future decision-making processes
could damage the creator and consumer relationship
Too many attributes (in dirty data)
curse of dimensionality
According to Olsen, real-time customer data could result in the ability to adjust premiums
daily
According to Taber, what is a common way Excel corrupts imported data
data are often converted into integers (a single number)
According to Bertolucci, the simplest type of analytics is
descriptive
What is a theory? What is the "direction of causality"?
direction of causality - A correlation between two variables does not indicate which variable is causing which.
Predictive analysis
ex. stock exchange prices (because it is unknown you need to try to predict it based on trends)
According to Wohlsen, Facebook's recent study of its users revealed
exposure to fewer positive messages led to fewer positive posts
Hadoop
framework that allows us to process and store huge data sets. The data is stored on multiple servers
Stein developed a model that could determine the gender of a caller using
his phone records from Google Voice
According to Unwin, a scale is 'really nice' if it
includes 0
According to Krum, the relationship between infographics and data visualizations is best described as
infographics can include visualizations within them
According to Hayes, a benefit of large samples is that
it minimizes sampling error
Which of the following an example of unstructured data
stock prices
Gandel cites the case of Reinhart and Rogoff, where an excel spreadsheet error:
led them to accidentally exclude five countries from their analysis
According to Gallagher, the RNC wants to customize the experience for visitors to GOP.com by
looking at past interactions with the GOP.com site
According to Microsoft, before cleaning your data you should
make a backup copy of the data in a separate workbook
According to Krum, good infographics should
make sure the relative size of chart elements are proportional to the data values
Hadoop is a platform that
makes big data easier to manage
Bertolucci claims an ongoing problem with Hadoop for companies is that
manages and executives don't really understand what it does
Prescriptive analysis
one step ahead of predictive, gives options on what to do
In Matlin's article, Whong states that his NYC Taxi Cab visualization is part of a larger movement for
open data and transparency
According to Paine, people have been tracking soccer data for
over 60 years
According to Davenport, what is an example of something that should not be included while storytelling with data
sequence of activities used in the analysis
Too many data points (in dirty data)
statistical sampling
According to Davenport, the "essence" of analytical communication includes
the data the model the relationships among the variables
According to Schambra, the problem with scoring non-profit outcomes as "success" or "failure" is
the distinction between success and failure is not always clear
Relational databases follow to a set of practices called
the rules of normalization
According to Paine, an analysis found that a team's probability of scoring increases as
they string together more successful passes
According to Gallagher, what is a key goal of the RNC's new data platform?
to create a single data platform for all Republican candidates
According to Unwin, the reason for using graphic displays is:
to present or explore data
According to Hurwitz, Hadoop is capable of handling _______ data.
unstructured
"The Agency Problem"
when the data creator is usually not the data consumer (creator and consumer must have a connection)