Data Science Final

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

How can you use Excel's VLOOKUP() function to transform data from one value to another?

-=VLOOKUP(value, array, column index) -Result is lookup value OR #N/A, depending if it was able to find the value

What is the "Quantified Self" movement and how does it relate to data? To KPIs?

-Apps that help us learn more about ourselves -Quantified all your moves/seconds/ amount you sleep -ex. fitbit -Insurance companies can change your premiums based on these numbers

What is a Database?

-Bytes grouped into columns (aka fields) -Columns grouped into rows (aka records) -Group of similar records, table or file -DB- collection of tables plus relationships among rows in tables, plus special data called metadata, which describes DB structure

What is data science? What is data (vs. information)?

-Data - raw facts, raw numbers, don't mean anything unless you perform action -Information - putting data in meaningful context -Knowledge - information in action

What is data integration? Why is it necessary?

-Data integration involves combining data residing in different sources and providing users with a unified view of them. -It is necessary in order to connect data sources together in order to merge

What to do to narrow confidence interval?

-Decrease confidence interval -99% to 95%

How do you clean it?

-Fix how you collect data (fix the structure) instead of going back and fixing the data itself -Correct spellings, redo calculations, look to the raw data.

Explain the "tyranny of success" and how a reliance of KPIs and exacerbate the problem. What can you do in those situations?

-Grey area -not everything is a yes or no -You can provide more options when collecting data to ensure all bases are covered

Tableau Functions

-IF milk/soda > 1 then set the value of MostExpensive to "Milk" -Else if milk/soda <= 1 then set the value of MostExpensive to "Soda" -IIF (test,then,else [unkownn])

What to do to widen the confidence interval?

-Increase confidence interval -95%-99%

What is the difference between infographics and data visualization? What are the similarities?

-Infographics may contain data visualizations -Infographics tell stories -Data visualizations explain data

How can you use Excel's MATCH() function to find invalid values?

-MATCH (E2,Lookup!Names$A$3:$A$54, 0) = row # -Value searches array, if there is a match then it will output row number

What is open data? The three "V"s of big data?

-Open data- free and accessible to everyone -Velocity, Volume, and Variety

Explain why each of the "SMART" criteria are important

-Specific purpose for the business (clearly define goal) -Measurable (results should be tangible) -Achievable by the organization (Goal should stretch you slightly enough to challenge you) -Relevant to success (goals should measure outcomes, not activities in the process) -Time-phased (Time frame must be incorporated)

Differences between statistics and machine learning

-Statistics deals with all assumptions of the validity of the model Always need to justify the model in statistics -ML deals with predicting the outcome of a model (model validation is accuracy)

What is the difference between structured and unstructured data?

-Structured is easily searchable with basic algorithms. Data is formatted in a way that all the related values is under separate columns with the same type and format -Unstructured is more of human language and doesn't fit well in a relational database. Images, sound files, etc. They do not follow any structure; there is not one single type. It is difficult to manage unstructured data.

What are some examples of structured and unstructured data?

-Structured: numbers, dates, strings -Unstructured: emails, videos, pictures, presentations, audio files

What is sentiment analysis? How does it work?

-The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral . Singular words within the piece of text are evaluated as positive or negative

Best practices of dealing with dirty data

-focus on getting new data right -limit time fixing old data -data producers should communicate with data consumers -check work

Characteristics of open data

-it can be redistributed to others -it can come from any source -it can be available for free

How does data get dirty

-missing values -inconsistent data -nonintegrated data -wrong granularity (too fine or too coarse) -too much data

4 Types of Sentiment analysis

-sentence-level -document level -aspect-based -comparative-level

What does it mean for data to be dirty

-spelling errors -punctuation errors -incorrect data in specified field -duplication data -non integrated data

According to Redman, which is cause of bad data

-users of data create "work arounds" instead of addressing root causes -people creating the data don't understand how others will use it -Creators and users of data have poor communication

What are the steps for communicating an analysis (and what do they mean)?

1)Understand the problem 2)How will I measure? 3)What data is available? 4)Initial solution hypothesis 5)Solution 6)Impact of solution

According to Hayes, what percentage of business leaders do not trust the information they use to make decisions

33%

the dangers of big data analytics

33% of Managers don't trust data

People spend ____ of the time on data cleaning, the rest is on _____. They are:

50% data analysis Searching for data correcting errors verifying correctness

Stein was able to eventually predict the gender of a caller

80% of the time

What is a (good) hypothesis? What are the criteria?

A hypothesis does not always have to be true - (testable predictions) it needs to be testable, falsifiable, and grounded in rationale.

What is the difference in definition and purpose between a scorecard and a dashboard?

A scorecard only shows you binary responses - a check or an x. A dashboard can be interactive and has a scale - showing you exact instances of what is being measured.

How do tables of data become associated in a relational database?

By creating relationships between them. Think: we do Tableau to connect. There should also be a primary key (EmpNo, for example) to connect all tables

What are the benefits of sentiment analysis?

Can help determine how people feel towards a brand, product, etc.

Gandel cites the case of Barclays effort to purchase Lehman Brothers, where the following Excel error resulted in the accidental purchase of 179 toxic assets

cells were hidden instead of deleted

What is the best data visualization to find outliers? To create scorecards?

Data visualization for outliers is scatter plot.

According to Unwin, one issue with map-based graphical visualizations is that

Distance is not directly related to similarity

According to Hoven, which of the following is NOT one of Few's 8 core principles of data visualizations?

Explain (some that are include Attend, simplify, be skeptical)

According to Acohido, Microsoft uses all of the following data to combat cybercrime EXCEPT

FBI watchlist (some that are included are threat reports, malicious files, early warning reports)

How do you resolve conflicts in data (i.e., PA versus Penna. Versus Pennsylvania)?

Find which name was recorded in the Lookups tab and change any differing names to match it

What are the uses of forecasting?

Forecasting sales amount for the future (predictive) To widen the range means better accuracy (moving from 95% to 99% accuracy, increasing the number of data points for better accuracy)

According to Farmer, the purpose of creating an information scent is to

Give the user hints where they should explore further

What is the difference between Hadoop and MapReduce?

Hadoop allows you to store the big data while MapReduce is a software that allows you to perform tasks on the slices of data across servers.

According to Peck, the use of analytics to determine workers' potential is most widely used in

Hourly work, where the jobs are standardized

How do you create a KPI for a scorecard using a calculated field in Tableau?

IIF

When should you not bother resolving conflicts or even fixing your data?

If it's too damaged

What are the uses of association mining?

Items that are usually purchased together (eg. diapers and beer)

What is a KPI

Key Performance Indicator: a quantifiable measure used to evaluate the success of an organization, employee, etc., in meeting objectives for performance

According to Crawford, a key problem of Boston's StreetBump app is that

Low income residents have less access to smartphone

Hadoop is often paired with another piece of software called

MapReduce

How do they facilitate the analysis of Big Data?

MapReduce operates on slices of data (the stations are the workers). Data is processed by each slice with the main finding reported back to the main machine (the queen). It allows big data to be analyzed more efficiently

What does Crawford propose was the reason for Google's overestimation of flu outbreaks

Media coverage of the flu season

What is the difference between metadata (data dictionary) and data?

Metadata- self-describing data

The Tableau article suggests the following to when creating dashboards

Minimize distracting text and formatting

In the article by Weisberg, Eli Pariser argues that the Filter Bubble is caused by

Personalization of web content

MapReduce

Programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment

According to Strickland, which of the following is a type of data integration approach

common user interface common data storage manual integration

Which of the following is the most appropriate technique to analyze the statement "Coke tastes better than Pepsi"

comparative sentiment analysis

What is the difference between relational databases and data formatted for Pivot Table analysis?

Relationship-separate themes in different files. In a flat file- everything is merged into one big file so you can make pivot tables and analyse the data as a whole. Same structure (column/row), same number of records, etc.

According to Feldman, the most common application of sentiment analysis is

Reviews of consumer products and services

Which of the following are part of the SMART criteria for KPIs

Specific Relevant Measurable Relevant Attainable(?)

According to Bialik, a key issue with counting steps as a measure of fitness is

Steps are not the only kind of exercise All steps do not burn the same calories Tools like Fitbit don't accurately count steps

How is data arranged in a relational database? For Pivot Table analysis?

Stored in a group of similar records (table (file))

What is the difference between Tableau and Excel?

Tableau: data visualization tool, organizes data Excel: Spreadsheet tool

The Ashley Madison hack is different from previous hacks in that

The Ashley Madison hack resulted in more personal damages to users

According to Krum, the fact that people remember messages with images more often than ones with just text is called

The Picture Superiority Effect

According to Peck, "people analytics" is

The application of predictive analytics to people's careers

According to Di Justo's article, "telephony metadata" includes

The callers duration

In what situations can sentiment analysis be inaccurate?

The meaning of the words can change in respect to how they are used

Data transformation for association mining

The two conditions we need to apply: -order IDs should be the same -product names should differ

According to Olson, insurance companies are seeking real-time customer data so that

They can build better customer risk profiles

FiveThirtyEight's search for America's best burrito began with data from

Yelp

How should you deal with outliers?

You hide them or you fix the incorrect formula

What is a scoreboard

a statistical record used to measure achievement or progress toward a particular goal.

A basic rule of the Pivot Table data structure is that

all values of the same type need to be in one column

According to Silver's article "What the Fox Knows," the "explanation" step involves

answering the questions "why" and "how"

The most detailed type of sentiment analysis is

aspect-based sentiment analysis

Explain the impact of bad data on future decision-making processes

could damage the creator and consumer relationship

Too many attributes (in dirty data)

curse of dimensionality

According to Olsen, real-time customer data could result in the ability to adjust premiums

daily

According to Taber, what is a common way Excel corrupts imported data

data are often converted into integers (a single number)

According to Bertolucci, the simplest type of analytics is

descriptive

What is a theory? What is the "direction of causality"?

direction of causality - A correlation between two variables does not indicate which variable is causing which.

Predictive analysis

ex. stock exchange prices (because it is unknown you need to try to predict it based on trends)

According to Wohlsen, Facebook's recent study of its users revealed

exposure to fewer positive messages led to fewer positive posts

Hadoop

framework that allows us to process and store huge data sets. The data is stored on multiple servers

Stein developed a model that could determine the gender of a caller using

his phone records from Google Voice

According to Unwin, a scale is 'really nice' if it

includes 0

According to Krum, the relationship between infographics and data visualizations is best described as

infographics can include visualizations within them

According to Hayes, a benefit of large samples is that

it minimizes sampling error

Which of the following an example of unstructured data

stock prices

Gandel cites the case of Reinhart and Rogoff, where an excel spreadsheet error:

led them to accidentally exclude five countries from their analysis

According to Gallagher, the RNC wants to customize the experience for visitors to GOP.com by

looking at past interactions with the GOP.com site

According to Microsoft, before cleaning your data you should

make a backup copy of the data in a separate workbook

According to Krum, good infographics should

make sure the relative size of chart elements are proportional to the data values

Hadoop is a platform that

makes big data easier to manage

Bertolucci claims an ongoing problem with Hadoop for companies is that

manages and executives don't really understand what it does

Prescriptive analysis

one step ahead of predictive, gives options on what to do

In Matlin's article, Whong states that his NYC Taxi Cab visualization is part of a larger movement for

open data and transparency

According to Paine, people have been tracking soccer data for

over 60 years

According to Davenport, what is an example of something that should not be included while storytelling with data

sequence of activities used in the analysis

Too many data points (in dirty data)

statistical sampling

According to Davenport, the "essence" of analytical communication includes

the data the model the relationships among the variables

According to Schambra, the problem with scoring non-profit outcomes as "success" or "failure" is

the distinction between success and failure is not always clear

Relational databases follow to a set of practices called

the rules of normalization

According to Paine, an analysis found that a team's probability of scoring increases as

they string together more successful passes

According to Gallagher, what is a key goal of the RNC's new data platform?

to create a single data platform for all Republican candidates

According to Unwin, the reason for using graphic displays is:

to present or explore data

According to Hurwitz, Hadoop is capable of handling _______ data.

unstructured

"The Agency Problem"

when the data creator is usually not the data consumer (creator and consumer must have a connection)


Set pelajaran terkait

WEEK 10 Sharing Resources and Working with Accounts, WEEK 9 Configuring a Network Connection, Week 8 Virtualization and Cloud Computing Fundamentals, Week 7: 7 Using and Configuring Storage Devices, Week 6: Configuring Input and Output Devices, Week...

View Set

(Ch 2) types of life insurance policies

View Set

Evaluating an Argument on Healthy Eating

View Set