D204: The Data Analytics Journey

¡Supera tus tareas y exámenes ahora con Quizwiz!

The ______________ section would contain the ideas of predictive modeling as well as data mining/machine learning.

"Modeling"

Describe predictive modeling (question and an example)

"What will happen in the future?" Example: churn analysis

Data Acquisition (Step 5), Data Cleaning (Step 6), and Data Exploration (Step 7) in this framework all fall under the "____________" domain.

"Wrangling" domain.

Describe prescriptive modeling (question, what it is and an example)

- "What should we do?" - Make a recommendation - Example: What types of donuts should we sell to maximize profits?

Describe descriptive modeling (question, what it is and an example)

- "What's going on here?" - Describes the data that's present (Mean, Median, Mode, counting things, etc) - Example: In what month did we sell the most donuts?

What happens during the reporting phase?

1. Create a story to report data in a meaningful, in-context way 2. Remove irrelevant variables 3. Use visuals to communicate the story 4. Present it to stakeholders 5. Revisit the model to make improvements and adjustments as needed 6. Archive assets

What is involved in the planning phase?

1. Defining goals 2. Organizing resources 3. Coordinate people 4. Schedule project

What happens during the Data Mining phase? (3)

1. Finding patterns and insights 2. Test hypotheses 3. Refine

What happens during data exploration phase? (3)

1. Get familiar with the dataset 2. Analyst gets insight into the structure of the data set Example: An analyst applies a statistical formula to obtain the average temperature for a city over the last 50 years. 3. Verification through visualization

What happens during the data acquisition phase? (3)

1. Get the data from various sources 2. Data cleaning (most time consuming part of process) 3. Detecting missing values - defining data columns that could contain null values, etc.

What happens during the discovery phase? (6)

1. Meet with stakeholders 2. Identify business needs 3. Define the question 4. Do we have the data we need to answer the question? 5. Organize resources 6. Coordinate people

What happens during the modeling phase?

1. Use the data to answer business questions 2. Descriptive Modeling 3. Prescriptive Modeling 4. Predictive Modeling "What will happen in the future?" Example: churn analysis

What is involved in the Applying phase?

13. Present model 14. Deploy model 15. Revisit model 16. Archive assets

What is involved in the wrangling phase?

5. Get data 6. Clean data 7. Explore data 8. Refine data

What is the 68, 95, 99 rule

68% of the data are within one standard deviation, above or below the mean. 95% of the observations are within two standard deviations above or below the mean. And 99.7% of the observations are within three standard deviations above or below the mean.

What is involved in the Modeling phase?

9. Create model 10. Validate model 11. Evaluate model 12. Refine model

A restaurant owner wants to sponsor a data analytics project to provide insights regarding hamburger sales before developing a strategy for increasing sales. Which question is framed appropriately for the data analytics project? A. What are the characteristics of customers who buy hamburgers? B. What does the supply and demand curve look like for hamburgers? C. Which discount coupons should we send to neighborhood residents? D. Which varieties of hamburgers are featured by competitors?

A

Which activity does an analyst perform in the discovery phase of the data analytics life cycle? A. Identifying business needs B. Identifying outliers C. Cleaning data D. Collecting data

A

Which feature is commonly found in collaboration tools like Jira, Slack, Teams, and PivotalTracker? A. Real-time messaging B. Multivariate analysis C. Equation editor D. Source code management

A

Which party has the primary vision for a data analytics project and brings resources to complete it? A. Project sponsors B. Project managers C. Customers D. Data analysts

A

Which characteristics are used to group data together in a cluster analysis? Choose 2 answers. A. Distance B. Similarity C. Shape D. Size

A & B

What are two purposes of the reporting phase of the data analytics life cycle? A. Provide the conclusions from the analysis in an engaging manner B. Provide a tool for decision-makers to import and analyze more data C. Provide actionable insights that can inform decision-making D. Provide an automated way for decision-makers to test their own models

A & C

An oil company uses robots and sensors to detect how pipeline corrosion changes over time. The collected data is then used in a predictive model that estimates when a pipe should be replaced.How does the predictive model serve this oil company? A. To minimize interruptions from maintenance shutdowns B. To minimize the need for workforce safety training C. To improve compliance with pipeline construction standards D. To improve compliance with pipeline disposal standards

A.

What is an example of random sampling of college students? A. Surveying students chosen arbitrarily from around the entire college campus B. Surveying every student in the college library C. Surveying students chosen arbitrarily in the library of the university D. Surveying every student on campus

A.

Which organizational objective could be accomplished with a descriptive data analytics project using website request logs as a data source? A. Explain why web data transfer has increased 25% B. Estimate the traffic increase for a new product launch C. Improve the speed of server request processing D. Recommend a strategy to increase network capacity

A.

Which technique can be used to determine the likelihood that a positive diagnostic test result indicates whether the disease is actually present? A. Bayes' theorem B. Central limit theorem C. Regression D. Optimization

A.

Which tool has libraries that expand its visualization capabilities? A. Python B. Tableau C. Adobe Infographics D. D3.js

A.

_____________________ is where for you prepare two different versions of a website, or a landing page, and you see how many people click on things as a result of those two different versions.

AB Testing

A ______________________________ isn't a source of data but rather it's a way of sharing data, it can take data from one application to another or from a server to your computer.

API or Application Programming Interface

_____________ focuses on web analytics, SQL, visualizations

Analyst

These are people who do the day-to-day data tasks that are necessary for any business to run efficiently. Those include things like web analytics, and S-Q-L, that's SQL or Structured Query Language, data visualizations, and the reports that go into business intelligence. These allow people to make decisions.

Analysts

_______________ is a beautiful time chart. It's similar to a line chart, but it's full of color, instead of being a simple stroke.

Area chart

_________________________ means algorithms that learn from data.

Artificial intelligence

A data analytics project manager has been asked to complete a project on a very short timeline. Which action is likely to yield positive results? A. Outsource the skilled work to an unproven vendor B. Expand the team with experienced staff C. Require current team to work overtime D. Accept lowered quality standards

B

A data analytics project team is preparing to develop a predictive model that will be included within a business intelligence tool for upper management. Which step should be considered for inclusion when creating the project schedule? A. Model testing and validation for users B. Business intelligence tool interface training C. Model training and testing for stakeholders D. Business intelligence tool data transformation training

B

In which phase of the data analytics life cycle does an analyst build a histogram? A. Data acquisition B. Data exploration C. Discovery D. Predictive modeling

B

What do open-source software tools and widely available analysis tools, such as spreadsheets, help accomplish? A. Data schemas B. Data democratization C. Data security D. Data compliance

B

What is a characteristic of active listening? A. Actively working on a task while listening to the speaker B. Seeking to understand the speaker's emotions and intent C. Focusing intently on the content of the message D. Waiting patiently to share one's own thoughts

B

Which action can the project manager take to keep the team engaged in the analytics project? A. At the end of the project, the team publishes an extensive research report and includes it in an email to project stakeholders. B. Throughout the project, the project manager communicates insights from the data analytics team and provides ideas of ways to act on those insights. C. At the end of the project, the project manager sends an email with the predictive model to the stakeholders so they can use it. D. Throughout the project, the project manager holds regular meetings so the entire data analytics team can showcase their work to different departments.

B

Which outcome should be expected when working with data aggregated from multiple sources? Select two answers. a. Consistently named fields B. Inconsistently named fields C. Data needs cleaning D. Data does not need cleaning

B & C

What is a feature of SQL? Choose 2 answers. A. It is an object-oriented programming language. B. The basic language is the same across database servers. C. It has built-in chart and graph creation. D. It is used with structured data and unstructured data.

B & D

A neural network algorithm in machine learning endeavors to recognize underlying relationships in a set of data. What does this process mimic? A. The way a computer processes data B. The way the human brain operates C. The way architects establish functionality D. The way that social media builds networks

B.

An analyst applies a statistical formula to obtain the average temperature for a city over the last 50 years.Which phase of the data analytics life cycle is represented by this activity?A. A. Data acquisition B. Exploratory data analysis C. Predictive modeling D. Data reporting

B.

Which concept should be considered when choosing variables for inclusion in a linear regression model? A. Feasibility of merging the variables B. Feasibility of controlling the variables C. Feasibility of testing the variables D. Feasibility of classifying the variables

B.

Which type of project management problem occurs when a data mining task has started but a data acquisition task has not been completed? A. Scope B. Schedule C. Procedure D. Cost

B.

Why might a data analyst resample a data set with replacement data in a data mining project? A. Skewed data resulting from outliers B. Too little data for training and testing data sets C. Wrong variables chosen for analyzation D. Misidentification of causation due to correlation

B.

___________________ are mainly used when comparing different items.

Bar charts

What ________________ does is it gives you the posterior or after-the-data probability of a hypothesis as a function of the likelihood of the data given the hypothesis, the prior probability of the hypothesis and the probability of getting the data you found.

Bayes' Theorem

______________ is a simple mathematical formula used for calculating conditional probabilities.

Bayes' Theorem

__________________ is a statistical paradigm that answers research questions about unknown parameters using probability statements. For example, what is the probability that the average male height is between 70 and 80 inches or that the average female height is between 60 and 70 inches?

Bayesian analysis

____________ is data that is characterized by any or all of three characteristics. Unusual volume, unusual velocity, and unusual variety.

Big data

___________________ can create interactive charts and graphs and also it allows you to create tools for the user so they can access and explore your visualizations in more detail.

Bokeh

____________________ is a competency about understanding business concepts, being business savvy, and understanding how businesses work.

Business acumen

__________________________ views help us see and execute on things like the business strategy, business capabilities, business knowledge, value streams, and organizational views.

Business architecture

__________________ is all about getting the insight to do something better in your business.

Business intelligence

___________________ really shows to the best extent how data science can be used to make practical decisions that make organizations function more effectively and more efficiently.

Business intelligence

__________________________ can use the research plan in selling the project to the organization.

Business leaders

Who are the end users?

Business managers, executives and even customers who are using the outputs of the research.

This phase is also known as the discovery phase. During this phase, an analyst defines the major questions of interest that need to be answered, understand the needs of the stakeholders, and assess the resource constraints in the project.

Business understanding

A consumer sues an entertainment streaming company for leaking personal information regarding her viewing habits. Which ASA ethical standard did the streaming company violate? A. Conflict of interest B. Biases C. Privacy D. Unfair discrimination

C

A data analyst notices that the data selected for an analytics project is slightly misaligned with the research question. How can the data analyst resolve this situation? A. Halt the data analytics project to pursue a new research question B. Dive deeper into the data to identify data quality issues C. Adjust the research question to reframe the analysis D. Transform the data to a new metric

C

How can an organization improve interprofessional communication among team members? A. By setting work priorities for team members B. By requiring weekly updates on project deadlines C. By using tools that provide a team-based collaboration space D. By ensuring employees can recite the desired outcomes

C

Numerical measurements of the amount of a toxic chemical substance are recorded in a large database. Which hypothesis can the data analyst answer through exploratory data analytic methods? A. The chemical will not cause harm to the habitat's native species. B. The chemical contamination is a result of human activity. C. The statistical distribution of the chemical measurements is normal. D. The best analytic approach for analyzing the data is linear regression.

C

What is an example of an external stakeholder for a data analytics project? A. President/CEO B. Project manager C. Regulatory body D. Data analyst's supervisor

C

What is an example of unstructured data? A. Names, dates, and addresses B. Credit card numbers that include a credit score C. Text messages that include video D. Height, weight, and gender

C

Which circumstance could cause a data analyst to have difficulty developing a model to answer a business question? A. Project scope creep B. Poor project budgeting C. Lack of relevant data sources D. Lack of stakeholder support

C

Which statistical technique should be used to draw conclusions about an entire population based on a representative sample? A. Correlation B. Bayes theorem C. Hypothesis testing D. Measures of central tendency

C

Which task would an analyst consider first during the discovery phase of the data analytics lifecycle? A. Seek out necessary data sources. B. Formulate a project plan. C. Identify project goals. D. Develop key metrics.

C

Which technique can a project manager use to foster the identification of quality data analytics questions? A. Organized project planning B. Rigorous data cleaning C. Frequent collaboration with the team D. Acquisition of abundant project resources

C

Which tool should a researcher use to conduct a univariate analysis on complex statistical data? A. Tableau B. Power BI C. R D. SQL

C

Which type of analysis would be used to predict a binary outcome based on a set of independent variables? A. Hypothesis testing B. Descriptive statistics C. Regression D. Time Series

C

Which type of data representation should a data analyst use to display expense categories as a percentage of total business expenses? A. Map visualization B. Line chart C. Pie chart D. Scatter plot

C

Which tools can be used for performing statistics and creating interactive data visualization for large datasets from various sources? Choose 2 answers. A. Gantt Chart B. SQL C. Tableau D. R

C & D

___________ are general-purpose languages that are used for the back end, the foundational elementsterm-26 of data science, and they provide maximum speed.

C, and C++, and Java

A data analyst has identified combinations of sales transactions that frequently occur together in data over the past 5 years.Which phase of the data analytics life cycle is represented by this analysis? A. Data acquisition B. Representation and reporting C. Data mining D. Predictive modeling

C.

A specific drug is manufactured for the treatment of depression. The company decides to ignore research results on an alternative, less expensive, drug treatment in order to make higher profits. Which ASA ethical standard has the company violated? A. Unfair discrimination B. Reproducible results C. Conflict of interest D. Transparent assumptions

C.

An analyst has been tasked with defining data columns that could contain null values.Which activity of the data acquisition phase is represented? A. Collecting data B. Disqualifying data sources C. Detecting missing values D. Transforming improperly formatted text

C.

An analyst is looking at data that includes the customer's address, date of purchase, and age. Which question could be answered from this data? A. Which customer has spent the highest dollar amount? B. Which customer is most likely to respond favorably to the next marketing campaign? C. Which state has the highest total customers? D. Which product has sold the most in a certain state?

C.

An analyst realizes that the data set has been reduced significantly, resulting in sample sizes that are too small.In which phase of the data analytics life cycle did this likely occur? A. Data exploration B. Data modeling C. Data mining D. Data discovery

C.

What does the critical path represent in data analytics project management? A. Minimum time to complete independent tasks B. Maximum time to complete independent tasks C. Minimum time to complete dependent tasks D. Maximum time to complete dependent tasks

C.

What is a common duty of a database administrator? A. Set project timelines, milestones, and goals B. Acquire funding for data analytics projects C. Maintain data on the IT infrastructure D. Define business needs at the onset of a project

C.

What is an effective method for a data analyst to prepare for a one-on-one meeting with a manager? A. Make a written list of all source code comments B. Ask other inside employees about the manager's reputation C. Bring a set of questions to draw on to keep the conversation going D. Create an essay summarizing steps in the source code

C.

What strategy will contribute to effective data representation and reporting? A. Creating a new training data set B. Selecting data for a prediction model C. Excluding unrelated data D. Extracting data from source repositories

C.

What will be a consequence of poor attention to detail during the data exploration phase? A. Not enough variables will be considered in the analysis. B. The outcome of the analysis will be misaligned to business needs. C. The analyst will lack insight into the structure of the data set. D. The model will be built using the wrong data set.

C.

Which activity in the data analytics life cycle occurs during the data acquisition phase and requires the most time and effort from the data analyst? A. Selecting the data sources B. Importing data into a database C. Cleaning data D. Defining goals

C.

Which aspect of data exploration occurs when an analyst writes code to compile a bar graph of dog food sales per month? A. Performance of a correlation analysis B. Analysis of data anomalies C. Verification through visualization D. Determination of variabilities

C.

Which type of data analysis is appropriate if the goal is to minimize the cost of a diet, using a data set consisting of the following variables: protein content, fat content, and cost per unit? A. Decision trees B. Calculus C. Optimization D. Bayes' theorem

C.

______________________ is the data that can simply be grouped or placed into a category, like the name suggests. A quick example could be a ghost, a dino, a bridge.

Categorical data

____________ may be, at least in theory, impossible. But ____________________ can get you close enough for any practical purposes and help put you and your organization on the right path to maximizing the outcomes that are most important to you.

Causality, prescriptive analytics

The ____________ states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger (if you were to take 50 people out of that population and get the mean, then take another 50 random people and get their mean age, and so forth, all of those means would follow the normal distribution (bell curve)).

Central Limit Theorem

Even if you have a wide spread of a variable, let's say, age in a population, and you take lots of sample groups, the mean age of those sample groups would tend to have a normal distribution.

Central Limit theorem

__________________________________________ is focused on the obligation of others, and not a lot about what I as a data practitioner am going to do.

Certified Analytics Professional's Code of Conduct

_________________ is law created by courts. In the U.S. and other former British colonies judge's have authority to actually create rules of law that determine individual and organizational rights and responsibilities.

Common Law

The three principle sources of U.S. law are: ________________________, ________________________, __________________________

Common law or judge made law, statutory law and constitutional Law

When the researching organization consciously ignores data that calls their results into question or only presents one side of the results that puts them in a positive light.

Conflict of interest

___________________________ gives the government the authority to act and restricts that authority to ensure that the branches don't overstep their bounds or infringe unnecessarily on individual rights such as rights to fairness and equality.

Constitutional law

____________________ is examining the relationship of 2 or more numerical values variables.

Correlation

________________ will show either positive or negative, null, which is . There's ___________, which could positive and/or negative, and you've got exponential correlations. For this we'll just simply use a ____________.

Correlation Linear scatter plot

Predictive or data mining models could be considered in the "_________________________" grouping.

Create a model

A data analyst needs to contact a specific member of the database administration team. Which method should be used to discover the person's email address? A. Ask the project's customers B. Ask the project's sponsors C. Send an email to project stakeholders D. Send an email to the team member's manager

D

Which mistake is commonly made during the predictive analytics phase? A. The data are separated into different sets. B. The variables are separated into response and independent variables. C. The data are prepared before the model is developed. D. The model is developed before the research question is known.

D.

What might be developed by data analysts when acquiring data from a data warehouse? A. The procedures for extracting files from the data warehouse B. The procedures for updating tables in the data warehouse C. The relational structure of tables D. The SQL queries of data within the tables

D. The SQL queries of data within the tables

______________ in itself is not a language, it's actually just a library within the language JavaScript

D3

_______________ is an awesome JavaScript library to visualize data.

D3

This is the phase of collecting data. Frequently, data will be retrieved from a database, perhaps a component of a data warehouse, by using a language like SQL.

Data Acquisition

In this phase, the analyst begins to understand the basic nature of data and the relationships within it. This phase often relies on the use of data visualization tools and numerical summaries, such as measures of central tendency and variability.

Data Exploration

This is the phase of collecting data.

Data acquisition

___________________ came from computer science. They learned to extract meaning from relational and noSQL databases. They focus on presenting and discovering interesting bits of data that support decision making.

Data analysts

Sometimes data might not be available and the analyst will use tools such as web scraping or surveys to acquire it during which phase?

Data aquisition

These are the developers, and the system architects, the people who focus on the hardware and the software that make data science possible

Data engineers

______________ is about analyzing the flow of data through a system and the user's experience with your product.

Data flow

What phase is this an example of: A data analyst has identified combinations of sales transactions that frequently occur together in data over the past 5 years

Data mining

What phase is this an example of: An analyst realizes that the data set has been reduced significantly, resulting in sample sizes that are too small

Data mining

During which phase of the data analytics life cycle does an analyst create a story to report data?

Data reporting

___________________ is what makes business intelligence possible.

Data science

________________ are seen as multi-disciplinary. They're data analysts, but they can also create software, work with mathematics, know the business and ask interesting questions.

Data scientists

A travel website tabulated the results of their latest marketing campaign to understand the relationship of clicks-to-sales conversions. Which area of analytics does this activity represent?

Descriptive

analytics answers what has happened in the past.

Descriptive

___________________________ describes the data that is present. Mean, Median, Mode, counting things. How many of each size and color of shirt were sold in the last month? Do we sell more shirts in the summer vs winter?

Descriptive analysis

______________________ include mean a median max and men.

Descriptive methods

_____________________ are best answered looking at historical data are usually displayed in dashboards of things that sort and who used to inform of trends and observations that have happened in the past.

Descriptive questions

What are the 3 types of analysis?

Descriptive, Predictive, Prescriptive

__________________ explores how data points relate to each other, while seeing if a data point differs in the mean.

Deviation

Which phase? - Working with stakeholders to help them ask better questions so that both they and you understand the outcome.

Discovery

________________ is a type of visualization shows data distribution often surrounding a central value. There are a lot of methods, but in short, it's simply measuring frequency or the rate that something occurs.

Distribution

____________________ is someone who developes, architects; focus on hardware and software that enable data science

Engineer

They often need all of the skills, including the business acumen, to make the business run well. They also need some great creativity in planning your projects and the execution that get them towards their entrepreneurial goals.

Entrepreneurs

What does GDPR stand for?

European Union's General Data Protection Regulation

___________________: Here your goal is to find the underlying common factor that gives rise to multiple indicators.

Factor analysis

An European union law regulating their citizens must have informed consent and ability to request or delete their own data that you collect.

GDPR

__________'s purpose is to create professional-looking plots quickly with minimal code. It doesn't pride itself by being the most robust with options and all the bells and whistles, but its ability to show you what you need, when you need it, with ease is its greatest feature.

GGplot

_____________ is basically just a bar chart showing all the tasks in your plan drawn along a timeline. They are great because everything's drawn out to scale.

Gantt chart

________________ is a Python toolbox for geographic visualizations. It's prime usage is for maps; therefore, it is quite powerful.

Geoplotlib

_________________ show the rate of your data with high and low or high density and/or low density. It can range in color using opacity and/or multiple colors.

Heat maps

_______________ shows the distribution of scores in a quantitative variable. That's also sometimes called a __________________.

Histogram continuous variable

________________ decision making - Many algorithmic decisions are made automatically, and even implemented automatically. But they're designed such that humans can at least understand what happened in them. Such as, for instance, with an online mortgage application.

Human-Accessible

__________________________ decision making is where advanced algorithms can make and even implement their own decisions, as with self-driving cars.

Human-in-the-Loop

________________________ is our legal analysis tool to understand how to move from identifying a legal issue to reaching a conclusion and a decision about how to take action

IRAC

_______________________ are super fun because they can communicate your story effectively, as well as artistically. I love using _____________________________ for this.

Infographics Adobe Illustrator

What can be identified using a box plot?

Interquartile range

It is the traditional method for analyzing legal problems that arise in any context. The legal analysis framework we will be using in this course has four parts: ______________, ______________, ______________, ______________.

Issue, Rule, Application and Conclusion. (or IRAC)

____________________ is a common and powerful technique for combining many variables in an equation to predict a single outcome, the same way that many different streams can all combine into a single river.

Linear regression

____________________ is someone who is the pro in CS and math; deep learning, artificial intelligence

Machine Learning Specialist

These are people who have extensive work in computer science and in mathematics. They work in deep learning. They work in artificial intelligence. And they're the ones who have the intimate understanding of the algorithms and understand exactly how they're working with the data to produce the results that you're looking for.

Machine learning specialists

______________________ decision making is when machines are talking to other machines. And the best example of this is the internet of things. And that can include things like Wearables. My smart watch talks to my phone, which talks to the internet, which talks to my car in sharing and processing data at each point.

Machine-Centric

They don't necessarily need to know how to do a neural network, they don't need to make the data visualization, but they need to speak data so they can understand how the data relates to the question they're trying to answer, and they can help take the information that the other people are getting and putting it together into a cohesive whole.

Managers

They need to frame the business-relevant questions and solutions. Then, they need to keep people on track and moving towards it.

Managers

This is where you actually create the statistical model and you do the linear regression. You do the decision tree. You do the deep learning neural network.

Modeling

These are tools that help us visualize and analyze.

Modeling and diagramming tools

____________________ examines the distances between each point and the closest point to it, and then compares these to expected values for a random sample of points from a CSR (complete spatial randomness) pattern.

Nearest Neighbor

______________________ which is about our responsibility to be careful in everything we do from driving a car to hiring employees to managing data is sourced in ___________________.

Negligence law common law

________________________ are sets of algorithms intended to recognize patterns and interpret data through clustering or labeling

Neural networks

________________ simply means name: red, green, blue, cars, trucks, boats.

Nominal

__________________ is simply comparing values of categories or subcategories. We have hot dogs, we have hamburgers, we have grilled chicken. We can easily compare these values, easy. There's 40 hot dogs, 25 hamburgers and 5 pieces of grilled chicken.

Nominal comparison

Data is made up of a set of ______________, the individual units being measured.

Observations

________________ simply put is categories with an order: small, medium, large, first, second, third, good, better, best.

Ordinal

Built on top of Matplotlib are some wrappers called _____________ and _____________. _______________ is great for data structures and as a data analysis tool, which is used a lot with other libraries as well. _______________, well, it makes things sexy. It's awesome because it has built-in themes and color palettes, which can make your life really easy down the road.

Pandas Seaborn Pandas Seaborn

____________________________ is simply how a smaller subset compares within that larger subset. One of the easiest ways to see this is with a quick pie chart or a stacked bar chart.

Part-to-whole

________________ are the best when showing part to whole comparison.

Pie charts

12-step process for projects

Planning 1. Define what project is 2. List all tasks 3. Get tasks into order 4. Add a safety margin 5. What you can do if the total time runs out 6. Gantt chart 7. Look at resources. 8. Think about what might go wrong 9. Monitor the progress during your project 10. Monitoring the cost 11. Readjust plan if necessary 12. Project review

What are the 4 parts of data analytics cycle?

Planning, Wrangling, Modeling and Applying

________________ flow has users reentering the same information multiple times or not having the information present when they need it.

Poor data

________ analytics answers what might happen in the future

Predictive

__________________ enables an analyst to move beyond describing the data to creating models that enable predicting outcomes of interest.

Predictive Modeling

During which phase in the data analytics life cycle would a churn analysis be performed?

Predictive analysis

____________________ attempts to determine which future events are the most likely.

Predictive analytics

____________________ makes predictions about future state of business. Forecasting volumes for example. Based on last summer and winter, what will we sell next year?

Predictive analytics

Exploring the data could be seen either in "________________" or "_____________"

Prepare the data Create a model

___________ analytics attempts to answer the toughest question of all, what should we do going forward?

Prescriptive

______________ analytics is about causation

Prescriptive

_______________________ analysis with an end goal of making a recommendation. What colors and sizes of shirts should we sell to maximize profits?

Prescriptive analytics

__________________ is responsible for making sure things get done on time and within budget and removes roadblocks.

Project manager

____________________ will work with marketing to put the most profitable advertising on the articles shared by the readers. They are a key part of making sure that the organization gets value from the team's discoveries.

Project manager

______________________ makes sure things get done on time and within budget; removes roadblocks

Project manager

____________________ are very good at protecting the data science team and keeping them from getting off track. They can do this by representing the team at meetings.

Project managers

_______________________ are responsible for trying to convince the department to give the data science team access to data. They also work to distribute any results. They'll go to the meetings and present the team's results.

Project managers

_______________________ champions the vision of the project; has the authority to allocate resources

Project sponsor

___________ creates charts that are interactive SVGs with very minimal lines of code.

Pygal

___________ manages memory and large data sets better in its default setup but neither of those is fixed

Python

____________ is one of the easiest languages to learn out there. It's super easy to read, clean, parse, and transform your data. Lastly, you can quickly analyze and visualize your data with many of it's libraries.

Python

_______________ is a multipurpose programing language that has libraries that extend its capabilities to do statistical analysis.

Python

Tools such as _______________ play an important role in automating the training and using of models.

Python and R

____________________ are two of the most popular languages for exploring and displaying your data.

Python and R

____________________ are programming languages that are very frequently used for data manipulation and modeling.

Python or R

Your ______________ plays the part of both adversary and assistant. They develop the test cases that push the software to its limits, and at the same time, provide the test data and other context that is needed for developers to perform their daily tasks.

QA team

________________ data includes things such as: summaries of written comments on customer cards collected from suggestion boxes at stores, results from interviews of store managers by an outside consultant, a paragraph taken from an employee's self-evaluation on a performance review.

Qualitative

________________ data is information that is gathered in non-numerical form that is typically ___________ and may be recoded to try and quantify its meaning.

Qualitative, descriptive

_________________ data is data that can be measured in numerical form.

Quantitative

_____________________ is data that can be quantified, verified, and measured. All the values are numerical. There are two categories that come from it. First, there's __________________. This data is based on counts. Then there's ___________________. This data simply falls onto a continuum. When on a graph, we will not only see the points, but the connections in between. Examples, things that can be measured: Time, weight, height, and so on.

Quantitative data discrete data continuous data

_______________ will be the most important drivers for your team's insights. The key part of science, in data science, is _____________________.

Questions finding the right question.

______________ works natively with vectorized operations and as non standard evaluation

R

____________________ is a programing language that is specific to statistics. It also has capabilities to visualize data.

R

we have open-source programming languages like _________________ that make more rigorous data analysis inexpensive and relatively easy as well.

R and Python

the gold standard for establishing cause and effect is what's called an _________________________________

Randomized controlled trial (RCT)

_______________________ is where the algorithm processes your data and makes a recommendation, or suggestion to you and you can either take it or leave it.

Recommendations

_______________________ is a very common term. It means an algorithm that is designed to reach a particular outcome like for instance running through the levels of a game.

Reinforcement learning

In this phase, an analyst tells the story of the data and uses graphs or interactive dashboards to inform others of the findings from the analyses.

Reporting and Visualization

Two tools which help to promote effective communication during the testing phase of software development. 1. Creating an issue template for _______________. 2. A _____________. This is where you and your team have an agreed upon set of categories for where bugs fall.

Reporting bugs bug priority matrix

A ___________________ has three main areas of responsibility. They find assumptions, drive questions, and know the business.

Research Lead

Why are we doing things this way? Does this make sense? Those are the questions you almost never hear inside organizations. One of the best names for this role is ______________________.

Research Lead

_______________ is someone who is the subject matter expert, high statistical expertise

Researcher

__________________ is a way of accessing data storing databases, usually relational databases, where you select the data, you specify the criteria you want, you can combine it and reformat it in ways that best work.

SQL

____________________ is a language for working with relational databases to do queries and data manipulation.

SQL

What's fun about D3 is it utilizes HTML and CSS standars to create web-based data viz's in the web browser and its capability to create _________________.

SVG

_____________________ are a great shotgun spread of data. They work great with the more data you collect. Think of large data sets here, and it's better when you're comparing two data points or two variables, one on the x and one on the y-axis.

Scatter plots

___________ is when new requirements are added to the project that increases the time/resources needed to complete it.

Scope creep

___________________ by contrast is law created by representative bodies such as the U.S. Congress.

Statutory law

Interactive dashboards tools, such as _____________, allow even the novice user the ability to interact with the data and spot trends and patterns.

Tableau

______________________ are platforms that specialize in visualization. This is where you can make graphs and charts for presentations and data storytelling to executive leaders.

Tableau and Power BI

apps for visualization like ______________, both the desktop and the public and server version, and __________. What these do is they facilitate data integration, that's one of their great things. They bring in data from lots of different sources and formats, and put it together in a pretty seamless way. Their purpose is interactive data exploration.

Tableau, Qlik

_______________________ are instant messaging platforms that facilitate in a faster, but less formal, way than email.

Teams, Slack

______________ which make it so easy to do deep learning neural networks you can use that in Python or in R

TensorFlow

___________ is someone who does everything. Very rare and very expensive; dangerous to rely on one person

The "Unicorn"

__________________________ is often used as a way to put standard deviations into perspective.

The 68, 95, 99 rule

Congress passed _______________________________ to regulate how employers and insurance companies can, if at all, access our genetic information for purposes of making decisions about whether we're employable or whether we're insurable.

The Genetic Information Nondiscrimination Act, otherwise known as GINA,

____________________ is the person who champions the vision of the project and has the authority to allocate resources.

The project sponsor

___________________ is tracking a data metric over time. In this example, we will see sales per quarter.

Time series

______________________ is just looking at any variable over time

Time series analysis

They focus on domain-specific research like, for instance, physics and genetics are common, so is astrophysics, so is medicine, so is psychology, and these kinds of researchers, while they connect with data science, they are usually better versed in the design of research within their particular field and doing common statistical analyses, that's where their expertise lies, but they connect with data science in that they're trying to find the answers to some of these big-picture questions that data scientists can also contribute to.

Topical researchers

True or False: you can get a unicorn by a team where you can get the people who have all the necessary skills

True

True or false: Data science can be done without machine learning.

True

Most __________________________ is concerned with advancing a field of study, and there may not be much of a discussion of budget or budgetary constraints in the research.

academic research

Once you've determined what data is at your disposal, then you __________________________________________.

ask actionable and detailed questions

The fact that the _________ is not close to the ________ or the __________ tells us the distribution of scores are skewed. The scores are not evenly distributed around the ________.

average, median, mode, mean

You can show ______________ in multiple ways. Vertical horizontal, and stacked. They are great for comparing values, showing a composition, showing distribution, and looking at trends.

bar charts

You can get the analytics and see how well is this performing, who's watching it and when. That's a ______________________________ of a form.

business intelligence dashboard

It's the communication between the ____________________ and the _________________ that helps you to establish rapport with your stakeholders and will make it much easier for them to feel connected to the project and get you the answers you need to be successful.

business stakeholders project team

It all comes down to decomposing bigger things into smaller things, like story slicing, and functional decomposition techniques. Sometimes this is called _____________.

chunking

When you're planning your data project, try to _____________________ that will allow you to leverage the data you currently have.

construct questions

the output of research and business environment must not only drive value but be simple enough and easy enough to __________________________.

consume by the end users

If the total time runs out on your project, as being too long to be acceptable, which is known as _______________________________.

crashing your project plan

"Collect the data" is synonymous with ____________________

data acquisition

A _____________________ has three main areas of responsibility. They prepare the data, select the tools and present the results.

data analyst

A ____________________ is responsible for obtaining and scrubbing the data. Then, they'll display the data in simple reports. They should work with the __________________ to see if anything jumps out of the reports. They'll also recommend statistical methods or create data visualizations.

data analyst research lead

The _____________________ creates the reports and develops the applications that finds the topic.

data analysts

The goal is to find ways to best reveal insights that aid and answering the questions or objectives that are posed in the research plan. Typically, these questions fall within three different buckets. Those are: ______________________, __________________________, __________________________

descriptive question, a predictive question or prescriptive question.

The ______________ data would say, I have 8 dogs. The _________________ data would be, their weight ranges from 35 to 55 pounds.

discrete continuous

______________________ which is the actual things that you end up with _______________________, how are the decisions made _______________________, how is the decision communicated

distributive justice procedural justice interactional justice

we can convey key performance indicators of our business to ____________, ______________, _____________ using dashboards.

executives, management, and employees

In _____________________, this hidden factor comes first and gives rise to the individual variables.

factor analysis

two of the most important things you can do in business intelligence are _________________, to predict what's likely to happen next, and to ___________________.

find trends, flag anomalies

If we have a large standard deviation, that means that a great deal of data lies ___________ away from the ________.

further, mean

The goal is to create a "unicorn" by ______________________

having the right team

You can do ___________________ or even do neural networks as a way of finding how well the data fits these known patterns, and if it doesn't, you may have an anomaly.

hierarchical clustering

The standard deviation measures:____________________________________________________________________________________________

how close to the mean is the overall sample.

we can convey complex information about our business to a wider audience using _____________ that allow users to rapidly consume and digest data

infographics

The __________________ is king when looking at continuous changes over time.

line chart

Another method that's frequently used in data science is what's called a ____________________. This is a whole series, a sequence of binary decisions, based on your data, that can combine to predict an outcome.

linear regression

The _______ and the _________ inform a user of the central points in skew of the data.

mean, median

The ____________ is the middle number in a series that is arranged from smallest to largest.

median

The ________________ is the most commonly occurring number in the dataset.

mode

If I want to know the rule of _________________, I will research and review court decisions.

negligence

If the mean and the median are fairly close to each other means we likely have some type of ________________________

normal distribution

These tools help us communicate with one another, as well as store and retrieve information. They're tools like word processors, spreadsheets, file systems and the like.

office productivity tools

Data scientists are able to find ______, _________, and _____ in unstructured data.

order, meaning, and value

Electronic Communications Privacy Act, or the ECPA. This law was passed by congress in 1986, during the age of ____________________.

police wiretapping

The _____________________ are those questions that we're seeking some sort of prediction or forecast in the prescriptive bucket.

predictive questions

You can do a very good _______________ without needing everything that goes into data science.

prescriptive analysis

In ___________________________ , the variables come first and the component results from it.

principal component analysis

___________________________________ is the idea here is that you take your multiple correlated variables and you combine them into a single component score.

principal component analysis

Privacy is sourced in common law in what we call _______________. It's sourced in statutory law such as GINA and other legislation that protects financial, health and student privacy. And it is sourced in the U.S. Constitution's amendments.

privacy torts

On a data science team, a __________________ is responsible for three main areas. They democratize the data, share the results, and enforce organizational learning.

project manager

The _____________________ takes new insights and makes it actionable. They may communicate the results to management to improve the organization.

project manager

The _________________________ works to get the data out into the organization. They also clear the way when there are organizational hurdles.

project manager

The final goal for ___________________ is to enforce learning.

project manager

Your ______________ is a key stakeholder who provides resources for your project and advocates for the project from initiation to closure.

project sponsor

Typically, your key stakeholders would be your _______________, _________________, __________________ and ___________________who will provide team members to perform project work. Other stakeholders might include ____________________, _______________ and _________________.

project sponsor, project team, subject matter experts and functional managers consultants, regulators and community groups.

You need _________________________ who are highly invested in the project and have considerable stakes in the project's success.

project stakeholders

___________________ are a whole host of research designs that let you use correlational data to try to estimate the size of the causal relationship between the two variables

quasi-experiments

The ______________________ knows the most about the business. They might know enough about the readers to guess at certain topics. They also might be the best resource for coming up with keywords.

research lead

The team should start with their _________________. This is someone from the business side who pushes the team to ask interesting questions. They should start by coming up with questions or identifying key problems. They can put them on a question wall or organize them onto sticky notes.

research lead

The __________________ and ________________ work hand-in-hand by building key insights. The _______________ focuses on the best question to ask. The ________________ will try to provide the best reports and visualizations.

research lead data analyst research lead data analyst

If a person feels that they have been harmed by a decision made by a neural network, such as it refused a loan application, they can sue the organization.

right to explanation

Avoid ________________ which is when new requirements are added to the project that increases the time/resources needed to complete it

scope creep

A _______________________________ means that the data are close together around the mean.

small standard deviation

You can use ______________________ for scraping consistently structured data or you can use packages and programming languages like _____________________.

specialized apps Python and R

The ____________________ are the ones who have the ultimate control over the success for your project.

stakeholders

Your questions must be aligned with ______________________, which vary depending on your target audience.

strategic goals

What does SQL stand for?

structured query language

Data sources usually occur within some combination of four different types. The types are ____________________________________________________________________________________________________________

structured, unstructured internal and external

If you want to make your data into something that you can see, there is no better place to start than looking at tooling, like ___________

tableau

Who attends demo day?

the entire agile team, product owners and essential stakeholders, should all be in attendance

Keenly observing standard deviations for a dataset can give us insight on __________________ and ____________________________________________________________.

the spread of the data, an idea of how we should interpret the statistics

This is a full-stack data scientist who can do it all, and do it at absolute peak performance.

unicorn, also known as the rock star, or the ninja.

Once you get some visualizations, you can look for one number that might be able to represent the entire collection. That's a ________________. The most common of these is going to be the mode.

univariate descriptive.

Frame the question as what question would I like to have an answer to, not _____________________________________?

what is the answer I'm looking for


Conjuntos de estudio relacionados

GA History ch 4 (fill in the blank notes, online quiz)

View Set

Chapter 7 - Operand Addressing And Instruction Representation

View Set

M3 Hardware Components of Personal Computer

View Set