IBM + Data Amalytics

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Machine Learning

is a branch of artificial intelligence (AI) and computer science that focuses on using data and algorithms to imitate the way that humans learn, gradually improving its accuracy. Machine-learning models understand only numbers, not text or images, so data scientists might need to transform unstructured data into a "0" or "1".

Methodology

is a general strategy that guides activities within a process. A methodology doesn't depend on technologies or tools, and it's not a set of techniques or recipes. Rather, a methodology provides data scientists with a framework for how to proceed with whatever methods and processes they will use to obtain answers or results.

Design Thinking

is a problem solving methodology that focuses on the user, having empathy for the user, and determining the best user experience. The data science project team examines the business problem. They can do this using design thinking.

Time

is a series of data points that are listed or sequenced in time order, such as, for example, the daily time of high tide and low tide at a beach Example: Column, Line

Science

is a system or method reconciling practical ends with scientific laws.

Database

is an organized collection of structured data in a computer system. Think of a database as a container or repository with many columns and rows. A database makes data highly organized, so the data is easily accessible by queries and computation software.

Data

is raw information. Data might be facts, statistics, opinions, or any kind of content that is recorded in some format. This could include voices, photos, names, and even dance moves!

Big Data

is that everything we do is increasingly leaving a digital trace (or data) which we (and others) can use and analyze to become smarter. The driving forces in this brave new world are access to ever-increasing volumes of data and our ever-increasing technological capability to mine that data for commercial insights.

Relative Portion

is the amount of quantity of a subset present in the population of all data points. Example: Pie, Column

Data Visualization

is the culmination of a data science team's efforts to view the insights that their data transformation efforts have produced. Visualizations help data scientists test their hypotheses and check assumptions. The data needs to tell a story and address the business problem or question that a project is trying to answer!

Frequency

is the number of times a certain event occurs. Example: Column, Line

Data Analysis

is the process of collecting, cleaning, and transforming data to obtain insights to help make better and informed decisions

Data Storytelling

is the process of converting data analyses into a simple, understandable story to influence a business decision. With the rise of digital business and data-driven decision making, data storytelling is an important skill. The idea is to "connect the dots" between the results and decision makers, who must be able to interpret the data. Data storytelling involves a combination of data, visualizations, and narrative. When narrative is coupled with data, it explains to the audience what is happening in the data and why an insight is important. When visualizations are applied to data, they enlighten an audience with insights that they might not obtain without charts or graphs. Patterns and trends emerge from all the rows and columns in a database, with the help of data visualizations. When narrative and visualizations come together, they can create a data story that can influence, drive change, and engage an audience.

Ranking

is the relationship between a set of items, in which one item is ranked higher, lower, or the same compared to a second item. Example: Bar

Correlation

is the relationship between two random variables, which are typically related in linear way. Example: Bar, Scatter

Descriptive analytics

is the simplest and most common type of data analytics. Descriptive analytics answers the question, "What is happening?". It provides a snapshot of business trends and patterns and uses historical and current data. Descriptive analytics manipulates raw data from multiple sources to give a data analyst valuable insights into the past and a view of key metrics within a business. Examples of data used for descriptive analytics: The number of subscribers for a service Customer satisfaction survey data Monthly revenue reports Daily stock reports Demographic data about a business' customer population, for instance, data that indicates that 30% of customers are self-employed

Collect

it is important to collect the right data from existing and new sources that is needed to answer the business question and securely store it according to proper business practices.

Clean

it takes data analysts a lot of time, approximately 70-80% of their time, to detect and correct missing or inaccurate records from a data set.

Visualize

it's time to make conclusions and pick the right data visualization to graphically show people the data results and insights to conclude actions to take.

Nominal Data

labels variables without any quantitative (or numerical) value. Nominal data can be grouped into categories, but it doesn't have a meaningful order or hierarchy. For example, types of nominal data include: Hair color, such as blonde, black, or brunette Religion, such as Christian or Buddhist Marital status, such as single, married, or widowed Ethnicity, such as Asian or Hispanic

Dichotomous Data

meaning each cell has only two possible values: Yes or No.

The final data presentation must be:

meaningful, compelling, and, most importantly, easy to interpret for the business sponsor. The team must consider the: Purpose: What problem are you trying to address and why will data visualization help to solve it? Audience: Who is viewing the presentation and how can it be valuable to them? Data: Is the data represented in the best way and will the visualization need to be updated in the future? Context: Where will the visualization reside (for example, in software, on a website, or in a business report)?

Diagnostic Analytics

"What is happening?", the next step is to dive deeper and ask "why?" Diagnostic analytics takes the insights found from descriptive analytics and drills down to find the causes of specific problems. Businesses use of diagnostic analytics because it creates more connections between data and identifies patterns of behavior. Examples of diagnostic analytics: A freight company investigates the cause of slow shipments in a certain region. A healthcare company examines diagnoses and prescribed medications to identify the influence of medications. An IT company analyzes server ticket data to identify a small number of servers causing the bulk of an organization's service outages.

Data Analysis & Data Scientist

1. Analytical Thinking 2. Critical Thinking 3. Data Storytelling 4. Programming Languages: Python

CRISP-DM consists of six phases:

1. Business understanding: This phase focuses on understanding the project objectives and requirements from a business perspective, and defining the data problem to solve. 2. Data understanding 3. Data preparation 4. Modeling 5. Evaluation 6. Deployment CRISP-DM is iterative, meaning that the phases can be repeated to incrementally improve the result. The results of some stages might require the project cycle to go back to earlier stages.

SEMMA stands for its five steps:

1. Sample 2. Explore 3. Modify 4. Model 5. Assess SEMMA is a data science methodology that helps convert data into knowledge. SEMMA can help solve a range of business problems, such as fraud identification, customer retention and turnover, database marketing, customer loyalty, market segmentation, and risk analysis. SEMMA is also an iterative process, in which answering one set of questions often leads to more interesting and more specific questions.

What is a model?

A data model identifies the data, data attributes, and relationships or associations with other data. A data model provides a generalized view of data that represents the real business scenario and data.

Why build a model?

A data scientist can develop a more systematic approach to address an identified business problem by building a model. The main goal of building a model is to make better predictions for the business and gain a better understanding of the system being modeled.

Fill in the blank. The first phase of the ______ methodology is to understand the project objective from a business perspective and define the data problem to solve.

CRISP-DM

Which data science methodology is set apart by the ability to be iterative?

CRISP-DM, KDD, and SEMMA

Which of the following statements is correct about CRISP-DM, KDD, and SEMMA?

CRISP-DM, KDD, and SEMMA use data mining methods that are best suited for structured data.

What is the purpose of a data visualization? Select all that apply.

Explore and interpret data during analysis to identify patterns or trends. Communicate results and help people understand the insights to make decisions.

Data insights

Improve operations Better understand end users or customers Drive efficiency Reduce costs Increase profits Find new innovations

Qualitative Data

Is also called categorical data. Represents the characteristics, attributes, properties, and qualities of things. Describes data using language (rather than numbers), such as smell, location, color, texture, marital status, and so on. Something you can categorize about data instead. Qualitative data can be nominal or ordinal.

Quantitative Data

Is also called numerical data. Represents things that can be measured and assigned values. Can be counted and measured, such as height, weight, length, blood pressure, the temperature outside, and so on. Something you can count (or measure). Quantitative data can be discreet or continuous.

Which type of analytics makes forecasts about the future?

Predictive

Which type of analytics recommends actions to take to eliminate a future problem or take advantage of a promising trend?

Prescriptive

Which of the following areas are a part of data science?

Programming Correctly checked Math and statistics Correctly checked Artificial intelligence (AI)

Common Charts for Visualizing

Quantitative: pie, car, column, line, scatter Conceptual: flow, structure, interrelationship, action plan, map,

Data representation and transformation

Regardless of the format of the source data, a data scientist structures and organizes data in a format that supports the most efficient machine-learning model possible. *Data scientists also use data transformation tools and languages, like Python to conduct more in-depth analysis and create more complex visualizations.

SQL

SQL is the most common language for extracting, organizing, and managing data in a relational database to then perform various operations on the data.

Sources to Collect Data

Structured & Unstructured 1. Static files, such as spreadsheets 2. Databases 3. The Internet

Structured Data

Structured data is information that can be organized in rows and columns. If you can organize information within data into groups, based on specific characteristics, then those groups are structured data. Examples of structured data includes names, dates, addresses, credit card numbers, stock information, and geolocation.

Which of the following statements about data visualizations is correct?

The goal is to have a visual that's effective, attractive, and impactive.

Data Exploration

The initial exploration of a data set is important because it helps data scientists find patterns and relationships and discover initial insights from the data. Here are a few questions a data scientist might think about during the initial exploration of data: Which data characteristics seem promising for further analysis? Has exploring revealed new characteristics about the data? Has exploring changed the initial hypothesis?

CRISP-DM, KDD, and SEMMA methodologies, what's one observation you can make about what they have in common?

The three methodologies are all iterative! This means that the phases or steps can be repeated. Knowledge acquired can be cycled back into the process to gain more or different insights.

Which of the following statements is correct about data analytics and data science?

They both work with data and share the same goal, which is to translate data analysis into business intelligence.

Data scientists must train a model. How?

They use machine learning. Essentially, machine learning is teaching a computer to solve problems. Machine learning allows a machine to learn from data without programming it with rules. The machine can learn from the data it's given. A machine learning algorithm "ingests" data so it can improve its accuracy. Data scientists use machine learning techniques, like unsupervised learning, during the train data models step.

Business Understanding

This methodology begins with a team determining and understanding the current business problem and the audience.

big data and its characteristics, called the 5 V's

Volume, Variety, Velocity, Veracity, and Value

Predictive Analytics

What is likely to happen in the future? Predictive analytics is about forecasting. This type of analytics uses historical data to make predictions about the future. Whether it's the likelihood of a future event, forecasting a quantifiable amount, or estimating a point in time at which something might happen - these are all done through predictive models. Examples of predictive analytics: A software company uses customer segmentation to determine sales leads. An automotive manufacturer forecasts the failure rate of a specific vehicle part to determine recommended service actions. A weather forecaster analyzes current weather conditions in one part of the world to determine future weather conditions in other parts of the world.

Science, technology, and data

are areas that are linked and have given rise to data science. Keep this key concept in mind as you learn about the field of data science. Science: linear algebra, statistics, probability Technology: Business Intelligence, Predictive Analytics, Data Mining, Machine Learning, Big Data Data: Structured (datasets, databases), Unstructured (videos, images, blogs, music)

Pie Charts

are circular, statistical graphics that are divided into slices to illustrate numerical proportions. The total sum of all "slices" must equal 100%. Pie charts are useful for showing the relative proportion for a small number of items. They can very easily show which category is the largest or has the most impact.

Relational Databases

are collections of multiple data sets or tables that link together. For example, one table might list names and addresses, while another might list properties and their owners. If some of the owners also appear in the name-and-address table, the two tables can be linked, creating a relational database.

Bar Charts

are useful for ranking a large number of categories, showing correlation, and using for before-after analysis. Use bar charts for comparison and ranking. Bar charts help illustrate change over time.

Continuous Data

can be divided into finer levels and take any value. (It's the opposite of discrete data.) Continuous data can be divided into many decimal places. It can be measured on a scale or continuum. There is an infinite number of possible values for continuous data. For example: The weight of a car can be calculated to many decimal places. Height, weight, and length are all forms of continuous data. Some continuous data can change over time, such as the speed of an airplane or the temperature in a room.

Line Charts

can display data over time and frequency; however they typically display this data over a continuum. Line charts can track changes over short and long periods of time. Because of this, line charts are useful for indicating small changes. They are effective in showing trends.

Column Charts

can show relative proportion between items and show data over time and frequency. Column charts display information vertically. They are also useful for showing negative data.

Data analysts

collect and examine large data sets to identify trends, forecasts, and data visualizations to tell a compelling story through actionable insights. These insights help businesses make informed decisions about business needs. 1. Data collection 2. Data Cleaning 3. Data Visualization 4. Use tools like: Excel & Power BI

Four steps to follow in the data analysis process

collect, clean, analyze, and visualize.

Data Science

combines the scientific method, math and statistics, specialized programming, advanced analytics, artificial intelligence (AI), and even storytelling to uncover and explain the business insights buried in data. Data science is a multidisciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data collected and created by today's businesses.

4 types of data analytics

descriptive analytics, diagnostic analytics, predictive analytics, and prescriptive analytics.

Data scientists

design and create new processes for data modeling. They use algorithms, predictive analytics, and statistical analysis. Data scientists have technical skills to arrange unstructured data and build their own methodologies to make predictions based on data trends. 1. Predictive Analysis 2. Machine Learning 3. Model training & deployment 4. Use tools like: Apache Hadoop

Scatter Plot Charts

display plotted points to show a relationship between two sets of data. They display a large amount of data and make outliers stand out. Scatter plot charts can also help you identify patterns.

Exploratory Visualizations

help make complex data more accessible and revealing. Data scientists use initial visualizations, like charts, graphs, and maps, to uncover distributions, find patterns, and understand trends.

Discrete Data

includes integers or whole numbers that can't be divided, such as the number 1 or 9. For example, the number of rooms in a house or the number of people in a movie theater is discrete data because you can only count whole individuals. You can't count 1.7 people.

Types of Visualization:

pie chart: combined portions must equal 100% bar chart: useful ranking a large number of categories also for showing correlation or before-after analysis. They show change over time, comparing different categories, or comparing parts of a whole. column charts: show relative proportion between items and data over time and frequency. Useful for showing negative data. line charts: display data over time and frequency. They display this data over continuum. They track changes over short and long periods of time. Useful for indicating small changes and trends. scatter plot: to show a relationship between two sets of data. They display a large amount of data and make outliers stand out and reveal patterns.

Business Sponsor

plays a critical role. Business sponsors are in a leadership position. They initiate the project because they have a "pain point", so they bring the business problem to the data science project team and then support the project.

Descriptive Statistics

quantitatively summarizes a data set. It can answer the question, "What is happening?" Data scientists can build a table to describe a large, complex data set and make quick observations about: Number (N): What is the total number of observations? Mean: What is the average of a set of two or more numbers? Median: What is the middle number or "center" in a sorted list of numbers? Mode: What is the most observed value in a data set? Minimum: What is the minimum extreme of a data set? Maximum: What is the maximum extreme of a data set? Standard deviation: How spread out is the data in relation to the mean?

Unstructured Data

refers to "everything else". There is no predefined format. Unstructured data lacks any built-in organization, or structure. It's a conglomeration of varied types of data that are stored in their original formats. Examples of unstructured data include images, texts, social media posts, like tweets, customer comments, medical records, and even song lyrics.

Value

refers to the ability to turn data into value. The main reasons why people invest time to understand big data is to derive value from it. Companies must make a case and have a clear understanding of the value they want to obtain from collecting and using big data. They must filter out the "noisy" data to find what they are looking for.

Variety

refers to the different types of data to use. data structured or unstructured

Velocity

refers to the incredible speed at which new data is generated and the speed at which it moves around. Companies need to know how fast data is moving, from the time a data sample is taken to the time it is used.

Ordinal

refers to the order of variables. Ordinal data is placed into an order by its position on a scale. For example: School letter grades (such as A, B, and C), sizes (such as small, medium, and large), and customer satisfaction survey levels (such as satisfied, neutral, and dissatisfied) are nominal scales. In data collection and research, ordinal scales are commonly used to measure perceptions and opinions (such as, "How likely are you to recommend...").

Veracity

refers to the quality and trustworthiness of data. Companies need budget and methods to ensure that data is clean and trustworthy. This is an area of focus for data scientists.

Volume

refers to the vast amounts of data being generated. Example: exabytes, zettabytes, and yottabytes of data.

KDD

stands for Knowledge Discovery in Database. KDD represents the overall process of collecting data and methodically refining it. KDD typically has five steps: 1. Selection 2. Preprocessing 3. Transformation 4. Data Mining 5. Interpretation/Evaluation The KDD methodology can help businesses stay current with customer needs and behaviors and predict future purchasing trends to stay competitive. But, the process doesn't address many of the modern realities of data science projects, such as the setup of big data architecture, considerations of ethics, or the various roles in a data science team. KDD is iterative, meaning new data can be integrated and transformed to get different and more appropriate results. The knowledge acquired can be cycled back into the process, enhancing its effectiveness.

Data Exploration

this methodology includes a series of steps: gather, transform, and visualize data. The team uses descriptive analytics to answer What is happening? and diagnostic analytics to answer, Why is it happening? The result is proposed solution to a company's data problem.

Future Proof Solution Implementation

this methodology includes training and deploying data models and using artificial intelligence (AI) to predict or classify the insights gained to predict future problems. The team does this by using predictive analytics to answer, What is liekly to happen in the future? and prescriptive analytics to answer, What should happen?

ETL

used in computer-based work environments, in relation to data, data warehousing, and analytics. ETL is an acronym for extract, transform, and load (ETL). ETL is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system. ETL provides the foundation for data analytics and machine learning workstreams. Organizations often use ETL to: Extract data from legacy systems Cleanse the data to improve data quality and establish consistency Load data into a target database

Analyze

uses problem-solving skills when he analyzes data to analyze trends and find root causes to answer the business question.

Three Methods for Machine Learning

1. Supervised learning: In supervised learning, a machine ingests many questions and their answers—essentially a set of pre-structured information. The information might, for example, be drawings and photos of animals, some of which are dogs and are labeled "dog". The machine attempts to identify patterns so that when it sees a new photo of a dog and is asked, "What is this?", it can respond, "dog", with high accuracy. Supervised learning trains machines on data to build general rules that can be applied to future problems. The better the training set of data, the better the output. 2. Unsupervised learning: In unsupervised learning, a machine ingests an enormous amount of information, asked a question, then allowed to determine how to answer the question by itself. For example, a machine might receive many photos and articles about dogs. The machine ingests and classifies the information within all of the photos and articles. When the machine is shown a new photo of a dog, the machine is intended to be able to identify it as a dog, with reasonable accuracy. Unsupervised learning trains machines on a huge volume of unlabeled or unstructured data. 3. Reinforcement learning: Humans and machines can learn through reinforcement learning. Reinforcement learning is a feedback-based, machine-learning technique. Through reinforcement learning, a machine determines how to behave in an environment by performing and observing the results of its actions. For each "good" action, the machine receives positive feedback (a reward). For each "bad" action, the machine receives negative feedback (a penalty). As a result, the machine learns automatically, through its experience and feedback. Reinforcement learning is widely used in self-driven cars, drones, and other robotics applications.

Data Representation

1. Understanding the data 2. Assessing data quality 3. Discovering initial insights about the data

CRISP-DM, KDD, and SEMMA

1. Use data mining methods 2. Are best suited for structured data 3. Are useful for using descriptive and predictive analytics 4. Share some common activities, such as data gathering, data transformation, data modeling, and model evaluation Note: These methodologies are not useful on projects that work with unstructured data, such as images and text.

The goal is to have a visualization that's:

1. effective, attractive, and impactive

CSV File

A comma separated values (CSV) file allows data to be saved in a tabular format.

Which type of visualization is best-suited for ranking a large number of categories, showing correlation, and using for before-after analysis?

Bar

Visualizations

Examples: charts, graphs, and maps for two reasons: To explore and interpret data during analysis to identify patterns or trends To communicate results and help people understand the insights to make decisions

3 classic and widely adopted data science methodologies:

Cross-Industry Standard Process for Data Mining (CRISP-DM) Knowledge Discovery in Database (KDD) Sample, Explore, Modify, Model, Assess (SEMMA).

Prepares the Data

Data preparation is very important and the most time-consuming step in a data science project. It involves constructing the data set that will be used in the modeling step. Data preparation also includes cleaning the data, combining data from multiple sources, and making sure the data doesn't have any gaps. Additionally, data preparation includes cleaning or "wrangling" the data so it's ready to transform. Data scientists can't assume that data is ready to use, even if it's structured data. Real-world data usually needs some work because it might be: 1. Incomplete or have incorrect values 2. Corrupted with broken lines or have fields in the wrong place 3. Too random 4. Irrelevant 5. An outlier, which is a value that lies far away from other values and will skew the data 6. A missing value in some fields Manually verifying large volumes of stored data can be challenging for data scientists. So, data scientists use automated processes and tools to prepare data quickly and accurately.

Which of the following statements is correct about data science?

Data science is a multidisciplinary approach to extract insights from large volumes of data.

What is data science?

Data science is the understanding of the world through the scientific analysis of digital data.

A company uses data analytics to get a "snapshot" of its business and to identify issues. The company's data sources are past and present revenue reports and stock prices. Which type of analytics uses this type of data?

Descriptive analytics

Which type of analytics drills down the trends and patterns to identify the causes of problems?

Diagnostic analytics

Prescriptive Analytics

What should happen? Prescriptive analytics combines the insight from all previous data analyses to determine a course of action to take to address a problem or make a decision. The purpose of prescriptive analytics is to prescribe what action to take to eliminate a future problem or take full advantage of a promising trend. Prescriptive analytics is typically used for a host of actions, versus an individual action. Prescriptive analytics uses advanced tools and technologies, like machine learning, business rules, and algorithms. This makes prescriptive analytics sophisticated to implement and manage. Examples of prescriptive analytics: A traffic application that helps people choose the best route home and considers the distance of each route, the speed at which one can travel on each road and, crucially, the current traffic constraints An exam timetable that checks if students have conflicting schedules Artificial intelligence (AI) systems from data-driven companies like Facebook, TikTok, and Netflix

Fill in the blank. An important characteristic of a data scientist is to be curious and begin by asking questions that start with __________.

Why

Tokens

a data scientist can break up text into words, phrases, symbols, or other meaningful elements called tokens

Attrition

a gradual reduction or weakening; a rubbing away

Tokenization

a list of tokens becomes input for further processing into numbers. This technique is called tokenization, which is one of many data transformation techniques.

Design Thinking Workshop

apply techniques to: Define the problem Determine the project objectives Develop personas or fictional characters that represent typical end users Document solution requirements from a business perspective


संबंधित स्टडी सेट्स

Chapter 9 Developing New Products and Services

View Set

Programmable logic controllers chapter 10

View Set

Fundamentals Practice Test A with NGN

View Set

51, 52, 53, Great Depression/New Deal Notes and Review Questions

View Set

College Accounting Exam #1 Flash Cards to Practice

View Set

Mood, Adjustment, and Dementia Disorders

View Set