Data Analytics

Ace your homework & exams now with Quizwiz!

Data transformation usually involves:

- Adding, copying, or replicating data - Deleting fields or records - Standardizing the names of variables - Renaming, moving or combining columns in a database - Joining one set of data with another - Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (CSV) file.

What's the Difference Between Data Wrangling and Data Cleaning

Data cleaning focuses on removing inaccurate data from your data set whereas data wrangling focuses on transforming the data's format, typically by converting "raw" data into another format more suitable for use. Data cleaning enhances the data's accuracy and integrity while wrangling prepares the data structurally for modeling

Ordinal categorical (qualitative) data

Ordinal data has categories, but we can order them from smallest to largest, least to most, worst to best, or in other ordered ways.

Discrete measures

are treated as finite values, since there is a countable amount of a given measure. The discrete measures, which are labeled in blue, add headers to the worksheet. An example of a discrete measure is the total number of rides.

Conceptual data modeling

gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn't contain technical details

Easy code for a calculated field in Tableau: convert the numbers in the Gender dimension to unknown, male, or female

if [Gender] = '0' then 'Unknown' elseif [Gender] = '1' then 'Male' elseif [Gender] = '2' then 'Female' End

Data

is a collection of recorded facts / observations An important feature of data is that it is recorded and gathered from observations and/or measurements. If something hasn't been recorded, it isn't data.

Data Dictionary

is a list of data elements (entity/table and attribute/column) with their attributes and descriptions. It has a form of a set of tables.

Quantitative Data

is also called numerical because we can count it, measure it, and represent it with numbers. Examples of quantitative data include weight, salary, pieces of pie eaten, years of education, and number of orders. All these examples are things that we can count or measure, and we must represent their values as numbers.

Long data

is data where each row contains a single data point for a particular item. Google: a dataset in which each row is one time point per subject, so each subject has data in multiple rows

Wide data

is data where each row contains multiple data points for the particular items identified in the columns. Google: A dataset in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject Wide data subjects can have data in multiple columns. Long data subjects can have multiple rows that hold the values of subject attributes.

Descriptive metadata

is metadata that describes a piece of data and can be used to identify it at a later point in time. For instance, the descriptive metadata of a book in a library would include the code you see on its spine, known as a unique International Standard Book Number, also called the ISBN.

Continuous numerical (quantitative) data

is numerical data that can take on any value between two whole numbers. In other words, it can have decimals and fractions.

Data transformation

is the process of changing the data's format, structure, or values

Data modeling

is the process of creating diagrams that visually represent how data is organized and structured

Data Cleaning

is the process of finding and correcting inaccurate data from a particular data set or data source. The primary goal is to identify and remove inconsistencies without deleting the necessary data to produce insights. It's important to remove these inconsistencies in order to increase the validity of the data set

Qualitative Data

or categorical. This means that we cannot count or measure the data; therefore, it isn't numerical. For example, hair color is qualitative. Possible values of hair color include black, brown, red, blonde, gray, white (and maybe more).

A/B testing (or split testing)

tests two variations of the same web page to determine which page is more successful in attracting user traffic and generating revenue. User traffic that gets monetized is known as the conversion rate. A/B testing allows marketers to test emails, ads, and landing pages to find the data behind what is working and what isn't working. Marketers use the confidence interval (determined by the conversion rate and the margin of error) to understand the results. For example, suppose you are conducting an A/B test to compare the effectiveness of two different email subject lines to entice people to open the email. You find that subject line A: "Special offer just for you" resulted in a 5% open rate compared to subject line B: "Don't miss this opportunity" at 3%. Does that mean subject line A is better than subject line B? It depends on your margin of error. If the margin of error was 2%, then subject line A's actual open rate or confidence interval is somewhere between 3% and 7%. Since the lower end of the interval overlaps with subject line B's results at 3%, you can't conclude that there is a statistically significant difference between subject line A and B. Examining the margin of error is important when making conclusions based on your test results.

Median

#1 Definition The median is a numeric value that separates the higher half of a set from the lower half. the middle score in a distribution; half the scores are above it and half are below it. #2 When is it applicable? The median is a numeric value that separates the higher half of a set from the lower half. The median is generally used to return the central tendency for skewed number distributions. #3 How is it calculated? The median can be calculated by listing all numbers in ascending order and then locating the number in the centre of that distribution. If the data you are comparing is mostly uniform then you can safely use the average (AVG) aggregator. However, if your number set has some outliers then you need to consider using median (MED) to filter out the values that are skewing the results.

When working with tabular data, a variable is referred to...

... as a field or a column.

Knowledge comes from ....

....accumulated information. Understanding and recognizing patterns is a type of knowledge. Sometimes knowledge is formalized into axioms, theories, concepts, and definitions.

3 common types of metadata

1) Descriptive; 2) Structural; 3) Administrative.

When discussing structured databases, data analysts refer to the data contained in a row as a record. How do they refer to the data contained in a column?

Data analysts refer to the data contained in a column as a field

Nominal categorical (qualitative) data

he categories of the data can be named or listed, but there is no inherent order to them.

For 2 tables to have a relationship....

one or more the same field must exist inside both tables

Composite Key

A primary key may be constructed using multiple columns of a table.

Query

A request for data or information from a database

Postgres

Database system for holding data

pgAdmin

Interface that talks to Postgres

How is ordinal data different from nominal data?

Ordinal data has an inherent order, while nominal data does not.

p-value

The probability level which forms basis for deciding if results are statistically significant (not due to chance). The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant.

Physical data modeling

depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.

Two types of numerical (quantitative) data

discrete and continuous

Metadata repository

is a database specifically created to store metadata.

Administrative metadata

is metadata that indicates the technical source of a digital asset. When we looked at the metadata inside the photo, that was administrative metadata.

Structural metadata

is metadata that indicates how a piece of data is organized and whether it's part of one or more than one data collection An example of structural data would be how the pages of a book are put together to create different chapters.

Descriptive analysis

summarizes (or aggregates) data to form a picture of what happened in the past. This adds additional context that enables us to understand information more clearly. Summarizing data tells us the average value for a variable, how frequently a value appears in data, and other useful pieces of knowledge.

Mean

the arithmetic average of a distribution, obtained by adding the scores and then dividing by the number of scores. The mean is used for normal number distributions, which have a low amount of outliers.

Mode

the most frequently occurring score(s) in a distribution

Benefits of Data Cleaning

· Elimination of errors · Reduced costs associated with errors · Improves the integrity of data · Ensures the highest quality of information for decision making

Benefits of Data Wrangling

· Enhances ease of access to data · Faster time to insights · Improved efficiency when it comes to data-driven decision making

information is data...

....that has context

Structured Query Language

A computer programming language used to communicate with a database

Pivot table

A data summarization tool used to sort, reorganize, group, count, total, or average data

Data type

A data type is a specific kind of data attribute that tells what kind of value the data is. In other words, a data type tells you what kind of data you are working with.

Normalized database

A database in which only related data is stored in each table

Relational Database

A database that contains a series of related tables that can be connected via their relationships

Foreigh key

A field within a table that is a primary key in another table. In other words, a foreign key is how one table can be connected to another.

Datum

A single datum or fact can be a number, word, measurement, observation (e.g., visual, auditory, etc.), or description of something.

Report

A static collection of data periodically given to stakeholders

Primary key

An identifier that references a column in which each value is unique Some tables don't require a primary key.

Three types of data modeling

Conceptual, Logical, Physical

DIKW Pyramid

DIKW stands for Data, Information, Knowledge, and Wisdom. The main idea of this framework is that data (D) is converted first into information (I). Then, information (I) is converted into knowledge (K). Finally, knowledge (K) is converted into wisdom (W).

What is Data Analytics?

Data analytics is the collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision making.

Logical data modeling

Focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn't spell out actual names of database tables. That's the job of a physical data model.

Database Normalization

Normalization is a process of organizing data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database.

Unstructured data

Not organized in any easy-to-identify way

Structured data

Organized in a certain format, such as rows and columns

Stakeholders

People who invest time and resources into a project and are interested in its outcome

Nominal data

Qualitative data that is categorized without a set order

Ordinal data

Qualitative data with a set order or scale

Data analysis

The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making

Discrete numerical (quantitative) data

We count discrete data using whole numbers. Whole numbers don't allow any fractions or decimals.

Do you prefer to ask small questions and let them lead you to the big question? Or you tend to come up with a big question first?

When I work with data, I start with one basic question, but I end up asking so many more questions as the data shows more information. To me, having a starting point is my critical step, then as you work, whether in research or troubleshooting, the data you find will give you answers, not always to the initial question, but to questions you didn't know you had.

Redundancy

When the same piece of data is stored in two or more places

Entity Relationship Diagram (ERD)

are a visual way to understand the relationship between entities in the data model

Continuous measures

are essentially measures that are infinite and add axes to the worksheet. Continuous measures are labeled in green. Time is one example of a continuous measure, since there is no defined end.

Measures (Tableau)

are located in the lower part of the Data Pane section. They are the quantitative values in your dataset, which are usually numerically measured. You can apply calculations to measures and they can be aggregated. When you drag a measure into Columns, Rows, Marks, or Filters, Tableau automatically applies an aggregation to that measure.

Dimensions (Tableau)

are located in the upper part of the Data Pane section. They are the qualitative values in your dataset, such as names, dates, and geographical data. You can use dimensions to categorize, segment, and reveal the details in your data.

Schema

A way of describing how something, such as data, is organized

Sorting

The process of arranging data into a meaningful order to make it easier to understand, analyze, and visualize

Data wrangling

The process of cleaning, unifying, and preparing unorganized and scattered data sets for easy access and analysis.

Turnover rate

The rate at which employees voluntarily leave a company

Data analytics

The science of data

SELECT

The section of a query that indicates the subset of a dataset

WHERE

The section of a query that specifies criteria that the requested data must meet

Data analysis process

The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making

Revenue

The total amount of income generated by the sale of goods or services


Related study sets

Chapter 2: The Recording Process

View Set

Elsevier/Evolve - Applying Elastic Stockings

View Set

Chapter 25 - Respiratory System, Chapter 26 - Upper Respiratory Problems, Chapter 27 - Lower Respiratory Problems (Pneumonia & TB)

View Set