Data Analytics
Data transformation usually involves:
- Adding, copying, or replicating data - Deleting fields or records - Standardizing the names of variables - Renaming, moving or combining columns in a database - Joining one set of data with another - Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (CSV) file.
What's the Difference Between Data Wrangling and Data Cleaning
Data cleaning focuses on removing inaccurate data from your data set whereas data wrangling focuses on transforming the data's format, typically by converting "raw" data into another format more suitable for use. Data cleaning enhances the data's accuracy and integrity while wrangling prepares the data structurally for modeling
Ordinal categorical (qualitative) data
Ordinal data has categories, but we can order them from smallest to largest, least to most, worst to best, or in other ordered ways.
Discrete measures
are treated as finite values, since there is a countable amount of a given measure. The discrete measures, which are labeled in blue, add headers to the worksheet. An example of a discrete measure is the total number of rides.
Conceptual data modeling
gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn't contain technical details
Easy code for a calculated field in Tableau: convert the numbers in the Gender dimension to unknown, male, or female
if [Gender] = '0' then 'Unknown' elseif [Gender] = '1' then 'Male' elseif [Gender] = '2' then 'Female' End
Data
is a collection of recorded facts / observations An important feature of data is that it is recorded and gathered from observations and/or measurements. If something hasn't been recorded, it isn't data.
Data Dictionary
is a list of data elements (entity/table and attribute/column) with their attributes and descriptions. It has a form of a set of tables.
Quantitative Data
is also called numerical because we can count it, measure it, and represent it with numbers. Examples of quantitative data include weight, salary, pieces of pie eaten, years of education, and number of orders. All these examples are things that we can count or measure, and we must represent their values as numbers.
Long data
is data where each row contains a single data point for a particular item. Google: a dataset in which each row is one time point per subject, so each subject has data in multiple rows
Wide data
is data where each row contains multiple data points for the particular items identified in the columns. Google: A dataset in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject Wide data subjects can have data in multiple columns. Long data subjects can have multiple rows that hold the values of subject attributes.
Descriptive metadata
is metadata that describes a piece of data and can be used to identify it at a later point in time. For instance, the descriptive metadata of a book in a library would include the code you see on its spine, known as a unique International Standard Book Number, also called the ISBN.
Continuous numerical (quantitative) data
is numerical data that can take on any value between two whole numbers. In other words, it can have decimals and fractions.
Data transformation
is the process of changing the data's format, structure, or values
Data modeling
is the process of creating diagrams that visually represent how data is organized and structured
Data Cleaning
is the process of finding and correcting inaccurate data from a particular data set or data source. The primary goal is to identify and remove inconsistencies without deleting the necessary data to produce insights. It's important to remove these inconsistencies in order to increase the validity of the data set
Qualitative Data
or categorical. This means that we cannot count or measure the data; therefore, it isn't numerical. For example, hair color is qualitative. Possible values of hair color include black, brown, red, blonde, gray, white (and maybe more).
A/B testing (or split testing)
tests two variations of the same web page to determine which page is more successful in attracting user traffic and generating revenue. User traffic that gets monetized is known as the conversion rate. A/B testing allows marketers to test emails, ads, and landing pages to find the data behind what is working and what isn't working. Marketers use the confidence interval (determined by the conversion rate and the margin of error) to understand the results. For example, suppose you are conducting an A/B test to compare the effectiveness of two different email subject lines to entice people to open the email. You find that subject line A: "Special offer just for you" resulted in a 5% open rate compared to subject line B: "Don't miss this opportunity" at 3%. Does that mean subject line A is better than subject line B? It depends on your margin of error. If the margin of error was 2%, then subject line A's actual open rate or confidence interval is somewhere between 3% and 7%. Since the lower end of the interval overlaps with subject line B's results at 3%, you can't conclude that there is a statistically significant difference between subject line A and B. Examining the margin of error is important when making conclusions based on your test results.
Median
#1 Definition The median is a numeric value that separates the higher half of a set from the lower half. the middle score in a distribution; half the scores are above it and half are below it. #2 When is it applicable? The median is a numeric value that separates the higher half of a set from the lower half. The median is generally used to return the central tendency for skewed number distributions. #3 How is it calculated? The median can be calculated by listing all numbers in ascending order and then locating the number in the centre of that distribution. If the data you are comparing is mostly uniform then you can safely use the average (AVG) aggregator. However, if your number set has some outliers then you need to consider using median (MED) to filter out the values that are skewing the results.
When working with tabular data, a variable is referred to...
... as a field or a column.
Knowledge comes from ....
....accumulated information. Understanding and recognizing patterns is a type of knowledge. Sometimes knowledge is formalized into axioms, theories, concepts, and definitions.
3 common types of metadata
1) Descriptive; 2) Structural; 3) Administrative.
When discussing structured databases, data analysts refer to the data contained in a row as a record. How do they refer to the data contained in a column?
Data analysts refer to the data contained in a column as a field
Nominal categorical (qualitative) data
he categories of the data can be named or listed, but there is no inherent order to them.
For 2 tables to have a relationship....
one or more the same field must exist inside both tables
Composite Key
A primary key may be constructed using multiple columns of a table.
Query
A request for data or information from a database
Postgres
Database system for holding data
pgAdmin
Interface that talks to Postgres
How is ordinal data different from nominal data?
Ordinal data has an inherent order, while nominal data does not.
p-value
The probability level which forms basis for deciding if results are statistically significant (not due to chance). The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant.
Physical data modeling
depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.
Two types of numerical (quantitative) data
discrete and continuous
Metadata repository
is a database specifically created to store metadata.
Administrative metadata
is metadata that indicates the technical source of a digital asset. When we looked at the metadata inside the photo, that was administrative metadata.
Structural metadata
is metadata that indicates how a piece of data is organized and whether it's part of one or more than one data collection An example of structural data would be how the pages of a book are put together to create different chapters.
Descriptive analysis
summarizes (or aggregates) data to form a picture of what happened in the past. This adds additional context that enables us to understand information more clearly. Summarizing data tells us the average value for a variable, how frequently a value appears in data, and other useful pieces of knowledge.
Mean
the arithmetic average of a distribution, obtained by adding the scores and then dividing by the number of scores. The mean is used for normal number distributions, which have a low amount of outliers.
Mode
the most frequently occurring score(s) in a distribution
Benefits of Data Cleaning
· Elimination of errors · Reduced costs associated with errors · Improves the integrity of data · Ensures the highest quality of information for decision making
Benefits of Data Wrangling
· Enhances ease of access to data · Faster time to insights · Improved efficiency when it comes to data-driven decision making
information is data...
....that has context
Structured Query Language
A computer programming language used to communicate with a database
Pivot table
A data summarization tool used to sort, reorganize, group, count, total, or average data
Data type
A data type is a specific kind of data attribute that tells what kind of value the data is. In other words, a data type tells you what kind of data you are working with.
Normalized database
A database in which only related data is stored in each table
Relational Database
A database that contains a series of related tables that can be connected via their relationships
Foreigh key
A field within a table that is a primary key in another table. In other words, a foreign key is how one table can be connected to another.
Datum
A single datum or fact can be a number, word, measurement, observation (e.g., visual, auditory, etc.), or description of something.
Report
A static collection of data periodically given to stakeholders
Primary key
An identifier that references a column in which each value is unique Some tables don't require a primary key.
Three types of data modeling
Conceptual, Logical, Physical
DIKW Pyramid
DIKW stands for Data, Information, Knowledge, and Wisdom. The main idea of this framework is that data (D) is converted first into information (I). Then, information (I) is converted into knowledge (K). Finally, knowledge (K) is converted into wisdom (W).
What is Data Analytics?
Data analytics is the collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision making.
Logical data modeling
Focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn't spell out actual names of database tables. That's the job of a physical data model.
Database Normalization
Normalization is a process of organizing data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database.
Unstructured data
Not organized in any easy-to-identify way
Structured data
Organized in a certain format, such as rows and columns
Stakeholders
People who invest time and resources into a project and are interested in its outcome
Nominal data
Qualitative data that is categorized without a set order
Ordinal data
Qualitative data with a set order or scale
Data analysis
The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making
Discrete numerical (quantitative) data
We count discrete data using whole numbers. Whole numbers don't allow any fractions or decimals.
Do you prefer to ask small questions and let them lead you to the big question? Or you tend to come up with a big question first?
When I work with data, I start with one basic question, but I end up asking so many more questions as the data shows more information. To me, having a starting point is my critical step, then as you work, whether in research or troubleshooting, the data you find will give you answers, not always to the initial question, but to questions you didn't know you had.
Redundancy
When the same piece of data is stored in two or more places
Entity Relationship Diagram (ERD)
are a visual way to understand the relationship between entities in the data model
Continuous measures
are essentially measures that are infinite and add axes to the worksheet. Continuous measures are labeled in green. Time is one example of a continuous measure, since there is no defined end.
Measures (Tableau)
are located in the lower part of the Data Pane section. They are the quantitative values in your dataset, which are usually numerically measured. You can apply calculations to measures and they can be aggregated. When you drag a measure into Columns, Rows, Marks, or Filters, Tableau automatically applies an aggregation to that measure.
Dimensions (Tableau)
are located in the upper part of the Data Pane section. They are the qualitative values in your dataset, such as names, dates, and geographical data. You can use dimensions to categorize, segment, and reveal the details in your data.
Schema
A way of describing how something, such as data, is organized
Sorting
The process of arranging data into a meaningful order to make it easier to understand, analyze, and visualize
Data wrangling
The process of cleaning, unifying, and preparing unorganized and scattered data sets for easy access and analysis.
Turnover rate
The rate at which employees voluntarily leave a company
Data analytics
The science of data
SELECT
The section of a query that indicates the subset of a dataset
WHERE
The section of a query that specifies criteria that the requested data must meet
Data analysis process
The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making
Revenue
The total amount of income generated by the sale of goods or services