Data Analyst Course 5 Week 3
Aggregation
Collecting or gathering many separate pieces into a whole
COUNT in spreadsheets
Count can be used to count the total number of numerical values within a specific range in spreadsheets
Common use for VLOOKUP
One of the most common things data analysts do when using VLOOKUP is populating data in one spreadsheet from another For example, let's say we're working with data that exists in two different spreadsheets, but we need information from both in order to answer our business question. VLOOKUP can connect two sheets together on a matching column to populate one single sheet
Common data-cleaning tasks
Different data types (data may have cells formatted in a different way than you need or there is no consistency among the formatting) Can use the VALUE function to convert a text string that represents a number to a numerical value Another common data-cleaning task is removing extra spaces When data is copied from one source to another, sometimes a few leading or trailing spaces tag along. These can cause problems when using VLOOKUP. Use the TRIM function during the data cleaning process TRIM automatically delete any extra spaces added to the cell. Another typical mistakes during VLOOKUP are duplicates If there are duplicate rows in the search, it will return only the first match it finds. REMOVE Duplicates: a tool that automatically searches for and eliminates duplicate entries from a spreadsheet.
RIGHT JOIN
Does the opposite of LEFT JOIN. It will return all of the records from the right table and only the matching records from the left You may see this as RIGHT OUTER JOIN or RIGHT JOIN. RIGHT JOIN returns all records from the right table and the corresponding records from the left table. Practically speaking, RIGHT JOIN is rarely used. Most people simply switch the tables and stick with LEFT JOIN. But using the previous example for LEFT JOIN, the query using RIGHT JOIN would look like the following:
VLOOKUP syntax
Example syntax: =VLOOKUP(103, A2:B26, 2, FALSE) FALSE = Exact Match, TRUE = Approximate match Example =VLOOKUP(10003, A2:B26, 2, FALSE) SYNTAX =VLOOKUP(search_key, range, index, [is sorted]) Search_key = value you search for Range: range to consider for search. The first column in the range is searched to locate data matching the value specified by search_key Index: The column index of the value to be returned, where the first column in range is numbered 1. If index is not between 1 and the number of columns in range, #VALUE! is returned IS Sorted: Whether you want to return TRUE(approximate) or FALSE(exact) match
VLOOKUP limitations
Only returns first match it finds even if there is lot of possible matches VLOOKUP can only return value from data to the right. It can't look left. To get around this limitation, can copy column you want to lookup to the left of all your data. (insert column and paste lookup data at the beginning) Make sure to lock your table array. This makes sure that when you copy formula down, it continues to lookup the right data This is an absolut reference. An absolute reference is a reference that is locked so that rows and columns won't change when copied Dollar signs ensure the corresponding reference will not change Another thing that can cause errors to VLOOKUP formula is version control issues. An example if formula working at first but you updated the spreadsheet which caused an error in your original formula. For example, maybe a user inserted a column. This now throws off your entire formula. When using VLOOKUP, you are likely to get different results depending on if you use the word TRUE or FALSE at the end TRUE: Approximate matches It's important to know that VLOOKUP starts at the top of a specified range and searches downward vertically in each cell to find the right value. It stops searching when it finds any value that's greater than or equal to the lookup value. That's why data analysts typically use FALSE. FALSE: Exact matches
Two most common uses for VLOOKUP
Populating data in a spreadsheet Merging data from one spreadsheet with data in another
The WHERE function can't be used with aggregate functions
For example, you can use WHERE on a statement and follow it with GROUP BY. But when you want to use GROUP BY first and then use WHERE on that output, you'll need a different function.
CASE Clause
Returns records with your conditions by allowing you to include if/then statements in your query
INNER JOIN
Function that returns records with matching values in both tables Records are overlapping between both tables. For the records to appear in the results table, they will have to be key values in both tables Records will only merge if there are matches in both tables JOIN in SQL defaults to inner join. Many analysts just use JOIN instead of typing out INNER JOIN INNER is optional in this SQL query because it is the default as well as the most commonly used JOIN operation. You may see this as JOIN only. INNER JOIN returns records if the data lives in both tables. For example, if you use INNER JOIN for the 'customers' and 'orders' tables and match the data using the customer_id key, you would combine the data for each customer_id that exists in both tables. If a customer_id exists in the customers table but not the orders table, data for that customer_id isn't joined or returned by the query.
Troubleshooting questions you should ask yourself
How should i prioritize these issues? Trying to solve a lot of problems at once can be overwhelming. Try to solve on problem at a time In a single sentence, what is the issue I am facing? Helps clarify what is really going on. You don't get bogged down with extra information Always best to start with your own clear understanding of the situation and then let the data tell you if you are on the right track or not What resources can help me solve the problem? The internet is one of the best resources out there. Look for answers to your problem using the internet (quick google search) People are resources too so don't be afraid to ask questions. It a great way to learn and build relationships How can I stop this problem from happening in the future? New procedure or guideline that can be put in place?
Data aggregation helps data analysts:
Identify trends Make comparisons Gain insights
Alternate syntax for aliases
If using AS results in an error when running a query because the SQL database you are working with doesn't support it, you can leave it out. In the previous examples, the alternate syntax for aliasing a table or column would be: FROM table_name_alias_name SELECT column_name_alias_name The key takeaway is that queries can run with or without using AS for aliasing, but using AS has the benefit of making queries more readable. It helps to make aliases stand out more clearly.
There are 4 common joins data analysts use
Inner Left Right Outer
Example of aliasing
Let's say that you are working with two tables: one of them has employee data and the other one has department data. The FROM statement to alias those tables could be: FROM work_day.employees AS employees These aliases still let you know exactly what is in these tables, but now you don't have to manually input those long table names. Aliases can be really helpful for long, complicated queries. It is easier to read and write your queries when you have aliases that tell you what is included within your tables.
EX of COUNT DISTINCT Query
SELECT COUNT (DISTINCT Warehouse.state) AS num_states FROM project3-326803.warehouse_orders.Orders AS Orders JOIN project3-326803.warehouse_orders.Warehouse AS Warehouse ON Orders.warehouse_id = Warehouse.warehouse_id
Example of OUTER JOIN
SELECT employees.name AS employee_name, employees.role AS employee_role, departments.name AS departments_name FROM project3-326803.employee_data.employees FULL OUTER JOIN project3-326803.employee_data.departments ON employees.department_id = departments.department_id In example of above, need to use FULL OUTER JOIN, not just OUTER JOIN There will be NULLS as you are pulling all the data and there most likely will not be corresponding data for everything in your tables. Not every piece of data will have a matching value
Example of Inner Join
SELECT employees.name AS employee_name, employees.role AS employee_role, departments.name AS departments_name FROM project3-326803.employee_data.employees INNER JOIN project3-326803.employee_data.departments ON employees.department_id = departments.department_id In the example of above, ON specifies which column in each table will contain the matching JOIN key. Now we've got a list of employee names and department IDs for the employees that have those IDs.
Subquery example using a WHERE statement. The bike-sharing company has two kinds of users: subscribers and one-time customers. Let's say we wanted a list of stations subscribers used.
SELECT station_id, name FROM bigquery-public-data.new_york_citibike.citibike_stations WHERE station_id IN ( SELECT start_station_id FROM bigquery-public-data.new_york_citibike.citibike_trips WHERE usertype = 'Subscriber' )
It is really common to see subqueries to be nested within FROM and WHERE statements. Example where we calculate the number of rides that have started at each station over time.
SELECT station_id, name, number_of_rides AS number_of_rides_starting_at_station FROM ( SELECT start_station_id, COUNT(*) number_of_rides FROM bigquery-public-data.new_york_citibike.citibike_trips GROUP BY start_station_id ) AS staion_num_trips INNER JOIN bigquery-public-data.new_york_citibike.citibike_stations ON station_id = start_station_id ORDER BY number_of_rides DESC
Example comparing # of bikes available vs average # of bikes available at a station. Subquery of a SELECT statement
SELECT station_id, num_bikes_available, (SELECT AVG(num_bikes_available) FROM bigquery-public-data.new_york_citibike.citibike_stations) AS avg_num_bikes_available FROM bigquery-public-data.new_york_citibike.citibike_stations
Syntax for aliasing a column
SELECT column_name AS alias_name, FROM table_name
Syntax for aliasing a table
SELECT column_name(s) FROM table_name AS alias_name
The two most common kinds of JOIN statements are INNER JOINs and OUTER LEFT JOINs (also known simply as LEFT JOINs). As a review INNER JOIN: Returns only the rows where the target appears in both tables. LEFT JOIN: Returns every row from the left table, as well as any rows from the right table with matching keys found in the left table
The takeaway here is that the sort of JOIN you use matters. When writing a query, you can draw out a Venn diagram like the example graphic above to help you decide which sort of JOIN you need. Recall that you can set aliases for tables by specifying the alias for the table after the table's name in FROM and/or JOIN statements
Example of SQL functions used in data aggregation
SQL HAVING SQL CASE SQL IF SQL COUNT
SQL functions are tools built into SQL to make it possible to perform calculations. A subquery (also called an inner or nested query) is a query within another query.
SQL functions are what help make data aggregation possible. (As a reminder, data aggregation is the process of gathering data from multiple sources in order to combine it into a single, summarized collection. Examples below
There are many ways to make use of a subquery. Below are the subquery rules. A subquery must follow:
Subqueries must be enclosed within parentheses A subquery can have only one column specified in the SELECT clause. But if you want a subquery to compare multiple columns, those columns must be selected in the main query Subqueries that return more than one row can only be used with multiple value operators, such as the IN operator which allows you to specify multiple values in a WHERE clause. A subquery can't be nested in a SET command. The SET command is used with UPDATE to specify which columns (and values) are to be updated in a table.
VLOOKUP reminders
TRUE means an approximate match, FALSE means an exact match on the search key. If the data used for the search key is sorted, TRUE can be used. You want the column that matches the search key in a VLOOKUP formula to be on the left side of the data. VLOOKUP only looks at data to the right after a match is found. In other words, the index for VLOOKUP indicates columns to the right only. This may require you to move columns around before you use VLOOKUP. After you have populated data with the VLOOKUP formula, you may copy and paste the data as values only to remove the formulas so you can manipulate the data again.
IN operator SQL
The IN operator allows you to specify multiple values in a WHERE clause. The IN operator is a shorthand for multiple OR conditions.
What if you get #N/A when using VLOOKUP
#N/A indicates that a matching value can't be returned as a result of the VLOOKUP. The error doesn't mean that anything is actually wrong with the data, but people might have questions if they see the error in a report. You can use the IFNA function to replace the #N/A error with something more descriptive, like "Does not exist." Syntax: =IFNA(value, value_if_na) Value = the function checks if cell value matches the value; #N/A if it doesn't Value_if_na = the function returns this value if the cell value matches the value in the first argument; it returns this value when the cell value is #N/A
Additional SQL subquery tips/facts
The innermost query executes first and the parent query executes last. This is the case so the parent query can use the results of the inner queries A Subquery can only have one column specified in the SELECT statement Subqueries that return more than one row rely on multiple value operators such as IN
Data aggregation
The process of gathering data from multiple sources in order to combine it into a single summarized collection In data analytics, a summarized collection, or summary, describes identifying the data you need and gathering it all together in one place Think of it like multiple puzzles falling on the floor. Need to clean up and group the puzzle pieces to original set to work with the puzzles. Functions help make data aggregation possible Data aggregation can give you all kinds of information about the data you are looking at. For example, in marketing, you can aggregate data from an ad campaign to see how it performed over time and for particular customers. One thing all companies using data aggregation all have in common is that they can use V-Lookup to achieve their insights
JOIN
A SQL clause that is used to combine rows from two or more tables based on a related column JOIN is like a SQL version of VLOOKUP JOIN helps you find matching or related columns from different tables (two or more) These are the primary and foreign keys of databases Primary keys reference columns in which each value is unique to that table but that table can have multiple foreign keys Foreign keys are primary keys in other tables Using JOINs can make working with multiple data sources a lot easier and it can make relationships between tables more clear. JOINs are super useful when you need to work with data from multiple related tables. They give you a lot of flexibility with how you combine and view that data. If you ever have trouble remembering what INNER, RIGHT, LEFT, or OUTER JOIN do, just think back to our Venn diagram. A JOIN combines tables by using a primary or foreign key to align the information coming from both tables in the combination process. JOINs use these keys to identify relationships and corresponding values across tables. The JOIN statement is part of the FROM clause of the query. JOIN in SQL indicates that you are going to combine data from two tables. ON in SQL identifies how the tables are to be matched for the correct information to be combined from both.
Subquery:
A SQL query that is nested inside a larger query Think of it like a Russian nesting doll. A large query can have a smaller query in it and that smaller query can have another query in it When you stack them all together, they make on query Can combine different pieces of logic together. Logic of outer query relies on logic of inner query. You can get more done with one query. All of the logic of the query is one place The statement containing the subquery can be called the "outer query" or "outer select" Makes the subquery the "inner query" or "inner select" Subqueries can also be nested within other queries. Usually subqueries are nested within SELECT, FROM and/or WHERE clauses
You can use comparison operators like IN, ANY, or ALL in subqueries
A comparison (or relational) operator is a mathematical symbol which is used to compare two values. Comparison operators are used in conditions that compares one expression with another. The result of a comparison can be TRUE, FALSE, or UNKNOWN (an operator that has one or two NULL expressions returns UNKNOWN).
OUTER JOIN
A function that combines RIGHT and LEFT JOIN to return all matching records in both tables Will return all records from both tables If there are records in one table without a match, it'll create a record with no values for the other table. You may sometimes see this as FULL JOIN. FULL OUTER JOIN returns all records from the specified tables. You can combine tables this way, but remember that this can potentially be a large data pull as a result. FULL OUTER JOIN returns all records from both tables even if data isn't populated in one of the tables. For example, in the query below, you will get all customers and their products' shipping dates. Because you are using a FULL OUTER JOIN, you may get customers returned without corresponding shipping dates or shipping dates without corresponding customers. A NULL value is returned if corresponding data doesn't exist in either table.
VLookup (Vertical Lookup)
A function that searches for a certain value in a column to return a corresponding pieces of information Before using VLOOKUP, we need to make sure our data is properly prepared Clean data is much more likely to give you better results It's always good to remember that clean data is the foundation that everything else is built on. VLOOKUP can be a very useful data-cleaning tool.
LEFT JOIN
A function that will return all the records from the left table and only the matching records from the right table Here's how you can figure out which table is left or right. In English and SQL we read from left to right. The table mentioned first is left and the table mentioned second is right. You can also think of left as a table name to the left of the JOIN statement and right as a table name to the right of the JOIN statement. Left table and data it shares with the right table are being selected Each row in the left table appears in the results even if there are no matches in the right table. You may see this as LEFT OUTER JOIN, but most users prefer LEFT JOIN. Both are correct syntax. LEFT JOIN returns all the records from the left table and only the matching records from the right table. Use LEFT JOIN whenever you need the data from the entire first table and values from the second table, if they exist. For example, in the query below, LEFT JOIN will return customer_name with the corresponding sales_rep, if it is available. If there is a customer who did not interact with a sales representative, that customer would still show up in the query results but with a NULL value for sales_rep.
Query
A query is a request for information from a database.
COUNT DISTINCT
A query that only returns the distinct values in a specified range Doesn't count repeating values You'll use COUNT and COUNT DISTINCT any time you want to answer questions about "how many"
COUNT in SQL
A query that returns the number of rows in a specified range
Subquery
A subquery, also called an inner or nested query, is a query within another query.
Cannot use WHERE clause in aggregate functions (COUNT, SUM, AVG, MIN)
Aggregate function: Basically a function that aggregates data into a single number
Aliases in SQL
Aliases are used in SQL queries to create temporary names for a column or table. Aliases make referencing tables and columns in your SQL queries much simpler when you have table or column names that are too long or complex to make use of in queries. Imagine a table name like special_projects_customer_negotiation_mileages. That would be difficult to retype every time you use that table. With an alias, you can create a meaningful nickname that you can use for your analysis. In this case "special_projects_customer_negotiation_mileages" can be aliased to simply "mileage."
HAVING function (clause)
Allows you to add a filter to your query instead of the underlying table that can only be used with aggregate functions This way, your query only returns records that meets your specific conditions
Data can also be aggregated over a given time period to provide statistics, such as:
Averages Minimums Maximums Sums
Think of a query as a cake. A cake can have multiple layers contained within it and even layers within those layers. Each of these layers are our subqueries, and when you put all of the layers together, you get a cake (query).
Usually, you will find subqueries nested in the SELECT, FROM, and/or WHERE clauses. There is no general syntax for subqueries, but the syntax for a basic subquery is as follows:
How to lock a spreadsheet for VLOOKUP purposes
You can lock the spreadsheet to ensure this does not happen (to do this in sheets, select data > protected sheets and ranges > next choose what you want to protect > can either select to show a warning or restrict who can edit, choose only you if only you want to update spreadsheet) If you can't lock the spreadsheet, can use MATCH instead. (MATCH: a function used to locate the position of a specific lookup value)
COUNT and COUNT DISTINCT
You will use them throughout the data analysis process at different stages For example, you might need them while you're cleaning data to check how many rows are left in your dataset. Or you might use COUNT and COUNT DISTINCT during the actual analysis to answer a "how many" question. COUNT and COUNT DISTINCT are really useful to know You'll find yourself using COUNT and COUNT DISTINCT during every stage of the data analysis process. Understanding what these queries are and how they are different is key.