SQL
SQL
Structured Query Language, used to communicate with databases, founded on the relational database model
outer query
the main query in a subquery
ON
this clause can be used when the common columns in 2 tables use the same column name (instead of USING)
key
this is something that uniquely identifies each individual record (can be one or more attributes, must work in all cases to avoid database anomalies)
*
called a splat; achieves same results as listing all column names
ALL
can be used after set operators to keep duplicates
SELECT
can be used on its own as print
HAVING
clause can use with GROUP BY to filter grouped results.
PostGres
compiles a query by processing anything before a semi-colon
modern data science team
data engineer, data analyst, data scientist
data analysis
1. Identify what you will be looking for in the database What is the desired output of your analysis? What is the base set of data/tables required? Will you need to filter out some records? Will you need to aggregate or summarize data? Use that information to build your database queries. • SELECT ... FROM ... WHERE ... GROUP BY ... Be curious!
entity-relationship model (ER)
A top-down approach to database designBegins big picture (entities and relationships)Then details are added (attributes and constraints)
PostGres keywords syntax
ALL CAPITAL LETTERS
data summarization/visualization
Be sure to choose the correct delivery method for your results. Allow the key takeaway from your analysis to dictate the type of visualization that is chosen. Relationship: Show correlations or connections between factors Comparison: Show how variables perform against the same metric over a period of time Distribution: Show how a series of data points are spread out Composition: Show how individual parts make up a whole Visualization isn't necessary if the answer is a simple number or a list of items
Postgre SQL
Besides aggregate functions this also supports Date/Time functions, String functions, Math functions
SELECT title, release_year
FROM films LIMIT 3; Query the title and release year for the first 3 rows of the films table
SELECT id, title
FROM films OFFSET 55; Query for all the ids and titles from the 56th row to the end.
SELECT *
FROM films OFFSET 665 LIMIT 1; Query for all information in the 666th row
SELECT *
FROM films ORDER BY RANDOM() LIMIT 5; Query 5 rows randomly.
SELECT country, release_year, title
FROM films ORDER BY country, release_year desc, title; Query for country, release year and title from films table, sort by country a -> z, release year largest -> smallest and title a -> z.
SELECT title, duration
FROM films ORDER BY duration Query for title and duration, sort duration from the shortest to the longest.
SELECT id.title
FROM films ORDER BY title ; Query the film table for titles and ids sorted by titles a -> z.
SELECT title, certification
FROM films WHERE certification != 'R'; Query for titles and certifications of films which are not R rated movies
SELECT title, country, release_year
FROM films WHERE country='USA' (release_year>2000 OR release_year<1980); Query for the title, country and year of movies made in the US or the UK, which were released after 2000 or before 1980.
SELECT title, duration
FROM films WHERE duration<60; Query for titles and duration of films shorter than 1 hour.
SELECT title, country
FROM films WHERE language IN ('German', 'Spanish', 'French'); Query for titles and countries of films in German, Spanish and French.
SELECT title, release_year
FROM films WHERE release_year BETWEEN 1960 AND 1969; Query for titles and release year of films in 1960s.
SELECT title
FROM films WHERE title LIKE '%Alexander%Horrible%Bad Day%; I cannot remember the exact title of a movie, which contains the words 'Alexander...Horrible..... Bad Day', write a query to find this movie.
SELECT DISTINCT language
FROM films; Query for all the different languages from the films table
SELECT *
FROM films; Query for everything from the films table
SELECT title, duration
FROM films; Query the title and duration columns from the films table
types of joins
INNER JOIN, OUTER JOIN (LEFT JOIN, RIGHT JOIN, FULL JOIN), CROSS JOIN
1:M relationship
Ideal for relational modeling (what we are trying to do) Should be the norm in any relational database design
OFFSET
If a statement contains both OFFSET and LIMIT, which will be conducted first regardless of their order
data cleaning
If the data set is new or not yet standardized, you'll have to clean it Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleaning may be performed interactively with data wrangling tools, or as batch processing through scripting.
SELECT DISTINCT certification FROM films;
Query for all the unique certifications from the films table
PostGres' version of print() statement
SELECT 'sing quotes only'
FULL JOIN syntax
SELECT columns FROM table1 FULL JOIN table2 ON table1.column = table2.column;
LEFT JOIN syntax
SELECT columns FROM table1 LEFT JOIN table2 ON table1.column = table2.column;
RIGHT JOIN syntax
SELECT columns FROM table1 RIGHT JOIN table2 ON table1.column = table2.column;
BETWEEN...AND.../NOT BETWEEN...AND...
To return all rows, at the specified SELECT columns, where a value in a column falls with in a specific range of values
=, !=, ,>=,<=
To return all rows, at the specified SELECT columns, where a value in a column is =, !=, ,>= or <= a specified value
IN/NOT IN (a1, a2...)
To return all rows, at the specified SELECT columns, where a value in a column is included/excluded from a set of values
IS NULL/NOT IS NULL
To return all rows, at the specified SELECT columns, where a value in a column is or is not NULL
subquery
a query inside the main query; also known as an inner query
Database related syntax
all lowercase letters
M:N relationship
cannot be implemented as such in the relational model; M:N relationships can be changed to 1:M relationships
AS
column name can be modified using
foreign key
created when a primary key from one table is carried over to another table
ON table1.id_x
is the same as USING(id_x)
joining tables
is used to gather related data from distinct tables in one place (a flat table)
LIKE
is used to search for a specified pattern in a column. And the operator uses wildcard characters to define text patterns.
(ASC)
or descending (DESC) order , separated by ' , '"
set caluse syntax
query1 as q1 set operator (ALL) query2 as q2
entities
represent objects or things; Physical things: like students, lecturers, employees, products; Abstract things:like modules, orders, courses, projects they have: Instances: samples or itemsAttributes: characteristics or descriptions
COUNT
returns number of values in specified column
SELECT * from table_x LIMIT m OFFSET n= SELECT * from table_x OFFSET n LIMIT m r
returns rows from the (n+1)th row to the (n+m)th row
MIN
returns smallest value in specified column
STDDEV
returns standard deviation of values in specified
SUM
returns sum of values in specified column
VARIANCE
returns variance of values in specified column
1:1 relationship
should be rare in any relational database design
LIMIT n
similar to the usage of data.head(n) in Python pandas package
GROUP BY
statement is often used with aggregate functions to group the result-set by distinct values in one or more columns.
SELECT...FROM...
statements to retrieve and display data from one or more database tables
general join syntax
table1 join_type table2 [join_condition]
primary key
uniquely defines a single table
Different databases
use different SQL dialects
DISTINCT
use this in front of column names to check all different values, missing value will show as null
OFFSET n
use this when you want n (where n must be a non-negative integer) at the end of the query to ignore the first n rows and return the rest.
JOINS
used to retrieve data from multiple tables
LIMIT n
using this n (where n must be a non-negative integer) at the end of the query to return the first n rows of all results
WHERE or HAVING clause FROM clause SELECT clause
where a subquery can reside
ON or USING
which clauses specify the join condition
ORDER BY col_name
ASC is default ordering option for ORDER BY so ORDER BY col_name ASC is the same as
set clauses
are used to combine or reorganize the results from multiple SELECT statements into a single result
ORDER BY
"is to used to sort results by ASC can be followed by a list of columns with ascending
[^charlist] or[!charlist]
"which wildcard Defines sets and ranges of characters NOT to match"
[charlist]
"which wildcard Defines sets and ranges of characters to match"
_
"which wildcard represents a single character"
%
"which wildcard represents zero, one, or multiple characters"
data analysis task framework
1. requirements gathering 2. data collection 3. data cleaning 4. data analysis 5. data summarization or visualization
data collection
After identifying necessary data points to answer the question, you must check to see if they are readily available. If the data are not available, you may need to have engineers log then extract, transform, and load (ETL) the data into your data warehouse Example Data Sources Transactional Data Log Files 3rd-Party Websites or Services
SELECT FROM
SELECT [what] FROM [where]
CROSS JOIN syntax
SELECT columns FROM table1 CROSS JOIN table2
LIKE/NOT LIKE
To return all rows, at the specified SELECT columns, where a string value in a column matches a specified pattern
set operators
UNION(ALL), INTERSECT(ALL), EXCEPT(ALL)
requirements gathering
Understanding the "why" behind the problem Understand the underlying business process Define the decisions or actions that are being taken Understand the business problem that you are trying to solve Define the specific information that is required to make the decisions Don't just answer the immediate question being asked to you. Be sure to get the full context behind any incoming data requests
Throw an error
if something is after the semi-colon what will happen
candidate key
is an attribute or set of attributes that could be used to uniquely identify each observation in a table
Semi-colon
is not required in Postgres, but is in other dialects, thus it is good practice to use it
INNER JOIN
most common type of join and is the default type of join; key word inner can be used optionally
Postgres dialect
not case sensitive and therefore as long as logic is correct, a query will compile
relationship types
one to many (1:M); many to many (M:N or M:M), one to one (1:1)
attribute
property of an entity
AVG
returns average of values in specified column
MAX
returns largest value in specified column