Data

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What are the six constraints in a relational database model?

1. Entity Integrity Constraint: The entity integrity constraint states that no attribute participating in the primary key of a relation is allowed to accept null values and no duplicate rows are allowed. The value null indicates that the value is unknown. In the entity integrity constraint, the primary key cannot have an unknown value. 2. Referential Integrity Constraint: Defines relationships between tables and ensures that these relationships remain valid. The validity of the data is enforced using a combination of primary keys and foreign keys. 3. Semantic Integrity Constraint: Refers to the correctness of the meaning of the data. The values for each attribute must make sense. 4. Domain Constraint: Specifies the permissible values for a given attribute. 5. Null Constraint: Specifies that attribute values cannot be null. 6. Check Constraint: Enforces domain integrity by limiting the values that are accepted by an attribute. For example, a check for a date attribute may be the data must be before the current date.

How do data scientists make sense of big data?

1. Formulate the business use case. Identify problem, objective, KPIs. 2. Collect and clean data. Determine which data is relevant and how to structure that data. 3. Explore the data. Investigate the data using visualizations and statistics. 4. Analyze data. Build, fit, and validate models. 5. Storytelling. Tell a story with the data. Requires visualization, slides, and communication. 6. Take action. Make decisions with the insights.

What does the JOIN operator do in SQL? What are the two types of table joins?

A JOIN combines the rows from two or more tables based on a relationship between certain columns in these tables. There are two types of table joins: inner join and outer join.

What is a primary key? What is a foreign key? How do these keys relate to parent and dependent tables? What are the rules for a primary key?

A Primary Key is a unique identifier for each row in a table, and using Primary Keys can help speed up your queries significantly. A foreign key is a column referring to a primary key of another table. A table containing a primary key that is related to at least one foreign key is called a parent table. A table containing one or more foreign keys is called a dependent table. If an entity needs to access information in another entity, that entity must reference the other entity. Rules for primary key are: The value of the Primary Key must be unique for each instance of the entity. There can be no missing values (i.e. Not Null) for Primary Keys. If the Primary Key is composed of multiple attributes, each of those attributes must have a value for each instance. The Primary Key is immutable, that is, once created the value of the Primary Key cannot be changed. If the Primary Key consists of multiple attributes, none of these values can be updated.

What is a SQL schema and how do you create one?

A SQL schema is identified by a schema name, and includes a authorization identifier to indicate the user or account who owns the schema. Schema elements include tables, constraints, views, domains and other constructs that describe the schema. A schema is created using the CREATE SCHEMA statement. For example, we can create a schema called LIBRARY for this course: CREATE SCHEMA LIBRARY AUTHORIZATION 'Robert'

What is a cloud database? What are some advantages of a cloud database?

A cloud database is a database service built and accessed through a cloud platform. It serves many of the same functions as traditional databases with the added flexibility of cloud computing. Users install software on a cloud infrastructure to implement the database. Advantages of using cloud databases are ease of use: Users can access cloud databases from virtually anywhere using a vendors API or web interface. Scalability: Cloud databases can expand their storage capacities during runtime to accommodate changing needs, organizations only pay for what they use. Disaster recovery: In the event of a natural disaster equipment failure or power outage data is kept secure through backups on remote servers. A few examples of cloud SQL databases are IBM Db2 on Cloud, Compose for PostgreSQL, Oracle Database Cloud, Microsoft Azure Cloud - SQL Database, Amazon Relational Database Services, which can run in the cloud, either in a virtual machine or as a service depending on the vendor.

What are nodes, racks, and clusters in the Hadoop framework?

A node is a computer. A rack is a collection of nodes, typically 30 or 40, that are physically stored close together and are all connected to the same network switch. The Hadoop cluster is a collection of racks.

What is a relational database? How is it managed?

A relational database stores data in a tabular format - i.e. in rows and columns. A table is a collection of related things like a list of employees or a list of book authors. In a relational database, you can form relationships between tables. For relational databases, it's called a relational database management system or RDBMS. RDBMS is a set of software tools that controls the data such as access, organization, and storage. And RDBMS serves as the backbone of applications in many industries including banking, transportation, health, and so on. Examples of relational database management systems are mySQL, Oracle Database, DB2 Warehouse on Cloud, and DB2 Express C. For the majority of people using a database, there are five simple commands to create a table, insert data to populate the table, select data from the table, update data in the table, delete data from the table. So those are the building blocks for SQL for data science.

What is a DBMS (database management system)?

A set of software tools for the data in the database is called a database management system or DBMS for short.

What is an aggregate SQL function? What are some common aggregate functions?

An aggregate function takes a collection of light values, such as all of the values in a column, as input, and returns a single value or null. Examples of aggregate functions include: sum, minimum, maximum, and average. Use a WHERE clause to perform a function on a subset of the data. You can also perform mathematical operations on the columns within the aggregate functions. SELECT AVG(col_1 / col_2) FROM table WHERE col_1 = x

What is an API? How does a SQL API work?

An application programming interface is a set of functions that you can call to get access to some type of service. The SQL API consists of library function calls as an application programming interface, API, for the DBMS. To pass SQL statements to the DBMS, an application program calls functions in the API, and it calls other functions to retrieve query results and status information from the DBMS. The application program begins its database access with one or more API calls that connect the program to the DBMS. To send the SQL statement to the DBMS, the program builds the statement as a text string in a buffer and then makes an API call to pass the buffer contents to the DBMS. The application program makes API calls to check the status of its DBMS request and to handle errors. The application program ends its database access with an API call that disconnects it from the database.

What is an enhanced 360-degree view of the customer?

An enhanced 360 degree view of the customer is a holistic approach, that takes into account all available and meaningful information about the customer to drive better engagement, revenue, and long-term loyalty. This is the basis for modern customer relationship management, or CRM systems. Let's look at an example in detail. By taking an enhanced 360 degree view of the customer, and taking available and meaningful information such as spending habits, shopping behavior and preferences, grocery stores are able to plan, prepare, and provide better services to customers.

What is an entity-relationship model?

An entity-relationship data model or ER data model, is a type of a relational data model and is the foundation of many relational databases. An Entity Relationship Diagram or ERD, represents entities called tables and their relationships. An entity-relationship model proposes thinking of a database as a collection of entities. The ER model is used as a tool to design relational databases. In the ER model, entities are objects that exist independently of any other entities in the database. It is simple to convert an ER diagram into a collection of tables. The building blocks of an ER diagram are entities, attributes, and relationships. Entities have attributes, which are the data elements that characterize the entity. Attributes tell us more about the entity. In an ER diagram, an entity is drawn as a rectangle, attributes are drawn as ovals, lines connect associated entities, and relationships are drawn as diamonds. Entity connections are often represented using crows foot notation (< | >). Attributes are connected to exactly one entity. The entity book becomes a table in the database, and the attributes become the columns in a table.

What is the difference between a data model and an information model?

An information model is at the conceptual level and defines relationships between objects. Data models are defined in a more concrete level, are specific, and include details. A data model is the blueprint of any database system. There are several types of information models. The most familiar is the hierarchical, typically used to show organization charts. The relational model is the most used data model for databases.

How does OUTER JOIN work in SQL?

And Outer Join is a specialized form of Join. And there are three types of Outer Join, Left Outer Join, Right Outer Join, and Full Outer Join. A left join matches the results from two tables and displays all the rows from the table to the left of LEFT JOIN, and combines the information with rows from the table right of LEFT JOIN that match the criteria specified in the query. So we see all rows from the left table, but only the rows from the right table that satisfy the condition. If the right table does not have a corresponding value, a null value is returned. A right join matches the results from two tables and displays all the rows from the table to the right of RIGHT JOIN, and combines the information with rows from the table left of RIGHT JOIN that match the criteria specified in the query. So we see all rows from the right table, but only the rows from the left table that satisfy the condition. If the left table does not have a corresponding value, a null value is returned. The full join keyword returns all rows from both tables. That is all rows from the left table and all rows from the right table. So, the full join could return a very large result set. The terms left and right refer to the table on the left-hand side and the right-hand side of the FULL JOIN keyword in the FROM clause. The rows from the left and right tables will match if they satisfy the condition after the ON keyword.

What is Hadoop? What is a Hadoop distributed file system?

Apache Hadoop is a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements. MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop. A Hadoop distributed file system, or HDFS, stores data for many different locations, creating a centralized place to store and process the data. Many large companies make use of Hadoop in their technologies.

What does COUNT() do in SQL?

COUNT() retrieves the number of rows that match the query criteria.

What does DISTINCT do in SQL?

DISTINCT is used to remove duplicate values form a result set.

What is a data lake?

Data lakes are a method of storing data that keep vast amounts of raw data in their native format and more horizontally to support the analysis of originally disparate sources of data.

How would you go about getting information about table names and attributes in a DBMS?

Database systems typically contain system or catalog tables, from where you can query the list of tables and get their properties. In IBM's DB2 Warehouse on Cloud, this catalog is called syscat tables. In SQL Server, the catalogs are called information schema tables. These tables will contain information for all tables in the database system, so it is important to query table information specifically for your username or project.

What is IBM Cloud?

IBM Cloud is IBM's innovative cloud computing platform which combines platform as a service (PaaS) with infrastructure as a service (IaaS). It includes a rich catalog of cloud services that can be easily integrated with PaaS and IaaS to build business applications rapidly. It lets you quickly create, deploy, and manage your applications in the cloud.

When can you use sub-queries in SQL?

In the WHERE clause when using functions. As a column expression in the SELECT clause. SELECT col, (SELECT AVG(col) FROM table) as average FROM table; As part of the FROM clause as a table expression, where it serves as a data source. SELECT * FROM (SELECT col FROM table) as t;

What is a data warehouse? What is data warehouse modernization?

It is a platform that can deliver deep insight with advanced in-database analytics. Data Warehouse Modernization, formerly known as Data Warehouse Augmentation, is about building on an existing Data Warehouse infrastructure leveraging Big Data technologies to augment its capabilities, essentially in upgrade. Given a set of data, there are three key types to Data Warehouse Modernization's. Pre-Processing, using Big Data as a landing zone before determining what data should be moved to the Data Warehouse. It could be categorized as Irrelevant Data or Relevant Data, which would go to the Data Warehouse. Offloading, moving the irrelevant accessed data from Data Warehouses into enterprise grade Hadoop. Exploration, using big data capabilities to explore and discover new high value data from massive amounts of raw data and free up the Data Warehouse for more structured deep analytics.

What is Jupyter Notebook's SQL Magic extension? How do you load it? How do you specify a line as a SQL statement? How do you specify an entire code cell as a SQL statement? How do you reference Python variables in a SQL statement? How do you set query results to a Python variable? How do you create a table in the cloud database using a pandas DataFrame? How do you split queries into multiple lines?

Jupyter Notebooks have a SQL "magic" extension that use SQLAlchemy to communicate with SQL databases. To load SQL magic, execute a cell with the following: %load_ext sql To specify a line as SQL, use %sql at the beginning: %sql INSERT INTO table ... To specify an entire code cell as SQL, do this at the top: %%sql To reference a Python variable, precede it with a colon, %sql select * from table where country = :country To set query results to a variable, use %sql after =, results = %sql SELECT col FROM table To create a table named 'df' in the DB using a DataFrame, do this: %sql PERSIST df To split queries into multiple lines, use a backslash: %sql SELECT col FROM table \ WHERE col=x

What are some popular SQL-based database management systems and their corresponding Python API?

MySQL - MySQL C API PostgreSQL - psycopg2 IBM DB2 Warehouse on Cloud - ibm_db SQL Server - dblib API MongoDB - PyMongo Database access for Microsoft Windows OS - ODBC Oracle - OCI Java - JDBC

What is SQL?

SQL is a language used for relational databases to query or get data out of a database. SQL is also referred to as SQL and is short for its original name Structured English Query Language. So SQL is a language used for a database to query data.

What are scalar functions and string functions in SQL?

Scalar functions perform operations on individual values. For example, round up or down every value in the column SELECT ROUND(col) from table There is a class of scalar functions called string functions, that can be used for operations on strings (CHAR and VARCHAR values) Example, retrieve the length of each value in the column, SELECT LENGTH(col) from table Uppercase and lowercase functions can be used to return uppercase or lowercase values of strings. SELECT UCASE (col) from table Scalar functions can be used in the where clause. SELECT * FROM table WHERE LCASE(col) = 'x' This type of statement is useful for matching values in the where clause, if you're not sure whether the values are stored in upper, lower or mixed case in the table. You can also have one function operate on the output of another function. SELECT DISTINCT (UCASE(col)) from table

How do you query mixed-case columns in SQL?

Some database parsers assume upper case column names by default, so a query for Id will be interpreted as "ID". To get around this, use double quotes in the query like so: SELECT "Id" FROM table If making a query in Python, use single quotes for the query: query = 'SELECT "Id" FROM table' If single quotes are in the query statement, use a backslash \ before it.

How do you query column names with special characters or spaces in SQL?

Special characters and spaces are mapped to underscores in the database, so replace spaces and special characters with an underscore, _

How do you query data based on a column containing certain categories?

The IN operator allows us to specify a set of values in a WHERE clause. This operator takes a list of expressions to compare against, and only returns data if the category is in the list. WHERE col IN (x, y, z, ...)

Describe the basic structure of the SELECT statement in SQL. How can you use SELECT to execute IF-THEN-ELSE statements?

The basic structure of the SELECT statement is formed from three clauses: SELECT, FROM and WHERE. <attribute list> is a list of attribute names whose values are to be retrieved by the query <table list> is a list of the relation names required to process the query <condition> is a conditional(Boolean) expression that identifies the tuples to be retrieved by the query In situations where you might want to use multiple IF-THEN-ELSE statements, you can often use a single SELECT statement instead. The SELECT statement allows a CLIST to select actions from a list of possible actions. An action consists of one or more statements or commands. The SELECT statement has the following syntax, ending with the END statement. You can use the SELECT statement with or without the initial test expression. SELECT [test expression] WHEN [expression1] (action) WHEN [expression2] (action) [OTHERWISE] (action) END

Explain the concept of a relational model. What do the terms degree and cardinality refer to?

The building blocks of the relational model are relation and sets. The relational model of data is based on the concept of relation. A relation is a mathematical concept based on the idea of sets. A set is an unordered collection of distinct elements. It is a collection of items of the same type. It would have no order and no duplicates. A relational database is a set of relations. A relation is also the mathematical term for a table. A table is a combination of rows and columns. A relation is made up of two parts, relational schema and relational instance. A relational schema specifies the name of a relation and the attributes. A relational instance is a table made up of the attributes or columns and the tuples or rows. Degree refers to the number of attributes or columns in a relation. Cardinality refers to the number of tuples or rows in a relation.

What is a database service instance?

The database service instance is the target of the connection request from applications. The application we will be using is Python. When a connection has been completed, your Python code sends SQL statements across the connection to the instance of the database. The database instance then resolves the SQL statements into operations against the data and objects in the database. Any data retrieved is returned to the application

What is a LIKE predicate in SQL?

The like predicate is used in a WHERE clause to search for a pattern in a column. The percent sign is used to define missing letters. The percent sign can be placed before the pattern, after the pattern, or both before and after the pattern. The percent sign is called a wildcard character. A wildcard character is used to substitute other characters.

How does INNER JOIN work in SQL?

The most common type of join is the inner join. An inner join matches the results from two tables and displays only the result set that matches the criteria specified in the query. It is of the format: SELECT * FROM table_1 t1 INNER JOIN table_2 t2 ON t1.id = t2.t1_id You can inner join more than two tables too: SELECT * FROM t1 INNER JOIN t2 ON t1.id=t2.t1_id INNER JOIN t3 ON t2.id=t3.t2_id

What are the benefits of a relational model for a database?

The relational model is the most used data model for databases, because this model allows for data independence. Data is stored in simple data structure tables. This provides logical data independence, physical data independence, and physical storage independence.

What are the two main components of a Python Database API? What are they used for? What are some common methods?

The two main concepts in the Python DB-API are connection objects and query objects. You use connection objects to connect to a database and manage your transactions. Cursor objects are used to run queries. You open a cursor object and then run queries. Here are the methods used with connection objects: The cursor() method returns a new cursor object using the connection. The commit() method is used to commit any pending transaction to the database. The rollback() method causes the database to roll back to the start of any pending transaction. The close() method is used to close a database connection. Cursors created from the same connection are not isolated i.e. any changes done to the database by a cursor are immediately visible by the other cursors. Cursors created from different connections can or cannot be isolated depending on how the transaction support is implemented. A database cursor is a control structure that enables traversal over the records in a database. Some common methods of a cursor object are execute() for executing queries, fetchone() for fetching one row from the results, fetchall() for fetching all results, and close() for closing the cursor.

What are ways of representing relationships that exist between entities in an ER model? Use books as an example.

There are three primary relationships. Let's use example of books, where book is an entity and author is an entity, the relationship is authored by. One-to-one (denoted with || in crows foot) means only one author writes one book and only one book is written by one author. One-to-many (denoted || and <) means one book is written by many authors or one author writes many books. Many-to-many (denoted > and <) means many books can be written by many authors and many authors can write many books.

What is Big Data?

There is not one definition for this term, but what you consistently see in definitions is something called the five V's of Big Data: Velocity, Volume, Variety, Veracity, Value. Velocity is the speed of the data, or the speed at which data accumulates and is transformed into insight. Volume is the scale of the data, or the increase in the amount of data stored. Everyday we create about 2.5 quintillion bytes of data, which is 10 million blue ray DVDs everyday. Variety is the diversity of the data. We have structured data and unstructured data, and data comes from many different sources. There is text, sound, video, image, numerical data. Veracity is the quality of data, and has to do with data consistency, completeness, ambiguity. 80% of data is considered to be unstructured, and so we have to find ways to produce reliable insights from that unstructured data. Value refers to why people are investing time and money in big data, because there is a lot of value in it. So we need value from big data, to do something with it.

How do you query data based on a column's values being greater than or equal to x and less than or equal to y?

Use the BETWEEN predicate. WHERE col BETWEEN x and y

What is the CHECK clause in SQL?

Use the CHECK clause to designate conditions that must be met before data can be assigned to a column during an INSERT or UPDATE statement. During an insert or update, if the check constraint of a row evaluates to false, the database server returns an error. The database server does not return an error if a row evaluates to NULL for a check constraint. In some cases, you might want to use both a check constraint and a NOT NULL constraint.

How do you sort a result set from a SQL query? How would you obtain the top 10 items? How do you handle null values?

Use the ORDER BY clause. Use ASC for ascending or DESC for descending. The default for ORDER BY is ascending. SELECT col FROM table ORDER BY col ASC If you want a certain number of rows, use the LIMIT keyword or the FETCH FIRST n ROWS ONLY keyword. If there are null values, they will automatically be placed first. To get around this, specify nulls last (this functionality may not exist depending on the database system). SELECT col FROM table ORDER BY col DESC nulls last LIMIT 10 or SELECT col FROM table ORDER BY col DESC nulls last FETCH FIRST 10 ROWS ONLY

How do you add a condition to a GROUP BY clause in SQL?

Use the keyword HAVING. The WHERE clause is for the entire result set, but the HAVING clause works only with the GROUP BY clause. SELECT col, COUNT(col) as count FROM table GROUP BY col HAVING COUNT(col) > 4

What is an implicit join in SQL?

We can access multiple tables by specifying them in the FROM clause of the query. Consider the example, SELECT * FROM table_1, table_2 Here we specify two tables in the FROM clause. This results in a table join, but note we are not explicitly using the join operator. The resulting join in this example is called a full join or Cartesian join, because every row in the first table is joined with every row in the second table. If you examine the results set, you will see more rows than in both tables individually.

How do you return the frequency of each category in a column using SQL?

You add the "group by" clause to the select statement. The "group by" clause groups a result into subsets that has matching values for one or more columns. In addition to the column, you add COUNT(col) to count the number in each subset. SELECT col, COUNT(col) as count FROM table GROUP BY col

How would you query information in one table based on a condition that involves information in another table?

You could use a sub-query in the WHERE clause of the main query. For example, SELECT * FROM table_1 WHERE t_id IN (SELECT id FROM table_2 WHERE col = x); You could also use an implicit join and match id's. For example, SELECT * FROM (SELECT * FROM table_1 t1, table_2 t2 WHERE t1.t_id = t2.id) WHERE col = x;

How can you use aggregate functions in a WHERE clause?

You use a sub-query. Sub-queries or sub selects are like regular queries but placed within parentheses and nested inside another query. One of the limitations of built in aggregate functions, like the average function, is that they cannot always be evaluated in the WHERE clause. So to evaluate a function like average in the WHERE clause, we can make use of a sub-select expression. SELECT col FROM table WHERE col > (SELECT AVG(col) from table);


Kaugnay na mga set ng pag-aaral

chapter one: introduction to psychology.

View Set

Barrons AP human geography unit one vocabulary

View Set

Sociology Spring 2019: Chapter 21

View Set

positieve eigenschappen (adjectieven) E - F

View Set

Unit 2: Legislative Branch AP GOV test

View Set