Basics/Chapter 1: Design and implement a data warehouse

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

How is uniqueness implemented in a fact table when a surrogate key is used as a Primary Key?

The fact table will have a unique constraint on one or more of the foreign key relations.

What are the characteristics of a fact table foreign key?

The foreign key is used to reference the dimensions of a fact by linkined fact columns to dimension columns. These usually have a 1:M relationship in that a member of the dimension table can be referenced multiple times in the fact table such as a customer in a customer table being referenced multiples times in a sales fact table.

What is IOPS?

The input/output read and write operations that can be completed in a second.

What is the grain of a fact table?

The level of specification or detail that the fact table presents the business process.

What is a drawback of using nonclustered indexes?

The require update by the server when the underlying table data is modified.

How are partition schema's re-used and what are the rules?

You simply declare a new table on the same schema. The tables must have the same data type as the partition column used in the partition function.

How do you implement a sliding partition views startegy?

You simply remove/add the table(s) from the UNION ALL in the view definition.

What is throughput?

How much data volume can be written or read in a given time frame, it doesn't mean how fast that data is read or written.

What is a partition function?

A function that defines the boundaries for a table partition.

How is a many to many relationship between two tables implemented in a data warehouse?

A intermediary table is used to act as a matching/intermediary between the two tables.

What is a sliding window partition?

A partition strategy that maintains the same number of partitions in the active table by removing old partitions to an archive table and adding new partitions for new future date.

What is a distributed partitioned view?

A partitioned view that brings together tables from different databases on different servers.

What are the characteristics of a fact table primary key?

A primary key is used to uniquely identify the fact table within the data warehouse. Often, the primary key is a composite key of the different foreign key columns. However, when a fact table can contain duplicates, a surrogate key can be used.

What is meant by a highly select query?

A query whose predicate evaluate to true for a very narrow set of rows.

What is the syntax for adding a new partition?

ALTER PARTITION FUNCTION {Partition function name}() SPLIT RANGE ({New boundary value that should create a new range})

What is the syntax for merging partitions?

ALTER PARTITION FUNCTION {Partition function name}() MERGE RANGE ({boundary to merge partitions up to})

What is the syntax for switching out to archive data using the ALTER TABLE SWITCH statement?

ALTER TABLE {source table} PARTITION {#} SWITCH TO {target table}

What's the difference between OLAP processing and relational processing?

OLAP processing is reporting that is from data in a data warehouse that has been processing into OLAP cubes whereas relational processing is querying data from the data warehouse directly using SQL queries.

What is the replicate approach to scaling out data? What are its drawbacks and advantages? What kind of system is this best for?

Data is replicated across servers in a peer-to-peer transaction or a merge transaction. A drawback is storage because data is replicated and stored in all servers. Also, copying data across servers implies performance costs. The advantage is having data on all servers. This is good for moderately updated databases that do both read and write operations.

What are hard disk drives?

Data storage devices that store data magnetically on a disk that is then mechanically read when data is accessed.

What table do you place a numeric attribute?

Generally, they are places in fact tables if they are additive. However numeric values that are not additive in nature (the size of a product sold) will be stored as a dimension attribute.

What level of granularity should you have in your time dimension table?

The lowest level of granularity that any fact table in the DW needs.

What is required to create a partitioned table for a table with a clustered index, such as a primary key?

The partition column must also be included in the unique clustered constraint.

What must be true for a partition column if the table has a unique index?

The partition column must be explicitly added to the clustered key column list.

What is a partition schema?

The placement of created partitions in different filegroups.

What's the point of creating a secondary file when making a partitioned table?

The primary table will store all the data but the secondary table will be used for querying purposes for better performance.

What is data tiering?

The process of assigning different categories of data to different physical storage resources.

What is ETL?

The process of extracting data from sources, transforming the data to the needed format and removing aspects of the data not needed, and loading the data to a OLAP system.

What is partition switching?

The process of moving a partition from one partitioned table to another.

What is partition elimination?

The process of skipping entire partitions in a partitioned table during a query to speed up performance.

What is scaling-out data?

The process of spreading data operations over as many servers as needed to improve querying and data loading performance.

What is switching out to archive data?

The process of switching out data from a partition in partitioned to a non-partitioned table

What are the benefits and drawbacks of having a phsyically segregated pool of disks exclusively for each server?

The server benefits from having an exclusive pool of disks for it's workload without interference from other servers. On the other hand, the I/O potential is decreased because operations and storage are spread across a smaller number of disks.

What are the inputs and outputs of a OLAP system?

An OLAP system will usually take in inputs that are dimensions and return the intersection of those dimensions in a measurement, or fact, output.

What is an alternative to physically separating logical disks units? What are the advantages of this approach?

An alternative is to not store the disks physically and instead assign servers logical disk units. This allows servers to have seperated workloads but also use other server disks when performance requirements temporarily change. Additionally, if more long-term increased performance is needed, disks can be logically reconfigured or expanded physically.

What are the characteristics of clustered rowstore indexes?

Clustered indexes are structured tables, as stored in the system, whose values are ordered for look up based on a clustering column. When utilized, the leaf nodes is the actual data that is trying to be retrieved.

What type of indexes are better for non-selective queries and why?

Columnstore indexes. They allow you to look up large sets of columns, while still filtering based on grouping rows, in-memory, which offers faster performance.

What are the components of a table partition?

- Partition column - Partition function - Partition schema

In what situations is using columnstore better than rowstore ?

- A column store performs better when you are interested in accessing some columns of a table and not all, generally in OLAP systems. -

Why are alternate keys not also used a surrogate keys in dimension tables?

- Alternate keys can sometimes by composite keys that will be inefficient when it comes time to reading the dimension tables. - Alternate keys can results in collisions when using data from other sources in the data warehouse. - To track changes in a dimension member's attribute over time, you can't use the alternate key to look it up because you would need to look up the dimension member, with the same LOB application primary key, in different instances in the data warehouse.

In what situations is using a rowstore index storage better than columnstore index storage?

- Rowstore offers better insert capabilities so if the goal is more inserts, as in OLTP systems, you will likely want to go with a rowstore structure. - When queries involve entire tables, and not using subsets of the table columns, then a rowstore will make more sense and be more efficient. - When the table will be involve in highly selective queries, then you will want to use a rowstore index over a columnstore index.

What is the limit for the number of partitions that can be made for a table or index?

15,000

What is RAID 10 and what are the advantages and disadvantages compared to RAID 5?

A RAID configuration that has good read performance, a fault tolerance of 2 disks, and better write performance than RAID 5, but at the cost of only having half the total raw capacity available for storage.

What is RAID 5 and what are the advantages and disadvantages compared to RAID 10?

A RAID configuration that has more usable storage, of 2/3 the total capacity, fault tolerance of one disk, and better read performance at a cost of lower write operations.

Starting with SQL Server 2019 15.x, what happens to Delta Rowgroups that have been OPEN for a long time (set by some threshold policy)?

A background task will automatically compress those delta rowgroups into Columnstore and remove the orginial delta rowgroup.

What's the difference between a clustered columnstore and a nonclustured columnstore index?

A clustered columnstore index is the actual physical storage of a table. A nonclustered columnstore index is an index made on an underlying rowstore table where selected columns are stored from it, including an option to filter out some rows.

What is a Filegroup?

A collection of files.

What is a rowgroup? How many rows can be stored in a single row group, how many are commonly stored together, and why?

A set of rows that compressed into columnstore format together. There is a maximum of 1,048,576 rows in a given rowstore. You usually store the max rows per rowgroup that allows you to benefit from faster compression from larger rowgroups but allows the data to benefit from in-memory operations.

What is a data mart?

A specialized subsection of a data warehouse that offers access to relevant data to conduct their analysis. The general idea is that not all people need access to all data.

How does a SAN storage system work?

A storage area network is a shared pool of storage blocks that servers can access via SAN switches. One or more SAN switches control the flow of data from servers to the SAN and allow for switching between switches in case of failure of one, thus ensuring data can always move between the servers and the SAN.

What are metadata repositories?

A storage of metadata on data that has been moved, transformed, and/or loaded within the system. The metadata can include things like source and target systems, data types, data mappings, filters, business rules, transformations, and aggregations applied to the data in the different data flows.

What are slowly changing dimensions?

Attributes in a LOB application that are expected to change over time for a given entity instance. Because the LOB application will store up to date information of the entity instance, it's up to the data warehouse to store the historical values in a dimensions that allow for changes over time to the relationship between the entity instance but can still be analyzed in a manner that accurately reflects the entities historic value when appropriate.

What is the syntax for creating a partitioned view?

CREATE VIEW {View name} AS SELET {Column List} FROM {Table 1} UNION ALL SELET {Column List} FROM {Table 2} . . . SELET {Column List} FROM {Table n}

What is meant by cardinality in terms of relationships between dimension and fact tables in a data warehouse?

Cardinality refers to number of references in a given table for another table to the number of references of the given table in the other table.

Why are column store indexes better for fact tables than row store indexes?

Fact tables have lots of data that is usually interacted with as a whole and involves less seeks and more aggregation, manipulation, without filtering out data. Columnstore indexes is best for this.

What is scaling-up data?

Increasing the resources of servers to process data operations.

What is a stretch approach to scaling out data?

It's the process of creating tables that dynamically store data to the Azure database using a filter or rule. This way, data can be separated on where it is stored but be able to be queried across all storage locations.

What are auditing interactions?

Stored information on the changes of data that has been moved, transformed, and/or loaded within the system. Information can include data flow batch start and end timestamps, errors and warnings, variable and parameter values, row counts, and other execution and performance metrics.

What are the common data types used in dimension table columns?

String Boolean Date

What are the types of columns in a dimension table?

Surrogate Key Alternate Key Attributes Metadata

How should a mixed index solution be implemented in a data warehouse?

The columns of dimension and fact tables that are highly selective should use a rowstore index, while others use a columnstore index.

Why are partitioned views considered more flexible than partitioned tables?

To change a partitioned view, you add or remove a table from the view definition. To change a partitioned table, you have to redefine the partition function, and then re-make a table on the new partition schema.

What are fully additive values?

Values that can be aggregated across all dimensions.

How do you create a partitioned table using a partition schema? (All steps)

- Create the needed filegroups. All have to be in the same database. - Create a partition function - Create a partition schema using the filegroups and allocate partitions among them. - Create a table with a primary key constraint on the partition column on the partition schema

How do you change a rowstore index to a columnstore index?

- Drop a primary key constraint if it exists - Drop the rowstore index - Add the columnstore index

What are the factors to consider for implementing a scale-out of data?

- How often data is added/updated. If the data is constantly updated and/or added, it will have a higher cost associated with replicating the data across different servers. - Will users and applications be able to access the data or will changes need to be across the servers and data definitions? - The storing of data in different servers should be in line with a querying pattern that creates performance benefits. - The interdependence of the data. Data that uses constraint that depend on other tables cannot be split out among different servers and databases.

What are the four ways to use the ALTER TABLE SWITCH statement?

- Move data from one non-partitioned table to another - Move data from one non-partitioned table to a partition in a partitioned table - Move data from a partition in a partitioned table to a non-partitioned table - Move data from one partition in a partitioned table to another partition in a partitioned table.

What are the steps to moving partitions in a sliding window partition strategy?

- Move the old partitioned data out of the partition with an ALTER TABLE SWITCH statement - Remove the empty partition by merging the empty partition with the automatically created one using the MERGE RANGE statement. - Add in a new partition by editing the partition function using the SPLIT RANGE statement.

What are the four types of columns in a fact table?

- Primary key - Foreign key - Measures - Metadata

What are the requirements to do a partition switch?

- The source and target tables (or partitions) must have identical columns, indexes and use the same partition column - The source and target tables (or partitions) must exist on the same filegroup - The target table (or partition) must be empty

What data types cannot be in a column that you want to store in columnstore format?

- ntext, text, and image - nvarchar(max), varchar(max), and varbinary(max) (Applies to SQL Server 2016 (13.x) and prior versions, and nonclustered columnstore indexes) - rowversion (and timestamp) - sql_variant - CLR types (hierarchyid and spatial types) - xml

What is required in every table of a partitioned view and why?

A CHECK constraint on the partition column is required so that the Server can use it during queries to look at tables that match the predicates.

Why is a clustered lookup slower than a nonclustered one?

A clustered index is not really index, rather an ordering of the underlying table. As a result, an index scan is done that results in scan the entire table. With a nonclustered index scan however, the index is first scanned and then the system goes directly to the specific location in the underlying table.

What is a column segment?

A column of data from within a rowgroup.

How does a columnstore reduce I/O compared to a rowstore database?

A columnstore stores data in individual columns. Thus when data is queried, it only needs to read the selected columns from the disk. On the other hand, data in rowstore is stored based on rows. When data is read from a rowstore based database, all the columns that pertain to a given row are selected from the disk even the ones that aren't needed in the final result.

What is an enterprise business matrix?

A common way of understanding the requirements of fact and dimension tables in a business. It works by listing out business processes on one axis and possible dimension on a second. Then, a binary yes or no is used to denote that there is a relationship between the process and the dimension.

What is a deltastore?

All the delta rowgroups for the difference columnstore indexes, collectively referenced.

What is OLAP?

A database system that is created for the intent of analyzing data from different sources, across a variety of businesses themes, for the purpose of supporting decision making.

What is an OLTP?

A database system that is meant for processing the operational data of a business and storing it in databases that are meant for storing the data and small, simple queries.

What is a type 2 SCD?

A dimension whose attribute is updated by adding a new record with the new information. The old and new records are indicated with valid time columns and/or a current flag.

What is a type 1 SCD?

A dimension whose attribute value is updated by overwriting the original value in the dimension table. This information is commonly not used in historical reporting.

What is a LOB application?

A line of business application is a computer application that is used to directly conduct the main operations of a business.

What is a LUN?

A logical unit number An identification for the pool of disks that comprise a single logical disk.

How is a many to many binary relationship implemented?

A middle table must be created that links, or assigns, the many entity instances in one table to the many entity instances in another table.

What is the mapping rule between partitions and filegroups?

A partition can only have one filegroup. A filegroup can have more than one partitions.

What is a local partitioned view?

A partitioned view that brings together tables from different databases on the same server.

What is a many to many binary relationship?

A relationship between more than one entity instance in one entity and more than one entity instance in another entity.

What is a one to many binary relationship?

A relationship between only one entity instance in one entity and more than one entity instances in another entity.

What is a one to one binary relationship?

A relationship between only one entity instance in one entity and only one entity instance in another entity.

What is a unary relationship?

A relationship of one entity with itself.

What is a binary relationship?

A relationship that exists between two entities.

What is a ternary relationship?

A relationship that exists with three entities. The relationship between the entities describes an entity/event.

What is a M:M/many-to-many relationship?

A relationship where multiple members of one table can have multiple occurrences [values] in another table.

What is RAID 0? What are its limitations? What should you keep in mind about storage space?

A straightforward disk configuration where data is striped across two or more disks. Unlike other RAID configurations, RAID 0 does not have fault tolerance or redundancy so a failure in one disk will result in loss of data across all disks. The amount of storage that each disk can add to the array is the size of the smallest disk in the aray.

What type of index is created when you make a unique constraint vs a primary key constraint?

A unique constraint create a nonclustered index where as a primary key constraint creates a clustered index.

What is a row locator?

A value that identifies the location of a row in a heap or clustered table.

What is the syntax for moving data from one partition in a partitioned table to another partition in a partitioned table using the ALTER TABLE SWITCH statement?

ALTER TABLE {source table} PARTITION {#} SWITCH TO {target table} PARTITION {#}

What's the ALTER TABLE SWITCH syntax to move data from one non-partitioned table to another non-partitioned table?

ALTER TABLE {source table} SWITCH TO {target table}

What is the syntax for using ALTER TABLE SWITCH to switch in to load data?

ALTER TABLE {source table} SWITCH TO {target table} PARTITION {#}

Where in the execution plan does it indicate how many partitions were actually read?

Actual Partition Count

What must be true of nonclustered index to create a partitioning table?

All nonpartitioned indexes need to include the partition column as part of the key columns or included columns.

What is a partition aligned index?

An index who partitioning is the same as the underlying table's. This can happen one of two ways: - An index is created on a partitioned table. By default, the index will have the same partitioning as the table. - An index is created using a different named partition function but the data type and boundary values are the same, making it, for all intensive purposes, aligned with the table it indexes.

What's the difference between an operational database and a dimensional database?

An operational database is used to store normalized current, transactional data, that is updated by users. A dimensional database is used to store unstructured data that is both current and historical that is updated systematically and not by users (ETL systems/packages).

How do partition aligned indexes allow for partition switching?

Because the index is aligned with the partitioned tables partitioning, when you switch partitioning, the index is still functional.

What is the syntax for creating a clustered columnstore index to an existing table?

CREATE CLUSTERED COLUMNSTORE INDEX {index_name} ON {table_name}

What is the syntax for creating a clustered rowstore index to an existing table?

CREATE CLUSTERED INDEX {index_name} ON {table_name} ({column1_name},{...})

What is the syntax for creating a nonclustered columnstore index to an existing table?

CREATE INDEX {index_name} NONCLUSTERED COLUMNSTORE ({column1_name},{...})

What is the syntax for creating a filtered nonclustered columnstore index in an already created table?

CREATE INDEX {index_name} NONCLUSTERED COLUMNSTORE ({column1_name},{...}) WHERE ({<column1_name}{<predicate} AND {,,,})

What is the syntax for creating a nonclustered rowstore index to an existing table?

CREATE NONCLUSTERED INDEX {index_name} ON {table_name} ({column1_name},{,,,})

What is the syntax for creating a covering nonclustered rowstore index in an already created table?

CREATE NONCLUSTERED INDEX {index_name} ON {table_name} ({column1_name},{,,,}) INCLUDE ({column1_name},{,,,})

What is the syntax for creating a filtered nonclustered rowstore index in an already created table?

CREATE NONCLUSTERED INDEX {index_name} ON {table_name} ({column1_name},{,,,}) WHERE ({<column1_name}{<predicate} AND {,,,})

What is the syntax for creating a partition function?

CREATE PARTITION FUNCTION {partition function name} ({data type}) AS RANGE {RIGHT/LEFT} FOR VALUES({List of bounary values});

What is the syntax for creating a filegroup?

CREATE PARTITION SCHEME {Partition schema name} PARTITION {Partition function being used on} [ALL] TO({List of filegroups to which partitions are added})

What's the syntax for creating a table on a partition schema?

CREATE TABLE ( {partition column definition} {Primary key constraint on partitioned column} ) ON {Partition schema name}({partition column})

Why are row store indexes better for dimension tables than column store indexes?

Dimension tables are used as a way to give context to fact table columns. This usually involves some sort of seeking through the table to find matching values. Rowstore indexes are usually best for this.

What are shared dimensions tables?

Dimension tables whose members are used for more than one fact table.

What are the two types of physical storage subsystem usually used by companies for their servers?

Direct Attached Storage Storage Area Network

What is DAS? What's the difference between a shared DAS and a dedicated DAS?

Direct attached storage. Refers to one or more disks, within a server enclosure, that acts as storage for one or more servers directly connected to it vis some wiring. When a DAS is attached to and only to one server, it's considered a dedicated DAS. When a DAS has more than one server utilizing it, it's considered a shared DAS.

What is the syntax for creating a clustered rowstore index on a table that doesn't have a primary key in a CREATE TABLE statement?

INDEX (<index name>) CLUSTERED (<column name>)

What is the syntax for creating a clustered columnstore index on a table that doesn't have a primary key in a CREATE TABLE statement?

INDEX (<index name>) CLUSTERED COLUMNSTORE

What is the syntax for creating a nonclustered rowstore index on a table that doesn't have a primary key in a CREATE TABLE statement?

INDEX {index_name} NONCLUSTERED ({column1_name},{...})

What is the syntax for creating a nonclustered columnstore index on a table that doesn't have a primary key in a CREATE TABLE statement?

INDEX {index_name} NONCLUSTERED COLUMNSTORE {{column1_name},{...})

What's the main difference between a dimensional star model and snowflake model?

In a star model, the only relationships are from fact tables to dimensional tables. In a snowflake model, there can be relationships between facts and dimension tables as well as between dimension tables.

What is a fact table?

In a star schema database, it is a table that contains the actual business event. It is made up of dimensions that describe the event as well the information about the business event.

Why is partition switching considered fast?

It changes the metadata of the table's location but doesn't actually move it physically in storage so it's instantaneous.

What does a filtered index do and when should it be used?

It creates an index on a column for specific values based on a predicate. When you query on that column, and your predicates in the query are false for all values that the filtered index commits, the optimizer utilizes the filtered column for a more efficient index seek.

What are the drawbacks of a structured storage database compared to a relational one?

It does not meet the higher conformity standards that a relational databases must have and thus can't guarantee ACID properties that a normalized relational database might.

What is data lineage?

It involves maintaining metadata repositories and auditing interactions as data flows from its origin to its destination.

What is a structured storage database as it applies to distributed databases?

It is a NoSQL database that uses stores data in a variety of formats that allows them to interact with one another.

What are the characteristics of a surrogate key columns in a dimension table?

It is a column used to uniquely identify the dimension table's members within the data warehouse. It is used to link the dimension table to fact tables and other dimension tables as needed. It is commonly defined a column with an IDENTITY property or some other auto generating default value such as a sequence, that increments with every new row. Think of them as the primary key for the dimension tables, within the context of the data warehouse. They are usually defined with a Primary Key constraint.

What is a covered index?

It is an index that can satisfy a query just by its index keys without having needed to touch the data pages. It means that when a query is fired, SQL Server doesn't need to go to the table to retrieve the rows (Key lookup), but can produce the results directly from the index as the index covers all the columns used in query.

What is an I/O bottleneck?

It is the latency associated with the length of time it takes for memory to be written or fetched from disk into memory.

What is MapReduce?

It is the process by which an input problem is broken up by a "master node" into sub-problems that are then allocated to be solved by one or more "worker nodes". The results of the worker nodes solutions are returned to the master node that pieces them together as needed to answer the input problem and the final result is returned.

What is NoSQL?

It stands for "Not only SQL". It's a movement towards organizing and managing big data sets that don't use a relational data model for storage. It does however still involve using a query like system to retrieve information that is needed.

How does a rowstore index store data?

It stores all the rows of a table, or copies them in the case of nonclustered indexes, in both the physical and logical order of the index structure it uses. Each leaf node is a

What is a partitioned view?

It's a collection of smaller tables joined together that can be queried as one by the server.

What is a partition column?

It's a column that holds values that are used as boundary determining values. They are used by the server to identify which groups belong in what partitions as well as being used for partition elimination for queries.

What is horizontal table partitioning and what is it used for?

It's a database architecture that splits up a table to groups of rows. It improves query performance for tables that are large and are resulting in query timeouts and slow performance.

What is a best effort query performance design?

It's a design of indexes based on the initial understanding of tables, business logic, and relationships. It does not include user access information yet.

What is a conformed dimension and how is it implemnted?

It's a dimension that has the seem meaning in all the fact table it relates to. It can implemented as a single dimensional table with many relations or identical copies of a table.

What is a middleware scale out data solution?

It's a solution that stores data in different servers based on some categorization. When data operations are initialized, they pass through a middleware software that routes the requests to the server(s) with the data needed.

What is data mining?

It's a sophisticated method of processing large sets of data and applying analysis techniques to uncover insights. It involves the employment of techniques that are more complex than those deployed on dimensional tables.

What is a delat rowgroup?

It's a sort of holding place for data that is columnstored. It holds data in b-index row format until a threshold number of rows is met to make a rowgroup. From there, the status of the rowgroup goes from open to close and the tuple-move stores it as a compressed rowgroup in columnstore. The original delta rowgroup is set to TOMBSTONE state and removed when it is no longer being referenced.

What is a distributed partition view scale out approach? What kind of system is this best for?

It's an approach where data is stored in tables in different databases and defined in a single partitioned view. To make use of the partitioning, the query has to use the view and the where clause as it aligns with the different partitions. This is good for systems that archive data and only use subsets of data for read and write oeprations.

What is best practice for data volumes and different categories of data?

It's good practice to separate different types of data into separate data volumes according to: - User database files (involve random read and write operations) - Transaction log files (involve sequential operations) - Database backup files (can require a large amount of performance suddenly) - TempDb (A database that is needed for the system to operate)

What is middleware in terms of data architecture?

It's software that routes data services requests from the client to the appropriate server.

What is multipathing?

It's the criss-cross architecture from multiple servers being connected to multiple SANs, offering more than one path to the SAN storage unit.

What is the process of "switching in to load data"?

Moving data from a non-partitioned table to a partition in a partitioned table.

What are the characteristics of unclustered rowstore indexes?

Nonclustered indexes are copies of row data based whose leaf nodes identify the location of a particular value within a heap or clustered table.

What are operational databases optimized for vs dimensional databases?

Operational databases are optimized for data storage speed which result in slower read speeds. Dimensional databases are optimized for read speed which result in slower storage speeds.

What are user heirachies?

Predefined dimensions that are arranged in a hierarchy according to a neutral or logical order. This lets users drill down on dimensions based on the hierarchy. Examples are date dimensions (year, quarter, month, etc.) and geography dimensions (country, state, county, city, etc.)

What two options are there for creating a partition function?

Range Left: Partitions data by using the listed values as the right bound of the boundary. This creates a catch-all partition of the unbounded values to the right of the last partition. Range Right: Partitions data by using the listed values as the left boundary values. This creates a catch all for values to the left of the first partition.

What is RAID?

Redundant Array of Independent Disks When more than one one disk are combined to act as one logical disk. This is done to spread out data storage and operations across multiple disks and increase fault tolerance.

What is the general structure of an apache cassandra database?

Related entities are stored in "keyspaces" that hold "column families" made up of "columns" whose values can be identified from a unique "row key". Columns are made up of a value name, value, and timestamp. Rows in a keyspace can have different numbers of column families and rows with columns can store different types of data.

What is a root node, intermediate node, and leaf node in a rowstore b-tree index?

Root node: The initial logical step that divides the index into two paths. Intermediate node: The subsequent logical steps that continue to split an index. Lead node: The final step of a b-tree index that identifies the location of data in a table, for nonclustered indexes, or is the data, for clustered indexes.

What type of indexes are better for highly selective queries and why?

Rowset functions because you can use rowstore indexes to select the rows you need, using an index seek.

What are the two main types of indexes used for data warehousing and what are their characteristics?

Rowstore indexes and columnstore indexes. Rowstore indexes store data in rows based on the row data of a table. Columnstore indexes store data in columns based on column values of a table.

What is a data retention policy?

Rules an organization enacts around when data is no longer useful and can be deleted.

What's the difference between scaling out and scaling up?

Scaling up involves increasing the capacity of a given server or servers to process data. This generally doesn't involve changing the data architecture. Scaling out involves adding new servers and/or changing the architecture of the data to optimize for load and query performance.

What is SSD and what are it's benefits over Hard Disk drives?

Solid state Drive: Data storage device that uses integrated circuits to store data. Unlike Hard Drives, data can be accessed from any location in the circuit, increasing read and write times.

What is the syntax for truncating the data in a partition?

TRUNCATE TABLE {Table Name} WITH (PARTITION ({Partition #}) )

Why can't you scale out a partitioned table?

Table partitions for a partition schema have be stored in the same database.

How is data stored when utilizing a columnstore structure?

The database creates tables for each individual column. This allows queries that retrieve only the desired columns, apply modifications, and ultimately return the desired accompanying rows.

How is data usually loaded to a shared database scale-out solution?

The database is usually loaded during maintenance windows or when no reports will be needed. During this time, the database is take off read-only, loaded, and then returned to read-only. If data needs to read during this load period, the database is copied for read purposes and another one is loaded. When done, it is switched in.

When moving data from a non-partitioned table to a partition in a partitioned table, what must be true about the definition of the source table and why?

The source table must have a CHECK constraint on it that matches the target tables so that the Server can know it it can add the data into the specified partition.

What's a linked server data scale out solution?

The storing of different categories of data in different databases but linking them so that they can all be accessed as if they were local database from any of the linked servers.

What is latency?

The time it takes for read and write operations to complete.

What are the characteristics of a alternate key columns in a dimension table?

They are columns that are used to uniquely identify a dimensions members in the LOB application. They are often the primary key that is used in the source tables from the LOB applications.

What are the characteristics of metadata columns in a dimension table?

They are columns used to track how rows in a dimension table have been changed over time and how changes in an ETL process have occured. Common ones are batches that show the ID of a batch that added the row to a table, timestamps for when rows were added, or valid time dates for slowly changing dimension data.

What is a dimension table?

They are groupings of similar attributes that give you information on "what", "when", "where", and "how" of a business process/fact table instance.

What is a limiting quality of nonclustered column store indexes?

They are read-only. The table cannot be updated as long as the index remains. To change the data in the table, you have to remove the index first.

What are the characteristics of a fact table measure columns?

They are the columns in the fact table that represent the additive and semi-additive attributes of a business process.

What are the characteristics of attribute columns in a dimension table?

They are the values in the dimension table that give business event information. They have data about the different types of context that would be used in a fact table such as transaction dates, size of products, weight of products.

How are date dimensions usually created?

They are usually created with a day for every year spanning decades in the past and future as their surrogate key. From there, each date is given a year, quarter, and month attribute.

How do rowstore and columnstore tables differ logically?

They don't. Logically, they are both tables with rows and columns. They differ physically in how they are stored arising in differences in performance.

How do indexes reduce I/O?

They minimize the amount of data that needs to pulled from disk into memory. Without indexes, entire tables are pulled into memory to be scanned. With indexes, you specify what parts of the table to pull out.

What is one mistake that people make with snowflake models?

They use it to normalize data in a way that is against the optimal read goal for a data warehouse.

What are semi-additive values?

Values that can be aggregated across some dimensions but not all. Usually, they can't be aggregated across a time dimension such as account balance. You can aggregate it across accounts but it wouldn't make sense to do so across time.

What are non-additive values?

Values that can not be summed across any dimensions.

What's the difference between vertical and horizontal table partitioning?

Vertical partitioning involves splitting up a table into different tables based on the columns who aren't accessed. Horizontal querying is the splitting up of tables into row groups while keeping all the columns of the row are kept.

What is 1:1/one-to-one relationship?

When a member in a one table can only have a reference to one occurrence [value] in another table.

What is a 1:M/one-to-many relationship?

When a member in one table can reference more than one member occurrence [value] in another table.

What is a 0:1/zero-to-one relationship?

When a member in one table does not have a relationship with occurrence in another table. However if a relationship existed, it would be in one direction.

Without a covering index, why/how does a nonclustered index use a key lookup?

When a query needs to return rows whose filtering columns are some which have a nonclustered index and some that don't, it will use the nonclustered index to identify the rows of interest. From there however, it will use a key lookup for each row that was identified to get the other column values for each row.

What is stripping, in terms of data storage?

When data is separated out and stored across multiple disks.

What is a shared database scale-out solution? What are its drawbacks and advantages? What kind of system is this best for?

When data is stored in a single SAN and processing power is spread out among the different servers through a read only copy of the data. This is good for read only solutions but does not allow for updates while the database is in read-only mode so it does not work for databases that need to make updates constantly or draw reports on new data. It's good for uses that only do reporting and data reads.

How do you implement partition elimination in your query?

When you write the query, ensure that the filtering perfectly aligns with the partitions as defined in the partition function.

When do key lookups become a less efficient for queries and why?

With queries that have really large result sets, the optimizer might decide that it's better to simply do a table scan than do an index seek then a key look up for each identified row.

Why are dimensional tables considered de-normalized?

Within fact tables, different business entities exist within the same table. That is, unlike normalization that splits up business themes into separate tables, fact tables combine different business themes, via dimensions, to give context to their attribute values.

Can a primary key constraint be used on a table that has a clustered columnstore index? What benefit is there?

Yes, as long as the primary constraint is unclustered. You enforce unique-ness while getting the performance of a columnstore structure.

What are the general steps for determining dimensions for a business process?

You begin with a specific business process/event. First, propose a set of attributes that DESCRIBE the business process. Next, group similar attributes, those that describe different parts of the same attribute of a business, into groups. Those are your dimensions.

What can be assumed about the automatically created partitions when running a sliding window partition strategy?

You can assume that they are empty if previous iterations of sliding window have ran.


Ensembles d'études connexes