Data Warehouse and ETL
DW Level2 - What is DataMart?
A DataMart is a special version of the data warehouse that contains a snapshot of the operational data and helps business people to make decisions based on past experiences and trends. A data mart is more focused on easy access to information from fact tables.
DW Level2 - What is fact less fact table?
A fact less fact table does not contain any number fact column in the fact table.
Level 2 -How many type of transformation supported by sorted input?
Aggregator, Joiner and Lookup Transformations are supported by sorted input. By implementation of these transformations the session performance will increase.
What is ER (Entity Relationship) Diagram ?
An Entity-Relationship diagram illustrates the relationship among entities in a database. This diagram shows the structure of different tables and links between tables.
DW Level5,6 - What is the main difference between Inmon and Kimball philosophies in data warehousing?
Both differ in the concept of building the Data Warehouse. In Kimball methodology we create Data Marts first then create Data Warehouse but in Inmon methodology we create Data Warehouse first then create Data Marts.
Explain checkpoint and breakpoint in SSIS
Checkpoints let us restart a package from the point of failure. Breakpoints pause a package to debug the data. A breakpoint is used to analyze the values of variables before and after execution.
What is Data Mining, explain your experience in data mining.
Data mining is a process for analyzing data from different perspectives or dimensions and summarizing the same into meaningful content. Data can be retrieved or queried from the database in their own format.
SSIS level4 - What is the data profiling task?
Data profiling is the process of analyzing the source data to better understand what condition the data is in, in terms of cleanliness, patterns, numbers or nulls, and so on. Data profiling task usually be used at the beginning of the development cycle to support the design of the destination database schema. Note that this task is not used when you develop the normal recurring ETL packages.
DW Level3,6 - What is the difference between ER Modeling and Dimensional Modeling?
ER modeling will have logical and physical model but Dimensional modeling will have only Physical model. ER Modeling is used for normalizing the OLTP database design whereas Dimensional Modeling is used for de-normalizing the ROLAP and MOLAP design.
DW Level4 - What is a Level of Granularity of a Fact Table?
Level of granularity means the level of detail that we put into the fact table in a data warehouse. Level of granularity implies the detail we are willing to put for each transactional fact.
SSIS Level 2 - How can we do logging in SSIS?
Logging in SSIS can be done by logging various events such as onError, onWarning to the multiple options like a flat file, XML, SQL server table and SQL server Profiler.
DW Level5 - Which one is faster, Multidimensional OLAP or Relational OLAP?
Multidimensional OLAP (MOLAP) is faster than Relational OLAP (ROLAP). In MOLAP data is stored in a multidimensional cube and the storage is not in the relational database but ROLAP products access a relational database by using SQL.
Have you ever had bad experience with SSIS as an ETL tool, is there any disadvantage of using SSIS
Not always efficient for use with JSON We receive errors in moving JSON data to the database. It does not work efficiently in JSON-related data. Excel or flat file connections are limited by version and type. You have to compile or edit the SSIS package in the same version of SSMS you are using.
SSIS Level 2 - What is the role of event handling tab in SSIS?
On the event handlers tab, workflows can be configured to respond to package events. For instance, we can configure workflow when any task stops, fails or starts. For example we can use event handlers ONERROR to send email when an error occurs.
SSIS Level 2 - How would you monitor performance in SSIS?
SSIS includes performance counter to measure the packages executed, Blob read/write, buffer information, rows read/write and more.
Explain the differences between ETL and ELT, which one you prefer ?
The Key Difference between ETL and ELT is ETL stands for Extract, Transform and Load while ELT stands for Extract, Load, and Transform. In ETL, data moves from the data source to staging into the data warehouse whereas ELT loads data directly into the target system. ETL is mainly used for a small amount of data whereas ELT is used for large amounts of data. I prefer ETL as it can perform sophisticated data transformations and can be more cost-effective than ELT. Also ETL can help with data privacy and compliance by cleaning sensitive and secure data even before loading into the data warehouse.
SSIS Level 2,3 - What is the SSIS Catalogue?
The SSIS catalog is a database to store all the deployed packages. It is widely used for security reasons to store and handle the deployed packages.
Level3 - Explain what is grain of Fact?
The grain of a fact table represents the most atomic level by which the facts may be defined. For example, an order fact table might have a grain of order, with one row per order, or have a grain of order line, with a row for every line on each order (meaning more than one line for some orders). The classic example for a supermarket checkout is 'one row for every beep/scan'.
SSIS level4 - What is manifest file in SSIS?
The manifest file is the utility which can be used to deploy the package using the wizard on the file system and SQL Server database.
DW Level3 - What is Execution Plan? What are the approaches used by Optimizer during execution plan?
The query optimizer determines the most efficient way to execute a SQL statement after considering many factors related to the objects referenced and the conditions specified in the query. The output from the optimizer is an execution plan that describes an optimum method of execution. There are two approaches used by Optimizer during execution plan: 1. Rule Based 2. Cost Based
Level 2 - Mention what are the different types of data warehousing and what is difference between data warehousing and data mining?
There are three types of data warehouse: 1. Enterprise Data Warehouse (EDW) 2. Operational Data Store (ODS) 3. Data Mart Data mining is considered as a process of extracting data from large data sets, whereas a Data warehouse is the process of pooling all the relevant data together. Data mining is the process of analyzing unknown patterns of data, whereas a Data warehouse is a technique for collecting and managing data.
DW Level5 - What are the different types of SCDs used in data warehousing?
Three types of SCDs are used in Data Warehousing: SCD1: It is a record that is used to replace the original record even when there is only one record existing in the database. The current data will be replaced and the new data will take its place. SCD2: It is the new record file that is added to the dimension table. This record exists in the database with the current data and the previous data that is stored in the history. SCD3: This uses the original data that is modified to the new data. This consists of two records: one record that exists in the database and another record that will replace the old database record with the new information.
DW Level5 - How do you load the time dimension?
Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. 100 years may be represented in a time dimension, with one row per day.
SSIS Level 3 - What is Variable in SSIS and what types are Variable in SSIS?
Variable allows us to dynamically control the package at runtime. Variables store values that an SSIS package and its containers, tasks, and event handlers can use at run time. The scripts in the Script task and the Script component can also use variables. The precedence constraints that sequence tasks and containers into a workflow can use variables when their constraint definitions include expressions. Integration Services supports two types of variables. 1. User-defined variables 2. System variables
DW Level2 - What is VLDB?
(Very Large Databases) are those databases whose size is defined as more than one terabyte. These are the decision support systems that are used to serve a large number of users.
Level3 - Explain what is tracing levels and what are the types?
- The amount of data stored in the log files is called tracing level - The tracing levels are NORMAL, VERBOSE - NORMAL explains the tracing levels in a detailed manner - VERBOSE explains the tracing levels at each and every row
Explain the benefit of using SSIS as an ETL tool
1. Graphical interfaces make it fairly easy to visualize the flow of data 2. Connects to many different sources 3. Can consume data from sources that can't be accessed directly using T-SQL, e.g. FTP, HTTP, MSMQ, Analysis Services and SMO. 4. We can use C# or VBA to extend its functionality and reuse scripts across multiple projects 5. Transform data without the need for a staging area 6. Provides transform functionality not available in T-SQL like Fuzzy logic, data profiling and direct insert to SSAS. 7. "Bad" data can be redirected to a different data sources for further examination or processing. 8. Provides a more sophisticated structured error handling system as well as multiple options for logging and auditing. 9. Allows parallel execution of data flows. 10. Better for complex transformations, multi-step operations, aggregating data from different data sources or types. 11. Use the SQL Server Destination instead of OLE DB; which allows us to load data into SQL faster. 12. Debugging is simplified because package logic is all in one place.
What is a workflow? What do you mean by a control flow and a data flow? what are the differences between control flow and data flow?
A Control flow defines a workflow of tasks to be executed and it includes Operations and order of operations and relationship between them. operations are Data flow task , containers and execute SQL task or any other control flow component. with precedence constraints we define order of operations and relationship between them. Data flow is the flow of data from the corresponding sources to the target destinations. we have transformations to clean and manipulate the data within the flow. The control flow is for designing the flow of the package. Data flow is for the ETL process and is the subset of control flow and cannot work without a control flow. So in a package we have only one control flow while multiple data flow can exists.
DW Level5 - What is Junk Dimension?
A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. The nature of the data of junk dimension is usually Boolean or flag values.
What is junk dimension in Data warehousing ?
A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. The nature of the data of junk dimension is usually Boolean or flag values.
DW Level3,6 - What is defined as Partial Backup?
A Partial backup in an operating system is a backup short of full backup and it can be done while the database is opened or shutdown.
What is a container? How many types of containers are there in SSIS?
A container is a logical grouping of tasks which allows to manage the scope of the tasks together. There are four types of containers in SSIS: For Each Loop Container, For Loop Container, Sequence Container and Task Host Container For Each Loop Container - Used for enumerating each object in a collection; for example a list of files. For Loop Container - Used when we want to have repeating flow in a package Sequence Container - Used for grouping logically related tasks together Task Host Container is the default container for all the tasks.
What is Data Warehouse?
A data warehouse is a large centralized repository of data that contains information from many sources within an organization for analytical and reporting purposes. A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.
Abdullah, SSIS Level2-What is the property expression in SSIS?
A property expression is an expression that is assigned to a property to enable dynamic update of the property at run time. EXAMPLE: I use user defined variables for file path and file name to create a dynamic connection to sources (or destinations) that connect to a specific folder. This makes it easier to reconfigure the connection IF the file location changes. The property expression allows for the dynamic update of this variable during run time.
What Is Real Time Data warehousing ?
A real-time data warehouse captures the business data as soon as it occurs. When a business activity gets completed, data will become available for instant use.
DW Level4 - What is a snapshot with reference to Data Warehouse?
A snapshot refers to a complete visualization of data at the time of extraction. It occupies less space and can be used to back up and restore data quickly. A snapshot is a process of knowing about the activities performed. It is stored in a report format from a specific catalog. The report is generated soon after the catalog is disconnected.
Explain Virtual Data Warehousing and Active data warehousing ?
A virtual data warehouse is a set of separate databases, which can be queried together, so a user can effectively access all the data as if it was stored in one data warehouse. It is one of the best ways for translating raw data and presenting it in the form that can be used by decision-makers. Active Data Warehousing is the technical ability to capture transactions when they change, and integrate them into the warehouse along with scheduled cycle refreshes. Active data warehouse can be utilized in finding trends and patterns that can be used in future decision making. Active data warehouse offers the possibility of automating routine tasks and decisions.
DW Level2 - What is active data warehouse?
Active Data Warehousing is the technical ability to capture transactions when they change, and integrate them into the warehouse, along with maintaining batch or scheduled cycle refreshes. An active data warehouse offers the possibility of automating routine tasks and decisions. The active data warehouse exports decisions automatically to the On-Line Transaction Processing (OLTP) systems.
DW Level4 - What is the difference between agglomerative and divisive hierarchical clustering?
Agglomerative hierarchical clustering method allows clusters to be read from bottom to top so that the program always reads from the sub-component first then moves to the parent; whereas, divisive hierarchical clustering uses top to bottom approach in which the parent is visited first then the child.
Level3 - What do you understand by active and passive transformation?
An active transformation can change the number of rows as output after a transformation, while a passive transformation does not change the number of rows and passes through the same number of rows that was given to it as input.
DW Level2 - What are aggregate tables?
An aggregate table contains the summary of existing warehouse data, which is grouped to certain levels of dimensions. It is always easy to retrieve data from aggregated tables than visiting original table which has millions of records. Aggregate tables reduce the load in the database server and improve the performance of the query, and they also can retrieve the result quickly.
DW Level4 - Explain the chameleon method used in Data Warehousing.
Chameleon is a hierarchical clustering algorithm that overcomes the limitations of the existing models and methods present in Data Warehousing. This method operates on the sparse graph having nodes that represent data items and edges which represent the weights of the data items. This representation allows large datasets to be created and operated successfully. The method finds the clusters that are used in the dataset using the two-phase algorithm. The first phase consists of the graph partitioning that allows the clustering of the data items into a large number of sub-clusters. The second phase uses an agglomerative hierarchical clustering algorithm to search for the clusters that are genuine and can be combined together with the sub-clusters that are produced.
DW Level4 - What is the purpose of cluster analysis in Data Warehousing?
Cluster analysis is used to define the object without giving the class label. It analyzes all the data that is present in the Data Warehouse and compares the cluster with the cluster that is already running. Purpose of cluster analysis: • Scalability • Ability to deal with different kinds of attributes • Discovery of clusters with attribute shape • High dimensionality • Ability to deal with noise • Interpretability
SSIS Level 3 - What is the use of config file in SSIS?
Config file in SSIS is used to provide inputs to connection manager different properties that package use to assign values at runtime dynamically. Using config file users need to make changes to the config file which package will take automatically at runtime because of using it you don't need to every time make changes in the packages in case you are deploying package on multiple servers or locations. There are multiple ways in which configuration values can be stored. XML configuration file Store the config file as an XML file. Environment variable Store the config in one of the environment variables. Registry entry Store the config in the registry. Parent package variable Store the config as a variable in the package that contains the tasks. SQL Server Store the config in a table in SQL Server.
What are conformed and non-conformed dimensions?
Conformed dimension A conformed dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables. It can refer to multiple tables in multiple data marts within the same organization. Non-confirmed dimension Dimension table targeted to a single fact table Used when dimensions have different definitions for different business units
DW Level3 - What is a core dimension?
Core dimension is nothing but a Dimension table which is used as dedicated for single fact table or DataMart.
What is data modeling? Have you used any tools to assist in building the data model ?
Data modelling is the process of creating a data model for the data to be stored in a Database. This data model is a conceptual representation of Data objects, the associations between different data objects and the rules. There are mainly three different types of data models: Conceptual data model, Logical data model and Physical data model. Conceptual data model defines WHAT the system contains. The purpose is to organize scope and define business concepts and rules. Logical data model defines HOW the system should be implemented regardless of the DBMS. The purpose is to developed technical map of rules and data structures. Physical data model describes HOW the system will be implemented using a specific DBMS system. The purpose is actual implementation of the database. I used Lucidchart and Microsoft Visio
level1,3 - What is the purpose of data profiling in ETL process? What are the different kinds of data profiling?
Data profiling is the process of reviewing source data, understanding structure, content and interrelationships. Data profiling can uncover data quality issues and anomalies in data sources, and what needs to be corrected in ETL. For example, a state column might use a combination of both two-letter codes and the fully spelled out name of the state. Data profiling would uncover this inconsistency and inform the creation of a standardization rule that could make them all consistent, two-letter codes. different kinds of data profiling Structure discovery, also known as structure analysis, validates that the data that you have is consistent and formatted correctly. Content discovery is the process of looking more closely into the individual elements of the database to check data quality. This can help you find areas that contain null values or values that are incorrect or ambiguous. Relationship discovery involves discovering what data is in use and trying to gain a better understanding of the connections between the data sets. This process can help cut down on some of the problems that arise in data warehouse.
Explain about data warehouse architecture
Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are: 1. Basic Data warehouse, 2. Data warehouse With staging area, 3. Data warehouse With staging area and Data Mart In Basic Data warehouse, end users directly access data derived from several source systems through the data warehouse. In Data warehouse With staging area, a staging area simplifies data cleansing and consolidation for operational data coming from multiple source systems. In Data warehouse With staging area and Data Mart, to customize warehouse's architecture for different groups within organization, we can add data marts, which are systems designed for a particular line of business.
SSIS level4 - How do you create deployment utility?
Deployment is the process in which packages converts from development mode into executables mode. For deploying the SSIS package, we can directly deploy the package by right-clicking the Integration Services project and build it. This will save the package.dtsx file on the projectbin folder. Also, we can create the deployment utility using which the package can be deployed at either SQL Server or as a file on any location. For creating deployment utility, follow these steps: · Right-click on the project and click on properties. · Select "True" for createDeploymentUtiltiy Option. Also, we can set the deployment path. · Now close the window after making the changes and build the project by right-clicking on the project. · A deployment folder will be created in the BIN folder of our main project location. · Inside the deployment folder, we will find .manifest file, double-clicking on it you can get options to deploy the package on SQL Server. · Log in to SQL Server and check-in MSDB on Integration Services.
What is Dimensional Modelling? Explain different types of dimensional modeling.
Dimensional modeling is a data structure technique optimized for data storage in a Data warehouse. The purpose of dimensional model is to optimize the database for fast retrieval of data. There are two types of dimensional modeling: Star Schema and Snowflake Schema Star Schema Fact table has direct joint to at least one related Dimension table Has a single table for each dimension Each table supports all attributes for that dimension As it is a de-normalized solution, result can be retrieved from the database quickly in the data warehouse environment. Snowflake Schema Dimension tables don't have to be directly connected to Fact table; Dimension table could be directly connected to another Dimension table. Typically contains multiple tables per dimension. Each table contains dimension key, value, and the foreign key value for the parent dimension. More normalized solution.
What is ETL ? Explain the importance of ETL in data warehousing. level1-What is an ETL process?
ETL stands for the three words Extract, Transform, and Load. The ETL process covers extracting data, mostly from different types of systems, transforming it into a structure that's more appropriate for reporting and analysis, and finally loading it into the database or cube or data warehouse. ETL is the first essential step in the data warehousing process that eventually lets making more informed decisions in less time. ETL increase query and system performance, quality and consistency in Data Warehousing.
What is ETL testing? Why do we need ETL testing? have you ever done ETL testing ?
ETL testing refers to the process of validating, verifying, and qualifying data while preventing duplicate records and data loss. Transportation of data from extraction to loading could result in human or system errors. ETL testing ensures that such errors do not occur and eliminates the bugs. The need for ETL testing is to keep a track on the efficiency and speed of the process. Also we need ETL testing to be familiar with the ETL process before we implement it into our business and production. The ETL test is performed in five stages as follows: 1-The identification of data sources and requirements 2-The acquisition of data 3-Implementing business logic and dimensional modeling 4-Building and publishing data 5-Reports building
DW Level3 - What are the different types of data warehousing?
Enterprise Data warehousing: It is a centralized place where all business information from different sources and applications are made available. Operational Data Store: This has a broad enterprise wide scope, but unlike the real enterprise data warehouse, data is refreshed in near real time and used for routine business activity. Data Mart: Data mart is a subset of data warehouse and it supports a particular region, business unit or business function.
SSIS level4 - Difference between Execute TSQL Task and Execute SQL Task
Execute the TSQL Task: Pros: Takes less memory, faster performance Cons: Output into variable not supported, only supports ADO.net connection Execute SQL Task: Pros: Support output into variables and multiple types of connection, parameterized query possible. Cons: Takes more memory, slower performance compared to the TSQL task.
What are the different types of Fact tables in Data Warehousing ?
Fact tables can be grouped into just three types: transaction grain, periodic snapshot grain and accumulating snapshot grain. 1. Transaction fact tables record facts about a specific event 2. Snapshot fact tables record facts at a given point in time 3. Accumulating snapshot tables record aggregate facts at a given point in time
SSIS level4 - What is file system deployment?
File system deployment means to save package files on local or network drive. Then we can use the SQL Agent job to schedule when the packages will run.
Mention the errors in SSIS and explain how error handling works
In SSIS package may occur two kinds of errors: Error which occurs in Output Error which occurs during runtime Many data flow components used during ETL process supports error outputs to handle errors. Error Output provides three SSIS Error Handling options: Fail Component, Ignore Failure, Redirect Rows For capturing errors occur during runtime we can use OnError Event Handlers and create flow to handle errors. We can also add send mail task in event of any failure.
level1 - How many types of transformation are there?
In SSIS, transformations are available in two main categories--Synchronous and Asynchronous. Synchronous (non-blocking) transformations always offer the highest performance. They do not change the shape of data. Synchronous transformations are either stream-based or row-based. Streaming transformations are calculated in memory and do not require any data from outside resources to transform the data. These are the fastest transformations around. Row-based transformations run a little bit slower because they require calling a service or looking up data from another source to calculate their values. Asynchronous transformations are either Fully Blocking or Semi-Blocking. Semi-Blocking Asynchronous Transformations require a subset of the data to be collected before they can be sent to the destination(s). The shape of the data can change. Fully Blocking Asynchronous Transformations are the slowest transformations. They require all the data to be pulled from the source(s) before they can be sent to the destination(s). All source data must be loaded into memory first. During ETL design it's recommended to use all Synchronous transformation components. For highest performance, it's recommended to use as few semi-blocking and fully blocking SSIS transformations as possible. For conclusion we can list transformations in 4 different types: 1. Non-Blocking Synchronous Streaming Transformations (Fastest) Like Conditional Split , Data Conversion, Derived Column, Multicast , Row Count and Lookup(if configured to "Full Cache" to store search result) 2. Non-Blocking Synchronous Row-Based Transformation (Fast) Like OLE DB Command and Lookup (if configured to "Partial cache" or "No Cache" to store search result) 3. Semi-Blocking Asynchronous Transformation (Medium) Like Merge, Merge Join, Pivot, Unpivot and Union All. 4. Fully-Blocking Asynchronous Transformation (Slowest) Like Sort and Fuzzy Lookup
How do you deploy your SSIS solution in production environment
In Visual Studio, right-click on the project and select Deploy, This will start the SSIS deployment wizard. It will deploy the entire project, with all packages included. If we want to deploy an individual package, we can right-click on the package itself and choose Deploy (since SSIS 2016). In the first step of the wizard, we need to choose the destination by entering the server name and make sure the SSIS catalog has already been created on that server. (Several steps are skipped since we started the wizard from Visual Studio) At the next step, we get an overview of the actions the wizard will take. We click on Deploy to start the deployment. The deployment will go through a couple of steps from loading to deploying and project will deploy to server and we can find it under Integration Service Catalogs.
DW Level3 - What are loops in Data warehousing?
In data warehousing, loops may exists between the tables. If there is a loop between the tables, then the query generation will take more time, because more than one path is available. It creates ambiguity also. Loops can be avoided by creating aliases of the table or by context. It is advised to avoid loop between the tables. Example: 4 Tables - Customer, Product, Time, Cost forming a close loop. Create alias for the cost to avoid loop.
What is initial load and full load in data warehousing ?
Initial Load is the process of populating all the data warehousing tables for the very first time and after that you have to incremental load (brining only modified and new records). Full Load, is a process of completely deleting the existing data and reloading it from scratch. Delete destination data. Read data from source. Load into destination. Full load is generally an initial load where you are going to load the empty target.
What is the difference between data dictionary and metadata ?
Metadata describes about data. It is 'data about data'. It has information about how and when, by whom a certain data was collected and the data format. Data dictionary is a file which consists of the basic definitions of a database. It contains the list of files that are available in the database, number of records in each file, and the information about the fields. In short, Data dictionary is a repository to store all information. Meta data is data that defines other data. Hence, the data dictionary can be metadata that describes some information about the database.
Level 2 - What is operational data store?
ODS stands for Operational Data Store. It is essentially a repository of real-time operational data. An operational data store (ODS) is a database designed to integrate data from multiple sources for additional operations on the data. An ODS is designed for relatively simple queries on small amounts of data (such as finding the status of a customer order), rather than the complex queries on large amounts of data typical of the Data Warehouse. An operational data store (ODS) is a database designed to integrate data from multiple sources for additional operations on the data. Unlike a master data store, the data is not sent back to operational systems. It may be passed for further operations and to the Data Warehouse for reporting. In ODS, data can be scrubbed, resolved for redundancy, and checked for compliance with the corresponding business rules. This data store can be used for integrating disparate data from multiple sources so that business operations, analysis, and reporting can be carried out. This is the place where most of the data used in the current operation is housed before it's transferred to the Data Warehouse for longer-term storage or archiving. An ODS is designed for relatively simple queries on small amounts of data (such as finding the status of a customer order), rather than the complex queries on large amounts of data typical of the Data Warehouse. An ODS is similar to the short-term memory where it only stores very recent information. On the contrary, the Data Warehouse is more like long-term memory, storing relatively permanent information.
Explain SSIS package configurations, how many different ways you can configure a package
Package configurations let us set run-time properties and variables from outside the development environment. SSIS provides several options for handling package configuration values such as environment variables, XML files, registry settings, and a table in a SQL Server database. In each case we are storing and maintaining configuration parameter values outside of the SSIS package then using those stored values when executing the package. I prefer using SQL server package configuration table over other options because changing the configuration data can be done with simple T-SQL command such as INSERT, UPDATE and DELETE. Also I can include this data in standard SQL server backup and restore plan.
SSIS level4 - What is precedence constraint in SSIS?
Precedence Constraint is a connector that connects 2 control flow tasks and allows defining the logical sequence of tasks in the order they should be executed. We can also specify a condition to be evaluated before the next task in the flow is executed. The condition could be a constraint, an expression or both. There are 3 kinds of precedence constraint Success (next task will be executed only when the last task completed successfully) or Failure (next task will be executed only when the last task failed) or Complete (next task will be executed no matter the last task was completed successfully or failed). The precedence constraint is very useful in error handling in SSIS package.
Explain the architecture of SSIS
SSIS Architecture comprises of the two major components 1. SSIS Runtime Engine 2. SSIS Data Flow Engine/Pipeline The SSIS Runtime Engine manages the workflow of the packages during runtime, which means its role is to execute the tasks in a defined sequence. the SSIS data flow engine extracts the data from one or more sources of data, applies the required transformations and delivers it to one or more destination locations.
SSIS Level 3 - What is SSIS Control Flow Integration?
SSIS Control Flow allows programming graphically how the tasks run by using the logical connectors between tasks. There are three basic logical connectors that we can use: success, failure, or complete. Furthermore, we can use the FX (expression) to handle more complex conditions in control flow. The SSIS control flow is responsible for directing which tasks will execute and in what order. There are three different types of objects in SSIS control flow: tasks, containers, and precedence constraints.
SSIS Level 3 - Explain the term dataflow buffer?
SSIS operates using buffers which is kind of an in-memory virtual table to hold data. We are able to set max rows and size for the buffer. It's very important to get as many rows into one single buffer. If the total row length is smaller, we can fit more rows into a buffer. One new property introduced in SQL Server 2016: AutoAdjustBufferSize. When this property is set to true, it will automatically adjust the size of the buffer so that the DefaultBufferMaxRows setting is met. Unfortunately, this property is set to False by default.
SSIS Level 3 - What is SSIS package and what does it do?
SSIS package includes the necessary components, such as the connection manager, tasks, control flow, data flow, parameters, event handlers, and variables, to execute a specific ETL task. SSIS package can be used to merge data from heterogeneous data sources into SQL Server. They can also be used to populate data warehouses, to clean and standardize data, and to automate administrative tasks. The tool may also be used to automate the maintenance of SQL Server databases, update multidimensional cube data, and perform other functions.
Explain different types of connection supported in SSIS
SSIS provides different types of built-in connection manager. Different types of connection that work within SSIS are • ODBC ODBC connection allows package to connect to open databases using open data base connectivity (ODBC) specification. • OLE DB OLEDB connection allows package to connect to data source by using OLEDB provider. An OLEDB connection manger that connects to SQL server can use Microsoft OLEDB provider for SQL server. • .net SQL Client • Flat File Flat file connection allows package to connect to single flat file to access data. • Excel Excel connection allows package to connect existing excel work sheet to read data or write data. • XML XML connection allows package to Connect to XML file to extract data from nodes. • FILE (Connect to a file or folder) • FTP (connect to an FTP server) • HTTP (connect to a web server) • SMTP (Connect to an SMTP mail server)
DW Level2 - What are the different stages of data warehousing?
Stages of a data warehouse helps to find and understand how the data in the warehouse changes. Offline Operational Databases: This is the initial stage of data warehousing. In this stage the development of database of an operational system to an off-line server is done by simply copying the databases. Offline Data warehouse: In this stage the data warehouses are updated on a regular time cycle from operational system and the data is persisted in an reporting-oriented data structure. Real time Data Warehouse: Data warehouses are updated based on transaction or event basis in this stage. An operational system performs a transaction every time. Integrated Data Warehouse: The activity or transactions generation which are passed back into the operational system is done in this stage. These transactions or generated transactions are used in the daily activity of the organization.
Level3 - What is the 3 tier system in ETL?
The staging layer, the data integration layer, and the access layer are the three layers that are involved in an ETL cycle. Staging layer: It is used to store the data extracted from various data structures of the source. Data integration layer: Data from the staging layer is transformed and transferred to the database using the integration layer. The data is arranged into hierarchical groups (often referred to as dimensions), facts, and aggregates. In a DW system, the combination of facts and dimensions tables is called a schema. Access layer: For analytical reporting, end-users use the access layer to retrieve the data.
If you have a package that runs fine in visual studio (SSDT) but fails when running from a SQL Agent Job what would be your first guess on what the problem is?
The user account that is used to run the package under SQL Server Agent differs from the original package author and does not have the required permissions. To resolve the issue, we can use a SQL Server Agent proxy account. Create a SQL Server Agent proxy account. This proxy account must use a credential that lets SQL Server Agent run the job as the account that created the package or as an account that has the required permissions.
Explain the steps and process to build incremental data load process
There are different approaches to build incremental data load process We can use merge statement in SQL with SCD type 1 or 2 to implement incremental loading. Merge command can figure out which rows need to be inserted and which only need updating. Also in SSIS package, we can use lookup transformation in data flow and compare ID column between source and destination, and for handling rows with no matching entry we select "Redirect rows to no match output". This output includes rows that destination table does not have it already. We connect this output to OLE DB destination to insert new rows. Lookup match output use to update existing rows, to do that we need another lookup transformation and this time compare all columns between source and destination and again for handling rows with no matching entry we select "Redirect rows to no match output". We can use this output to connect OLE DB command to update existing rows if we deal with small amount of data, for better performance with huge amount of data we use another OLE DB destination and load this rows to staging table and in control flow use execute SQL task to update destination records.(Update query from destination join with staging). I prefer use create stored procedure for merge statement and execute it inside execute SQL task in SSIS package.
SSIS level 5 - What is a transaction in SSIS and how to implement it?
Transactions in SSIS allows us to commit the group of data flow tasks (or) allows us to roll back the group of dataflow tasks as single logical unit of work. Transactions supported by every executable (i.e. package level, task level, container level) We can manage transactions in SSIS by using TransactionOption property. We have three options (Required, Supported and NotSupported) Require indicates that the container starts a transaction unless one is already started by its parent container. If a transaction already exists, the container joins the transaction, For example, if a package that is not configured to support transactions includes a Sequence container that uses the Required option, the Sequence Container would start its own transaction. If the package were configured to use the Required option, the Sequence container would join the package transaction. Supported indicates that the container does not start a transaction, but joins any transaction started by its parent container. For example, if a package with four Executable SQL tasks starts a transaction and all four tasks use the Supported option, the database updates performed by the Execute SQL tasks are rolled back if any task fails. If the package does not start a transaction, the four execute SQL tasks are not bound by a transaction, and no database updates except the ones performed by the failed task are rolled back. Not Supported indicates that the container does not start a transaction or join an existing transaction. A transaction started by a parent container does not affect child containers that have been configured to Not Support transactions. For instance, if a package is configured to start a transaction and a For Loop Container in the package uses the NotSupported option, none of the tasks in the For Loop can roll back if they fail.
Explain what is transformation and why we need to use transformations in SSIS, what purpose they serve?
Transformations are the components in the data flow of a package that aggregate, merge, distribute, and modify data. Transformations can also perform lookup operations and generate sample datasets. We need to use transformation in order to bringing in the data in the desired format. Properly formatted and validated data improves data quality. For example, we are pulling data from the source and want to ensure only distinct records are written to the destination, so duplicates are removed.
Explain your understanding on transforming and transporting of data
Transportation is just moving data from one place to another - in ETL, from source system to either staging area, data warehouse or data mart. Transformation is changing data structure so that it meets data warehouse needs. For instance star schema, de-normalization, aggregation, calculations, data cleansing. Both happen while ETL, but the "T" from "ETL" stands for Transformation.
DW Level5 - What is the difference between View and Materialized View?
View has logical structure that does not occupy space and changes get affected in the corresponding tables. Materialized view has Pre-calculated data and it has physical data space occupation. Changes will not get affected in the corresponding tables.
What is the importance of time dimension in data warehousing ? how do you load time dimension ? Explain the load process and importance of time dimension in Data Warehousing
Virtually every data warehouse is a time series. The Date dimension is a key dimension in data warehousing as it allows us to analyze data in different aspects of date. It has one record per each day, no more, no less. No more gaps in date ranges since we have every single date in this table. No more wrong ideas and opinions on weeks, quarters or other periods. No more issues when grouping dates in reports Easy date hierarchy set up in SSAS cubes Correct date formats through the entire company. We have ability to slice and dice by many date-attributes and also we have consistency in analysis. Time dimensions are usually loaded by a program that loops through all possible dates appearing in the data. It is not unusual for 100 years to be represented in a time dimension, with one row per day.
How to load JSON files using SSIS
We can load JSON files in SSIS by using script component as a source in data flow task in SSIS. 1. Load JSON file in the SQL database and extract the file name and convert that as a date format for better tacking of which file has been processed assuming file names are as the day file has been delivered from the data provider. 2. If we have Column names in the file are separated by spaces or special character, we need to modify those and remove spaces/special characters. 3. We create a package and add a data flow task in the control flow. 4. Inside the data flow we add a script component and click on "Inputs and Outputs" and add columns and provide appropriate data types based on we want to store in the database table. I usually use string data type and then down the line change them with target data type by using Data Conversion. Also columns name are exactly as columns name in JSON file because column names in script component are case sensitive and it must match exactly in the script that we write. 5. We go to script and edit script; first we need to add System.Web.Extension reference to read the JSON data in the reference library. 6. The quickest method of converting between JSON text and a .NET object is using the JsonSerializer. The JsonSerializer converts .NET objects into their JSON equivalent and back again by mapping the .NET object property names to the JSON property names and copies the values. 7. I add a class in C# with the same structure and properties as the JSON content. Provide name of the namespace and add contents, these field names need to match with output columns we created at the beginning. As JsonSerializer cannot handle null INT values unless we create helper methods in c#, easy solution is to read data as string then convert in SSIS if needed. 8. Then I go to main.cs window and add namespace that I provided to give access to Class I have created earlier and write required code in main.cs by using foreach loop to get row by row data from the file. 9. Click OK and Script Component is configured. Next I extract file name to populate FileDate column. 10. In the control flow add script task. Create two variables (JSONFileName, JSONFilePath) string data type. 11. Configure script task and write code. 12. Now we need format FileDate by using "Derived Column" Transformation inside the dataflow task connect script component to derived column. We use expression to format FileDate column. 13. Convert the data if needed using Data conversion transformation. 14. Now we are ready to connect to Table and mapping for data load, use OLE DB connection to connect to target table that we created at the beginning.
DW Level4 - What is XMLA?
XMLA is XML for Analysis which can be considered as a standard for accessing data in OLAP, data mining, or data sources on the Internet. It is Simple Object Access Protocol. XMLA uses 'Discover' and 'Execute' methods. Discover fetches information from the Internet, while 'Execute' allows the applications to execute against the data sources. XMLA is an industry standard for accessing data in analytical systems, such as OLAP. It is based on XML, SOAP, and HTTP. XMLA specifies MDXML as a query language. In the XMLA 1.1 version, the only construct in the MDXML is an MDX statement enclosed in the tag.
SSIS Level 3 - How can I manually fail a package in SSIS?
https://stackoverflow.com/questions/435651/how-can-i-manually-fail-a-package-in-integration-services