SAS DataFlux
File
Within the Data Mgmt Studio Repository, the ___________storage of a repository can contain the following: data jobs process jobs match reports entity resolution files queries entity definitions other files A. Data B. File
Data explorations can be used for the following: to identify data redundancies to extract and organize metadata from multiple sources to identify relationships between metadata to catalog data by specified business data types and processes
A data exploration reads data from databases and categorizes the fields in the selected tables into categories. These categories are predefined in the Quality Knowledge Base (QKB). Data explorations perform this categorization by matching column names. You also have the option of sampling the data in the table to determine whether the data is one of the specific types of categories in the QKB. A. Repository B. Data Collection C. Data Exploration
Token
A __________ is an "atomically semantic" component of a data value. In other words, _____________ represent the smallest pieces of a data value that have some distinct meaning. A. Token B. Data Value C. Data Object
Collection
A _______________ is a set of fields that are selected from tables that are accessed from different data connections. A _______________ provides a convenient way for users to build a dataset using those fields. A __________________ can be used as an input source for a profile in Data Management Studio. A. Collection B. Data Connection C. Master Data Foundation
Standardization A standardization definition has the following attributes: is more complex than a standardization scheme involves one or more standardization schemes can also parse data and apply regular expression libraries and casing
A ________________________ scheme is a simple find-and-replace table that specifies how data values will be standardized. A. Data Search B. Standardization
Standardization A standardization scheme can be built from the profile report. When a scheme is applied, if the input data is equal to the value in the Data column, then the data is changed to the value in the Standard column. The standard value DataFlux was selected by the Scheme Builder because it was the permutation with the most occurrences in the profile report.
A _________________scheme takes various spellings or representations of a data value and lists a standard way to consistently write this value. A. Build B. Standardization
Preview Previewing does not create the output. The output is physically created only when the job is executed.
A _______________of a Data Output node does not show field name changes or deletions. This provides the flexibility to continue your data flow after a Data Output node. In addition, previewing a Data Output node does not create the output. You must run the data job to create the output. A. Export B. Import C. Preview
Reference Reference source locations are registered on the Administration riser bar in DataFlux Data Management Studio. One reference source location of each type should be designated as the default.
A ______________object is typically a database used by DataFlux Data Management Studio to compare user data to a reference source (for example, USPS Address Data). You cannot directly access or modify references. A. Data Source B. Reference
Business Rule Business rules are defined within a repository using the Business Rules Manager.
A formula, validation, or comparison that can be applied to a given set of data.Data must either pass or fail the business rule. A. Exception B. Business rule
Plan: Discover
A quick inspection of your corporate data would probably find that it resides in many different databases, managed by many different systems, with many different formats and representations of the same data. This step of the methodology enables you to explore metadata to verify that the right data sources are included in the data management program. You can also create detailed data profiles of identified data sources so that you can understand their strengths and weaknesses. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
Data Collection A data collection has the following features: provides a convenient way to build a data source using desired fields can be used as an input source for profiles
A set of data fields from different tables in different data connections. A. Repository B. Data Collection
Act: Execute
After business users establish how the data and rules should be defined, the IT staff can install them within the IT infrastructure and determine the integration method (real time, batch, or virtual). These business rules can be reused and redeployed across applications, which helps increase data consistency in the enterprise. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
Act: Design
After you complete the first two steps, this phase enables you to take the different structures, formats, data sources, and data feeds and create an environment that accommodates the needs of your business. At this step, business and IT users build workflows to enforce business rules for data quality and data integration. They also create data models to house data in consolidated or master data sources. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
Plan
Analyzing and exploring the data sources can lead to the discovery of data quality issues. The ACT phase is designed to create data jobs that cleanse, or correct, the data. This phase involves the following: standardizing, parsing, and/or casing the data correctly identifying types of data (identification analysis) performing methods to remove duplicates from data sources or to join tables with no common key A. Plan B. Act C. Monitor
Role-based rule
Business Rule that evaluates every row in a table? A. Role-based rule B. Set-based rule C. Group-based rule
Group-based rule
Business Rule that evaluates groups of data (for example, if data is grouped by product code, then the rules are evaluated for each product code)? A. Role-based rule B. Set-based rule C. Group-based rule
Set-based rule
Business Rule that evaluates the table as a whole? A. Role-based rule B. Set-based rule C. Group-based rule
SAS QKB for Product Data (PD)
Contains extraction, parsing, standardization, and pattern analysis definitions to handle the following attributes in generic product data: • brands/manufacturers • colors • dimensions • sizes • part numbers • materials • packaging terms and units of measurement A. SAS QKB for Contact Information (CI) B. SAS QKB for Product Data (PD)
Network
Possible values of Diff type include the following: A record belongs to a set of records that are involved in one or more different multirecord clusters in the left and right tables. A. Combine B. Divide C. Network
Extraction
Extracts parts of the text string and assigns them to corresponding tokens for the specified data type. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
Data Type
In the context of the QKB, a _______________ is an object that represents the semantic nature of some data value. A _____________ serves as a placeholder (or grouping) for metadata used to define data cleansing and data integration algorithms (called definitions). DataFlux provides many data types in the QKB, but you can also create your own. A. Data Object B. Data Type
Data Job Node The referenced data job (the one that is embedded using the Data Job (reference) node) must have an External Data Provider node as the input. Data is passed from the parent job to the referenced data job, processed, and returned to the flow in the parent job. The Data Job (reference) node is found in the Data Job grouping of nodes.
Is used to embed a data job within a data job. A. Data Node B. Data Job Node
Combine
Possible values of Diff type include the following: A record belongs to a set of records from one or more clusters in the left table that are combined into a larger cluster in the right table. A. Combine B. Divide C. Network
Divide
Possible values of Diff type include the following: A record belongs to a set of records in a cluster in the left table that is divided into two or more clusters in the right table. A. Combine B. Divide C. Network
Metadata
Profiles are not stored as files, but as ____________. To run a profile via the command line, the Batch Run ID for the profile must be specified. A. Metadata B. Tokens
External Data Provider Node The External Data Provider node has the following characteristics: accepts source data from another job or from user input that is specified at run time can be used as the first node in a data job that is called from another job can be used as the first node in a data job that is deployed as a
Provides a landing point for source data that is external to the current job. A. External Data Provider Node B. External Data Job
Data profiles provide the following benefits: improve understanding of existing databases aid in identifying issues early in the data management process, when they are easier and less expensive to manage help determine which steps need to be taken to address data problems enable you to make better business decisions about your data
Provides the ability to inspect data for errors, inconsistencies, redundancies, and incomplete information. A. Data Profile B. Data Collection
Extensible
Rules are no longer limited to well-known contact data. With the customization feature in Data Management Studio, you can create data-cleansing rules for any type of data. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible
Modifiable
Rules can be modified to appropriately address the needs of the enterprise and can be implemented across Data Management Studio modules. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible
Master Data Foundation
The __________________ feature in Data Management Studio uses master data projects and entity definitions to develop the best possible record for a specific resource, such as a customer or a product, from all of the source systems that might contain a reference to that resource. A. Collection B. Data Connection C. Master Data Foundation
Cluster Diff
The __________________ node is used to compare two sets of clustered records by reading in data from a left and a right table. From each table, the ______________________ node takes two inputs: a numeric record ID field and a cluster number field. A. Cluster Group B. Cluster Diff
SAS QKB for Contact Information (CI)
Supports management of commonly used contact information for individuals and organizations, such as names, addresses, company names, and phone numbers. A. SAS QKB for Contact Information (CI) B. SAS QKB for Product Data (PD)
Multiple
The Allow generation of _____________ matchcodes per definition option requires the creation of a special match definition in the QKB. A. Single B. Multiple
Validation The Data Validation node is in the Utilities grouping of nodes.
The Data _________________node is used to filter or flag rows according to the specified condition(s). A. Import B. Validation C. Output
NULL
The Generate null match codes for blank field values option generates a ____________match code if the field is blank. If this option is not selected, then a match code of all $ symbols is generated for the field. When you match records, a field with NULL does not equal another field with NULL, but a field with all $ symbols equals another field with all $ symbols. A. Preview B. Numeric C. NULL
Collection
The SAS Quality Knowledge Base (QKB) is a _______________ of files that store data and logic that define data management operations. A. Collection B. Repository
Surviving Record Identification
The Surviving Record Identification (SRI) node examines clustered data and determines a surviving record for each cluster. A. Entity Resolution B. Surviving Record Identification
Match
The ____________ Report node produces a report listing the duplicate records identified by the match criteria. ______________ reports are displayed with a special report viewer. A. Match B. Clustering
Table
The ______________Match report displays a list of database tables that contain matching fields for a selected table or field. A. Field B. Identification C. Table
Clustering
The ________________ node enables the specification of an output ______________ ID field and specifications of _____________ conditions. A. Match B. Clustering
Quality Knowledge Base (QKB)
The _________________ is a collection of files and configuration settings that contain all the DataFlux Data Management algorithms. A. Collections Repository B. Quality Knowledge Base (QKB)
Field Match
The _________________ report displays a list of the fields in metadata that match a selected field's name. A. Field Name B. Field Relationship C. Field Match
Entity Resolution
The __________________ File enables you to manually review the merged records and make adjustments as necessary. This can involve the following tasks: examining clusters reviewing the Cluster Analysis section reviewing related clusters processing cluster records editing fields for surviving records A. Entity Resolution B. Surviving Record Identification
Identification
The _____________________Analysis report displays a list of fields in metadata that match categories in the identification analysis definitions specified for field name and sample data analysis. A. Field B. Identification C. Table
Field Relationship
The ______________________ map provides a visual presentation of the field relationships between all of the databases, tables, and fields that are included in the data exploration. A. Field Name B. Field Relationship C. Field Match
Execute Business Rule The Execute Business Rule Properties window allows for the specification of a Return status field, which flags records as either passing (True) or failing (False) the business rule. Not selecting the Return status field will pass only records that pass the business rule to the next node.
The _________________________ node applies an existing, row-based business rule to the rows of data as they flow through a data job. Records either pass or fail the selected rule. A. Execute Business Rule B. Business Rules
Clustering
The ______________node provides the ability to match records based on multiple conditions. Create conditions that support your business needs. A. Match B. Clustering
Outliers
The ______________tab lists the X minimum and maximum value outliers. The number of listed minimum and maximum values is specified when the data profiling metrics are set. A. Frequency Distribution B. Frequency Pattern C. Outliers
-j
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Executes the job in the specified file. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
-o
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Overrides settings in configuration files. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
-c
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Reads the configuration from the specified file. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
-i
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Specifies job input variables. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
-b
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Specifies job options for the job being run. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
-l
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Writes the log to the specified file. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
Monitor: Control
The final stage in a data management project involves examining any trends to validate the extended use and retention of the data. Data that is no longer useful is retired. The project's success can then be shared throughout the organization. The next steps are communicated to the data management team to lay the groundwork for future data management efforts. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
Administration
The locations of the Quality Knowledge Base files are registered on the _________________ riser bar in DataFlux Data Management Studio. There can only be one active QKB at a time. A. Collections B. Folders C. Administration
Data Job
The main way to process data in DataFlux Data Management Studio. Each ____________ specifies a set of data-processing operations that flow from source to target. A. Command B. Routine C. Data Job
Physical The command line to execute the data job could be similar to the following: call dmpexec -j "D:\Workshop\dqdmp1\Demos\files\batch_jobs\Ch4D2_Products_Misc.ddf" -l "C:\Temp\log1.txt"
The physical path and filename of data jobs must be specified with the -j switch. A. Logical B. Physical
Plan: Define
The planning stage of any data management project starts with this essential first step. This is where the people, processes, technologies, and data sources are defined. Roadmaps that include articulating the acceptable outcomes are built. Finally, the cross-functional teams across business units and between business and IT communities are created to define the data management business rules. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
Monitor: Evaluate
This step of the methodology enables users to define and enforce business rules to measure the consistency, accuracy, and reliability of new data as it enters the enterprise. Reports and dashboards on critical data metrics are created for business and IT staff members. The information that is gained from data monitoring reports is used to refine and adjust the business rules. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
Case
Transforms a text string by changing the case of its characters to uppercase, lowercase, or proper case. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
True
True or False: If a single value in a group of items needs to be changed, then select Edit Modify Standards Manually Single Instance. A single value can then be modified manually. To toggle back to the ability to change all instances in a group, select Edit Modify Standards Manually All Instances.
True
True or False? Data standardization does not perform a validation of the data (for example, Address Verification). Address verification is a separate component of the DataFlux Data Management Studio application and is discussed in another section.
True
True or False? If you standardize a data value using both a definition and a scheme, the definition is applied first and then the scheme is applied.
True
True or False? Monitoring tasks are created by pairing a defined business rule with one or more events. Some available events include the following: call a realtime service execute a program launch a data flow job on a Management server log error to repository log error to text file raise an event on the process job (if hosted) run a local job run a local profile send email message set a data flow key or value write a row to a table
True
True or False? Record-level rules select which record from a cluster should survive. If there is ambiguity about which record is the survivor, the first remaining record in the cluster is selected.
True Jobs and profiles developed with Data Management Studio can be uploaded to the Data Management Server. Jobs and profiles can be executed on this server, which is intended to be a more powerful processing system. Data Management Server needs access to a copy of the QKB and data packs that are used in the data jobs and profiles.
True or False? The DataFlux Data Management Server is an application server that supports web service requests through a service-oriented architecture (SOA) executes profiles, data jobs, process jobs, and services on Windows, UNIX, or LINUX servers.
True
True or False? The match code generation process consists of the following steps: 1. Data is parsed into tokens (for example, Given Name and Family Name). 2. Ambiguities and noise words are removed (for example, the). 3. Transformations are made (for example, Jonathon > Jon). 4. Phonetics are applied (for example, PH > F). 5. Based on the sensitivity selection, the following occurs: Relevant components are determined. A certain number of characters of the transformed, relevant components are used.
Target
When choosing Output Field settings, which of the options sends all fields available to target nodes to the target? A. Target B. Source and Target C. All
All
When choosing Output Field settings, which of the options specifies All available fields are passed through source nodes, target nodes, and all intermediate nodes. A. Target B. Source and Target C. All
Source and Target
When choosing Output Field settings, which of the options specifies All fields available to a source node are passed to the next node and all fields available to target nodes are passed to the target. A. Target B. Source and Target C. All
dmserver.cfg
When configuring options for the Data Management Server...which config file describes the settings below? DMSERVER/SOAP/LISTEN_PORT= PORT specifies the TCP port number where the server will listen for SOAP connections. DMSERVER/LOGCONFIG_PATH= PATH specifies the path to the logging configuration file. A. app.cfg B. dmserver.cfg
app.cfg
When configuring options for the Data Management Server...which config file describes the settings below? QKB/PATH = PATH specifies the location of the active Quality Knowledge Base. VERIFY/USPS = PATH specifies the location of USPS reference source. VERIFY/GEO = PATH specifies the location of Geo/Phone reference source. A. app.cfg B. dmserver.cfg
Lower
When creating folders, it is best practice to set folder names in _____________ with no spaces. A. Lower B. Upper
Batch Jobs
When importing to a Data Management Server, Each defined Data Management Server has a series of predefined folders. Selecting ____________ (for example) enables the Import tool in the navigation area, as well as in the main information area. A. Data Jobs B. Batch Jobs
ABANDONED
When parsing, the which term best describes the description below: A resource limit was reached. Increase your resource limit and try again. A. OK B. NO SOLUTION C. NULL D. ABANDONED
NULL
When parsing, the which term best describes the description below: The parse operation was not attempted. This result occurs only when a null value was in the field to be parsed and the Preserved null values option was enabled. A. OK B. NO SOLUTION C. NULL D. ABANDONED
OK
When parsing, the which term best describes the description below: The parse operation was successful. A. OK B. NO SOLUTION C. NULL D. ABANDONED
NO SOLUTION
When parsing, the which term best describes the description below: The parse operation was unsuccessful; no solution was found. A. OK B. NO SOLUTION C. NULL D. ABANDONED
Preserve
When standardizing, selecting _________________ null values ensures that if a field is null when it enters the node, then the field is null after being output from the node. It is recommended that this option be selected if the output is written to a database table. A. Import B. Preserve C. Archive
SQL
Which of the statements below describes this querying method? The data generated for both the __________query and the filter have the same results. The filter pulled all records. The filter was processed on the machine where the profile was run. The database does the filtering for the ________ query. A. Filtering B. SQL
Data
Within the Data Mgmt Studio Repository, the ___________storage of a repository can contain the following: explorations and reports profiles and reports business rules monitoring results custom metrics business data information master data information A. Data B. File
%
Within the Standardization Scheme, which of these commands provides an indicator specifying that the matched word or phrase is not updated? A. //Remove B. %
//Remove
Within the Standardization Scheme, which of these commands removes the matched word or phrase from the input string? A. //Remove B. %
Define
Within the _______________ methodology, there are four main functions which can be used: Connect to Data Explore Data Define Business Rules Build Schemes A. Define B. Discover
Flexible
You can customize rules to conform to the ever-changing business environment regardless of your data needs. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible
Efficient
You can dramatically reduce manual data manipulation time by simply updating cleansing rules. It is much easier to manipulate reusable data-cleansing rules than to manually manipulate the data itself. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible
Data Collection
You can use _____________ to group data fields from different tables, database connections, or both. These collections can be used as input data sources for profiles. A. Repository B. Data Collection C. Data Exploration
Fully Customizable
You have full control of data-cleansing rules across the enterprise and through time. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible
Parse
_____________ definitions define rules to place the words from a text string into the appropriate tokens. A. Parse B. Text C. Case
Field Name
______________ analysis analyzes the names of each field from the selected data sources to determine which identity to assign to the field. A. Identification B. Field Name C. Sample Data
Case
______________ definitions are algorithms that can be used to convert a text string to uppercase, lowercase, or proper case. A. Parse B. Text C. Case
Address Verification
________________ identifies, corrects, and enhances address information. A. Address Validation B. Address Verification
Sample Data
_________________ analysis analyzes a sample of data in each field to determine which identity to assign to the field. A. Identification B. Field Name C. Sample Data
Data Connection
__________________ are used to access data in jobs, profiles, data explorations and data collections. A. Collection B. Data Connection C. Master Data Foundation
Entity Resolution
__________________ is the process of merging duplicate records in a single file or multiple files so that records referring to the same physical object are treated as a single record. Records are matched based on the information that they have in common. The records that are merged might appear to be different, but can actually refer to the same person or item. A. Entity Match B. Entity Resolution C. Match Entity
Data Exploration
______________________ have the following types of analysis methods: field name matching field name analysis sample data analysis A. Repository B. Data Collection C. Data Exploration
Geocoding Geocoding latitude and longitude information can be used to map locations and plan efficient delivery routes. Geocoding can be licensed to return this information for the centroid of the postal code or at the roof-top level. Currently, there are only geocoding data files for the United States and Canada. Also, roof-top level geocoding is currently available only for the United States.
_______________enhances address information with latitude and longitude values. A. Geo Validation B. Geocoding
Identification, Right
______________analysis and ___________ fielding use the same definitions from the QKB, but in different ways. ______________ analysis identifies the type of data in a field, and __________ fielding moves the data into separate fields based on its identification. Both the ___________ analysis and _________ fielding examples above use the Contact Info identification analysis definition. a. Identification, Right B. Right, Identification