DS 320 Midterm

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

what is the goal of data integration

1) offer a uniform access to a set of autonomous and heterogeneous data sources 2) tie together different sources controlled by many people under one schema

what is data integration

a set of techniques that enable building systems geared for flexible sharing and integration of data across multiple autonomous data providers

challenges in data matching and how to address them

accuracy: variations in formatting/abbreviation and naming conventions, omissions, nicknames, and errors in data scaling up: high computational cost for large data sets

why is string matching challenging in data integration

accuracy: matching strings may appear differently scalability: applying a similarity measure to every pair is impractical

what is data matching

finds structured data items that refer back to the same real-world entity

difference between generative and discriminative models

generative: model how data is placed throughout the space discriminative: draw boundaries in the data space

extract feature vectors using TF/IDF

convert each string into a document d with terms t term frequency = # times t appears in document inverse document frequency = # of documents / # documents that contain t

why is it difficult to reconcile the semantic heterogeneity between schemas

could be different names for similar concepts multiple attributes in one schema relate to one attribute in the other different tabular organizations different coverage and level of details

what is the difference between data matching and string matching

data matching involves tuples of strings, string matching looks at the strings themselves

what are the major differences between a data warehouse and a virtual data integration system

data warehouse integrates the data by brining it into a single physical warehouse virtual data integration system leaves the data at the sources and accesses them through the global representation at the time of a query

local as view mapping

describes the data sources as views over the mediated schema expresses incomplete information easier to add sources and specify their constraints

multi-strategy learning to reuse previous matches

employ a set of learners use a meta learner to learn a weight for each element of the mediated schema and each learner learners act as matchers and meta learner acts as a combiner outputs a similarity matrix

techniques for scaling up rule based matching

hashing: hash tuples into buckets, match only tuples within each bucket sorting: use key to sort the tuples, scan the sorted list and match each tuple with only the previous w-1 tuples where w is a pre-specified value indexing: index the tuples such that given any tuple a, can use the index to locate a relatively small set of tuples that are likely to match canopies: use a computationally cheap similarity measure to quickly group tuples into overlapping clusters, then use a different similarity measure to match tuples within each canopy

components of a typical schema matching system

match selector: constraint enforcer: combiner:

integrity constraints

mechanism for limiting possible states of the database, ex: repeat roes key constraints, foreign key constraints, functional dependencies

global as view mapping

mediated schema defined as a set of views over the data sources reformulation is easier forces everything into perspective of mediated schema

matchers examples

name based, instance based

why is it preferred to create semantic matches, then elaborate matches into mappings when creating the descriptions of the data sources

reduces complexity: 1) semantic matches are often easier to elicit from designers 2) it breaks the long process in the middle 3) it allows the designer to verify and correct the matches necessary: matches often specify functional relationships but cannot be used to obtain data instances so SQL queries are needed

approaches for developing a data matcher

rule based: developer writes rules that specify when two tuples match supervised: learns a matching model M from training data, then applies M to match new tuple pairs unsupervised: clustering (AHC, k-means, ...) collective: decide whether two tuples match while simultaneously considering other tuples

similarity/distance measure equation

s(x, y) = 1 - ( d(x, y) / max(length(x), length(y)) ) d(x, y) is edit distance

how is semantic heterogeneity addressed in the data integration systems

schema mappings

why do we need to employ more than one matcher in matching elements between two schemas

we need to select different matchers based on the types of information, ex: names and number values

how to extract training data and features to train a supervised data matching model

set of tuples of elements from S and T such that training data = {(x1, y1, l1), ... (xi, yi, li)} where li = yes if xi matches yi and no if not

combiners

simple combiners: avg, min, max complex combiners: hand-crafted scripts, weighted sum, learners

why is data integration hard

system level: managing different platforms, SQL across multiple systems is complex, distributed query processing logical: schema and data heterogeneity social: locating and capturing relevant data, convincing people to share their data

how many matchers do you need for two schemas with multiple tables and attributes

the minimum number of attributes across both tables

architecture of the data integration system

there are a series of data sources from which various wrappers/loaders request and parse data, then the source descriptions/transformations convert the data from the source schemas/values into the global representation. which is the mediated schema/warehouse that abstracts all source data, the user poses queries over this

semantic heterogeneity

there will be differences in schemas if they are designed by different people, even if they model the same domain

match selection strategies

thresholding: stable marriage:


Ensembles d'études connexes

Wiley Chapter 5 Assignment Questions

View Set

Visual Communication Fall 2017 Weber Exam 1

View Set

Paramedic Jb Learning All Chapter Exams

View Set

Chapter 13 Anatomy and Physiology

View Set

9.REF/moratorium/recasting/short sale addendum/ chapter 11/foreclosure by advertisement/entry&possession/write of entry/deficiency judgement/reduction act/recourse clause/depreciation value/notice of default levy/ redemption period/reinstatement

View Set

Nursing Care of the Child With an Alteration in Bowel Elimination/Gastrointestinal Disorder

View Set