DS 320 Midterm
what is the goal of data integration
1) offer a uniform access to a set of autonomous and heterogeneous data sources 2) tie together different sources controlled by many people under one schema
what is data integration
a set of techniques that enable building systems geared for flexible sharing and integration of data across multiple autonomous data providers
challenges in data matching and how to address them
accuracy: variations in formatting/abbreviation and naming conventions, omissions, nicknames, and errors in data scaling up: high computational cost for large data sets
why is string matching challenging in data integration
accuracy: matching strings may appear differently scalability: applying a similarity measure to every pair is impractical
what is data matching
finds structured data items that refer back to the same real-world entity
difference between generative and discriminative models
generative: model how data is placed throughout the space discriminative: draw boundaries in the data space
extract feature vectors using TF/IDF
convert each string into a document d with terms t term frequency = # times t appears in document inverse document frequency = # of documents / # documents that contain t
why is it difficult to reconcile the semantic heterogeneity between schemas
could be different names for similar concepts multiple attributes in one schema relate to one attribute in the other different tabular organizations different coverage and level of details
what is the difference between data matching and string matching
data matching involves tuples of strings, string matching looks at the strings themselves
what are the major differences between a data warehouse and a virtual data integration system
data warehouse integrates the data by brining it into a single physical warehouse virtual data integration system leaves the data at the sources and accesses them through the global representation at the time of a query
local as view mapping
describes the data sources as views over the mediated schema expresses incomplete information easier to add sources and specify their constraints
multi-strategy learning to reuse previous matches
employ a set of learners use a meta learner to learn a weight for each element of the mediated schema and each learner learners act as matchers and meta learner acts as a combiner outputs a similarity matrix
techniques for scaling up rule based matching
hashing: hash tuples into buckets, match only tuples within each bucket sorting: use key to sort the tuples, scan the sorted list and match each tuple with only the previous w-1 tuples where w is a pre-specified value indexing: index the tuples such that given any tuple a, can use the index to locate a relatively small set of tuples that are likely to match canopies: use a computationally cheap similarity measure to quickly group tuples into overlapping clusters, then use a different similarity measure to match tuples within each canopy
components of a typical schema matching system
match selector: constraint enforcer: combiner:
integrity constraints
mechanism for limiting possible states of the database, ex: repeat roes key constraints, foreign key constraints, functional dependencies
global as view mapping
mediated schema defined as a set of views over the data sources reformulation is easier forces everything into perspective of mediated schema
matchers examples
name based, instance based
why is it preferred to create semantic matches, then elaborate matches into mappings when creating the descriptions of the data sources
reduces complexity: 1) semantic matches are often easier to elicit from designers 2) it breaks the long process in the middle 3) it allows the designer to verify and correct the matches necessary: matches often specify functional relationships but cannot be used to obtain data instances so SQL queries are needed
approaches for developing a data matcher
rule based: developer writes rules that specify when two tuples match supervised: learns a matching model M from training data, then applies M to match new tuple pairs unsupervised: clustering (AHC, k-means, ...) collective: decide whether two tuples match while simultaneously considering other tuples
similarity/distance measure equation
s(x, y) = 1 - ( d(x, y) / max(length(x), length(y)) ) d(x, y) is edit distance
how is semantic heterogeneity addressed in the data integration systems
schema mappings
why do we need to employ more than one matcher in matching elements between two schemas
we need to select different matchers based on the types of information, ex: names and number values
how to extract training data and features to train a supervised data matching model
set of tuples of elements from S and T such that training data = {(x1, y1, l1), ... (xi, yi, li)} where li = yes if xi matches yi and no if not
combiners
simple combiners: avg, min, max complex combiners: hand-crafted scripts, weighted sum, learners
why is data integration hard
system level: managing different platforms, SQL across multiple systems is complex, distributed query processing logical: schema and data heterogeneity social: locating and capturing relevant data, convincing people to share their data
how many matchers do you need for two schemas with multiple tables and attributes
the minimum number of attributes across both tables
architecture of the data integration system
there are a series of data sources from which various wrappers/loaders request and parse data, then the source descriptions/transformations convert the data from the source schemas/values into the global representation. which is the mediated schema/warehouse that abstracts all source data, the user poses queries over this
semantic heterogeneity
there will be differences in schemas if they are designed by different people, even if they model the same domain
match selection strategies
thresholding: stable marriage: