Semistructured Data and XML
XQuery
• XQuery • is a general purpose query language for XML data • XQuery FLWOR syntax • for ... let ... where ... order by ...result ... ‒ for <-> SQL from ‒ where <-> SQL where ‒ order by <-> SQL order by ‒ result <->SQL select ‒ let allows temporary variables, and has no equivalent in SQL
XML documents
• are not required to have an associated schema • However, schemas are very important for XML data exchange • Otherwise, a site cannot automatically interpret data received from another site
Joins
• are specified in a manner very similar to SQL • Example for $c in /university/course, $i in /university/instructor, $t in /university/teaches where $c/course_id= $t/course id and $t/IID = $i/IID return <course_instructor> { $c $i } </course_instructor> • The same query can be expressed with the selections specified as XPath selections: for $c in /university/course, $i in /university/instructor, $t in /university/teaches[$c/course_id= $t/course_id and $t/IID = $i/IID] return <course_instructor> { $c $i } </course_instructor>
Tags
• <tag>...</tag> • Delimit the beginning and the end of the portion of the document to which the tag refers • Example ‒ <th>Name</th>
Nesting Adv/Disadv
• Advantage • Example ‒ Course elements nested within department element ‒ Easy to find all courses offered by a department • Disadvantages • Redundant storage of data • Example ‒ Details of courses taught by an instructor are stored within the instructor elements ‒ If a course is taught by more than one instructor, course information would be stored redundantly.
Two mechanisms for specifying XML schema
• Document Type Definition (DTD) ‒ Widely used • XML Schema ‒ Newer, increasing use
Querying and Transforming XML Data
• The following two are closely related, and handled by the same tools • Translation of information from one XML schema to another • Querying on XML data • Standard XML querying/translation languages • Xpath: Simple language consisting of path expressions • Xquery: An XML query language with a rich set of features • XSLT: Simple language designed for translation from XML to XML and XML to HTML • Query and transformation languages are based on a tree model of XML data
Subelements can be specified as
• names of elements, or • #PCDATA (parsed character data), i.e., character strings • EMPTY (no subelements) or ANY (anything can be a subelement) • Example <! ELEMENT department (dept_name, building, budget)> <! ELEMENT dept_name (#PCDATA)> <! ELEMENT budget (#PCDATA)>
Markup
• refers to anything in a document that is not intended to be part of the printed output
Root element
Every document must have a single root element encompasses all other elements in the document
Tree Model of XML
• An XML document is modeled as a tree, with nodes corresponding to elements and attributes • Element nodes have child nodes, which can be attributes or subelements • Text in an element is modeled as a text node child of the element • Children of a node are ordered according to their order in the XML document • Element and attribute nodes (except for the root node) have a single parent, which is an element node • The root node has a single child, which is the root element of the document
Xpath 3
• Attributes are accessed using "@" • Example ‒ /university-3/course[credits >= 4]/@course_id ˃ returns the course identifiers of courses with credits >= 4 • XPath provides several functions • The function count() at the end of a path counts the number of elements in the set generated by the path • Example ‒ /university-2/instructor[count(./teaches/course)> 2] ˃ Returns instructors teaching more than 2 courses (on university-2 schema) • Boolean connectives and and or and function not() can be used in predicates
Document Type Definition
• DTD is an optional part of an XML document • DTD constraints structure of XML data • What elements can occur • What attributes can/must an element have • What subelements can/must occur inside each element, and how many times. • DTD does not constrain data types • All values represented as strings in XML
XML Motivation
• Data interchange is critical in today's networked world • Examples: ‒ Banking: funds transfer ‒ Order processing (especially inter-company orders) • Paper flow of information between organizations is being replaced by electronic flow of information • Each application area has its own set of standards for representing information • Chemistry: ChemML (Chemical Markup Language), ... • Genetics: BSML (Bio-Sequence Markup Language), ... • XML has become the basis for all new generation data interchange formats
XML: Extensible Markup Language
• Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML • Documents have tags giving extra information about sections of the document e.g. <title>XML</title><slide>Introduction...</slide> • Extensible, unlike HTML • Users can add new tags, and separately specify how the tag should be handled for display • The ability to specify new tags, and to create nested tag structures make XML a great way to exchange data, not just documents. • Much of the use of XML has been in data exchange applications, not as a replacement for HTML
Attributes vs. Subelements
• Distinction between subelement and attribute • In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents • In the context of data representation, the difference is unclear and may be confusing ‒ Same information can be represented in two ways • Suggestion: use attributes for identifiers of elements, and use subelements for contents
DTD syntax
• Element / subelement ‒ <!ELEMENT element (subelements-specification) > • Attribute ‒ <!ATTLIST element (attributes) >
Attribute
• Element can have attributes • The attributes of an element appear as name=value pairs before the closing ">" of a tag <tagname attributename ='"value"> • An element may have several attributes, but each attribute name can only occur once
IDs and IDREFs
• ID ‒ An element can have at most one attribute of type ID ‒ The ID attribute value of each element in an XML document must be distinct • IDREF ‒ An attribute of type IDREF must contain the ID value of an element in the same document ‒ An attribute of type IDREFS contains a set of (0 or more) ID values. Each ID value must contain the ID value of an element in the same document
Comparison of the relational database and the XML document
• Inefficient: tags, which in effect represent schema information, are repeated • Better than relational tuples as a data-exchange format ‒ Unlike relational tuples, XML data is self-documenting due to presence of tags ‒ Non-rigid format: tags can be added ‒ Allows nested structures ‒ Wide acceptance, not only in database systems, but also in browsers, tools, and applications
Attribute specification: for each attribute
• Name • Type of attribute ‒ CDATA ‒ ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs) • Whether ‒ mandatory (#REQUIRED) ‒ has a default value (value), ‒ or neither (#IMPLIED)
Namespaces Solution
• Namespace mechanism ‒ allow organizations to specify globally unique names to be used as element tags in documents.
Nesting
• Nested elements
Limitations of DTDs
• No typing of text elements and attributes ‒ All values are strings, no integers, reals, etc. • Difficult to specify unordered sets of subelements ‒ (A | B)* allows specification of an unordered set, but ˃ Cannot ensure that each of A and B occurs only once • IDs and IDREFs are untyped ‒ The instructors attribute of a course may contain a reference to another course, which is meaningless ˃ instructors attribute should ideally be constrained to refer to instructor elements
Elements must be properly nested
• Proper nesting ‒ <course>...<title> ... </title>...</course> • Improper nesting ‒<course>...<title>...</course>...</title> • Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element.
Structure of XML Data
• Tag • label for a section of data • Element • section of data beginning with <tagname> and ending with matching </tagname>
Xpath 2
• The initial "/" denotes root of the document (above the top-level tag) • Path expressions are evaluated left to right • Each step operates on the set of instances produced by the previous step • Selection predicates may follow any step in a path, in [ ]
Sorting in XQuery
• The order by clause can be used at the end of any expression. • Example ‒ To return instructors sorted by name for $i in /university/instructor order by $i/name return <instructor> { $i/* } </instructor> ‒ Use order by $i/name descending to sort in descending order • Can sort at multiple levels of nesting (sort departments by dept_name, and by courses sorted to course_id within each department)
XML Schema Overview
• XML Schema supports • Typing of values ‒ Such as integer, string, etc ‒ Also, constraints on min/max values • User-defined, complex types • Many more features, including ‒ uniqueness and foreign key constraints, inheritance • XML Schema is integrated with namespaces • BUT: XML Schema is significantly more complicated than DTDs.
Namespaces - Motivation
• XML data has to be exchanged between organizations • Same tag name may have different meaning in different organizations, causing confusion on exchanged documents • Specifying a unique string as an element name avoids confusion
XPath 1
• XPath • is used to address (select) parts of documents using path expressions • A path expression • is a sequence of steps separated by "/" • Example ‒ /company/employee/name ‒ /company/employee/address/city • Result of path expression: • set of values that along with their containing elements/attributes match the specified path
Database schemas
• constrain what information can be stored, and the data types of stored values
Markup language
• is a formal description of ‒ what part of the document is content, ‒ what part is markup, and ‒ what the markup means
For
• uses XPath expressions, and variable in for clause ranges over values in the set returned by Xpath • Example • Find all courses with credits > 3, with each result enclosed in an <course_id> .. </course_id> tag for $x in /university-3/course[credits>3] return <course_id> {$courseId} </course id> • Items in the return clause are XML text unless enclosed in {}, in which case they are evaluated
DTD Symbols
• |: or, alternatives • +: 1 or more occurrences • *: 0 or more occurrences • Example ‒ <!ELEMENT university ( (department|course|instructor|teaches)+)>