Lesson 4: Comparing and Contrasting Different Data Structures, Formats, and Markup Languages
An analyst exported data into a new TAB delimited file for analysis in another system. Which of the following file extensions would the export process NOT assign to the new file? .TAB .CSV .TXT .TSV
.TAB
Would data that is stored in a relational database be considered structured or unstructured data, and why?
A relational database stores structured data; we know this because structured data fits into columns and rows.
Able to query data Allows you to perform actions on data T-SQL, which stands for Transact - SQL, is used in software developed by Microsoft. Oracle also uses SQL, but it uses PL/SQL, or Procedural Language/SQL. When searching for syntax examples, it is helpful to look for the version of SQL that your particular software uses.
Attributes of SQL
As there is far more unstructured data in the world than structured data, a company hired a data analyst specifically to analyze their unstructured and semi-structured data. What systems might the analyst work with? (Select all that apply.) BLOB storage Data lakes Relational databases NoSQL databases
BLOB storage Data lakes NoSQL databases
What is a major drawback to working with CSV files?
CSV files are not connected to the live data, so the file will not update when the data updates. A new file must be exported to see the changes.
What are some common delimiters you will likely encounter in text-based files?
Comma, pipe, and tab are common delimiters.
Hypertext Markup Language (HTML)
Created by Tim Berners Lee in the early 1990s, a language dedicated to presenting data in a browser-based environment. Derived from Standard Generalized Markup Language (SGML), which is considered to be the parent of all markup languages. SGML provides the standard that defines all markup languages and is also widely used for data structures. HTML is considered a markup and tag language. It allows you to mark up any type of text for presentation on a web browser, which makes HTML suitable as the language of the web. The browser reads the HTML and produces a web page or website following the structure that is laid out in the code. HTML has a set of predefined tags based on the HTML language (e.g., HTML, TITLE) so that when the browser sees those lines of code, it knows how to present that information and where. For example, you can create a text file with HTML tags and then open that text document in the browser to see the results. Involves the use of tags in the form of angle brackets to mark up a document intended to be displayed on a web browser. These tags typically consist of an opening tag (< >) indicating the start of an element and a closing tag (/ < >) indicating the end of the element, with the text content of the element enclosed between the two tags.
Define the following file extensions File Extensions .CSV .CSV (comma separated value) with UTF .TAB (tab delimited) .TXT (tab delimited) .TSV (tab delimited)
File Extensions .CSV (comma separated value) --> Comma separates the values .CSV (comma separated value) with UTF - Encoding --> Comma separates the values; this extension is preferred when dealing with web-based coding .TAB (tab delimited) --> Tab separates the values .TXT (tab delimited) --> When saving from Excel as a tab delimited file, Excel will generate the delimiter and line breaks as needed but use an extension of .TXT .TSV (tab delimited) --> Tab separates the values
A data analyst is preparing to present the findings of a recent analysis to executive leadership. To ensure there are no technical issues with the presentation, the analyst is using a tool that wraps the data in tags that will present the data on any web browser. What language is the analyst using to present the data?
HTML
Which markup language is dedicated to the presentation of information on the web?
HTML
Which markup language supports the ability to use an array?
JSON
A data analyst queried a database to retrieve information on an organization's last month's sales and then used the same language to append that data to a separate flat-file database. What language did the analyst use to perform those tasks?
SQL
What language is used to write queries in a structured database environment?
SQL
A data analyst received a delimited file and knew that which of the following characters would most likely separate the data? (Select all that apply.) TABs Pipes Commas Apostrophes
TABs Pipes Commas
Is most data in the world structured or unstructured, and why?
The majority of data in the world is unstructured. A process must be performed to structure data into tables and fields.
.CSV
The most common type of delimited file is comma separated values (CSV). In this file format, each field of data is separated by a comma.
Video, audio, and images are all examples of what type of data?
Unstructured data, because these items do not fit neatly into structured data tables with field names and require additional action, like watching a video, to gain context.
A data analyst is using a tool to generate custom tags to wrap around customer data fields before transferring the data to another system. What language is the analyst generating?
XML
Which markup language supports the transfer of data between systems and is a child of SGML?
XML
Semi-structured data
a mix of both structured and non-structured data. Emails, JSON files, XML files, zipped files, and even web pages (HTML) are examples of semi-structured data. Semi-structured data does not meet the rigid standards of structured data, so it cannot be stored in a relational database. However, semi-structured data does contain some structure through the use of tags and attributes that group the data and describe its storage methods. These tags allow us to search for and view the data, and also help us to transform unstructured data into more structured data sets.
Unstructured data
data that is not organized in a predefined manner to meet the standards for structured data. Unstructured data does not fit neatly into tables, but instead has undefined fields. Stored in non-relational database, can't be stored in traditional May use data lake or blob storage Ex. Image data, video data, likes, log files
Flat files
delimited files that are exported out of a system Not connected to the database; when the database updates, the flat file does not update. The flat file will continue to display the data as it was when you exported it. In order to see the updated data, you will need to export the file again.
Delimited Files
files in which some form of character separates each field of data from the other data fields. If you have ever opened a text file and found it was full of commas, or even weird-looking spaces, this file was likely downloaded from a data system. The commas and spaces are characters intended to support the conversion of the data into a structured file format.
A data analyst has exported the files out of a database to perform calculations. Once the data in the database has updated, the analyst will have to export the data again since the exported data will not automatically update. What type of database file is the analyst working with?
flat file
Structured data
has structure and fits nicely in tables, schemas. Ex. field data, such as numeric, alphanumeric, and date data types
Extensible Markup Language (XML)
is also a text-based markup language that uses tags derived from SGML. Unlike HTML, the primary purpose of XML is to transfer data, not display it. XML also uses opening and closing tags, but one of the key differences of XML is that the "tags" are not defined by the language itself. Rather, the author can invent the tags and the structure. This ability allows the developer to tag the fields of data with something more meaningful. As an example, consider the example of the product-pricing table from our HTML discussion. Rather than being limited to just using <TR> and <TD> tags, we can use more relevant tags with XML, as shown in the following example.
JavaScript Object Notation (JSON)
is an object-oriented, event-driven programming language that allows us to interact with websites (as opposed to HTML, which simply displays information). JSON is a child of JavaScript and is based on the JavaScript language. With the advent of web applications, JavaScript has become one of the core languages of the World Wide Web. JSON is used to transfer information and interact with programming languages like JavaScript. JSON and XML are alike in several ways. Both utilize tags or keys that are created by the developer and thus more relevant to the data. JSON and XML can both be read, parsed, and used by many programming languages. However, JSON has a more simplified coding syntax than XML, with fewer requirements, and it is not considered a markup language because it is object oriented. JSON uses curly brackets for its syntax, and it does not require the opening and closing tags seen in XML and HTML. JSON is written using key value pairs. Let's return to our product example. When we write the product list in JSON, the Product Name is an example of a key, and the associated value is "Adjustable Race." The key value is always a text string. The value in the "value" section of the key value pair must be one of the following data types: a string, an object, a number, an array, Boolean, or null. Has ability to use array - An array is noted using square brackets and contains comma-separated elements. The ability to use an array helps make JSON more streamlined than XML
A company requires users to tag all the files, images, and videos that they upload to the company portal with keywords and descriptions so that they can later find the data through the Search feature. What kind of data is the company using?
semi-structured
A user is registering for a very basic account on an online dating site, but does not upload a profile picture. What kind of data does this represent?
structured
When a student has their picture taken for their student ID card, the school also links the photo with the student's school record. What kind of data is the student's record?
structured