Data analysis- Week 2 DSMG

Ace your homework & exams now with Quizwiz!

XML (Extensible Markup Language)

XML files, contain data values that are identified or marked up using tags. While data in flat files is "flat" or maps to a single table, XML files can support more complex data structures, such as hierarchical. Some common uses of XML include data from online surveys, bank statements, and other unstructured data sets.

Structured data storage

You can typically store structured data in relational or SQL databases. You can also easily examine structured data with standard data analysis methods and tools.

DA ecosystem

infrastructure, software, tools, frameworks, and processes used to gather, clean, analyze, mine, and visualize data

Five Vs of Big Data

velocity, volume, variety, veracity, and value

Data warehouse

A data warehouse works as a central repository that merges information coming from disparate sources and consolidates it through the extract, transform, and load process, also known as the ETL process, into one comprehensive database for analytics and business intelligence. Data warehouses serve as the single source of truth—storing current and historical data that has been cleansed, conformed, and categorized. A data warehouse is a multi-purpose enabler of operational and performance analytics.

Database

A database is a collection of data, or information, designed for the input, storage, search and retrieval, and modification of data

Semi-structured data

A mix of data that has consistent characteristics and data that doesn't conform to a rigid structure. For example, emails. An email has a mix of structured data, such as the name of the sender and recipient, but also has the contents of the email, which is unstructured data

DA role responsibilities

Acquiring data from primary and secondary data sources, Creating queries to extract required data from databases and other data collection systems, Filtering, cleaning, standardizing, and reorganizing data in preparation for data analysis, Using statistical tools to interpret data sets, Using statistical techniques to identify patterns and correlations in data, Analyzing patterns in complex data sets and interpreting trends, Preparing reports and charts that effectively communicate trends and patterns, Creating appropriate documentation to define and demonstrate the steps of the data analysis process.

Flat files

Flat files, store data in plain text format, with one record or row per line, and each value separated by delimiters such as commas, semi-colons or tabs. Data in a flat file maps to a single table, unlike relational databases that contain multiple tables. One of the most common flat file format is CSV in which values are separated by commas.

Example of DBMS use

For example, if you want to find customers who have been inactive for six months or more, using the query function, the database management system will retrieve data of all customers from the database that have been inactive for six months and more. Even though a database and DBMS mean different things the terms are often used interchangeably.

Four types of NoSQL databases

Key-value store, document-based, column-based, and graph-based

Spreadsheet files

Spreadsheet files are a special type of flat files, that also organize data in a tabular format - rows and columns. But a spreadsheet can contain multiple worksheets, and each worksheet can map to a different table. Although data in spreadsheets is in plain text, the files can be stored in custom formats and include additional information such as formatting, formulas, etc. Microsoft Excel, which stores data in .XLS or .XLSX format is probably the most common spreadsheet. Others include Google sheets, Apple Numbers, and LibreOffice.

Data stream uses

Stock and market tickers for financial trading, retail transaction streams for predicting demand and supply chain management, surveillance and video feeds for threat detection, social media feeds for sentiment analysis, sensor data feeds for monitoring industrial or farming machinery, web click feeds for monitoring web performance and improving design, and real-time flight events for rebooking and rescheduling.

ETL step 2: Stream processing

Stream processing, which means source data is pulled in real-time from the source and transformed while it is in transit and before it is loaded into the data repository. Tools for stream processing include Apache Samza, Apache Storm, and Apache Kafka.

Structured data long definition

Structured data has a well-defined structure or adheres to a specified data model, can be stored in well-defined schemas such as databases, and in many cases can be represented in a tabular manner with rows and columns. Structured data is objective facts and numbers that can be collected, exported, stored, and organized in typical databases.

ETL step 5: Load verification

Load verification, which includes data checks for missing or null values, server performance, and monitoring load failures, are important parts of this process step. It is vital to keep an eye on load failures and ensure the right recovery mechanisms are in place.

ETL step 1: Extract

Extract is the step where data from source locations is collected for transformation. Data extraction could be through: Batch processing, meaning source data, is moved in large chunks from the source to the target system at scheduled intervals. Tools for batch processing include Stitch and Blendo.

Data Lake

A Data Lake is a storage repository that can store large amounts of structured, semi-structured, and unstructured data in their native format, classified and tagged with metadata. So, while a data warehouse stores data processed for a specific need, a data lake is a pool of raw data where each data element is given a unique identifier and is tagged with metatags for further use. You would opt for a data lake if you generate, or have access to, large volumes of data on an ongoing basis, but don't want to be restricted to specific or pre-defined use cases. Unlike data warehouses, a data lake would retain all source data, without any exclusions. And the data could include all types of data sources and types. Data lakes are sometimes also used as a staging area of a data warehouse. The most important role of a data lake is in predictive and advanced analytics.

Database Management System (DBMS)

A Database Management System, or DBMS, is a set of programs that creates and maintains the database. It allows you to store, modify, and extract information from the database using a function called querying.

Unix/Linux shell

A Unix/Linux Shell is a computer program written for the UNIX shell. It is a series of UNIX commands written in a plain text file to accomplish a specific task.

data mart

A data mart is a sub-section of the data warehouse, built specifically for a particular business function, purpose, or community of users. The idea is to provide stakeholders data that is most relevant to them, when they need it. For example, the sales or finance teams accessing data for their quarterly reporting and projections. Since a data mart offers analytical capabilities for a restricted area of the data warehouse, it offers isolated security and isolated performance. The most important role of a data mart is business-specific reporting and analytics.

Data repository

A data repository is a general term used to refer to data that has been collected, organized, and isolated so that it can be used for business operations or mined for reporting and data analysis. It can be a small or large database infrastructure with one or more databases that collect, manage, and store data sets. Overall, data repositories help to isolate data and make reporting and analytics more efficient and credible while also serving as a data archive.

Data stream programs

Apache Kafka, Apache Spark Streaming, and Apache Storm

ETL

At a very high-level, the ETL process helps you to extract data from different data sources, transform the data into a clean and usable state, and load the data into the enterprise's data repository. ETL is how raw data is converted into analysis-ready data. It is an automated process in which you gather raw data from identified sources, extract the information that aligns with your reporting and analysis needs, clean, standardize, and transform that data into a format that is usable in the context of your organization; and load it into a data repository. While ETL is a generic process, the actual job can be very different in usage, utility, and complexity.

Web scraping tools

BeautifulSoup, Scrapy, Pandas, and Selenium

Examples of web scraper uses

Collecting product details from retailers, manufacturers, and eCommerce websites to provide price comparisons, generating sales leads through public data sources, extracting data from posts and authors on various forums and communities, and collecting training and testing datasets for machine learning models.

NoSQL:Column-based

Column-based: Column-based models store data in cells grouped as columns of data instead of rows. A logical grouping of columns, that is, columns that are usually accessed together, is called a column family. For example, a customer's name and profile information will most likely be accessed together but not their purchase history. So, customer name and profile information data can be grouped into a column family. Since column databases store all cells corresponding to a column as a continuous disk entry, accessing and searching the data becomes very fast. Column databases can be great for systems that require heavy write requests, storing time-series data, weather data, and IoT data. But if you need to use complex queries or change your querying patterns frequently, this may not be the best option for you. The most popular column databases are Cassandra and HBase.

CSVs and TSVs

Comma-separated values (or CSVs) and tab-separated values (or TSVs) are the most commonly used file types in this category. In CSVs, the delimiter is a comma while in TSVs, the delimiter is a tab. When literal commas are present in text data and therefore cannot be used as delimiters, TSVs serve as an alternative to CSV format. Tab stops are infrequent in running text.

soft skills

Data Analysis is both a science and an art. You can ace the technical and functional expertise, but one of the key differentiators for your success is going to be soft skills. This includes your ability to work collaboratively with business and cross-functional teams; communicate effectively to report and present your findings; tell a compelling and convincing story; and gather support and buy-in for your work. Above all, being curious, is at the heart of data analysis. In the course of your work, you will stumble upon patterns, phenomena, and anomalies that may show you a different path. The ability to allow new questions to surface and challenge your assumptions and hypotheses makes for a great analyst. You will also hear data analysis practitioners talk about intuition as a must-have quality. It's essential to note that intuition, in this context, is the ability to have a sense of the future based on pattern recognition and past experiences.

Extensible Markup Language, or XML

Extensible Markup Language, or XML, is a markup language with set rules for encoding data. The XML file format is both readable by humans and machines. It is a self-descriptive language designed for sending information over the internet. XML is similar to HTML in some respects, but also has differences. For example, an .XML does not use predefined tags like .HTML does. XML is platform independent and programming language independent and therefore simplifies data sharing between various systems.

Data pipelines

Data pipelines can be architected for batch processing, for streaming data, and a combination of batch and streaming data. In the case of streaming data, data processing or transformation, happens in a continuous flow. This is particularly useful for data that needs constant updating, such as data from a sensor monitoring traffic. A data pipeline is a high performing system that supports both long-running batch queries and smaller interactive queries. The destination for a data pipeline is typically a data lake, although the data may also be loaded to different target destinations, such as another application or a visualization tool. There are a number of data pipeline solutions available, most popular among them being Apache Beam and DataFlow.

Types of data selling organisations

Data stored in databases and data warehouses can be used as a source for analysis. For example, data from a retail transactions system can be used to analyse sales in different regions, and data from a customer relationship management system can be used for making sales projections. External to the organisation, there are other publicly and privately available data sets. For example, government organisations releasing demographic and economic data sets on an ongoing basis. Then there are companies that sell specific data, for example, Point-of-Sale data or Financial data, or Weather data, which businesses can use to define strategy, predict demand, and make decisions related to distribution or marketing promotions, among other things. Such data sets are typically made available as flat files, spreadsheet files, or XML documents.

Data streams

Data streams are another widely used source for aggregating constant streams of data flowing from sources such as instruments, IoT devices and applications, GPS data from cars, computer programs, websites, and social media posts. This data is generally timestamped and also geo-tagged for geographical identification.

Structured data

Data that follows a rigid format and can be organized neatly into rows and columns, seen typically in databases and spreadsheets, for example

Unstructured data

Data that is complex, and mostly qualitative information that is impossible to reduce to rows and columns. For example, photos, videos, text files, PDFs, and social media content.

Delimited text files

Delimited text files are text files used to store data as text in which each line, or row, has values separated by a delimiter; where a delimiter is a sequence of one or more characters for specifying the boundary between independent entities or values. Any character can be used to separate the values, but most common delimiters are the comma, tab, colon, vertical bar, and space. Each row, or horizontal line, in the text file has a set of values separated by the delimiter, and represents a record. The first row works as a column header, where each column can have a different type of data. For example, a column can be of date type, while another can be a string or integer type data. Delimited files allow field values of any length and are considered a standard format for providing straightforward information schema. They can be processed by almost all existing applications. Delimiters also represent one of various means to specify boundaries in a data stream.

NoSQL: Document-based

Document databases store each record and its associated data within a single document. They enable flexible indexing, powerful ad hoc queries, and analytics over collections of documents. Document databases are preferable for eCommerce platforms, medical records storage, CRM platforms, and analytics platforms. However, if you're looking to run complex search queries and multi-operation transactions, a document-based database may not be the best option for you. MongoDB, DocumentDB, CouchDB, and Cloudant are some of the popular document-based databases.

ETL uses/ data pipelines

ETL has historically been used for batch workloads on a large scale. However, with the emergence of streaming ETL tools, they are increasingly being used for real-time streaming event data as well. It's common to see the terms ETL and data pipelines used interchangeably. And although both move data from source to destination, data pipeline is a broader term that encompasses the entire journey of moving data from one system to another, of which ETL is a subset.

Ernst and Young Big Data definition

Ernst and Young offers the following definition: big data refers to the dynamic, large, and disparate volumes of data being created by people, tools, and machines. It requires new, innovative and scalable technology to collect, host, and analytically process the vast amount of data gathered in order to drive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder value.

technical skills

Expertise in using spreadsheets such as Microsoft Excel or Google Sheets, Proficiency in statistical analysis and visualization tools and software such as IBM Cognos, IBM SPSS, Oracle Visual Analyzer, Microsoft Power BI, SAS, and Tableau Proficiency in at least one of the programming languages such as R, Python, and in some cases C++, Java, and MATLAB, Good knowledge of SQL, and ability to work with data in relational and NoSQL databases, The ability to access and extract data from data repositories such as data marts, data warehouses, data lakes, and data pipelines, Familiarity with Big Data processing tools such as Hadoop, Hive, and Spark. We will understand more about the features and use cases of some of these programming languages, databases, data repositories, and big data processing tools further along in the course.

NoSQL: Graph-based

Graph-based databases use a graphical model to represent and store data. They are particularly useful for visualizing, analyzing, and finding connections between different pieces of data. The circles are nodes, and they contain the data. The arrows represent relationships. Graph databases are an excellent choice for working with connected data, which is data that contains lots of interconnected relationships. Graph databases are great for social networks, real-time product recommendations, network diagrams, fraud detection, and access management. But if you want to process high volumes of transactions, it may not be the best choice for you, because graph databases are not optimized for large-volume analytics queries. Neo4J and CosmosDB are some of the more popular graph databases.

HDFS benefits

HDFS also replicates smaller pieces onto two additional servers by default, ensuring availability when a server fails, In addition to higher availability, this offers multiple benefits. It allows the Hadoop cluster to break up work into smaller chunks and run those jobs on all servers in the cluster for better scalability. Finally, you gain the benefit of data locality, which is the process of moving the computation closer to the node on which the data resides. This is critical when working with large data sets because it minimizes network congestion and increases throughput. Some of the other benefits that come from using HDFS include: Fast recovery from hardware failures, because HDFS is built to detect faults and automatically recover. Access to streaming data, because HDFS supports high data throughput rates. Accommodation of large data sets, because HDFS can scale to hundreds of nodes, or computers, in a single cluster. Portability, because HDFS is portable across multiple hardware platforms and compatible with a variety of underlying operating systems.

Hadoop

Hadoop is a collection of tools that provides distributed storage and processing of big data. Hadoop, a java-based open-source framework, allows distributed storage and processing of large datasets across clusters of computers. In Hadoop distributed system, a node is a single computer, and a collection of nodes forms a cluster. Hadoop can scale up from a single node to any number of nodes, each offering local storage and computation. Hadoop provides a reliable, scalable, and cost-effective solution for storing data with no format requirements.

Hive

Hive is a data warehouse for data query and analysis built on top of Hadoop. Hive is an open-source data warehouse software for reading, writing, and managing large data set files that are stored directly in either HDFS or other data storage systems such as Apache HBase. Hadoop is intended for long sequential scans and, because Hive is based on Hadoop, queries have very high latency—which means Hive is less appropriate for applications that need very fast response times. Also, Hive is read-based, and therefore not suitable for transaction processing that typically involves a high percentage of write operations. Hive is better suited for data warehousing tasks such as ETL, reporting, and data analysis and includes tools that enable easy access to data via SQL.

Java uses

Java is used in a number of processes all through data analytics, including cleaning data, importing and exporting data, statistical analysis, and data visualization. In fact, most of the popular frameworks and tools used for big data are typically written in Java, such as Hadoop, Hive, and Spark. It is perfectly suited for speed-critical projects.

JavaScript Object Notation (JSON)

JavaScript Object Notation, or JSON, is a text-based open standard designed for transmitting structured data over the web. The file format is a language-independent data format that can be read in any programming language. JSON is easy to use, is compatible with a wide range of browsers, and is considered as one of the best tools for sharing data of any size and type, even audio and video. That is one reason, many APIs and Web Services return data as JSON.

Advantages of Python

It is easy to learn - With Python, you have the advantage of using fewer lines of code to accomplish tasks compared to other languages. It is open-source — Python is free and uses a community-based model for development. It runs on Windows and Linux environments and can be ported to multiple platforms. It has widespread community support with plenty of useful analytics libraries available. It has several open-source libraries for data manipulation, data visualization, statistics, and mathematics, to name just a few.

Applications within Python

Its vast array of libraries and functionalities also include: Pandas for data cleaning and analysis, Numpy and Scipy, for statistical analysis, Beautifulsoup and Scrapy for web scraping, Matplotlib and Seaborn to visually represent data in the form of bar graphs, histogram, and pie-charts, Opencv for image processing.

Java

Java is an object-oriented, class-based, and platform-independent programming language originally developed by Sun Microsystems. It is among the top-ranked programming languages used today.

NoSQL: key-value store

Key-value stores store Data in a key-value database is stored as a collection of key-value pairs. The key represents an attribute of the data and is a unique identifier. Both keys and values can be anything from simple integers or strings to complex JSON documents. Key-value stores are great for storing user session data and user preferences, making real-time recommendations and targeted advertising, and in-memory data caching. However, if you want to be able to query the data on specific data value, need relationships between data values, or need to have multiple unique keys, a key-value store may not be the best fit. Redis, Memcached, and DynamoDB are some well-known examples in this category.

RDBMS example

Let's take the example of a customer table that maintains data about each customer in a company. The columns, or attributes, in the customer table are the Company ID, Company Name, Company Address, and Company Primary Phone; and Each row is a customer record. Now let's understand what we mean by tables being linked, or related, based on data common to each. Along with the customer table, the company also maintains transaction tables that contain data describing multiple individual transactions pertaining to each customer. The columns for the transaction table might include the Transaction Date, Customer ID, Transaction Amount, and Payment Method. The customer table and the transaction tables can be related based on the common Customer ID field. You can query the customer table to produce reports such as a customer statement that consolidates all transactions in a given period. This capability of relating tables based on common data enables you to retrieve an entirely new table from data in one or more tables with a single query. It also allows you to understand the relationships among all available data and gain new insights for making better decisions.

HDFS example

Let's understand this through an example. Consider a file that includes phone numbers for everyone in the United States; the numbers for people with last name starting with A might be stored on server 1, B on server 2, and so on. With Hadoop, pieces of this phonebook would be stored across the cluster. To reconstruct the entire phonebook, your program would need the blocks from every server in the cluster.

ETL step 4: Load

Load is the step where processed data is transported to a destination system or data repository. It could be: Initial loading, that is, populating all the data in the repository, Incremental loading, that is, applying ongoing updates and modifications as needed periodically; or Full refresh, that is, erasing contents of one or more tables and reloading with fresh data.

APIs (Application Programming Interfaces)

Many data providers and websites provide APIs, or Application Program Interfaces, and Web Services, which multiple users or applications can interact with and obtain data for processing or analysis. APIs and Web Services typically listen for incoming requests, which can be in the form of web requests from users or network requests from applications and return data in plain text, XML, HTML, JSON, or media files.

Microsoft Excel Open XML Spreadsheet (XLSX)

Microsoft Excel Open XML Spreadsheet, or XLSX, is a Microsoft Excel Open XML file format that falls under the spreadsheet file format. It is an XML-based file format created by Microsoft. In an .XLSX, also known as a workbook, there can be multiple worksheets. And each worksheet is organized into rows and columns, at the intersection of which is the cell. Each cell contains data. XLSX uses the open file format, which means it is generally accessible to most other applications. It can use and save all functions available in Excel and is also known to be one of the more secure file formats as it cannot save malicious code.

NoSQL advantages

NoSQL was created in response to the limitations of traditional relational database technology. The primary advantage of NoSQL is its ability to handle large volumes of structured, semi-structured, and unstructured data. Some of its other advantages include: The ability to run as distributed systems scaled across multiple data centers, which enables them to take advantage of cloud computing infrastructure; An efficient and cost-effective scale-out architecture that provides additional capacity and performance with the addition of new nodes; and Simpler design, better control over availability, and improved scalability that enables you to be more agile, more flexible, and to iterate more quickly.

NoSQL database definition

NoSQL, which stands for "not only SQL," or sometimes "non SQL" is a non-relational database design that provides flexible schemas for the storage and retrieval of data. NoSQL databases have existed for many years but have only recently become more popular in the era of cloud, big data, and high-volume web and mobile applications. They are chosen today for their attributes around scale, performance, and ease of use. It's important to emphasize that the "No" in "NoSQL" is an abbreviation for "not only" and not the actual word "No." NoSQL databases are built for specific data models and have flexible schemas that allow programmers to create and manage modern applications. They do not use a traditional row/column/table database design with fixed schemas, and typically not use the structured query language (or SQL) to query data, although some may support SQL or SQL-like interfaces. NoSQL allows data to be stored in a schema-less or free-form fashion. Any data, be It structured, semi-structured, or unstructured, can be stored in any record.

Hadoop: HDFS (Hadoop Distributed File System)

One of the four main components of Hadoop is Hadoop Distributed File System, or HDFS, which is a storage system for big data that runs on multiple commodity hardware connected through a network. HDFS provides scalable and reliable big data storage by partitioning files over multiple nodes. It splits large files across multiple computers, allowing parallel access to them. Computations can, therefore, run in parallel on each node where data is stored. It also replicates file blocks on different nodes to prevent data loss, making it fault-tolerant.

Advantages of RDBMS

One of the most significant advantages of the relational database approach is its ability to create meaningful information by joining tables. Some of its other advantages include: Flexibility: Using SQL, you can add new columns, add new tables, rename relations, and make other changes while the database is running and queries are happening. Reduced redundancy: Relational databases minimize data redundancy. For example, the information of a customer appears in a single entry in the customer table, and the transaction table pertaining to the customer stores a link to the customer table. Ease of backup and disaster recovery: Relational databases offer easy export and import options, making backup and restore easy. Exports can happen while the database is running, making restore on failure easy. Cloud-based relational databases do continuous mirroring, which means the loss of data on restore can be measured in seconds or less. ACID-compliance: ACID stands for Atomicity, Consistency, Isolation, and Durability. And ACID compliance implies that the data in the database remains accurate and consistent despite failures, and database transactions are processed reliably.

Relational database case study

Online Transaction Processing: OLTP applications are focused on transaction-oriented tasks that run at high rates. Relational databases are well suited for OLTP applications because they can accommodate a large number of users; they support the ability to insert, update, or delete small amounts of data; and they also support frequent queries and updates as well as fast response times. Data warehouses: In a data warehousing environment, relational databases can be optimized for online analytical processing (or OLAP), where historical data is analyzed for business intelligence. IoT solutions: Internet of Things (IoT) solutions require speed as well as the ability to collect and process data from edge devices, which need a lightweight database solution.

PDF

Portable Document Format, or PDF, is a file format developed by Adobe to present documents independent of application software, hardware, and operating systems, which means it can be viewed the same way on any device. This format is frequently used in legal and financial documents and can also be used to fill in data such as for forms.

PowerShell

PowerShell is a cross-platform automation tool and configuration framework by Microsoft that is optimized for working with structured data formats, such as JSON, CSV, XML, and REST APIs, websites, and office applications. It consists of a command-line shell and scripting language. PowerShell is object-based, which makes it possible to filter, sort, measure, group, compare, and many more actions on objects as they pass through a data pipeline. It is also a good tool for data mining, building GUIs, and creating charts, dashboards, and interactive reports.

functional skills

Proficiency in Statistics to help you analyze your data, validate your analysis, and identify fallacies and logical errors. Analytical skills that help you research and interpret data, theorize, and make forecasts. Problem-solving skills, because ultimately, the end-goal of all data analysis is to solve problems. Probing skills that are essential for the discovery process, that is, for understanding a problem from the perspective of varied stakeholders and users—because the data analysis process really begins with a clear articulation of the problem statement and desired outcome. Data Visualization skills that help you decide on the techniques and tools that present your findings effectively based on your audience, type of data, context, and end-goal of your analysis. Project Management skills to manage the process, people, dependencies, and timelines of the initiative.

Programming languages

Programming languages are designed for developing applications and controlling application behaviour; for example, Python, R, and Java

Python

Python is a widely-used open-source, general-purpose, high-level programming language. Its syntax allows programmers to express their concepts in fewer lines of code, as compared to some of the older languages. Python is perceived as one of the easiest languages to learn and has a large developer community. Because of its focus on simplicity and readability, and a low learning curve, it's an ideal tool for beginning programmers. It is great for performing high-computational tasks in vast amounts of data, which can otherwise be extremely time-consuming and cumbersome. Python provides libraries like Numpy and Pandas, which eases this task by the use of parallel processing. It has inbuilt functions for almost all of the frequently used concepts. Python supports multiple programming paradigms, such as object-oriented, imperative, functional, and procedural, making it suitable for a wide variety of use cases.

Query languages

Query languages are designed for accessing and manipulating data in a database; for example, SQL

R

R is an open-source programming language and environment for data analysis, data visualization, machine learning, and statistics.

Limitations of RDBMS

RDBMS does not work well with semi-structured and unstructured data and is, therefore, not suitable for extensive analytics on such data. For migration between two RDBMSs, schemas and type of data need to be identical between the source and destination tables. Relational databases have a limit on the length of data fields, which means if you try to enter more information into a field than it can accommodate, the information will not be stored. Despite the limitations and the evolution of data in these times of big data, cloud computing, IoT devices, and social media, RDBMS continues to be the predominant technology for working with structured data.

Key differences between RDBMS and NoSQL

RDBMS schemas rigidly define how all data inserted into the database must be typed and composed, whereas NoSQL databases can be schema-agnostic, allowing unstructured and semi-structured data to be stored and manipulated. Maintaining high-end, commercial relational database management systems is expensive whereas NoSQL databases are specifically designed for low-cost commodity hardware. Relational databases, unlike most NoSQL, support ACID-compliance, which ensures reliability of transactions and crash recovery. RDBMS is a mature and well-documented technology, which means the risks are more or less perceivable as compared to NoSQL, which is a relatively newer technology. Nonetheless, NoSQL databases are here to stay, and are increasingly being used for mission critical applications.

RDBMS formats

Relational databases range from small desktop systems to massive cloud-based systems. They can be either: open-source and internally supported, open-source with commercial support, or commercial closed-source systems. IBM DB2, Microsoft SQL Server, MySQL, Oracle Database, and PostgreSQL are some of the popular relational databases. Cloud-based relational databases, also referred to as Database-as-a-Service, are gaining wide use as they have access to the limitless compute and storage capabilities offered by the cloud. Some of the popular cloud relational databases include Amazon Relational Database Service (RDS), Google Cloud SQL, IBM DB2 on Cloud, Oracle Cloud, and SQL Azure. RDBMS is a mature and well-documented technology, making it easy to learn and find qualified talent.

relational database (RDBMS)

Relational databases, build on the organizational principles of flat files, with data organized into a tabular format with rows and columns following a well-defined structure and schema. It is a collection of data organized into a table structure, where the tables can be linked, or related, based on data common to each. Tables are made of rows and columns, where rows are the "records", and the columns the "attributes".However, unlike flat files, RDBMSes are optimized for data operations and querying involving many tables and much larger data volumes. Structured Query Language, or SQL, is the standard querying language for relational databases. Relational databases use SQL for querying data, which gives you the advantage of processing millions of records and retrieving large amounts of data in a matter of seconds. Moreover, the security architecture of relational databases provides controlled access to data and also ensures that the standards and policies for governing data can be enforced.

Structured data sources

SQL Databases and Online Transaction Processing (or OLTP) Systems that focus on business transactions, Spreadsheets such as Excel and Google Spreadsheets, Online forms, Sensors such as Global Positioning Systems (or GPS) and Radio Frequency Identification (or RFID) tags; and Network and Web server logs.

Relational database examples

SQL Server, Oracle, MySQL, and IBM DB2

Unix/Linux shell uses

Writing a shell script is fast and easy. It is most useful for repetitive tasks that may be time-consuming to execute by typing one line at a time. Typical operations performed by shell scripts include: file manipulation, program execution, system administration tasks such as disk backups and evaluating system logs, installation scripts for complex programs, executing routine backups, running batches.

Advantages of using SQL

SQL is portable and can be used independent of the platform, It can be used for querying data in a wide variety of databases and data repositories, although each vendor may have some variations and special extensions, It has a simple syntax that is similar to the English language, Its syntax allows developers to write programs with fewer lines than some of the other programming languages using basic keywords such as select, insert, into, and update, It can retrieve large amounts of data quickly and efficiently, It runs on an interpreter system, which means code can be executed as soon as it is written, making prototyping quick and easy. SQL is one of the most popular querying language. Due to its large user community and the sheer volume of documentation accumulated over the years, it continues to provide a uniform platform, worldwide, to all its users.

SQL (Structured Query Language)

SQL, or Structured Query Language, is a querying language designed for accessing and manipulating information from, mostly, though not exclusively, relational databases.

Semi structured data long definition

Semi-structured data is data that has some organizational properties but lacks a fixed or rigid schema. Semi-structured data cannot be stored in the form of rows and columns as in databases. It contains tags and elements, or metadata, which is used to group data and organize it in a hierarchy. XML and JSON allow users to define tags and attributes to store data in a hierarchical form and are used widely to store and exchange semi-structured data.

Factors for choosing a database

Several factors influence the choice of database, such as the data type and structure, querying mechanisms, latency requirements, transaction speeds, and intended use of the data.

Shell and Scripting languages

Shell and Scripting languages, such as Unix/Linux Shell, and PowerShell, are ideal for repetitive and time-consuming operational tasks.

Advantages of R

Some of the key benefits of R include the following: It is an open-source platform-independent programming language, It can be paired with many programming languages, including Python, It is highly extensible, which means developers can continue to add functionalities by defining new functions, It facilitates the handling of structured as well as unstructured data which means it has a more comprehensive data capability, It has libraries such as Ggplot2 and Plotly that offer aesthetic graphical plots to its users, You can make reports with the data and scripts embedded in them; also, interactive web apps that allow users to play with the results and the data, It is dominant among other programming languages for developing statistical tools.

Semi structured data sources

Some of the sources of semi-structured data could include: E-mails, XML, and other markup languages, Binary executables, TCP/IP packets, Zipped files, Integration of data from different sources.

Spark

Spark is a distributed data analytics framework designed to perform complex data analytics in real-time. This brings us to Spark, a general-purpose data processing engine designed to extract and process large volumes of data for a wide range of applications, including Interactive Analytics, Streams Processing, Machine Learning, Data Integration, and ETL. It takes advantage of in-memory processing to significantly increase the speed of computations and spilling to disk only when memory is constrained. Spark has interfaces for major programming languages, including Java, Scala, Python, R, and SQL. It can run using its standalone clustering technology as well as on top of other infrastructures such as Hadoop. And it can access data in a large variety of data sources, including HDFS and Hive, making it highly versatile. The ability to process streaming data fast and perform complex analytics in real-time is the key use case for Apache Spark.

DA skill types

Technical, functional, soft

Big Data tools

The scale of the data being collected means that it's not feasible to use conventional data analysis tools, however, alternative tools that leverage distributed computing power can overcome this problem. Tools such as Apache Spark, Hadoop, and its ecosystem provides ways to extract, load, analyze, and process the data across distributed compute resources, providing new insights and knowledge.

Examples of APIs being used for analysis

The use of Twitter and Facebook APIs to source data from tweets and posts for performing tasks such as opinion mining or sentiment analysis, which is to summarize the amount of appreciation and criticism on a given subject, such as policies of a government, a product, a service, or customer satisfaction in general. Stock Market APIs used for pulling data such as share and commodity prices, earnings per share, and historical prices, for trading and analysis. Data Lookup and Validation APIs, which can be very useful for Data Analysts for cleaning and preparing data, as well as for co-relating data—for example, to check which city or state a postal or zip code belongs to. APIs are also used for pulling data from database sources, within and external to the organization.

RSS (Really Simple Syndication) feeds

These are typically used for capturing updated data from online forums and news sites where data is refreshed on an ongoing basis. Using a feed reader, which is an interface that converts RSS text files into a stream of updated data, updates are streamed to user devices.

ETL step 3: Transform

Transform involves the execution of rules and functions that converts raw data into data that can be used for analysis. For example, making date formats and units of measurement consistent across all sourced data, removing duplicate data, filtering out data that you do not need, enriching data, for example, splitting full name to first, middle, and last names, establishing key relationships across tables, applying business rules and data validations.

Unstructured data storage

Unstructured data can be stored in files and documents (such as a Word doc) for manual analysis or in NoSQL databases that have their own analysis tools for examining this type of data.

Unstructured data long definition

Unstructured data is data that does not have an easily identifiable structure and, therefore, cannot be organized in a mainstream relational database in the form of rows and columns. It does not follow any particular format, sequence, semantics, or rules. Unstructured data can deal with the heterogeneity of sources and has a variety of business intelligence and analytics applications.

Hadoop capabilities

Using Hadoop, you can: Incorporate emerging data formats, such as streaming audio, video, social media sentiment, and clickstream data, along with structured, semi-structured, and unstructured data not traditionally used in a data warehouse. Provide real-time, self-service access for all stakeholders. Optimize and streamline costs in your enterprise data warehouse by consolidating data across the organization and moving "cold" data, that is, data that is not in frequent use, to a Hadoop-based system.

Uses of SQL

Using SQL, we can write a set of instructions to perform operations such as Insert, update, and delete records in a database; Create new databases, tables, and views; and Write stored procedures—which means you can write a set of instructions and call them for later use

Big Data: Value

Value is our ability and need to turn data into value. Value isn't just profit. It may have medical or social benefits, as well as customer, employee or personal satisfaction. The main reason that people invest time to understand big data is to derive value from it.

Big Data: Variety

Variety is the diversity of the data. Structured data fits neatly into rows and columns in relational databases, while unstructured data is not organized in a predefined way like tweets, blog posts, pictures, numbers, and video. Variety also reflects that data comes from different sources; machines, people, and processes, both internal and external to organizations. Drivers are mobile technologies social media, wearable technologies, geo technologies video, and many, many more. Let's think about the different types of data. Text, pictures, film, sound, health data from wearable devices, and many different types of data from devices connected to the internet of things.

Big Data: Velocity

Velocity is the speed at which data accumulates. Data is being generated extremely fast in a process that never stops. Near or real-time streaming, local, and cloud-based technologies can process information very quickly. Every 60 seconds, hours of footage are uploaded to YouTube, which is generating data. Think about how quickly data accumulates over hours, days, and years.

Big Data: Veracity

Veracity is the quality and origin of data and its conformity to facts and accuracy. Attributes include consistency, completeness, integrity, and ambiguity. Drivers include cost and the need for traceability. With the large amount of data available, the debate rages on about the accuracy of data in the digital age. Is the information real or is it false? Eighty percent of data is considered to be unstructured and we must devise ways to produce reliable and accurate insights. The data must be categorized, analyzed, and visualized.

Big Data: Volume

Volume is the scale of the data or the increase in the amount of data stored. Drivers of volume are the increase in data sources, higher resolution sensors, and scalable infrastructure. The world population is approximately 7 billion people and the vast majority are now using digital devices. Mobile phones, desktop and laptop computers, wearable devices, and so on. These devices all generate, capture, and store data approximately 2.5 quintillion bytes every day. That's the equivalent of 10 million blu-ray DVDs.

Unstructured data sources

Web pages, Social media feeds, Images in varied file formats (such as JPEG, GIF, and PNG), video and audio files, documents and PDF files, PowerPoint presentations, media logs; and surveys.

web scraping

Web scraping is used to extract relevant data from unstructured sources. Also known as screen scraping, web harvesting, and web data extraction, web scraping makes it possible to download specific data from web pages based on defined parameters. Web scrapers can, among other things, extract text, contact information, images, videos, product items, and much more from a website.

R uses

Widely used for developing statistical software and performing data analytics, it is especially known for its ability to create compelling visualizations, giving it an edge over some of the other languages in this space.


Related study sets

Answer these questions about José's and Anabela's pastimes.

View Set

Torts II Midterm Rule Statements

View Set

(12) BUSN Ch. 11 quiz, hospitality management chapter 12 quiz, Management 301 Exam 4, Final Exam 1-15, Mangement: Chapter 14, Management Op. 470 Midterm, Chapter 14 Organizational Management

View Set

Unit 17 - Food Allergies and Intolerances

View Set