Data Analytics C756 / CIW Data Analyst
KNIME
KNIME stands for KoNstanz Information MinEr and is pronounced: [naim] (that is, with a silent "k", just as in "knife"). It is developed by KNIME AG located in Zurich and the group of Michael Berthold at the University of Konstanz, Chair for Bioinformatics and Information Mining. KNIME promotes the ability to provide end-to-end analytics. Data blending and tool blending are provided through various tools on KNIME. KNIME's website promotes what it calls metanodes in the following way: "KNIME LocalSpace Repository sits on your local machine and gives you a repository for checking in metanodes. After you have checked in a metanode, you can drag and drop or copy and paste that metanode into your workflows as many times as you like. KNIME keeps a link of that copy. Now, whenever the repository metanode is updated or a new version is checked in, every instantiation of that copy can be automatically updated based on your setting! A fantastic time saver for you, the KNIME user."
metanode
KNIME's website promotes what it calls metanodes in the following way: "KNIME LocalSpace Repository sits on your local machine and gives you a repository for checking in metanodes. After you have checked in a metanode, you can drag and drop or copy and paste that metanode into your workflows as many times as you like. KNIME keeps a link of that copy. Now, whenever the repository metanode is updated or a new version is checked in, every instantiation of that copy can be automatically updated based on your setting! A fantastic time saver for you, the KNIME user."
Gartner's Magic Quadrant
Magic Quadrant (MQ) refers to a series published by IT consulting firm Gartner of market research reports that provides a wide-angle view of the relative positions of the competitors. It enables you to create a view that reflects your own business goals, needs and priorities by using the four quadrants. This will help the DPO to evaluate a vendor before a specific technology product, service or solution is purchased. It is a 2D matrix representation technique that provides the information about strengths and differences between companies, into four distinct quadrants/sections, based on both completeness of vision (on X-axis) and ability to execute it (on Y-axis), as shown in the attached figure. Leaders execute well against their current vision and are well positioned for tomorrow. Visionaries understand where the market is going or have a vision for changing market rules, but do not yet execute well. Niche Players focus successfully on a small segment or are unfocused and do not out-innovate or outperform others. Challengers execute well today or may dominate a large segment, but do not demonstrate an understanding of market direction.
MapReduce
MapReduce is a feature in Hadoop for parallel processing of large data sets. It manages communication and data transfer among different parts of the system. A composite program that consists of a Map procedure that performs filtering and sorting and a Reduce method that performs a summary operation. An open-source application programming interface (API) that provides fast data analytics services; one of the main Big Data technologies that allows organizations to process massive data stores.
Tableau Public - notes
Once data is collected, Tableau Public can connect that information to Google Analytics, connect it with other data, schedule automatic refresh tasks, remove unique formatting and un-needed information, and even connect to data sources that have been published to Tableau Server. Tableau Public allows a user to take raw data and create the following: Manage and review data in a visual display. Tableau Public offers users the ability to enhance data using various displays and visuals. The data can then be automatically refreshed using Tableau Public's online sync application. This type of data can be drawn from Google Analytics, and additionally from Salesforce.com, Force.com and database.com. If we imagine that a user has data coming from three different sources, the user may want to look at all of this data from one source in a simplified visual. Tableau Public extracts the data and based on our goals for the data, will produce a visual display. We can then adjust the "controls" to perform bulk management tasks, pivot our data from one format to another and blend data from other sources. Visual analysis. Forecast, categorize and use a variety of filters to break down the data. This capability allows users to compare budgets to actual goals using bullet graphs. This option also allows the user to display values in drop-down lists. In addition to the connections that one can make through Tableau Public, adding bullet graphs, showing trends and visually grouping data together can help in the data's presentation. Calculations. Create unique calculations of different variables, and compute totals and subtotals. The calculations option through Tableau Public allows a user to create dynamic variables to use in calculated fields, add calculations that apply to an entire table and compute grand totals and subtotals. With the calculations ability of Tableau Public, a user can create calculated fields from a multi-dimensional table and take advantage of various drag-and-drop support functions. Create dashboards from the data. With the dashboard tool, we can select items in a view to dynamically filter the specifics of the calculations. If we want to show only specific data, the dashboard tool in Tableau Public can do it for us. In addition, the user can also tell a story with Story Points in Tableau Public.
Hadoop - notes
Open-source software framework that enables distributed parallel processing of huge amounts of data across many inexpensive computers. Hadoop takes data, processes it using the input format and then calls the GETSPLIT function and computes splits for each file. The INPUTSPLITS function defines the unit of work that comprises a single map task. Hadoop requires some tradeoffs in exchange for this power in that it is slower than traditional relational databases when working with small datasets. A company with a dataset grown too large for a spreadsheet, something in the realm of 500 megabytes, would be better served with a relational database like MySQL, MS-SQL Server or Postgres. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Hadoop's common components: HDFS: (Hadoop Distributed File System) a file system designed for managing a large number of potentially very large files in a highly distributed environment MapReduce is a feature in Hadoop for parallel processing of large data sets. It manages communication and data transfer among different parts of the system. YARN: (Yet Another Resource Negotiator) is a cluster management technology responsible for managing the resources. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Hadoop-common refers to the commonly used utilities and libraries that support the Hadoop modules. Hadoop-core is the same as Hadoop-common; It was renamed to Hadoop-common in July 2009, per https://hadoop.apache.org/. Hadoop-client refers to the client libraries used to communicate with Hadoop's common components (HDFS, MapReduce, YARN) including but not limited to logging and codecs for example. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Facebook has been using Hadoop since 2010, and according to various research; Hadoop has been used by more than half of the Fortune 500 companies.
Hadoop
Open-source software framework that enables distributed parallel processing of huge amounts of data across many inexpensive computers; The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
OpenRefine
OpenRefine is a tool that can take disorganized data and transform it from one format to another. In addition, OpenRefine also has the ability to extend that data with Web services and external data. A key area of OpenRefine is performing advanced data operations with the General Refine Expression Language. A key area of OpenRefine includes importing data in various formats. The key areas of OpenRefine include: *Importing data in various formats *Exploring datasets in a matter of seconds *Applying basic and advanced cell transformations *Dealing with cells that contain multiple values *Creating instantaneous links between data sets *Filtering data with regular expressions *Performing advanced data operations with the General Refine Expression Language ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In OpenRefine, facet can help both get an overview of the bigdata as well as bring more consistency to the data. It is a way to isolate certain records that share features. Types of facets include: Numeric facets Timeline facets (for dates) Custom facets Scatterplot facets
PaaS
Platform as a Service. Provides cloud customers with an easy-to-configure operating system and on-demand computing capabilities. Compare to IaaS and SaaS.
PostgreSQL
PostgreSQL (pronounced "post-gress-Q-L"), often simply Postgres, is an open source Relational DBMS developed by a worldwide team of volunteers. It includes: It supports transactions, triggers, views, foreign key referential integrity, sub-selects and sophisticated locking. It runs on numerous platforms including Linux, most flavors of UNIX, Mac OS X, Solaris, Tru64 and Windows. It supports text, images, sounds, and videos, and includes programming interfaces for C / C++, Java, Perl, Python, Ruby, Tcl and Open Database Connectivity (ODBC). PostgreSQL has developed strong features to boost performance. These features include: Asynchronous commit Cost-based optimizer Asynchronous as well as synchronous replication Several indexing functions including functional, index-only scans, partial, functional and multiple-index-counting
R / R Project
R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. The R Project serves the purpose of a data analytic software but is, at its core, a programming language and software environment to run statistical analysis and graphics on data sets. Being a programming language, the user-interface for R looks much more like the user-interface for other software languages like Python, Java, HTML or C++ than it does like Microsoft Excel, which utilizes spreadsheets or SPSS which uses more point-and-click features. As such, R's great asset is its flexibility to be a fully comprehensive language that can run any type of data analysis that is required in a few lines of code. R has a fantastic mechanism for creating data structures. Obviously, if you are doing data analysis, you want to be able to put your data into its natural form. You don't have to warp your data into a particular structure because that is all that is available. Also, real data have missing values, and these are an integral part of the R language. Many functions have arguments that control how missing values are handled.
RStudio
RStudio is an integrated development environment (IDE) for R, a programming language for statistical computing and graphics. The RStudio IDE is developed by RStudio, Inc., a commercial enterprise founded by JJ Allaire,[5] creator of the programming language ColdFusion. RStudio, Inc. has no formal connection to the R Foundation, a not for profit organization located in Vienna Austria,[6] which is responsible for overseeing development of the R environment for statistical computing.
click-stream data
Sequence or 'flow' of mouse clicks or keystrokes which a user makes in navigating through webpages or websites. Web advertisers use the clickstream record to determine the user's interests and preferences in order to custom-tailor their advertisements. Click-stream data is a good example of structured business data. This data can be analyzed to determine customer behavior and buying patterns.
Service Cloud 2
Service Cloud 2 is the software that has been developed by Salesforce for extracting useful CRM data, specifically from social media sites. For example, Service Cloud 2 automatically initiates a customer service case every time someone posts a tweet that matches a set of pre-specified key words.
Tableau software
Simple drag and drop tools for visualizing data from spreadsheets and other databases.
SaaS
Software as a Service; a subscription service where you purchase licenses for software that expire at a certain date; vendor hosts the software online and user accesses and uses the software over the internet.
three steps to obtaining good data to make more optimal business decisions:
Step 1: Capturing data Step 2: Analyzing data Step 3: Using data
Supply chain management (SCM)
Supply chain management (SCM) refers to the designing, planning, execution and monitoring of products, information, and finances as they flow from the initial supplier of raw materials all the way to the final consumer. SCM seeks to integrate and harmonize the flow of these components, both within and across companies, with the objective of optimizing inventory management and creating net value.
Public Key Infrastructure (PKI)
System for creating public and private keys using a certificate authority (CA) and digital certificates for authentication; Public key infrastructure supports the distribution and identification of public encryption keys, enabling users and computers to both securely exchange data over networks such as the Internet and verify the identity of the other party.
DSO (Days Sales Outstanding)
The DSO (Days Sales Outstanding) is a measure of the average number of days that a company takes to collect revenue after a sale has been made.
GDPR
The General Data Protection Regulation (GDPR) (EU) 2016/679 is a regulation in the EU law on data protection and privacy for all individuals within the European Union (EU) and the European Economic Area (EEA). GDPR came into effect across the EU on May 25, 2018. Every user must be informed about their privacy rights under GDPR, with a data protection policy that includes: Right to process the data at any time Right to view and access their personal data Right to get the copy of the stored data Right to remove the stored data under certain circumstances Right to file complaints about the data loss As a data analyst, we should remember that GDPR is a privacy regulation, not a data security regulation.
IaaS
The Infrastructure as a Service (IaaS) model provides access in a virtualized environment across a public connection. In the case of IaaS, the computing resource provided is comprised of virtualized hardware, i.e. computing infrastructure, such as network connections, virtual server space and load balancers. A cloud computing technology useful for heavily utilized systems and networks. Organizations can limit their hardware footprint and personnel costs by renting access to hardware such as servers. Compare to PaaS and SaaS.
Sarbanes-Oxley Act of 2002
The Sarbanes-Oxley act was passed by the US Congress in 2002 to protect investors from the possibility of fraudulent accounting activities by corporations; established requirements for proper financial record keeping for public companies and penalties of as much as 25 years in prison for noncompliance;
funnel graph (funnel chart)
The funnel chart is used to visualize the progressive reduction of data as it passes from one phase to another. A funnel graph is most suited to represent stages in the sales process and project revenue at each stage. This graph can also help in recognizing an organization's problems in sales areas.
three types of vendor lock-in
The three types of vendor lock-in are: platform lock-in, data lock-in and tools lock-in. In platform lock-in, the cloud services are built on a virtualization platform, such as VMWare or Xen. With data lock-in, the standards of ownership may not be clearly documented. In tools lock-in, the proprietary tools will only work in a particular vendor's cloud environment.
examples of webcast tools
Tools such as vsee.com, Adobe Connect, or gotomeeting.com are used to present important and valuable information to an audience that logs into the Webcast from anywhere. Webcasts take place in a distance-oriented environment.
decision tree
a graph of decisions and their possible consequences; it is used to create a plan to reach a goal; decision trees are used to forecast consequences of decisions. Consequences may include chance event outcomes, resource costs and utilities. They help in identifying the strategy that will most likely reach the goal and forecast the output of the decisions taken.
Google Analytics
a service offered by Google that evaluates the effectiveness of websites and profiles their users by collecting and analyzing data; a freemium web analytics service offered by Google that tracks and reports website traffic
cgroups
cgroups (abbreviated from control groups) is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes. Engineers at Google (primarily Paul Menage and Rohit Seth) started the work on this feature in 2006 under the name "process containers".[1] In late 2007, the nomenclature changed to "control groups" to avoid confusion caused by multiple meanings of the term "container" in the Linux kernel context, and the control groups functionality was merged into the Linux kernel mainline in kernel version 2.6.24, which was released in January 2008.[2] Since then, developers have added many new features and controllers, such as support for kernfs in 2014,[3] firewalling,[4] and unified hierarchy.[5]
mashup
combining of data from more than one source to create something new.
story points
feature of Tableau Story Points gives the author the ability to present a narrative. As part of that narrative, the author can highlight certain insights and provide additional context. The narrative structure also gives the author a chance to break the story into pieces that build on each other so the reader can consume them more easily.
transaction data
generated and captured by operational systems, describe the business's activities, or transactions; transaction data reveals useful information about the company's current state of health.
wireframe diagram
is used to prototype the design of screens of a system; shows the structure of a web page using only outlines for each content type and widget; a graphic design method where boxes show the position of a Web page's objects; an image or set of images which displays the functional elements of a website or page, typically used for planning a site's structure and functionality
PCI DSS
payment card industry data security standard - credit card, prevent identity theft
cloud computing
the practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer; a distributed model of Internet-based computing, where computers and other devices use shared resources, data and information.
three V's of big data
volume, velocity, variety
blockchain
- A blockchain is a shared ledger where transactions are permanently recorded by appending blocks - The blockchain serves as a historical record of all transactions that ever occurred, from the genesis block to the latest block, hence the name blockchain.
bullet graph
A bullet graph is a variation of a bar graph developed by Stephen Few. Seemingly inspired by the traditional thermometer charts and progress bars found in many dashboards, the bullet graph serves as a replacement for dashboard gauges and meters.
general ledger
A ledger that contains all accounts needed to prepare financial statements Once the company decides the appropriate account to post the entry, it is put into the general ledger.
RapidMiner
A popular, open source, free-of-charge data mining software suite that employs a graphically enhanced user interface, a rather large number of algorithms, and a variety of data visualization features. Rapid Miner promotes predictive analytics. The site provides the user with powerful Hadoop analytics, Tableau extensions and streamlined usability. This site "empowers analysts to effortlessly design predictive analytics, from mashup to modeling to deployment."
Google Transparency Reports
A site (https://transparencyreport.google.com/) that provides a breakdown of data and how policies and laws affect Internet users and the flow of data. It explains and displays how requests have come in from all over the country and throughout the world to remove certain information from Google and the Internet. Google provides data in a statistical and visual form that shows how this is occurring and how it affects the Internet. The attached figure displays the different kinds of reports that are available for viewing or download. When a user clicks on the reports, they are brought to options that captures specific data and analyzes it.
general journal
An all-purpose journal in which all the transactions of a business may be recorded. The general journal is part of the accounting system in which record is kept. These are books of original entry in which transactions are first recorded.
container
An isolated guest in a shared server, quarantined to prevent unwanted data access by others; a container is an isolated guest, using a single kernel, preventing one container from gaining access to another's files. Containerization represents a recent innovation with cloud-based infrastructure.
CRM software
Automates and integrates the functions of sales, marketing, and service in an organization B2C CRM Software Consistency of long-term sales through marketing is a primary objective Capability to manage a large number of contacts and leads Executing mass campaigns and updating large amounts of data is emphasized B2B CRM Software Long-term management of a potential customer is a critical goal Automation of sales processes is stressed Predicting customer behavior based on past buying patterns and business profitability
B2C CRM software
B2C CRM software mainly focuses on the long-term sales and satisfying customers with services, so that they can be involved for a long time, for better sales revenue. Automation is one of the most essential features of a CRM system; by allowing salespersons to automate regular tasks, such as sending out emails, generating reports and performing other repetitive processes, sales effectiveness is greatly enhanced.
Capsa Free
Capsa Free is a portable network analyzer application for both LANs and WLANs which performs 24x7 network monitoring, advanced protocol analysis, in-depth packet decoding, automatic expert diagnosis, and has real-time packet capturing capability. With the most user-friendly interface and the most powerful data packet capture and analysis engine in the industry, it is a necessary tool for network monitoring.
container image
Containers use modern Linux kernel primitives such as control groups (cgroups) and namespaces to provide similar resource allocation and isolation features as those provided by virtual machines with much less overhead and much greater portability. Application developers now need to become comfortable with container images to take full advantage of the features of modern cloud infrastructure.
event monitoring software
Event monitoring software helps to detect errors, connectivity, security threats, and other areas that need attention. This software can provide immediate detection of physical facilities management, by reporting leaks, fires, electrical failures, and other problems.
FISMA
Federal Information Security Management Act - US law requires federal agencies to create, document and implement security program; legislation that defines a comprehensive framework to protect U.S. government information, operations and assets against natural or man-made threats),
B2B framework
For managing the complexity of B2B operations, a suitable framework needs to be in place. Data handling is the essential part of the B2B framework. Data may be obtained through this by using interviewing, user surveys, KPIs, social media, so on. For managing the complexity of B2B operations, a suitable framework needs to be in place. Data handling is an essential part of this framework. Data may be obtained through B2B frameworks by using KPIs (Key Performance Indicators) where the enterprise defines and monitors certain performance indices directly related to the product and service, and uses the numbers to make inferences on customer satisfaction.
Google Fusion Tables
Google Fusion Tables is a robust system that can accomplish the following: Filter and summarize across thousands of rows of data Embed or share the data through charts, maps, network graphs and custom layouts Collaborate with others on the data by sharing it through Google Drive ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Works on the same concept as centralized database; centralized data is data that is housed, stored and managed from a single location. Google Fusion Tables create a single visualization that combines all the information. This way users get access to updated data at a single place. Using Google Fusion Tables, you can share only what you want to share with specific people. This can be accomplished by setting certain user permissions. Using these permissions, rows or columns of data can be shared on a timed basis as well.
Graph-R
Graph-R is an application used to create three-dimensional contours, contour lines, wire frames, and scatter diagrams from numeric data files(CSV files). The charts created in Graph-R can be saved as a .PNG, .JPEG, .BMP or .GIF, or copied and pasted into other programs. Graph-R and Graph-R Plus can create 3D contour charts, wireframes, scatter charts and vector diagrams. We can easily set up the graph and change the viewpoint. After we create a graph, we can save it as a .PNG, .JPEG, .BMP or .GIF file, or copy it to the clipboard for pasting.
HDFS
HDFS or Hadoop Distributed File System is a file system designed for managing a large number of potentially very large files in a highly distributed environment
Hadoop Core
Hadoop Common, also known as Hadoop Core, refers to the collection of common utilities that support the other Hadoop modules.
YARN
Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology responsible for managing the resources.
HappyFox
HappyFox helps to manage large-scale emergency issues and small-scale issues by creating tickets, which are customer service requests, for each issue and allowing users to work on the same interface. Furthermore, HappyFox can convert email, phone, chat and Web requests into tickets to keep them organized.
five stages of the data lifecycle
Here are the five stages of the data lifecycle: Data extraction Data aggregation Data normalization Data analysis Data reporting/submission