Data Science Chapter 1 to 2
SAS Enterprise Miner
Allows users to run predictive and descriptive models based on a large volumes of data across the enterprise
alpine miner
provides a graphical user interface for creating analytic workflows, including data manipulations and a series of analytic events such as staged data-mining techniques
Matlab
provides a high-level language for performing a variety of data analytics, algorithms, and data exploration
scientific method
provides a solid framework for thinking about and deconstructing problems into their principal parts.
business intelligence analyst
provides business domain expertise based on a deep understanding of the data, key performance indicators, key metrics, and business intelligence from a reporting perspective. they generally create dashboards and reports and have knowledge of the data feeds and sources
Business intelligence analyst
provides business domain expertise based on a deeper understanding on the data
SAS
provides integration between SAS and the analytics sandbox via multiple data connectors
data scientist
provides subject matter expertise for analytical techniques, data modeling, and applying valid analytical techniques to given business problems. designs and execute analytical methods and approaches with the data available to the project.
CRISP-DM
provides useful input on ways to frame analytics problems and is popular approach for data mining
database administrator
provisions and configures the database environment to support the analytics needs of the working team. these responsibilitites may include providing access to key databases or tables and ensuring the appropriate security levels are in place related to the data repositories
data conditioning
refers to the process of cleaning data, normalizing datasets, and performing transformations on the data
project sponsor
responsible for the genesis of the project. provides the impetus and requirements for the project and defines the core business problem. this person sets the priorities for the project and clarifies the desired outputs
business user
someone who understands the domain area and usually benefits from the result. this person can consult and advise the project team on the context of the project
reframe business challenges as analytics challenges
specifically, this is a skill to diagnose business problems, consider the core of a given problem, and determine which kinds of candidate analytical methods can be applied to solve it.
medical information
such as genomic sequencing and diagnostic imaging
quantitative skill
such as mathematics or statistics
semi-structured
textual data files with a discernible pattern that enables parsing such as Extensible Markup Language (XML)
data deluge
the result of the prevalence of automatic data collection, electronic instrumentation, and online transactional processing (OLTP). mobile sensors, social media, video surveillance, medical imaging, smart grids, gene sequencing
data users and buyers
these groups directly benefit from the data collected and aggregated by others within the data value of chain
technology and data enabler
this group represents people providing technical expertise to support analytical project, such as provisioning and administrating analytical sandboxes, and managing large-scale architectures that enable widespread analytics within companies and other organizations.
design, implement, and deploy statistical model and data mining techniques on Big Data
this set of activities is mainly what people think about when they consider the role of the Data Scientist.
RDBMS (Relational Database Management System)
this store characteristics of the support calls as typical structured data.
true
true or false: for each gigabyte of new data created, an additional petabyte of data is created about the data.
smart devices
which provide sensor-based collection of information from smart electric grids, smart buildings, and many other public and industry infrastructures.
develop insights that lead to actionable recommendations
it is a critical to note that applying advanced methods to data problems does not necessarily drive new business value.
OpenRefine
it is a free, open source, powerful tool for working messy data
skeptical mind-set and critical thinking
it is important that data scientists can examine their work critically rather than in a one-sided way
Alpine Miner
provide a GUI front end for users to develop analytic workflows and interact with Big Data tools and platforms on the back end
SQL
provide an alternative to in-memory desktop analytical tools
data engineer
leverages deep technical skills to assist with tuning SQL queries for data management and data extraction, and provides support for data ingestion into the analytic sandbox.
data aggregators
marked as make sense of the data collected from the various entities from the "SensorNet"or the "Internet of Things."
(BI vs Data Science)
-BI tends to provide reports, dashboards, and queries on business questions for the current period or in the past. It makes it easy to answers questions related to quarter-to-date revenue, progress toward quarterly targets, and understand how much of a given product was sold in a prior quarter or year. -Data science tends to use disaggregated data in a more forward-looking, exploratory was, focusing on analyzing the present and enabling informed decisions about the future.
communicative and collaborative
must be able to articulate the business value in a clear way and collaboratively work with other groups, including project sponsors and key stakeholders.
data warehouse
Centralized database that stores data from several databases so they can be easily analyzed.
technical aptitude
namely, software engineering, machine learning, and program skills.
applied information economics
provides a framework for measuring intangibles and provides guidance on developing decision models, calibrating expert estimates, and deriving the expected value of information
analytic sandbox
attempts to resolve the conflict for analyst and data sciencticts with EDW and more formally managed corporate data.
business user
benefits from the result
unstructured
no inherent structure
DELTA framework
offers and approach for data analytics projects, including the context of the organization's skills, datasets, and leadership engagement
Quasi-structured
Textual data with erratic formats that can be formatted with effort and software tools
MAD skills
offers input for several of the techniques mentioned that focus on model planning, execution, and key findings
SPSS Modeler
offers methods to explore and analyze data through a GUI
phase 6
operationalize
discovery
phase 1
data preparation
phase 2
model planning
phase 3
model building
phase 4
framing
process of stating the analytics problem to be solved
Octave
a free software programming language for computational modeling, has some of the functionality of Matlab
Enterprise Data Warehouse
are a critical for reporting and BI task and solve many of the orblems that proliferating spreadsheets introduce, such as which of multiple versions of a spreadsheet is correct.
big data
can come in multiple forms, including structured and non-structured data such as financial data, etc.
SQL analysis services
can perform in-database analytics of common data mining functions, involve aggregations, and basic predictive models
hadoop
can perform massively parallel ingest and custom analysis for the we traffic parsing, GPS location analytics, genomic analysis... etc/
data warehouse
centralize data containers in a purpose-built space. supports BI and reporting, but restricts robust analysis.
ETLT
combination of extracting, transforming, and loading data into the sandbox
phase 5
communicate results
structured data
data contain a defined data type, format, and structure
curious and creative
data scientists are passionate about data and finding creative ways to solve problems and portray information
Unstructured data
data that has no inherent structure, which may include text documents, PDFs, images, and video. Common phenomenon that bears closer scrutiny
Analytic Sandbox
enable high-performance computing using in-database processing.
workspaces
enables teams to explore many datasets in a controlled fashion and are not typically used for enterprise-level financial reporting and sales dashboards
project manager
ensures that key milstones and objectives are met on time and at the expected quality
quasi-structured
erratic data formats
semi-structured
files with a discernible pattern that enables parsing.
WEKA
free data mining software package with an analytic workbench
data devices
gather data from multiple location and continuously generate new data about this data
analytic sandbox
gathered from multiple sources
R
has a complete set of modelling capabilities and provides a good environment for building interpretative models with high-quality code. ability to interface with databases via an ODBC connection and execute statistical test....
Data Savvy professional
has less technical depth but has basic knowledge of statistics or machine learning and can define key questions that can be answered using advanced analytics.
data collectors
includes sample entities that collect data from the device and users.
Data Wrangler
interactive tool for data cleaning and transformation.; developed at Standford University and can be used to performed many transformations on a given datasets
Phyton
is a programming lanugage that provides toolkits for machine learning and analysis, such as scikit-learn, numpy,spicy, pandas etc
deep analytical talent
is technically savvy, with strong analytical skills. members possess a combination of skills to handle raw, unstructured data and to apply complex analytical technique at massive scales.