data science terms
Model Steps to solve problem
-Iterative Process: prepare data; build model (from scratch or resource); train model -deploy model, use model
Supervised Learning
-data is labeled and model trained to make correct predictions; part of ML; ex: regression, classification
Random Forest
An algorithm used for regression or classification that uses a collection of tree data structures trees "vote" on the best model
DAX
Data Asset Exchange created by IBM; finding open data sets for enterprise applications for images, video, text, audio etc; normally easier to adapt and license- even has videos on do to do everything
DDL/DML for SQL
Data definition language statements: define, change or drop data Data Manipulation Language statements: read and modify data
DBMS
Database Management System; set of software tools for the data in the database; also RDBMS (regional)- what SQL is
KNN
K Nearest Neighbor algorithm; method for classifying cases based on their similarity to other cases; near each other- "neighbors";
Development enviroments
IDEs; help to implement, execute, test and deploy work
Clustering
Machine Learning technique that involves the grouping of data points
MAX
Model Exchange from IBM; free resource for DL models- ready to use, customizable DL microservice; model serving microservices expose standardized REST API- predicting endpts
SPSS
Statistical Package for the Social Sciences; build predictive models, preform statistical analysis of data, etc
Application Programming Interface (API)
application programming interface was originally understood to be an application specific computing interface exposed by a particular software program or operating system to allow third parties to extend the functionality of that software application beyond its capabilities as they existed out of the box
API
application programming interface; medium between program to software;
TCO
Total cost of ownership; factor considered when purchasing new products and services; identify the cost of a product or service over its lifetime.
commerical software
aka seldom payware- produced for sale or that serves commercial purposes
ANOVA
analysis of variance; stat comparison of groups; either 1) ftest- variation between sample group means divided by variation within sample group pvalue- confidence degree
natural language processing
branch of AI; interaction between computers and humans using natural language
machine learning
brand of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed; development of computer programs that can access data and use it learn for themselves; identifies patterns in data
target attribute
categorical variable with discrete values
Spark
cluster computing framework enabling processing data by computing clusters; computing in parallel
Libraries
collection of functions and methods that enable you to preform a wide variety of actions without writing the code yourself
Table
collection of related things; columns- properties/attributes; create, insert select update delete def from SQL
SQL
communicating within databases; language for a database to query data
open source software
computer software where source code licensed to copyright holder- grants users the rights to study, change, and distribute the software to anyone and for any purpose
Docker
container platform making it easier to build applications and deploy; can be on Github
Model Building
creating a machine or deep learning model using an appropriate algorithm w a lot of data
Unsupervised learning
data not labeled; model tries to identify patterns without external help; learning problems w clustering, anomaly detection (identifying outliers), reinforcement learning (not able to adapt to best route)
array
data structure consisting of a collection of elements; variables assigned to array
Pandas
data structures and tools for data cleaning, manipulation, and analysis; table w columns and rows; scientific computing
Seaborn
data visualization program; used espically for heat maps, time plots etc
Keras
deep learning neutral networks
TensorFlow
deep learning; production and deployment
PyTorch
deep learning; regression, classification
ETL
extract, transform, load; task of data integration and transformation in the classic data warehousing world
Deep Learning
field of machine learning where computers learn to make intelligent decisions on their own; normally involves deeper level of automation vs other algorithms; frameworks: TensorFlow, PyTorch, Keras
dervive
finding rate of change of variable
Framework
focused solutions meant for data scientists who don't know much coding etc
Aggregate Function (SQL)
function where the values of multiple rows are grouped together to form a single summary value
Hardware Vs software
hold in your hand vs telling the computer how to work
Multilinear regression
identify the strength of effect the indep vari have on dep vari; predict if impact of indep varis changes dep varis
Area plot
in matplotlib; area chart or area graph displays graphically quant data; think of line graph with horizontal blocks through it to show quant; stacked by default
Information vs Data Model
info- conceptual level defining relationships; data- concrete w details, blueprint of any database system
float vs integer
integer- no decimal, float- can have; python
Data Pipeline
is a system that captures, organizes, and routes data so that it can be used to gain insights. Raw data contains too many data points that may not be relevant. Data pipeline architecture organizes data events to make reporting, analysis, and using data easier
Classification
learn the relationship between a set of feature variables and a target interest
NumPy
libraries based on N-dimensional arrays; enabling you to preform mathematical functions on these arrays; Pandas: built on top of NumPy for data visualization to communicate findings of analysis;
Matplotlib
library known for data visualization- graphs and plots; Seaborn based on this- for plots
Model deployment
makes the model built available for 3rd part applications
statistical modeling
mathematical model that embodies a set of statistical assumptions concerning the generation of sample data.
Kmeans Clustering
method of vector quant; partition k observations into k clusters in which each observation belongs to the cluster with the nearest mean
Regression
observing model to analyze relationships between variables-how they contribute/related to producing a particular outcome together; predicting continuous variable
Cloud computing
on demand availability of computer system resources, especially data storage and computing power, without the direct active management by the user; good for scalability, access anywhere, disaster recovery
Data Visualization
part of initial data exploration process and can be used as final deliverable
Jupiter Notebooks
perform data cleaning, pre-processing, and exploratory analysis
Community Data License Agreement
permission to use and modify data
Microservice
pre trained DL model, code that preprocesses the input before analyzed by the model and code that post processes model output, standardized public API- making high availability; model-serving microservices expose standardized REST API
sequence mining
predicting the next event ex: click-stream in websites
Data Manipulation
process of changing data to make it easier to read or be more organized
Data Modeling
process of creating a data model for the data to be stored in a Database. This data model is a conceptual representation of Data objects, the associations between different data objects and the rules
Database
repository for data including the modification, addition, and querying; relational database forms relationships between tables; application (ex python)-> sql-> database instance
REST API
representational state transfer application programming interface; medium between client to resource; file to web service to client
querying
request for data or information from a database table or combination of tables. This data may be generated as results returned by Structured Query Language (SQL) or as pictorials, graphs or complex results, e.g., trend analyses from data-mining tools
Scikit-Learn
stat modeling including regression, classification, clustering; built on NumPy, SciPy, and matplotlib; machine/deep learning
Regression Analysis
statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables
3 Types of Machine Learning Models
supervised learning, regression, classification
pivot table
table of stats that summarizes the data of a more extensive table
artifical intelligence (AI)
the creation of a machine to mimic cognitive human intelligence
Execution Environments
tools where data processing, model training, and deployment take place