Data Science

Ace your homework & exams now with Quizwiz!

Bias

A particular preference or point of view that is personal, rather than scientific.

Perl

An older scripting language with roots in pre-Linux UNIX systems. Perl has always been popular for text processing, especially data cleanup and enhancement tasks.

strata, stratified sampling

Divide the population units into homogeneous groups (strata) and draw a simple random sample from each group

NumPy

Fundamental package for scientific computing with Python

Structured Query Language(SQL)

SQL is another programming language that is used to perform tasks, such as updating or retrieving data for a database.

Summary Statistics

Summary statistics are the measures we use to communicate insights about our data in a simple way. Examples of summary statistics are the mean, median and standard deviation.

Predictive Analytics

The analysis of data to predict future events, typically to aid in business planning. This incorporates predictive modeling and other techniques. Machine learning might be considered a set of algorithms to help implement predictive analytics. The more business-oriented spin of "predictive analytics" makes it a popular buzz phrase in marketing literature.

Pandas

A Python library for data manipulation popular with data scientists

Vector Space

A collection of elements that can be formed by adding or multiplying vectors together.

Statistical Significance

A result is statistically significant when we judge that it probably didn't happen due to chance. It is highly used in surveys and statistical studies, though not always an indication of piratical value. The mathematical details of statistical significance are beyond the scope of this post, but a fuller explanation can be found here.

Plotly

A technical computing company headquartered in Montreal, Quebec, that develops online data analytics and visualization tools.

k-nearest neighbors

Also, kNN. A machine learning algorithm that classifies things based on their similarity to nearby neighbors. You tune the algorithm's execution by picking how many neighbors to examine (k) as well as some notion of "distance" to indicate how near the neighbors are. For example, in a social network, a friend of your friend could be considered twice the distance away from you as your friend. "Similarity" would be comparison of feature values in the neighbors being compared.

Amazon Web Services (AWS)

Amazon Web Services is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis.

discrete variable

Consists of separate, indivisible categories. No values can exist between two neighboring categories.

latent variable

In statistics, latent variables (from Latin: present participle of lateo ('lie hidden'),as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models

Unstructured Information Management Architecture (UIMA)

The "Unstructured Information Management Architecture" was developed at IBM as a framework to analyze unstructured information, especially natural language. OASIS UIMA is a specification that standardizes this framework and Apache UIMA is an open-source implementation of it. The framework lets you pipeline other tools designed to be plugged into it.

Data Wrangling

The process of taking data in its original form and "taming" it until it works better in a broader workflow or project. Taming means making values consistent with a larger data set, replacing or removing values that might affect analysis or performance later, etc. Wrangling and munging are used interchangeably.

predictive modeling

The development of statistical models to predict future events.

Mode

The value that occurs most frequently in a given data set.

F1 Score

harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall)

Colaboratory

Colab" for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education.

Web Scraping

Web scraping is the process of pulling data from a website's source code. It generally involves writing a script that will identify the information a user wants and pull it into a new file for later analysis.

Flask

Flask is a micro web framework written in Python. It is classified as a microframework because it does not require particular tools or libraries. It has no database abstraction layer, form validation, or any other components where pre-existing third-party libraries provide common functions.

scripting

Generally, the use of a computer language where your program, or script, can be run directly with no need to first compile it to binary code as with with languages such as Java and C. Scripting languages often have simpler syntax than compiled languages, so the process of writing, running, and tweaking scripts can go faster

Density-based spatial clustering of applications with noise (DBSCAN)

It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

Microsoft Azure

Microsoft Azure is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers.

Precision vs. Recall

Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity. In simple terms, high precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.

Perceptron

Pretty much the simplest neural network is the perceptron, which approximates a single neuron with n binary inputs. It computes a weighted sum of its inputs and 'fires' if that weighted sum is zero or greater.

Mean Absolute Error (MAE)

The average error of all predicted values when compared with observed values.

Mean Squared Error (MSE)

The average of the squares of all the errors found when comparing predicted values with observed values. Squaring them makes the bigger errors count for more, making Mean Squared Error more popular than Mean Absolute Error when quantifying the success of a set of predictions.

Front End

The front end is everything a client or user gets to see and interact with directly. This includes data dashboards, web pages, and forms

Overfitting

The process of fitting a model too closely to the training data for the model to be effective on other data.

SAS

A commercial statistical software suite that includes a programming language also known as SAS.

Cost function

A cost function represents a value to be minimized, like the sum of squared errors over a training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function.

k-means clustering

A data mining algorithm to cluster, classify, or group your N objects based on their attributes or features into K number of groups (so-called clusters).

Data Warehouse

A data warehouse is a system used to do quick analysis of business trends using data from many sources. They're designed to make it easy for people to answer important statistical questions without a Ph.D. in database architecture.

Ruby

A scripting language that first appeared in 1996. Ruby is popular in the data science community, but not as popular as Python, which has more specialized libraries available for data science tasks.

Time Series

A time series is a set of data that's ordered by when each data point occurred. Think of stock market prices over the course of a month, or the temperature throughout a day.

D3

"Data-Driven Documents." A JavaScript library that eases the creation of interactive visualizations embedded in web pages. D3 is popular with data scientists as a way to present the results of their analysis.

Gradient Boosting

"Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function."

Feature engineering

"To obtain a good model, however, often requires more effort and iteration and a process called feature engineering. Features are the model's inputs. They can involve basic raw data that you have collected, such as order amount, simple derived variables, such as 'Is order date on a weekend? Yes/No,' as well as more complex abstract features, such as the 'similarity score' between two movies. Thinking up features is as much an art as a science and can rely on domain knowledge."

t-distribution

A variation on normal distribution that accounts for the fact that you're only using a sampling of all the possible values instead of all of them.

Linear Algebra

A branch of mathematics dealing with vector spaces and operations on them such as addition and multiplication. "Linear algebra is designed to represent systems of linear equations. Linear equations are designed to represent linear relationships, where one entity is written to be a sum of multiples of other entities. In the shorthand of linear algebra, a linear relationship is represented as a linear operator—a matrix.

reinforcement learning

A class of machine learning algorithms in which the process is not given specific goals to meet but, as it makes decisions, is instead given indications of whether it's doing well or not. For example, an algorithm for learning to play a video game knows that if its score just went up, it must have done something right.

Naive Bayes Classifier

A collection of classification algorithms based on Bayes Theorem. It is not a single algorithm but a family of algorithms that all share a common principle, that every feature being classified is independent of the value of any other feature. So for example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A Naive Bayes classifier considers each of these 'features' (red, round, 3" in diameter) to contribute independently to the probability that the fruit is an apple, regardless of any correlations between features. Features, however, aren't always independent which is often seen as a shortcoming of the Naive Bayes algorithm and this is why it's labeled 'naive'.

MATLAB

A commercial computer language and environment popular for visualization and algorithm development.

Tableau

A commercial data visualization package often used in data science projects.

Stata

A commercial statistical software package, not to be confused with strata.

SPSS

A commercial statistical software package, or according to the product home page, "predictive analytics software. The product has always been popular in the social sciences. The company, founded in 1968, was acquired by IBM in 2009.

NoSQL

A database management system that uses any of several alternatives to the relational, table-oriented model used by SQL databases. While this term originally meant "not SQL," it has come to mean something closer to "not only SQL" because the specialized nature of NoSQL database management systems often have them playing specific roles in a larger system that may also include SQL and additional NoSQL systems.

Poisson distribution

A distribution of independent events, usually over a period of time or space, used to help predict the probability of an event. Like the binomial distribution, this is a discrete distribution. Named for early 19th century French mathematician Siméon Denis Poisson.

TesorFlow

A free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.

Scikit-Learn

A free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Histogram

A graph of vertical bars representing the frequency distribution of a set of data.

Logistic Regression

A model similar to linear regression but where the potential results are a specific set of categories instead of being continuous.

FastAPI

A modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints.

probability distribution

A probability distribution for a discrete random variable is a listing of all possible distinct outcomes and their probabilities of occurring. Because all possible outcomes are listed, the sum of the probabilities must add to 1.0.

normal distribution

A probability distribution which, when graphed, is a symmetrical bell curve with the mean value at the center. The standard deviation value affects the height and width of the graph.

Logarithm

A quantity representing the power to which a fixed number (the base) must be raised to produce a given number. If y = 10x, then log(y) = x. Working with the log of one or more of a model's variables, instead of their original values, can make it easier to model relationships with linear functions instead of non-linear ones. Linear functions are typically easier to use in data analysis.

P value

Also, p-value. "The probability, under the assumption of no effect or no difference (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.""It's a measure of how surprised you should be if there is no actual difference between the groups, but you got data suggesting there is. A bigger difference, or one backed up by more data, suggests more surprise and a smaller p value...The p value is a measure of surprise, not a measure of the size of the effect." A lower p value means that your results are more statistically significant.

random forest

An algorithm used for regression or classification that uses a collection of tree data structures. "To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree 'votes' for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Backpropagation

Back prop is just gradient descent on individual errors. You compare the predictions of the neural network with the desired output and then compute the gradient of the errors with respect to the weights of the neural network. This gives you a direction in the parameter weight space in which the error would become smaller. --Mikio L. Braun.

Blockchain

Blockchain is essentially a decentralized distributed database. Data scientists with access to blockchain data are able to build models and make predictions with cleaner, more reliable historical data. This is because the linked structure of blockchain makes it possible to trace the origin (as well as ownership changes) of any digital asset. This ability can provide key evidence in support of the authenticity of an object, asset, or record. This could lead to large amounts of highly structured, anonymized, and authenticated data assets with a transparent chronology of ownership.

Plotly-Dash

Dash is a productive Python framework for building web applications. Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It's particularly suited for anyone who works with data in Python.

null hypothesis

If your proposed model for a data set says that the value of x is affecting the value of y, then the null hypothesis—the model you're comparing your proposed model with to check whether x really is affecting y—says that the observations are all based on chance and that there is no effect. "The smaller the P-value computed from the sample data, the stronger the evidence is against the null hypothesis.

S curve

Imagine a graph showing, for each month since smartphones originally became available, how many people in the US bought their first one. The line would rise slowly at first, when only the early adopters got them, then quickly as these phones became more popular, and then level off again once nearly everyone had one. This graph's line would form a stretched-out "S" shape. The "S curve" applies to many other phenomena and is often mentioned when someone predicts that a rising value will eventually level off.

Keras

Keras is an open-source neural-network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible.

Pivot Table

Pivot tables quickly summarize long lists of data, without requiring you to write a single formula or copy a single cell. But the most notable feature of pivot tables is that you can arrange them dynamically. Say you create a pivot table summary using raw census data. With the drag of a mouse, you can easily rearrange the pivot table so that it summarizes the data based on gender or age groupings or geographic location. The process of rearranging your table is known as pivoting your data: you're turning the same information around to examine it from different angles.

Root Mean Squared Error (RMSE)

The square root of the Mean Squared Error. This is more popular than Mean Squared Error because taking the square root of a figure built from the squares of the observation value errors gives a number that's easier to understand in the units used to measure the original observations.

principal component analysis

This algorithm simply looks at the direction with the most variance and then determines that as the first principal component. This is very similar to how regression works in that it determines the best direction to map data to.

spatiotemporal data

Time series data that also includes geographic identifiers such as latitude-longitude pairs.

Vector

Webster's first mathematical definition is "a mathematical expression denoting a combination of magnitude and direction," which you may remember from geometry class, but their third definition is closer to how data scientists use the term: "an ordered set of real numbers, each denoting a distance on a coordinate axis

quantile, quartile

When you divide a set of sorted values into groups that each have the same number of values (for example, if you divide the values into two groups at the median), each group is known as a quantile. If there are four groups, we call them quartiles, which is a common way to divide values for discussion and analysis purposes; if there are five, we call them quintiles, and so forth.

shell

When you use a computer's operating system from the command line, you're using its shell. Along with scripting languages such as Perl and Python, Linux-based shell tools (which are either included with or easily available for Mac and Windows machines) such as grep, diff, split, comm, head, and tail are popular for data wrangling. A series of shell commands stored in a file that lets you execute the series by entering the file's name is known as a shell script.

confidence interval

statistical range, with a given probability, that takes random error into account

Convolutional Neural Network (CNN)

A CNN is a common method used with deep learning, and is typically associated with computer vision and image recognition. CNNs employ the mathematical concept of convolution to simulate the neural connectivity lattice of the visual cortex in biological systems. Convolution can be viewed as a sliding window over top a matrix representation of an image. This allows for the simulation of the overlapping tiling of the visual field.

Long Short Term Memory (LSTM)

A LSTM network is a special kind of recurrent neural network (RNN) which is optimized for learning from and acting upon time-related data which may have undefined or unknown lengths of time between relevant events. LSTMs work very well on a wide range of problems and are now widely used. They were introduced in 1997 by Hochreiter & Schmidhuber, and were refined and popularized by many subsequent researchers.

Data Pipelines

A collection of scripts or functions that pass data along in a series. The output of the first method becomes the input of the second. This continues until the data is appropriately cleaned and transformed for whatever task a team is working on.

Data Warehouse.

A data warehouse is a repository where all the data collected by an organization is stored and used as a guide to make management decisions.

Artificial Intelligence (AI)

A discipline involving research and development of machines that are aware of their surroundings. Most work in A.I. centers on using machine awareness to solve problems or accomplish some task. In case you didn't know, A.I. is already here: think self-driving cars, robot surgeons, and the bad guys in your favorite video game.

Greedy Algorithms

A greedy algorithm will break a problem down into a series of steps. It will then look for the best possible solution at each step, aiming to find the best overall solution available. A good example is Dijkstra's algorithm, which looks for the shortest possible path in a graph.

Covariance

A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship

Cross validation

A method to validate the stability, or accuracy, of your machine-learning model. Although there are several types of cross validation, the most basic one involves splitting your training set in two and training the algorithm on one subset before applying it the second subset. Because you know what output you should receive, you can assess a model's validity.

Coefficient

A numerical or constant quantity placed before and multiplying the variable in an algebraic expression (e.g. 4 in 4x y)

Fuzzy Algorithms

Algorithms that use fuzzy logic to decrease the runtime of a script. Fuzzy algorithms tend to be less precise than those that use Boolean logic. They also tend to be faster, and computational speed sometimes outweighs the loss in precision.

Big Data

Big data is a term that suffers from being too broad to be useful. It's more helpful to read it as, "so much data that you need to take careful steps to avoid week-long script runtimes." Big data is more about strategies and tools that help computers do complex analysis of very large (read: 1+ TB) data sets. The problems we must address with big data are categorized by the 4 V's: volume, variety, veracity, and velocity.

Clustering

Clustering techniques attempt to collect and categorize sets of points into groups that are "sufficiently similar," or "close" to one another. "Close" varies depending on how you choose to measure distance. Complexity increases as the more features are added to a problem space.

Data Engineering

Data engineering is all about the back end. These are the people that build systems to make it easy for data scientists to do their analysis. In smaller teams, a data scientist may also be a data engineer. In larger groups, engineers are able to focus solely on speeding up analysis and keeping a data well organized and easy to access.

Exploratory Data Analysis (EDA)

EDA is often the first step when analyzing datasets. With EDA techniques, data scientists can summarize a dataset's main characteristics and inform the development of more complex models or logical next steps.

Data Science

Given the rapid expansion of the field, the definition of data science can be hard to nail down. Basically, it's the discipline of using data and advanced statistics to make predictions. Data science is also focused on creating understanding among messy and disparate data. The "what" a scientist is tackling will differ greatly by employer.

Gradient descent

Gradient Descent is an optimization algorithm, based on a convex function, that's used while training a machine learning model. The algorithm adjusts its parameters iteratively to minimize a given function to its local minimum.

Hypothesis Testing

Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. It's frequently used in clinical research

Linear Regression

Linear regression models the relationship between two variables by fitting a linear equation to the observed data. By doing so, you can predict an unknown variable based on its related known variable. A simple example is the relationship between an individual's height and weight.

Overfitting

Overfitting happens when a model considers too much information. It's like asking a person to read a sentence while looking at a page through a microscope. The patterns that enable understanding get lost in the noise.

Tidyverse

The tidyverse is a very well thought-out collection of R packages for data manipulation, exploratory data analysis, and visualization that share a common design philosophy. The tidyverse was primarily developed by data science luminary Hadley Wickham, but is now being expanded by several other contributors. The goal for the tidyverse is to make data scientists more productive by providing a path through workflows that facilitate concise communication, and results in reproducible work products.

Econometrics

The use of statistical techniques to describe the relationships between economic variables.

Variance

The variance of a set of values measures how spread out those values are. Mathematically, it is the average difference between individual values and the mean for the set of values. The square root of the variance for a set gives us the standard deviation, which is more intuitively useful.

Decision Trees

This machine learning method uses a line of branching questions or observations about a given data set to predict a target value. They tend to over-fit models as data sets grow large.Random forests are a type of decision tree algorithm designed to reduce over-fitting.

ETL (Extract, Transform, Load)

This process is key to data warehouses. It describes the three stages of bringing data from numerous places in a raw form to a screen, ready for analysis. ETL systems are generally gifted to us by data engineers and run behind the scenes.

A/B Testing

Used to collect data and compare performance among two options studied (A and B).

Mean (Average, Expected Value)

A calculation that gives us a sense of a "typical" value for a group of numbers. The mean is the sum of a list of values divided by the number of values in that list. It can be deceiving used on its own, and in practice we use the mean with other statistical values to gain intuition about our data.

Algorithms

An algorithm is a set of instructions we give a computer so it can take values and manipulate them into a usable form. This can be as easy as finding and removing every comma in a paragraph, or as complex as building an equation that predicts how many home runs a baseball player will hit in 2018.

Deep Learning

Deep learning models use very large neural networks — called deep nets — to solve complex problems, such as facial recognition. The layers in a model start with identifying very simple patterns and then build in complexity. By the end the net (hopefully) has a nuanced understanding that can accurately classify or predict values.

GPU acceleration

GPU-acceleration refers to the use of a graphics processing unit (GPU) along with a computer processing unit (CPU) in order to facilitate compute-intensive AI operations such as deep learning. A related term is GPU database which is a database, relational or non-relational, that uses a GPU to accelerate certain database operations.

GitHub

GitHub is a code-sharing and publishing service, as well as a community for developers. It provides access control and several collaboration features, such as bug tracking, feature requests, task management and wikis for every project. GitHub offers both private repositories and free accounts, which are commonly used to host open-source software projects.

Dimension Reduction

Process of reducing the number of variables to consider in a data-mining approach.

R

R is a programming language and software environment for statistical computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

Regression

Regression is another supervised machine learning problem. It focuses on how a target value changes as other values within a data set change. Regression problems generally deal with continuous variables, like how square footage and location affect the price of a house.

Jupyter Notebooks

The Jupyter (an acronym using the names of several popular languages used by data scientists: Julia, Python, and R) Notebook is the tool of choice for many data scientists. It is an open-source web application that allows you to create and share documents that contain code, equations, visualizations, and narrative text. Jupyter Notebooks help data scientists streamline their work and enable increased productivity and provide the means for collaboration.

Data Visualization

The art of communicating meaningful data visually. This can involve infographics, traditional plots, or even full data dashboards.

Back End

The back end is all of the code and technology that works behind the scenes to populate the front end with useful information. This includes databases, servers, authentication procedures, and much more. You can think of the back end as the frame, the plumbing, and the wiring of an apartment.

Data Storytelling

The last step of the data science process involves communicating potentially complex machine learning results to project stakeholders who are non-experts with data science. Data storytelling is an important skill set for all data scientists.

Feature

The machine learning expression for a piece of measurable information about something. If you store the age, annual income, and weight of a set of people, you're storing three features about them. In other areas of the IT world, people may use the the terms property, attribute, or field instead of "feature."

Data Exploration

The part of the data science process where a scientist will ask basic questions that helps her understand the context of a data set. What you learn during the exploration phase will guide more in-depth analysis later. Further, it helps you recognize when a result might be surprising and warrant further investigation.

Feature Selection

The process of identifying what traits of a data set are going to be the most valuable when building a model. It's especially helpful with large data sets, as using fewer features will decrease the amount of time and complexity involved in training and testing a model. The process begins with measuring how relevant each feature in a data set is for predicting your target variable. You then choose a subset of features that will lead to a high-performance model.

Data Mining

The process of pulling actionable insight out of a set of data and putting it to good use. This includes everything from cleaning and organizing the data; to analyzing it to find meaningful patterns and connections; to communicating those connections in a way that helps decision-makers improve their product or organization.

Feature Engineering

The process of taking knowledge we have as humans and translating it into a quantitative value that a computer can understand. For example, we can translate our visual understanding of the image of a mug into a representation of pixel intensities.

Residual (Error)

The residual is a measure of how much a real value differs from some statistical value we calculated based on the set of data. So given a prediction that it will be 20 degrees Fahrenheit at noon tomorrow, when noon hits and its only 18 degrees, we have an error of 2 degrees. This is often used interchangeably with the term "error," even though, technically, error is a purely theoretical value.

Sample

The sample is the collection of data points we have access to. We use the sample to make inferences about a larger population. For instance, a political poll takes a sample of 1,000 Greek citizens to infer the opinions of all of Greece.

Standard Deviation

The standard deviation of a set of values helps us understand how spread out those values are. This statistic is more useful than the variance because it's expressed in the same units as the values themselves. Mathematically, the standard deviation is the square root of the variance of a set. It's often represented by the Greek symbol sigma, σ.

Data Analysis

This discipline is the little brother of data science. Data analysis is focused more on answering questions about the present and the past. It uses less complex statistics and generally tries to identify patterns that can improve an organization.

Quantitiative Analysis

This field is highly focused on using algorithms for to gain an edge in the financial sector. These algorithms either recommend or make trading decisions based on a huge amount of data, often on the order of picoseconds. Quantitative analysts are often called "quants."

Training and Testing

This is part of the machine learning workflow. When making a predictive model, you first offer it a set of training data so it can build understanding. Then you pass the model a test set, where it applies its understanding and tries to predict a target value.

Transfer Learning

Transfer learning is a deep learning technique where a model developed for one task is repurposed as the starting point for a model on another task. Transfer learning is a popular method where pre-trained deep learning models are used as the starting point for computer vision and natural language problems. This saves considerable computing resources required to develop deep neural networks for these problem domains.

Underfitting

Underfitting happens when you don't offer a model enough information. An example of underfitting would be asking someone to graph the change in temperature over a day and only giving them the high and low. Instead of the smooth curve one might expect, you only have enough information to draw a straight line.

Cross-validation

Verifying the results obtained from a validation study by administering a test or test battery to a different sample (drawn from the same population)

Supervised Machine Learning

With supervised learning techniques, the data scientist gives the computer a well-defined set of data. All of the columns are labelled and the computer knows exactly what it's looking for. It's similar to a professor handing you a syllabus and telling you what to expect on the final.

Reinforcement learning

Without specific goals, reinforcement learning algorithms deal with the problem of finding suitable actions to take in a given situation in order to maximize a reward where learning optimal goals by trial and error. When I first learned about reinforcement learning, I reflected back to the old Pac Man video game. With reinforcement learning, using trial and error, the algorithm would find that certain uses of the button and movements of the joystick would improve the player's score; moreover, the process of trial and error would tend toward an optimal state of the game.

Binomial Distribution

A distribution of outcomes of independent events with two mutually exclusive possible outcomes, a fixed number of trials, and a constant probability of success. This is a discrete probability distribution, as opposed to continuous—for example, instead of graphing it with a line, you would use a histogram, because the potential outcomes are a discrete set of values. As the number of trials represented by a binomial distribution goes up, if the probability of success remains constant, the histogram bars will get thinner, and it will look more and more like a graph of normal distribution.

Neural Networks

A machine learning method that's very loosely based on neural connections in the brain. Neural networks are a system of connected nodes that are segmented into layers — input, output, and hidden layers. The hidden layers (there can be many) are the heavy lifters used to make predictions. Values from one layer are filtered by the connections to the next layer, until the final set of outputs is given and a prediction is made.

Machine Learning

A process where a computer uses an algorithm to gain understanding about a set of data, then makes predictions based on its understanding. There are many types of machine learning techniques; most are classified as either supervised or unsupervised techniques.

Python

A programming language for general-purpose programming and is one language used to manipulate and store data. Many highly trafficked websites, such as YouTube, are created using Python.

JavaScript

A scripting language (no relation to Java) originally designed in the mid-1990s for embedding logic in web pages, but which later evolved into a more general-purpose development language. JavaScript continues to be very popular for embedding logic in web pages, with many libraries available to enhance the operation and visual presentation of these pages.

Normalize

A set of data is said to be normalized when all of the values have been adjusted to fall within a common range. We normalize data sets to make comparisons easier and more meaningful. For instance, taking movie ratings from a bunch of different websites and adjusting them so they all fall on a scale of 0 to 100.

Recurrent Neural Network (RNN)

An RNN represents a type of neural network that works with sequences of data, and where the output from the previous step is fed to the current step as input. We see that in traditional neural networks, all the inputs and outputs are independent of one another. However, in some cases where it's required to predict the next word of a sentence, for example, there is a need to remember the previous words. RNNs solve this need with the help of a hidden layer. The distinguishing characteristic of an RNN is the hidden state, which recalls information about a sequence. RNNs have in them a sense some memory about what happened earlier in the sequence of data.

Fuzzy Logic

An abstraction of Boolean logic that substitutes the usual True and False and for a range of values between 0 and 1. That is, fuzzy logic allows statements like "a little true" or "mostly false.

Bayes' Theorem

An equation for calculating the probability that something is true if something potentially related to it is true. If P(A) means "the probability that A is true" and P(A|B) means "the probability that A is true if B is true," then Bayes' Theorem tells us that P(A|B) = (P(B|A)P(A)) / P(B). This is useful for working with false positives—for example, if x% of people have a disease, the test for it is correct y% of the time, and you test positive, Bayes' Theorem helps calculate the odds that you actually have the disease. The theorem also makes it easier to update a probability based on new data, which makes it valuable in the many applications where data continues to accumulate. Named for eighteenth-century English statistician and Presbyterian minister Thomas Bayes.

AngularJS

An open-source JavaScript library maintained by Google and the AngularJS community that lets developers create what are known as Single [web] Page Applications. AngularJS is popular with data scientists as a way to show the results of their analysis.

Outlier

An outlier is a data point that is considered extremely far from other points. They are generally the result of exceptional cases or errors in measurement, and should always be investigated early in a data analysis workflow.

Database

As simply as possible, this is a storage space for data. We mostly use databases with a Database Management System (DBMS), like PostgreSQL or MySQL. These are computer applications that allow us to interact with a database to collect and analyze the information inside.

Chi-square test

Chi (pronounced like "pie" but beginning with a "k") is a Greek letter, and chi-square is "a statistical method used to test whether the classification of data can be ascribed to chance or to some underlying law."The chi-square test "is an analysis technique used to estimate whether two variables in a cross tabulation are correlated." A chi-square distribution varies from normal distribution based on the "degrees of freedom" used to calculate it.

Classification

Classification is a supervised machine learning problem. It deals with categorizing a data point based on its similarity to other data points. You take a set of data where every item already has a category and look at common traits between each item. You then use those common traits as a guide for what category the new item might have.

Correlation

Correlation is the measure of how much one set of values depends on another. If values increase together, they are positively correlated. If one values from one set increase as the other decreases, they are negatively correlated. There is no correlation when a change in one set has nothing to do with a change in the other.

H2O.ai

H2O is a leading open source data science and machine learning platform for R and Python. H2O's Driverless AI tool is an AI platform that automates some of the most difficult data science and machine learning workflows such as feature engineering, model validation, model tuning, model selection, and model deployment. It aims to achieve the highest predictive accuracy in the shortest amount of time, while minimizing the amount of data scientist resources. H2O's mission is to democratize AI for all.

Docker containers

In a nutshell, a Docker container is a small, user-level virtualization that helps data scientists build, install, and run code. In other words, a container is a light-weight virtual machine (VM) that is built from a script that can be version controlled, resulting in the ability to version control a data science environment.

Median

In a set of values listed in order, the median is whatever value is in the middle. We often use the median along with the mean to judge if there are values that are unusually high or low in the set. This is an early hint to explore outliers.

Activation Function

In neural networks, linear and non-linear activation functions produce output decision boundaries by combining the network's weighted inputs. The ReLU (Rectified Linear Unit) activation function is the most commonly used activation function right now, although the Tanh or hyperbolic tangent, and Sigmoid or logistic activation functions are also used.

Unsupervised Machine Learning

In unsupervised learning techniques, the computer builds its own understanding of a set of unlabeled data. Unsupervised ML techniques look for patterns within data, and often deal with classifying items based on shared traits.

MXNet

MXNet is a popular and scale-able deep learning framework. As an open source library, MXNet helps data scientists build, train, and run deep learning models.

Natural Language Processing (NLP)

NLP is a branch of AI that provides a vehicle for computers to understand, interpret, and manipulate natural (human) language. NLP is composed of elements from a number of fields including computer science and computational linguistics in order to bridge the separation between human communication and computer understanding.

Gradient Descent

Optimization algorithm for finding the input to a function that produces the optimal value; iterative

Business Intelligence (BI)

Similar to data analysis, but more narrowly focused on business metrics. The technical side of BI involves learning how to effectively use software to generate reports and find important trends. It's descriptive, rather than predictive.

Standard Error

Standard error is the measure of the statistical accuracy of an estimate. A larger sample size decreases the standard error.

Statistical Power

Statistical power is the probability of making the correct decision to reject the null hypothesis when the null hypothesis is false. In other words, it's the likelihood a study will detect an effect when there is an effect to be detected. A high statistical power means a lower likelihood of concluding incorrectly that a variable has no effect.

Statistic vs. Statistics

Statistics (plural) is the entire set of tools and methods used to analyze a set of data. A statistic (singular) is a value that we calculate or infer from data. We get the median (a statistic) of a set of numbers by using techniques from the field of statistics.


Related study sets

BUS100 Exam #1 **PRACTICE QUESTIONS**

View Set

Introduction to Management final exam

View Set

Quizzes BABU MARIADOSS IBUS 482 Javier

View Set

DRUG CARD PARAMEDIC MAGNESIUM SULFATE

View Set

Policy provisions, options and riders

View Set

Curriculum and the Young Child Midterm Review

View Set

Keystone Species, Trophic Cascades and Communities as Ecological Networks

View Set