data science basics glossary

Ace your homework & exams now with Quizwiz!

Mean (Average, Expected Value)

A calculation that gives us a sense of a "typical" value for a group of numbers. The mean is the sum of a list of values divided by the number of values in that list. It can be deceiving used on its own, and in practice we use the mean with other statistical values to gain intuition about our data.

Database

A collection of data organized in a manner that allows access, retrieval, and use of that data

Data Pipelines

A collection of scripts or functions that pass data along in a series. The output of the first method becomes the input of the second. This continues until the data is appropriately cleaned and transformed for whatever task a team is working on.

Algorithms

very specific, step-by-step procedures for solving certain types of problems

Underfitting

when a model is too simple, both training and test errors are large

Artificial Intelligence (AI) Field

A discipline involving research and development of machines that are aware of their surroundings. Most work in A.I. centers on using machine awareness to solve problems or accomplish some task. In case you didn't know, A.I. is already here: think self-driving cars, robot surgeons, and the bad guys in your favorite video game.

Greedy Algorithms

A greedy algorithm will break a problem down into a series of steps. It will then look for the best possible solution at each step, aiming to find the best overall solution available. A good example is Dijkstra's algorithm, which looks for the shortest possible path in a graph.

Machine Learning

A process where a computer uses an algorithm to gain understanding about a set of data, then makes predictions based on its understanding. There are many types of machine learning techniques; most are classified as either supervised or unsupervised techniques.

Statistical Significance

A result is stasticially significant when we judge that it probably didn't happen due to chance. It is highly used in surveys and statistical studies, though not always an indication of pratical value.

Normalize

A set of data is said to be normalized when all of the values have been adjusted to fall within a common range. We normalize data sets to make comparisons easier and more meaningful. For instance, taking movie ratings from a bunch of different websites and adjusting them so they all fall on a scale of 0 to 100.

Time Series

A time series is a set of data that's ordered by when each data point ocurred. Think of stock market prices over the course of a month, or the temperature throughout a day.

Fuzzy Algorithms

Algorithms that use fuzzy logic to decrease the runtime of a script. Fuzzy algorithms tend to be less precise than those that use Boolean logic. They also tend to be faster, and computational speed sometimes outweighs the loss in precision.

Fuzzy Logic

An abstraction of Boolean logic that substitutes the usual True and False and for a range of values between 0 and 1. That is, fuzzy logic allows statements like "a little true" or "mostly false."

Data Science Field

An interdisciplinary field involving the design and use of techniques to process very large amounts of data from a variety of sources and to provide knowledge based on the data.

Outlier

An outlier is a data point that is considered extremely far from other points. They are generally the result of exceptional cases or errors in measurement, and should always be investigated early in a data analysis workflow.

Big Data

Big data is about strategies and tools that help computers do complex analysis of very large (read: 1+ TB) data sets. The problems we must address with big data are categorized by the 4 V's: volume, variety, veracity, and velocity.

Clustering

Clustering techniques attempt to collect and categorize sets of points into groups that are "sufficiently similar," or "close" to one another. "Close" varies depending on how you choose to measure distance. Complexity increases as the more features are added to a problem space.

Correlation

Correlation is the measure of how much one set of values depends on another. If values increase together, they are positively correlated. If one values from one set increase as the other decreases, they are negatively correlated. There is no correlation when a change in one set has nothing to do with a change in the other.

Data Journalism Field

Data as new "material" for journalists

Data Engineering Field

Data engineering is all about the back end. These are the people that build systems to make it easy for data scientists to do their analysis. In smaller teams, a data scientist may also be a data engineer. In larger groups, engineers are able to focus solely on speeding up analysis and keeping a data well organized and easy to access.

Deep Learning

Deep learning models use very large neural networks — called deep nets — to solve complex problems, such as facial recognition. The layers in a model start with identifying very simple patterns and then build in complexity. By the end the net (hopefully) has a nuanced understanding that can accurately classify or predict values.

Feature Selection

Feature Selection returns a subset of original feature set. It does not extract new features. Benefits: • Features retain original meaning • After determining selected features, selection process is fast Disadvantages: • Cannot extract new features which have stronger correlation with target variable

statistical significance of a correlation

In a correlational study, the correlation in the sample is large enough that it is very unlikely to have been produced by random variation, but rather represents a real relationship in the population.

Overfitting

Overfitting happens when a model considers too much information. It's like asking a person to read a sentence while looking at a page through a microscope. The patterns that enable understanding get lost in the noise.

Regression

Regression is another supervised machine learning problem. It focuses on how a target value changes as other values within a data set change. Regression problems generally deal with continuous variables, like how square footage and location affect the price of a house.

Business Intelligence (BI) Field

Similar to data analysis, but more narrowly focused on business metrics. The technical side of BI involves learning how to effectively use software to generate reports and find important trends. It's descriptive, rather than predictive.

Statistic vs. Statistics

Statistics (plural) is the entire set of tools and methods used to analyze a set of data. A statistic (singular) is a value that we calculate or infer from data. We get the median (a statistic) of a set of numbers by using techniques from the field of statistics.

Summary Statistics

Summary statistics are the measures we use to communicate insights about our data in a simple way. Examples of summary statistics are the mean, median and standard deviation.

Data Visualization Field

The art of communicating meaningful data visually. This can involve infographics, traditional plots, or even full data dashboards. Nicholas Felton is a pioneer in this field, and Edward Tufte literally wrote the book.

Back End

The back end is all of the code and technology that works behind the scenes to populate the front end with useful information. This includes databases, servers, authentication procedures, and much more. You can think of the back end as the frame, the plumbing, and the wiring of an apartment.

Front End

The front end is everything a client or user gets to see and interact with directly. This includes data dashboards, web pages, and forms.

Data Exploration

The part of the data science process where a scientist will ask basic questions that helps her understand the context of a data set. What you learn during the exploration phase will guide more in-depth analysis later. Further, it helps you recognize when a result might be surprising and warrant further investigation.

Data Wrangling (Munging)

The process of cleaning, unifying, and preparing unorganized and scattered data sets for easy access and analysis.

Classification

The process of grouping things based on their similarities

Data Mining

The process of pulling actionable insight out of a set of data and putting it to good use. This includes everything from cleaning and organizing the data; to analyzing it to find meaningful patterns and connections; to communicating those connections in a way that helps decision-makers improve their product or organization.

Feature Engineering

The process of taking knowledge we have as humans and translating it into a quantitative value that a computer can understand. For example, we can translate our visual understanding of the image of a mug into a representation of pixel intensities.

Residual (Error)

The residual is a measure of how much a real value differs from some stastical value we calculated based on the set of data. So given a prediction that it will be 20 degrees fahrenheit at noon tomorrow, when noon hits and its only 18 degrees, we have an error of 2 degrees. This is often used interchangably with the term "error," even though, technically, error is a purely theoretical value.

Sample

The sample is the collection of data points we have access to. We use the sample to make inferences about a larger population. For instance, a political poll takes a sample of 1,000 Greek citizens to infer the opinions of all of Greece.

Standard Deviation

The standard deviation of a set of values helps us understand how spread out those values are. This statistic is more useful than the variance because it's expressed in the same units as the values themselves. Mathematically, the standard deviation is the square root of the variance of a set. It's often represented by the greek symbol sigma, σ.

Variance

The variance of a set of values measures how spread out those values are. Mathematically, it is the average difference between individual values and the mean for the set of values. The square root of the variance for a set gives us the standard deviation, which is more intuitively useful.

Data Analysis Field

This discipline is the little brother of data science. Data analysis is focused more on answering questions about the present and the past. It uses less complex statistics and generally tries to identify patterns that can improve an organization.

Quantitiative Analysis Field

This field is highly focused on using alogrithms for to gain an edge in the financial sector. These algorithms either recommend or make trading decisions based on a huge amount of data, often on the order of picoseconds. Quantitative analysts are often called "quants."

Training and Testing

This is part of the machine learning workflow. When making a predictive model, you first offer it a set of training data so it can build understanding. Then you pass the model a test set, where it applies its understanding and tries to predict a target value.

Decision Trees

This machine learning method uses a line of branching questions or observations about a given data set to predict a target value. They tend to over-fit models as data sets grow large. Random forests are a type of decision tree algorithm designed to reduce over-fitting.

ETL (Extract, Transform, Load)

This process is key to data warehouses. It describes the three stages of bringing data from numerous places in a raw form to a screen, ready for analysis. ETL systems are generally gifted to us by data engineers and run behind the scenes.

Web Scraping

Web scraping is the process of pulling data from a website's source code. It generally involves writing a script that will identify the information a user wants and pull it into a new file for later analysis.

Parts of a Workflow

While every workflow is different, these are some of the general processes that data professionals use to derive insights from data. The parts include: Data Exploration Data Mining Data Wrangling ETL- Extract, Transform, Load Web Scraping

Data Warehouse

a place where databases are stored so that they are available when needed

Median

the middle score in a distribution; half the scores are above it and half are below it

Unsupervised Machine Learning

machine learning that does not need input for the algorithms and does not need to be trained

Supervised Machine Learning

machine learning that requires humans to provide input and desired output as well as feedback about prediction accuracy during the beginnings of the system

Neural Networks

networks of nerve cells that integrate sensory input and motor output


Related study sets

NURS 245 Comprehensive Final Exam Modules 1-7

View Set

Teacher Academy-Ch. 9-Teaching Diverse Learners

View Set

WSU COMSTRAT 301 - Foundations of Persuasion Study Guide for final

View Set

Presidential Selection - The Electoral College

View Set

Fundamentals of Nursing Chapter 4: Health of the Individual, Family, and Community

View Set