305 quiz

Ace your homework & exams now with Quizwiz!

Bubble Plot

-Allows to add more variables to scatter plot. Size can also be used to show another continuous variable. Color can be added to visualize yet another variable. -Trivariate (size is additional) - Quartovariate (color)

Histogram

-Also encodes data using height/length and shows a distribution -Histogram is a Univariate chart, as it shows the distribution of one variable. Histogram is used to display Continuous (Numerical) data.

Bar Chart

-Bar Chart is a Univariate chart, as it shows the distribution of one variable. Bar Chart is used to display Categorical data (nominal or ordinal). A bar chart is great for showing precise quantitative comparisons encoding data with the height or length of the bar from a common baseline. -encodes data using height/length of bar and shows categorical comparisons

What is data mining

-Business analytics is often referred to as Data Mining. Data mining is a set of statistical and machine learning methods that inform decision making, often in a automated fashion. -Data Mining for broad public may mean "digging through vast stores of data in search of something interesting". Other terms used instead of Data Mining are - Predictive Analytics - Predictive Modeling - Machine Learning - Knowledge Discovery in Databases (KDD) -Data Mining is statistics at scale and speed

Ubiquity of Data Opportunities: What contributed to the opportunities to perform data analytics?

-Computers became VERY powerful -Data networking is VERY fast -Computer storage (memory) is VERY cheap -Algorithms were developed to connect and process multiple datasets Overall, the convergence of these factors resulted in a lot of applications to business scenarios

Small Multiples

-Invented by Edward Tufte "Illustrations of postage-stamp size are indexed by category or a label, sequenced over time like the frames of a movie, or ordered by a quantitative variable not used in the single image itself."

Symbol Map (Dot Map)

-encodes data using position to show data geographically and can also use size to show quantitative data. -Dot map is Trivariate: it can use dot size to represent a numerical variable (such as sales), bubble color to represent another numerical variable (such as profit) by geographical variable (state).

Dot Plot

-encodes data using position to show the comparisons. -Dot plot is a Univariate plot for Continuous data.

TreeMap

-encodes data using size and color and is useful for hierarchical data or when there are a very large number of categories to compare. -Treemap is Multivariate. It can use large segments to represent, say, sales in geographical regions. Size of segment is proportional to region's sales. Segments can be split into sub-segments (such as countries), with size proportional to sales. Color can be used to represent third variable (such as profit).

Packed Bubbles Chart

-encodes data using size of circle to show comparisons which is difficult for making precise quantitative comparisons. -A packed bubble chart is typically Univariate, as it uses the size of bubble to represent a numerical value of a variable. Color can be used to represent second variable. this type of chart is also not recommend

Word cloud

-encodes data using size of word to show comparisons which is difficult for making precise quantitative comparisons. this type of chart is also not recommend

Pie charts contd.

General rules: Don't Use Pie Charts If you must break Rule #1 then: Make sure it adds up 100% Only a few categories Start at noon and move clockwise Largest to Smallest Values Add Labels for % Avoid 3D Keep it Simple

Contingency tables and Mosaic Plot

contingency table represents two categorical variables (therefore, it is bi-variate) Mosaic plot is just a chart graphically representing the content of the contingency table

Heat Map

encodes a data table using color to highlight the differences in the table without numbers. A heat map is close to highlight table. It encodes numerical data using color to highlight the difference in a table but without using numbers.

Gantt Chart

encodes data using length and position to show amount of work completed in segments of time. A Gantt chart shows sequence of activities, and when one activity ends, and another begins.

Sparkline/SparkBar

encodes data using position (line) or height/length (bar) in a small, word-sized graphic. invented by edward Tuftee Small, high-resolution graphics embedded in a context of words, numbers, images. Sparklines are data-intense, design-simple, word-sized graphics. -can use sparklines in tweets

Slopegraph

encodes data using position to show quantitative comparison or rank, typically between two time periods.

Scatter Plot

encodes data using position to show the relationship between two variables

4V's of big data: structured vs. unstructured data contd.

-Most traditional data analysis performed in companies includes structural data -It mostly involves data stored in databases -80-90% of future data growth comes from non-structured data types -At CNU, IT issues are reported through "ticket" system Ticket includes Name (who created ticket) Email, phone, room number Text/description of IT/software issue

What are types of data

-Numerical -Nominal -Ordinal

4V's of big data: Veracity

-Refers to quality, trustworthiness of data, lack of bias, noise, and abnormalities -Garbage in, garbage out E.g. non-response bias, label email as "spam" -Traditional statistics often collects data in a controlled way: for example, matched pairs experiment -Data mining deals with data which must be accumulated through "organic" process, and not be pre-cooked or biased, or collected in a controlled way

What are the method of business analytics

-Supervised Learning: Discover patterns in the data that relate data attributes (variables) with a target attribute. These patterns are then utilized to predict the values of the target attribute in future data instances. -Unsupervised Learning : The data have no target attribute. We want to explore the data to find some intrinsic structures or similarity in them.

What is business analytics

-The study of data through statistical and operations analysis, the formation of predictive models, application of optimization techniques and the communication of these results to customers, business partners and colleague executives -"the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and actions."

4V's of big data: Velocity

-Velocity: refers to the increasing speed at which data is created and the speed at which it can be processed, stored and analyzed. -More data means more complex relationships Business Process: Database Data (1x) Human: enterprise content and external sources (10x) -Web Logs -Email -Text Documents -Social Media Machine: Complex data (100x) -satellite imaging -sensors -videos -recording -M2M Log files -Bip Informatics

Highlight Table

-encodes a data table using color to highlight the differences in the table numbers. -Highlight Table is Multivariate. It can demonstrate a numerical variable (e.g. sales) in table cells, by columns (e.g. geographical regions), each split into rows (e.g. product categories). Color is simply used to highlight high/medium/low values in the table.

Pie Chart

-encodes data using angle, area and arc to show a part-to-whole comparison but problematic for many reasons -Pie Chart is a Univariate chart. Bar Chart can be used to display Categorical data (nominal or ordinal) or Continuous data. -this type of chart is usually not recommend

Donut Chart

-encodes data using arc and area to show a part-to-whole comparison but problematic for many reasons. -Caution this chart type is not recommended.

Concentric circle

-encodes data using arc and area to show comparisons but problematic for many reasons. -Concentric circles use arc and area to show comparisons, but they are problematic for many reasons. It's difficult to make precise quantitative comparisons using arc and area. This can distort the comparison of the data. this chart type is also not recommend

Choropleth Map (Shaded Map)

-encodes data using color and position to show data geographically -Shaded (filled) map is Bivariate: it uses color to represent a numerical variable (such as sales) by geographical variable (state). Can be done on a larger scale (e.g. country) or a smaller scale (e.g. zip code)

Waterfall Chart

-encodes data using height and often color to show increase and decrease between time periods or categories.

Stacked Bar Chart

-encodes data using height or length of bar and color by segment and shows categorical and part-to-whole comparisons. -Bar Chart is a Multivariate chart, as it can show two or more variables simultaneously. For example, the "axis" variable can be time (years), bars can represent sales, and segments can represent product categories. -caution to be careful not to slice stacked charts into too many segment

Lollipop Chart

-encodes data using height or length of bar and shows categorical comparisons -A lollipop chart is a variation of a bar chart, using height or length from a common baseline to allow for a precise quantitative comparison.

Side by Side Bar

-encodes data using height/length of bar and uses color to show categorical comparisons. -Side-by-Side Bar Chart is a Multi-variate. A group of columns may represent, say, region, different colors - product categories, and bar height may represent sales.

Diverging bar

-encodes data using height/length of bar diverging from a midpoint to show categorical comparisons. -Diverging Chart is a Multivariate chart, as it can show two or more variables simultaneously. For example, the "axis" variable can be time (years), or categories, top bars can represent profits, and bottom bars can represent sales.

Bullet Graph

-encodes data using length/height, position and color to show actual compared to target and performance bands. -Bullet graphs are an excellent way to show an actual value compared to a target value. It use height or length from a common baseline for the actual values and position to make the comparison to the target line. Color can be used to show performance bands so the actual value can be shown in context to the desired performance levels. -just actual to target and easier to understand -invented by stephen few

Box Plot aka Box and whisker Plot

-encodes data using position and height/length to show the distribution of the data -Box-and-Whiskers is a Univariate plot In a Box-and-Whiskers plot the "box" shows first, second, and third quartile. The length of whiskers represents how wide is the data distribution. Box plot is used only for Continuous data.

Line Chart

-encodes data using position and often shows trend over time. -It is best to keep the time series on the x-axis (not rotating) and having the oldest time period on the left going to the newest time period on the right. Technically line chart is Bivariate, as it shows numerical variable (y-axis) vs horizontal variable (time, for example).

What are the four different types of data analytics

1. Descriptive analytics -Hindsight data -Answers what happened? -Not very difficult and low value 2. Diagnostic Analytics -Insight data -answers why did it happen? -Somewhat difficult and valuable 3. Predictive Analytics -Insight and Foresight data -answers what will happen? -Valuable and diffcult 4. Prescriptive analytics -Foresight data -Answers how can we make it happen -very valuable and difficult

What are the four V's of big data

1. Volume 2. Variety 3. Velocity 4. Veracity

4V's of big data: Variety

Structured Data: Data containing a defined data type, format, and structure (e.g. transaction data, spread sheets) Unstructured Data:Data that has no inherent structure, which may include text documents (tweets, Facebook posts, blog entries), sensor data, images, audio, video, log files

Supervised vs unsupervised learning and their examples

Supervised Learning -Classification: learns a method for predicting the instance class from pre-labeled classified instances -Regression- an attempt to predict a continuous attribute Unsupervised Learning -Clustering: Find natural grouping of instances given un-labeled data -Association Rules: Method for discovering interesting relation between variables in large DBs

4V's of big data : Volume and comparative scale of bytes

There a several different bytes that are used for data storage examples are: Byte -1B: the basic unit of measurement KiloByte -1000 Bytes -30KB: one page of text MegaByte -1000 KB -5MB: a piece of music GigaByte -1000 MB -1GB: a two-hour film TeraByte -1000 GB -1 TB: 6 million books PetaByte -1000 TB -1 PB: a stack of DVD as tall as a 55 story building ExaByte -1000 PB -5 EB: all the information generated up to 2003 ZettaByte - 1000 EB -1, 8 ZB: all the data recorded in 2011 YottaByte -1000 ZB -1 YB: storage capacity of the NSA datacenter


Related study sets

Argumentative Essay Study.com Videos

View Set

Chapter 6: Creating Charts, Diagrams, and Templates

View Set

Interactive Animation: Natural Levee Development with Flooding

View Set

Urinalysis and Other Body Fluids Exam Simulator

View Set

Principles of Management: Disadvantages of Sole Proprietorship

View Set

Chapter 10 Study Guide Introduction to Criminal Justice

View Set

History: Worlds collide (1491-1607)

View Set