BUSS6002

¡Supera tus tareas y exámenes ahora con Quizwiz!

Benefits of MapReduce

- Allows user to quickly write and test code - Works in all environments - Major efficiency gains (parallellisation) - Flat scalability curve (reuse code)

Trade-offs ML models

- Prediction accuracy vs interpretability - Good fit vs over or underfit - Black box vs Parisomy

Zero-one loss function algorithm

0 for correct, unit loss for misclassification

Threshold parameter for binary decision rule

0.5

Problem definition - market research

1. Describe the situation & identify options management needs to choose from 2. Define marketing research problems

4 Phases of MapReduce

1. Input splits - divide into fixed sized chunks 2. Mapping - splits passed to mapping function to produce intermediate outputs 3. Shuffling - consolidates relevant records from intermediate output 4. Reducing - Outputs are aggregated to return to final output

K-fold algorithm

1. Random Split into K fold of about equal size 2. For each fold k estimate model on other folds combined and use k as validation set 3. The cross validation error is the average error across the K validation sets

Map Reduce

A programming model that divides data set into smaller subsets (chicks) loaded into RAM and analyzed separately, results combined Entire dataset still physically stored in dark-drive

Cunfusion matrix

Actual on y-axis, Predicted on x-axis

Bayes loss algorithm

Assign x to the class with the highest probability

sentiment analysis

Assigns pos/neg points for each word, then adds up Vader

NLP

Automated handling of natural human language, text or speech

Distribution used in logistic regression

Bernoulli

Estimation error

Bias^2 + variance

Shuffling process

Bringing together intermediate outputs into reducers

Ridge Regression

Cannot zero-out a specific coefficient good when coefficients have similar size

Iceberg Metaphor of market research

Decision maker: loss of sales, low traffic, unhappy customers Researcher: Marginal performance of sales, low-quality products, poor image, inappropriate system

Bias

Difference between average error estimate and true

Challenge od NLP

Disambiguate complexity of human language, eg lexical, semantic, syntactic, pragmatic

Divide & Conquer

Divide data into smaller blocks, analyze each step (parallell), combine solutions from each block to form final solution - solution=full data solution for linear, not logistic (compute for each block, take average) - K=sqrt(n) /log p

Counter overfitting

Drop features, regularization

Training data purpose

EDA, model building, estimation

Goal of subsampling

Each S could provide a good estimate, simple aggregation with average, or update solution iteratively

Variance

Expected squared deviation of predictions arounds its mean

Problem with wide and tall Big Data

Exploit sparsity, variable selection, regularisation -subsample -Divide & conquer -MapReduce

Bag-of-words

Extract features from text, vocabulary of known words

MLE

Find theta that maximizes the p(D|theta)

TD-IDF

Frequency of n-gram/doc. Average n-gram count across corpus. translate into how important an n-gram is

Precision Recall urve

Good for imbalanced classes, mean precision approximates ares

F1

Harmonic average of precision and recall 2*precision*recall / precision + recall

Rule-based biggest weaknesses

Increasingly convoluted logic, discovery of language change over time (must adopt), enormous amounts of data, cannot harness all info

Alpha (gradient decent)

Learning rate

RAM memory

Loads when computational task, deleted when turned off. Limited space, compare with brain 8-16 GB

Higher complexity consequences

Lower bias, higher variance -> overfitting

Lower complexity consequences

Lower variance, higher bias -> underfitting

Goal of NLP

Make language accessible for computers. Unstructured data -> Structured data

Expected loss goal

Minimize E[L(Y, Ypred(x))]

Test dataset purpose

Model evaluation

Cross validation

Multiple random training/validation set splits Use when training data is not large enough

Gradient Decent

Negative direction

Spark

Newer, better iterative methods

Parts of Speech

Nouns, verbs, adverbs, different uses and meanings

Hadoop

Open-source sw that implements MapReduce Distributed scalable, Fault tolerant Works with BI, ETL, DB, OS/Cloud, HW allos to use CPU and RAM of individual computers on a cluster

Impala

Parallell computing on Hadoop, SQL, uses Dremel

Regex

Represent highly complex deterministic queries Difficult to learn/interpret

Word embedding

Represents word as vectors in multidimensional space. Similar words should be close

Hive

SQL based

Subsampling

Select random subsets S, compute solution. Repeat R times. Avg of R solutions to form solution - solution -> full solution if R increases, no info lost

N-gram

Sequence of n tokens. Keep some structure and "meaning" of text

Lasso

Shrink towards zero, better interpretability. good when few predictors with large coefficients, variable selection

How to reduce complexity NLP

Stemming, lemmatization, stop-words (filter out words with little meaning), TD-IDF

Partitions

Subsets

Reducers

Task that process output from mappers (intermediate)

Mappers

Tasks that process chuck in isolation

Applications of NLP

Text as input to supervised learning, chatbots, sentiment analysis

Specificity

True negative rate (true negatives/actual negatives)

Recall

True positive rate (true positives/actual positives)

Receiver Operating characteristics (ROC)

True positives - false positives for range Tao Area under the curve (AUC) max = 1

Precision

True positives / Positive classification

Latent Dirichlet Allocation

Unsupervised ml, define k = number of topics (can have multiple docs)

Overfitting

Values change much but are less off, overreact , poor predictive on unseen error

Underfitting

Values don't change much but are off

Classification elements

Y = [1,...,K] Categorical X,...,Xp Input variables

Validation dataset purpose

appropriate model selection

Elastic Net

combines ridge + lasso performs well for ridge regression

Phrase structure rules

eg noun followed by verb

Gradient Ascent

first order iterative optimization alg - positive direction

challenge with N-grams

high dimensionality, highly unbalanced

Parse tree

identify parts of speech that form phrase structures that forms sentences. Use earlier step to solve ambiguity of an earlier step. Good enough for simple text and interactions

Tokens

instance of a sequence of characters

Structure of data elements in MapReduce

key-value pairs. all values with the same key are passed to a single reducer

Problem with tall Big Data

too many samples, simple (linear) models don't suffice

Problem with wide Big Data

too many variables, overfitting need to remove variables or regularize

Goal of Divide & conquer

work out analytically how parameter estimates from each split should be combined to produce the estimate you would have obtained


Conjuntos de estudio relacionados

Excelsior College- Life Span 1 Quiz 1 NoAnswers

View Set

BUS210 Business Law Final Resendez

View Set

Sheppard air frequently missed questions (IFR)

View Set

Fluid, Electrolyte and Acid-base regulation

View Set

Chapter 28: Care of Patients Requiring Oxygen Therapy or Tracheostomy

View Set

AP Environmental Science Chapter 2 Review (1-10)

View Set