BUSS6002
Benefits of MapReduce
- Allows user to quickly write and test code - Works in all environments - Major efficiency gains (parallellisation) - Flat scalability curve (reuse code)
Trade-offs ML models
- Prediction accuracy vs interpretability - Good fit vs over or underfit - Black box vs Parisomy
Zero-one loss function algorithm
0 for correct, unit loss for misclassification
Threshold parameter for binary decision rule
0.5
Problem definition - market research
1. Describe the situation & identify options management needs to choose from 2. Define marketing research problems
4 Phases of MapReduce
1. Input splits - divide into fixed sized chunks 2. Mapping - splits passed to mapping function to produce intermediate outputs 3. Shuffling - consolidates relevant records from intermediate output 4. Reducing - Outputs are aggregated to return to final output
K-fold algorithm
1. Random Split into K fold of about equal size 2. For each fold k estimate model on other folds combined and use k as validation set 3. The cross validation error is the average error across the K validation sets
Map Reduce
A programming model that divides data set into smaller subsets (chicks) loaded into RAM and analyzed separately, results combined Entire dataset still physically stored in dark-drive
Cunfusion matrix
Actual on y-axis, Predicted on x-axis
Bayes loss algorithm
Assign x to the class with the highest probability
sentiment analysis
Assigns pos/neg points for each word, then adds up Vader
NLP
Automated handling of natural human language, text or speech
Distribution used in logistic regression
Bernoulli
Estimation error
Bias^2 + variance
Shuffling process
Bringing together intermediate outputs into reducers
Ridge Regression
Cannot zero-out a specific coefficient good when coefficients have similar size
Iceberg Metaphor of market research
Decision maker: loss of sales, low traffic, unhappy customers Researcher: Marginal performance of sales, low-quality products, poor image, inappropriate system
Bias
Difference between average error estimate and true
Challenge od NLP
Disambiguate complexity of human language, eg lexical, semantic, syntactic, pragmatic
Divide & Conquer
Divide data into smaller blocks, analyze each step (parallell), combine solutions from each block to form final solution - solution=full data solution for linear, not logistic (compute for each block, take average) - K=sqrt(n) /log p
Counter overfitting
Drop features, regularization
Training data purpose
EDA, model building, estimation
Goal of subsampling
Each S could provide a good estimate, simple aggregation with average, or update solution iteratively
Variance
Expected squared deviation of predictions arounds its mean
Problem with wide and tall Big Data
Exploit sparsity, variable selection, regularisation -subsample -Divide & conquer -MapReduce
Bag-of-words
Extract features from text, vocabulary of known words
MLE
Find theta that maximizes the p(D|theta)
TD-IDF
Frequency of n-gram/doc. Average n-gram count across corpus. translate into how important an n-gram is
Precision Recall urve
Good for imbalanced classes, mean precision approximates ares
F1
Harmonic average of precision and recall 2*precision*recall / precision + recall
Rule-based biggest weaknesses
Increasingly convoluted logic, discovery of language change over time (must adopt), enormous amounts of data, cannot harness all info
Alpha (gradient decent)
Learning rate
RAM memory
Loads when computational task, deleted when turned off. Limited space, compare with brain 8-16 GB
Higher complexity consequences
Lower bias, higher variance -> overfitting
Lower complexity consequences
Lower variance, higher bias -> underfitting
Goal of NLP
Make language accessible for computers. Unstructured data -> Structured data
Expected loss goal
Minimize E[L(Y, Ypred(x))]
Test dataset purpose
Model evaluation
Cross validation
Multiple random training/validation set splits Use when training data is not large enough
Gradient Decent
Negative direction
Spark
Newer, better iterative methods
Parts of Speech
Nouns, verbs, adverbs, different uses and meanings
Hadoop
Open-source sw that implements MapReduce Distributed scalable, Fault tolerant Works with BI, ETL, DB, OS/Cloud, HW allos to use CPU and RAM of individual computers on a cluster
Impala
Parallell computing on Hadoop, SQL, uses Dremel
Regex
Represent highly complex deterministic queries Difficult to learn/interpret
Word embedding
Represents word as vectors in multidimensional space. Similar words should be close
Hive
SQL based
Subsampling
Select random subsets S, compute solution. Repeat R times. Avg of R solutions to form solution - solution -> full solution if R increases, no info lost
N-gram
Sequence of n tokens. Keep some structure and "meaning" of text
Lasso
Shrink towards zero, better interpretability. good when few predictors with large coefficients, variable selection
How to reduce complexity NLP
Stemming, lemmatization, stop-words (filter out words with little meaning), TD-IDF
Partitions
Subsets
Reducers
Task that process output from mappers (intermediate)
Mappers
Tasks that process chuck in isolation
Applications of NLP
Text as input to supervised learning, chatbots, sentiment analysis
Specificity
True negative rate (true negatives/actual negatives)
Recall
True positive rate (true positives/actual positives)
Receiver Operating characteristics (ROC)
True positives - false positives for range Tao Area under the curve (AUC) max = 1
Precision
True positives / Positive classification
Latent Dirichlet Allocation
Unsupervised ml, define k = number of topics (can have multiple docs)
Overfitting
Values change much but are less off, overreact , poor predictive on unseen error
Underfitting
Values don't change much but are off
Classification elements
Y = [1,...,K] Categorical X,...,Xp Input variables
Validation dataset purpose
appropriate model selection
Elastic Net
combines ridge + lasso performs well for ridge regression
Phrase structure rules
eg noun followed by verb
Gradient Ascent
first order iterative optimization alg - positive direction
challenge with N-grams
high dimensionality, highly unbalanced
Parse tree
identify parts of speech that form phrase structures that forms sentences. Use earlier step to solve ambiguity of an earlier step. Good enough for simple text and interactions
Tokens
instance of a sequence of characters
Structure of data elements in MapReduce
key-value pairs. all values with the same key are passed to a single reducer
Problem with tall Big Data
too many samples, simple (linear) models don't suffice
Problem with wide Big Data
too many variables, overfitting need to remove variables or regularize
Goal of Divide & conquer
work out analytically how parameter estimates from each split should be combined to produce the estimate you would have obtained