PLS, Validation, Outliers, Comparing PCA and PLS
From PCA to PLS - what is the difference?
Loadings are corregated (changing) and associated scores are changing with them in PLS. PLS uses same consept as PCA - only PLS uses predictions.
What are the scores in PLS and are we interested in the calibration scores?
The scores are sample specific and we are not interested in the calibration scores
How does does PLS work (how many scales)? And what is the difference between PLS1 and PLS2?
There is an X scale and Y scale (X-score plotted vs. Y-score) PLS 1 is based on 1 variable in the Y scale and PLS2 has 2 variables in Y
What is the goal of PLS?
To find the best loading for Y-data with data matrix (X) and vector (Y)
There are two types of outliers - which and how do yo detect them (PCA)?
To types: Extreme, but IN the model. Different, outside the model. How to detect: Scores and residuals; - Extreme outlier --> extreme score (high score). - Weird/different sample --> (high residual).
What is the name of the total error, talking about errors in regression models?
Total error: Root-mean-squared-error (RMSE)
Which two errors consist the total error of and what is the difference between the two?
Total error: systematic errors (bias) + random errors (standard error of prediction/SEP). Systematic errors: The errors are making a pattern, but are not on the line. - Fit for companies Random errors: The errors are not making a pattern (random); spread around the line. - Fit for academic
What are the most common mistakes in chemometrics?
Trusting poorly validated results too much and removing non-outliers Predictions can never be better than thereference values!
In PLS there is an inner relation, U=T - what does this mean?
U = T; correlation between X and Y scores
In validation the focus is on the predictions of Y and not X. When looking at RMSECV/RMSEP (ift. CV or test-set), is the component then predictive or not, if it yields (udbytte) a lower RMSECV/RMSEP?
f a component yields a lower RMSECV/RMSEP it is predictive
What does validation includes in PCA/PLS models (two things)?
- Selecting the number of components. - Checking for problematic samples (outliers).
What is the general formula for regression?
y = a + x*b b = slope (hældning) a = offset
How do you estimate/check the linearity in an PLS plot? And which parameter do you use to asses?
Check T (X-score) vs. U (Y-score) Forklaringsgrad, R^2
What is the limiting factor for the columns? for the rows? (in Multilinear regression (MLR)
Columns: limiting factor is variables Rows: Limiting factor is samples
What statistical parameter is used to validate; validation in statistics?
Confidence interval
Which assessment factor is used when talking about error in regression models?
Correlation coefficient: r^2
What is the goal of PLS and what is the correlation between X and Y?
Covariance (of T and U) is as high as possible The correlation should be high and the numbers should be big Big things in X that are correlated to big things in Y
The PLS model consist of three models - which?
Each model can be used for outlier detection. Models of X and Y are examined exactly asin PCA (influence, scores, residuals,Hotelling's T2).
Which two empiricals do you distinguish between; validation in chemo? What are the difference between the two?
Empirical 1: Knowledgebased validation; what do you know of the system on beforehand? Empirical 2: Fit/prediction of samples. Estimations of behaviour of future samples: RMSEP (P=prediction), r^2_pred, Bias_pred.
What does PLS do?
Find big things in X that correlates to big things in Y
What is essentiel to look for in terms of the scores in outlier control (PCA)?
First and foremost, outliers are found by looking in score plots. This is much more powerful than testing because peculiar behavior is easily spotted. Look for unusual grouping or other structure which is not in accordance with the knowledge of the data.
How can a model always be modeled with perfect fit (considering validation)?
Fitting data with increasing polynomial, from looking at degree of polynomial vs. error
What should a PCA/PLS model do?
Generalize - i.e. be descriptive of new objects Describe systematic variation and leave out noise
What is the difference between Hottelings T^2 (score outliers) and residual (outlier)?
Hottelings T^2, scores outlier: Same pattern as described by model. Residual outliers: Different pattern as described by model
What are the (three) alternatives within multivariate regression?
Multiple Linear Regression - MLR Principal Component Regression - PCR Partial Least Squares (Regression) - PLS(R)Department of Food Science
What is essentiel in outlier control (PCA)?
Outliers should not be removed, but understood - Bad outliers must be removed
What is the difference between PCA and PLS (in terms of multivariate regression)?
PCA (principal Component Analysis) is for the analysis of one data matrix (X); One data structure X. PLS (Partial Least Squares Regression)/Multivariate regression is for correlating the information in one datamatrix (X) to the information of one or more response vectors (y/Y); Two data structures X and Y.
At what number of component does PLS shows extract? How does PLS differ from PCA in terms of components?
PLS shows extract in component 2-3. PLS tends to need fewer component, but it's not always the firsts components you wanna look at.
What do you use cross-validation for?
Predicting samples from looking at errors
What do you do when predicting new samples in PLS?
Prediction of new samples: multiplying the regression vector; the regression vector is predicting. Regression vector (prediction) is gonna look exactly like the spectra.
What is the full name for PCR and PLS/-R and the two equations?
Principal component regression (PCR): b = (T^T*T)^-1*T^T*Y Partial least squares regression (PLS/-R) b = (T_PLS^T*T_PLS)^-1*T_PLS^T*Y
Regression error measures - Define RMSEC, RMSECV and RMSEP
RMSEC: On calibration samples. - The more samples (components) the better. - A low RMSEC does not tell you much. RMSECV: Cross-validation (trust-worthy). - Using all samples for calibration set and all samples for test set. RMSEP: Test set/prediction (trust-worthy). - If you have enough samples, a test set is good.
In test set validation, you distinguish between RMSECV and RMSEC, what is the difference?
RMSECV (Cross-validation): swapping around what is a part of the calibration and validation. - Based on 3 individual models (the unsureness is higher than RMSEC) RMSEC: Based on calibration. - Based on 1 model
What does RMSEP consist of?
RMSEP: Total error (SEP: Random error + Bias: Systematic error) - SEP: Random error - Bias: Systematic error (Bias are always 0 in calibration models) RMSEP^2 = SEP^2 + Bias^2
When looking at outliers in influence plot (Sample residual variation vs T^2), does number of components matter?
Sample residual variation vs T^2 (for a specific number of components) (not interested in how many components - outliers will still flag out). Outliers will flag to one side or another.
What does residuals tell us, when talking about outlier control (PCA)? What do we know about residuals?
Samples (or variables) that are not consistent with the model (different pattern than described by model) get high residuals. Residuals are: - Small - Approximately random - Of similar size
What is validation used for?
See/prove a model - does it fit? Estimate model complexity and estimate model accuracy
Sumerize the steps in "how to PLS1"? And what is the difference from PLS2
Skipping 7th step + The model is y = XB + E (and not Y = XB + E, like PLS2)
What is unique for Multilinear regression (MLR) in terms of the slope, b and how does the equation looks like?
Slopes, b: One unique slope pr. variable b = (X^T*X)^-1*X^T*Y In order to solve the equation, you need more samples than variables
What are the steps in predicting new samples (4 steps)?
Step 1: build model (sample specific - T and U, and calibration model specific - P' and Q') Step 2: find X scores (known - X_new) Step 3: estimate Y scores from X scores Step 4: estimate Y data from Y scores
How many models can be used for outlier detection in PLS?
The PLS model consists of three models; each model can be used for outlier detection
What is the goal of regression analysis in general?
We would like to find a relation between the X and Y
What is PLS about/what do you use it for?
When we would like to relate the two data structures and to find out which variables in X that correlate to the Y variables as well as the internal correlations in X and Y (in contrast to PCA) PLS model can be used in future, i.e. protein content in wheat kernels can be measured by NIT and protein content be predicted → Slow method replaced (but some reference measurements are still necessary).
Refresh the equation for Principal Component Analysis (PCA)
X = TP' + E
What is the formula for PCA (Principal Component Analysis)? What does each variable symbolize?
X = TP' + E TP': Matrix (T: scores, P: loadings) X - TP' = E -- > to get the residuals (E: residuals)
What does a PLS consist of and what do you want to be able to?
X-matrix and Y-matrix. Be able to predict Y-scores from X-scores.
What is the general formula for multivariate regression in general?
Y = X*B + E
How does PLS differ from PCA in terms of the R-value?
You get a higher explaination with a PLS model than PCA model (higher r-value)
What do you want to be able to predict and to avoide in PLS?
You want to be able to predict Y. You want to avoide noise in X.