COMP20008 - Exam Revision
What is the problem of having too big bin size in a histogram?
Too big bin size means outliers in some frequent bins, giving a false negative
In the context of regex, what does '[ ]' do?
a set of characters
In the context of regex, what does '$' do?
matches end of string
In the context of regex, what does '|' do?
the or operator
What are the ways the deal with missing data? (3)
- Delete all instances with a missing value - Manually correct - Imputation
What is the difference between xml and html?
xml tags are not pre-defined in xml, you create tags according to your needs. xml is case sensitive, html isn't. html is not extensible where as xml is.
In the context of regex, what does '*' do?
zero or more repetitions
Advantages of item based methods (recommender systems)?
- All computations can be done offline - Item-item similarity more stable that user-user similarity (no need for frequent updates)
What are some examples of where finding outlier would be useful?
- Credit card fraud detection - Medical analysis (unusual test results) - Sports (identifying exceptional talent)
advantages/disadvantages of 'deleting all instances with a missing value' when dealing with missing data.
- Easy to analyse the new (complete) data - May produce a bias on analysis if the new sample size is small or structure exists in the missing data
What are the possible causes of missing data?
- Malfunction of equipment - Not recorded due to misunderstading - May not be considered important at the time of entry - Deliberate
Disadvantages of user based methods (recommender systems)?
- No recommendation for new users - User preference is dynamic, high update frequency of offline calculations
advantages/disadvantages of imputation when dealing with missing data.
- Obtain values to work with so preserve data size - Reduces variance of data is using the mean
advantages/disadvantages of doing imputation for missing values by filling in with zeros
- Simple, won't break application programs - Limited utility for analysis - Lose other data
advantages/disadvantages of doing imputation for missing values by filling in category mean
- Take categories/clusters and compute mean (e.g. females for age instead of all average age) - More accurate and maintains relationship - still skews, as above
advantages/disadvantages of doing imputation for missing values by filling in with mean value
- can be good for supervised classification - Reduces the variances of the feature - Incorrect view of the distribution of that attribute - relationships to other features change - use mode (most frequency value) imputation for categorical features
What is the difference between a csv file and a xls file?
A csv file is a plain test format of comma separated values where as an xls (excel spreadsheet) is the main binary file format.
What is an outlier?
A data object that deviates significantly from the normal objects as if it were generate by a different mechanism.
advantages/disadvantages of manually correcting when dealing with missing data.
A human eyeballs the missing value and fills it in using their expert knowledge - Highly accurate - Very costly (time consuming)
What qualities of data quality does pre-processing deal with?
Accuracy, completeness, consistency, timeliness, believability, interpretability
Why is a Tukey boxplot useful?
Clearly shows number and extremity of outliers
What is the basic structure of JSON?
Data is in name/value pairs.
What is the difference between data missing at random and data not missing completely at random?
Data missing completely at random are missing independently of observed and unobserved data where as data missing not completely at random are missing in part due to info observed (e.g. not answering a q about how u feel)
What is the goal of user based methods in recommender systems?
Identify like-minded users
What is the difference between JSON and XML?
JSON is simpler and more compact/lightweight than XML. JSON is easier to parse. JSON offers more speed and efficiency. Common JSON application is to read and display data from a webserver using javascript. XML comes with a large family of other standards for querying and transforming.
Why is JSON-LD used to represent linked data?
JSON-LD provides mechanisms for specifying unambiguous meaning in JSON data. Provides extra keys with "@" sign
What is linked data?
Linked data is a method of publishing structured data so that it can be interlinked and become more useful through semantic queries.
What is an interesting application of JSON?
Linked data using JSON-LD to represent it
What is collaborative filtering?
Making predictions about a user's missing data according to the behaviour of many other users
In the context of regex, what does '.' do?
Matches any character
What is the purpose of XML namespaces?
Namespace declarations are used to qualify names with universal resource identifiers (URIs).
What does data-cleaning deal with?
Noisy data, inconsistent data, intentionally disguised data and incomplete (missing) data.
What is the benefit of a histogram?
Often makes fewer assumptions about the data and thus can be applicable in more scenarios
What is the importance of finding outliers?
Outliers are interesting as they violate the mechanism that generates the normal data. Unlike noise, that doesn't show reality, outliers can show exceptions.
What is the goal of recommender systems?
Seeks to predict the rating or preference that a user would give to an item
What is the process of model (matrix) based methods (recommender systems)?
Solve an optimization problem and identify latent factors
Where are the whiskers in a Tukey boxplot?
They show the highest point still with I.5IQR of upper quartile and lowest point still with 1.5IQR of lowest quartile.
what values are considered suspected outliers in a tukey boxplot?
any value >1.5*IQR above third quartile or >1.5*IQR below 1st quartile
What values are considered outliers in a tukey boxplot?
any value >3*IQR above 3rd quartile or >3*IQR below 1st quartile
In the context of regex, what does '^' do?
matches start of string
What is the problem of having too small bins in a histogram?
normal objects fall in empty or rare bins, giving a false positive in outlier detection
What is a disadvantage of a histogram in outlier detection?
often hard to choose appropriate bin size
In the context of regex, what does '+' do?
one or more repetitions