COMP20008 - Exam Revision

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What is the problem of having too big bin size in a histogram?

Too big bin size means outliers in some frequent bins, giving a false negative

In the context of regex, what does '[ ]' do?

a set of characters

In the context of regex, what does '$' do?

matches end of string

In the context of regex, what does '|' do?

the or operator

What are the ways the deal with missing data? (3)

- Delete all instances with a missing value - Manually correct - Imputation

What is the difference between xml and html?

xml tags are not pre-defined in xml, you create tags according to your needs. xml is case sensitive, html isn't. html is not extensible where as xml is.

In the context of regex, what does '*' do?

zero or more repetitions

Advantages of item based methods (recommender systems)?

- All computations can be done offline - Item-item similarity more stable that user-user similarity (no need for frequent updates)

What are some examples of where finding outlier would be useful?

- Credit card fraud detection - Medical analysis (unusual test results) - Sports (identifying exceptional talent)

advantages/disadvantages of 'deleting all instances with a missing value' when dealing with missing data.

- Easy to analyse the new (complete) data - May produce a bias on analysis if the new sample size is small or structure exists in the missing data

What are the possible causes of missing data?

- Malfunction of equipment - Not recorded due to misunderstading - May not be considered important at the time of entry - Deliberate

Disadvantages of user based methods (recommender systems)?

- No recommendation for new users - User preference is dynamic, high update frequency of offline calculations

advantages/disadvantages of imputation when dealing with missing data.

- Obtain values to work with so preserve data size - Reduces variance of data is using the mean

advantages/disadvantages of doing imputation for missing values by filling in with zeros

- Simple, won't break application programs - Limited utility for analysis - Lose other data

advantages/disadvantages of doing imputation for missing values by filling in category mean

- Take categories/clusters and compute mean (e.g. females for age instead of all average age) - More accurate and maintains relationship - still skews, as above

advantages/disadvantages of doing imputation for missing values by filling in with mean value

- can be good for supervised classification - Reduces the variances of the feature - Incorrect view of the distribution of that attribute - relationships to other features change - use mode (most frequency value) imputation for categorical features

What is the difference between a csv file and a xls file?

A csv file is a plain test format of comma separated values where as an xls (excel spreadsheet) is the main binary file format.

What is an outlier?

A data object that deviates significantly from the normal objects as if it were generate by a different mechanism.

advantages/disadvantages of manually correcting when dealing with missing data.

A human eyeballs the missing value and fills it in using their expert knowledge - Highly accurate - Very costly (time consuming)

What qualities of data quality does pre-processing deal with?

Accuracy, completeness, consistency, timeliness, believability, interpretability

Why is a Tukey boxplot useful?

Clearly shows number and extremity of outliers

What is the basic structure of JSON?

Data is in name/value pairs.

What is the difference between data missing at random and data not missing completely at random?

Data missing completely at random are missing independently of observed and unobserved data where as data missing not completely at random are missing in part due to info observed (e.g. not answering a q about how u feel)

What is the goal of user based methods in recommender systems?

Identify like-minded users

What is the difference between JSON and XML?

JSON is simpler and more compact/lightweight than XML. JSON is easier to parse. JSON offers more speed and efficiency. Common JSON application is to read and display data from a webserver using javascript. XML comes with a large family of other standards for querying and transforming.

Why is JSON-LD used to represent linked data?

JSON-LD provides mechanisms for specifying unambiguous meaning in JSON data. Provides extra keys with "@" sign

What is linked data?

Linked data is a method of publishing structured data so that it can be interlinked and become more useful through semantic queries.

What is an interesting application of JSON?

Linked data using JSON-LD to represent it

What is collaborative filtering?

Making predictions about a user's missing data according to the behaviour of many other users

In the context of regex, what does '.' do?

Matches any character

What is the purpose of XML namespaces?

Namespace declarations are used to qualify names with universal resource identifiers (URIs).

What does data-cleaning deal with?

Noisy data, inconsistent data, intentionally disguised data and incomplete (missing) data.

What is the benefit of a histogram?

Often makes fewer assumptions about the data and thus can be applicable in more scenarios

What is the importance of finding outliers?

Outliers are interesting as they violate the mechanism that generates the normal data. Unlike noise, that doesn't show reality, outliers can show exceptions.

What is the goal of recommender systems?

Seeks to predict the rating or preference that a user would give to an item

What is the process of model (matrix) based methods (recommender systems)?

Solve an optimization problem and identify latent factors

Where are the whiskers in a Tukey boxplot?

They show the highest point still with I.5IQR of upper quartile and lowest point still with 1.5IQR of lowest quartile.

what values are considered suspected outliers in a tukey boxplot?

any value >1.5*IQR above third quartile or >1.5*IQR below 1st quartile

What values are considered outliers in a tukey boxplot?

any value >3*IQR above 3rd quartile or >3*IQR below 1st quartile

In the context of regex, what does '^' do?

matches start of string

What is the problem of having too small bins in a histogram?

normal objects fall in empty or rare bins, giving a false positive in outlier detection

What is a disadvantage of a histogram in outlier detection?

often hard to choose appropriate bin size

In the context of regex, what does '+' do?

one or more repetitions


Set pelajaran terkait

GEO 1000: Science in Cinema EXAM I - HW Sheets

View Set

Ohio Esthetics State Board Review

View Set

Ancient Civilizations: Indus River Valley Civilizations

View Set