INFS 348 Exam 1 Script

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

# mean age by cluster - aggregate age by cluster and get the mean for each cluster.

aggregate(data = teens, age ~ cluster, mean)

# proportion of females by cluster - aggregate female column by cluster and then calculate the mean for each cluster.

aggregate(data = teens, female ~ cluster, mean)

# mean number of friends by cluster aggreate friends by cluster and calculate mean for each cluster.

aggregate(data = teens, friends ~ cluster, mean)

default settings result in zero rules learned

apriori(groceries)

How do you show how many rows and columns there are?

dim(nlab1)

How would you format factors and lists?

f <- factor(ratings, levels) fl <- list(ratings=ratings, critics=critics, movies=movies, attendance=attendance)

So lets tranform it into vertical format ## where one row correspond to one item in a given transaction. Lets put this into a view ## put your accountname after vertical in the view name eg. grocery_vertical_dataminer1. ## We need a transaction identifier that is what row_number()... is doing.

CREATE VIEW grocery_vertical AS SELECT tid, regexp_split_to_table(line, ',') FROM (SELECT row_number() over () AS tid,line FROM grocery) as food;

How do you find out different columns.

Delimiter is the way we show different columns. Here the delimiter is '|' called "Pipe". lab1 <- read.table("lab1_01.txt", sep="|", header=TRUE)

Now the data has been whipped to shape, lets do kmeans clustering: ## Lets training a kmeans model on the data. For this, we need the K (how many clusters) and If you do str(teens) ##you will realize that good number of our features/numeric variables are contained in columns 5 to 40

features <- teens[5:40]

## We are done with gender variable's missing value. Can we use the same approach for age? The answer is "NO" ## So what to do with 5K missing age values? We will use a technique called "Imputation" which means inpute the age ## by using mathematical tricks like mean, median, mode, regression and so on. ## So lets try finding the mean age by cohort. Since there are NAs need to remove by using parameter called na.rm=TRUE

mean(teens$age, na.rm = TRUE)

How do you make a matrix?

mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol=3, byrow=TRUE, dimnames = list(c("row1", "row2"), c("C1", "C2", "C3")))

How would you format vectors of strings and numbers?

movies <- c("The Undefeated", "Snakes on a Plane", "Encino Man", "Casablanca") attendance <- c(15, 350,175, 400)

query the data and put it in a variable.

mytable=dbGetQuery(conn, "select * from test")

## Only & Only if you are very intersted - create a function that goes thru x clusters on a dataframe df.

optimalk=function(df,x) { y=vector() for ( i in 1:x) { set.seed=42 k_clusters=kmeans(df,i) sumwss=sum(k_clusters$withinss) y=rbind (y,c(i, sumwss)) } dft=as.data.frame(y) colnames(dft)=c("cluster","WithinSS") print(dft) }

##Plot the cluster with the within sum of squares to find out optimal K.

plot(x$cluster,x$WithinSS,type="b")

How do you show what type of element is in a variable?

typeof(nlab1)

writing the rules to a CSV file

write(groceryrules, file = "groceryrules.csv", sep = ",", quote = TRUE, row.names = FALSE)

write the clusters in a file in order to visualize in tableau

write.csv(teen_clusters$centers, "center.txt")

write the rules to a file for further digging. row.names are set to false so that we do not get first columns. ## quotes are set to false so that files come out neat. sep is set to be pipe. file location is d:/grule.txt

write.table(groceryrules_df,file ="d:/grule.txt",sep="|",quote = FALSE,row.names=FALSE)

## call the function and assign the output dataframe to a variable. We will see how withinSS decays with K and if we could find an elbow

x=optimalk(features_norm,15)

## For unknown dummmy variable, we create a variable no_gender in dataframe teens and it can have values 1 or 0. We use ifelse function ##to populate this variable. In ifelse function, if gender variable is equal to "NA" (is.na(teens$gender))

## then value of no_gender dummy variable is 1 else 0. teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)

##Doesn't age depends upon grad years? So I should calculate average age by grad year and use that ## for corresponding grad year subset of data. ## So in plain english, this is what the next statement is doing: aggregate age by gradyear and then apply ## mean to each gradyear. Also, remove the NA (na.rm=TRUE) because otherwise mean will be null. But use Ave instead of aggregate. Ave does ## exactly the same thing as aggregate by returns a vector of length equal to number of rows in teens dataframe (30000). # so this is what the next statment is doing: creating a vector with the average for each gradyear, repeated by person.

ave_age <- ave(teens$age, teens$gradyear, FUN = function(x) mean(x, na.rm = TRUE))

finding subsets of rules containing any berry items

berryrules <- subset(groceryrules, items %in% "berries") inspect(berryrules)

See the type of variable

class(mytable) str(mytable)

change the names of columns

colnames(gr)=c("tid","items") install.packages("arules") library(arules)

establish the connection

conn<-dbConnect("PostgreSQL",dbname="infs494",host="147.126.64.66", port=8432, user="dataminer1", password="fall2015")

Copy the data in this table.

copy grocery from 'D:\data\LAB02\groceries.csv'

How do you show correlation?

cor(nlab1)

Command to put the file in a table is as follows: ##create the empty staging table first

create table grocery(line text);

## Remember, kmeans is very sensitive to outliers and scale. So we need to rescale our clusters ## As a way to normalize the data, We will calculate z-score for each variable. ## lapply function will apply scale formula (which is zscore) on each list elements. ## Remember I have been saying for a while that data.frame is a list. So thing of our features dataframe ## as list which looks like following: features = list(basketball=c(....), soccer=c(.....)) ## lapply will normalize all values for basketball, then scoccer.....

features_norm <- as.data.frame(lapply(features, scale))

for each TID split the dataset and then convert it into class "transactions"

groceries=as(split(gr$items,gr$tid),"transactions") summary(groceries)

set better support and confidence levels to learn more rules

groceryrules <- apriori(groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2)) groceryrules

converting the rule set to a data frame

groceryrules_df <- as(groceryrules, "data.frame") str(groceryrules_df)

a visualization of the sparse matrix for the first five transactions

image(groceries[1:5])

visualization of a random sample of 100 transactions

image(sample(groceries, 100))

look at the first five transactions

inspect(groceries[1:5])

sorting grocery rules by lift

inspect(sort(groceryrules, by = "lift")[1:5])

How to connect to PostgreSQL database. Install package first

install.packages("RPostgreSQL")

How to install packages?

install.packages("rattle")

examine the frequency of items

itemFrequency(groceries[, 1:3])

plot the frequency of items

itemFrequencyPlot(groceries, support = 0.1) itemFrequencyPlot(groceries, topN = 20) itemFrequencyPlot(groceries, support = 0.1,topN=20)

just to check the length of this vector do following and it will return 30000

lengh(ave_age)

Call the library

library(RPostgreSQL)

Call library to install the function

library(rattle)

Call the function

rattle()

What if you have 100 objects in workspace, how would you delete all the objects?

rm(list=ls())

## Since kmeans clustering begins with random seed, lets set some number as seed. ## This way, I can reproduce the same clusters again and again. Otherwise, each time I run the model, ## it will produce a different result.

set.seed(2345)

set your working directory.

setwd("yourdirectory/LAB02")

Defining a function

std <- function(x) sd(x) # defining a function v <- c(1:100) # create a test vector std(v) tellme <- function(x) { p1 <- paste("Type of", x, " is",typeof(x),sep=" ") print(p1) p2 <- paste("Class of", x, "is", class(x), sep=" ") print(p2) p3 <- paste("String rep of ",x," is", str(x), sep=" ") print(p3) p4 <- paste("Names for ", x, "is", names(x), sep=" ") print(p4) invisible() } tellme(t)

How do you look in more detail?

summary(lab1)

How do you make a table?

t <- table(ratings, reviewers)

but I do not see NA values in there. What is the problem? supply parameter useNA and set it to ifany or always

table(teens$gender, useNA = "ifany")

check our recoding work

table(teens$gender, useNA = "ifany") table(teens$female, useNA = "ifany") table(teens$no_gender, useNA = "ifany")

## Run the model/algorithm with K=5 and put the output in variables called teen_clusters

teen_clusters <- kmeans(features_norm, 5)

# look at the cluster centers

teen_clusters$centers

look at the size of the clusters

teen_clusters$size

## After this, the step is similar to the previous approach where we use ifelse statement to choose ave_age where teens$age is NA otherwise use ## teens' age and put this imputed value in teens$age.

teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)

So what to do for the outliers in age variable? ## eliminate age outliers and substitute them wth NA. We will deal with NA later in one go. ## For this, we will use ifelse function. It means if age is greater than or equal to 13 and age is ## age is less than 20 then value is age from the data otherwise age is NA. "&" stands for AND.

teens$age <- ifelse(teens$age >= 13 & teens$age < 20, teens$age, NA)

# apply the cluster IDs to the original data frame

teens$cluster <- teen_clusters$cluster

For categorical variables, we treat the missing values ## as a different kind of value called "unknown". in other words, we will have three genders (male, female, unknown). ## Also, we will create two dummy variables for each distinct three value of gender. ##unknown and female which will have values 1 (means yes) or 0 (means No) ## If both female and unknown are 0 (means NO), that implies person is male. So there is no point in creating dummy ## variable for male. ## In plain english, We create a variable female in dataframe teens and assign the output of ifelse function to it. ## In ifelse function, if gender variable is equal to "F" (shown by '==') and (shown by &) and gender is not na (!is.na(teens$gender)) ## then value of female dummy variable is 1 else 0.

teens$female <- ifelse(teens$gender == "F" & !is.na(teens$gender), 1, 0)

dbGetQuery does not convert strings to factors so lets do it explicitly by as.factor command

teens$gender=as.factor(teens$gender)

# look at the first five records. Remember RC (Rows Columns), So give me 1 to 5 rows and cluster, gender, age, friends as columns

teens[1:5, c("cluster", "gender", "age", "friends")]


Kaugnay na mga set ng pag-aaral

HA prepU ch 17 heart and neck vessels

View Set

نظم المعلومات المحوسبة

View Set

Business Finance (Ch. 2) - Financial Statements, Taxes & Cash Flow

View Set

Medicare Part A (Hospital Insurance) Original Medicare

View Set

Bio Phillips Chapter 9 Test Review

View Set

Methods in cultural anthropology

View Set

1. Foundations of Rhetorical Analysis

View Set