INFS 348 Exam 1 Script
# mean age by cluster - aggregate age by cluster and get the mean for each cluster.
aggregate(data = teens, age ~ cluster, mean)
# proportion of females by cluster - aggregate female column by cluster and then calculate the mean for each cluster.
aggregate(data = teens, female ~ cluster, mean)
# mean number of friends by cluster aggreate friends by cluster and calculate mean for each cluster.
aggregate(data = teens, friends ~ cluster, mean)
default settings result in zero rules learned
apriori(groceries)
How do you show how many rows and columns there are?
dim(nlab1)
How would you format factors and lists?
f <- factor(ratings, levels) fl <- list(ratings=ratings, critics=critics, movies=movies, attendance=attendance)
So lets tranform it into vertical format ## where one row correspond to one item in a given transaction. Lets put this into a view ## put your accountname after vertical in the view name eg. grocery_vertical_dataminer1. ## We need a transaction identifier that is what row_number()... is doing.
CREATE VIEW grocery_vertical AS SELECT tid, regexp_split_to_table(line, ',') FROM (SELECT row_number() over () AS tid,line FROM grocery) as food;
How do you find out different columns.
Delimiter is the way we show different columns. Here the delimiter is '|' called "Pipe". lab1 <- read.table("lab1_01.txt", sep="|", header=TRUE)
Now the data has been whipped to shape, lets do kmeans clustering: ## Lets training a kmeans model on the data. For this, we need the K (how many clusters) and If you do str(teens) ##you will realize that good number of our features/numeric variables are contained in columns 5 to 40
features <- teens[5:40]
## We are done with gender variable's missing value. Can we use the same approach for age? The answer is "NO" ## So what to do with 5K missing age values? We will use a technique called "Imputation" which means inpute the age ## by using mathematical tricks like mean, median, mode, regression and so on. ## So lets try finding the mean age by cohort. Since there are NAs need to remove by using parameter called na.rm=TRUE
mean(teens$age, na.rm = TRUE)
How do you make a matrix?
mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol=3, byrow=TRUE, dimnames = list(c("row1", "row2"), c("C1", "C2", "C3")))
How would you format vectors of strings and numbers?
movies <- c("The Undefeated", "Snakes on a Plane", "Encino Man", "Casablanca") attendance <- c(15, 350,175, 400)
query the data and put it in a variable.
mytable=dbGetQuery(conn, "select * from test")
## Only & Only if you are very intersted - create a function that goes thru x clusters on a dataframe df.
optimalk=function(df,x) { y=vector() for ( i in 1:x) { set.seed=42 k_clusters=kmeans(df,i) sumwss=sum(k_clusters$withinss) y=rbind (y,c(i, sumwss)) } dft=as.data.frame(y) colnames(dft)=c("cluster","WithinSS") print(dft) }
##Plot the cluster with the within sum of squares to find out optimal K.
plot(x$cluster,x$WithinSS,type="b")
How do you show what type of element is in a variable?
typeof(nlab1)
writing the rules to a CSV file
write(groceryrules, file = "groceryrules.csv", sep = ",", quote = TRUE, row.names = FALSE)
write the clusters in a file in order to visualize in tableau
write.csv(teen_clusters$centers, "center.txt")
write the rules to a file for further digging. row.names are set to false so that we do not get first columns. ## quotes are set to false so that files come out neat. sep is set to be pipe. file location is d:/grule.txt
write.table(groceryrules_df,file ="d:/grule.txt",sep="|",quote = FALSE,row.names=FALSE)
## call the function and assign the output dataframe to a variable. We will see how withinSS decays with K and if we could find an elbow
x=optimalk(features_norm,15)
## For unknown dummmy variable, we create a variable no_gender in dataframe teens and it can have values 1 or 0. We use ifelse function ##to populate this variable. In ifelse function, if gender variable is equal to "NA" (is.na(teens$gender))
## then value of no_gender dummy variable is 1 else 0. teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)
##Doesn't age depends upon grad years? So I should calculate average age by grad year and use that ## for corresponding grad year subset of data. ## So in plain english, this is what the next statement is doing: aggregate age by gradyear and then apply ## mean to each gradyear. Also, remove the NA (na.rm=TRUE) because otherwise mean will be null. But use Ave instead of aggregate. Ave does ## exactly the same thing as aggregate by returns a vector of length equal to number of rows in teens dataframe (30000). # so this is what the next statment is doing: creating a vector with the average for each gradyear, repeated by person.
ave_age <- ave(teens$age, teens$gradyear, FUN = function(x) mean(x, na.rm = TRUE))
finding subsets of rules containing any berry items
berryrules <- subset(groceryrules, items %in% "berries") inspect(berryrules)
See the type of variable
class(mytable) str(mytable)
change the names of columns
colnames(gr)=c("tid","items") install.packages("arules") library(arules)
establish the connection
conn<-dbConnect("PostgreSQL",dbname="infs494",host="147.126.64.66", port=8432, user="dataminer1", password="fall2015")
Copy the data in this table.
copy grocery from 'D:\data\LAB02\groceries.csv'
How do you show correlation?
cor(nlab1)
Command to put the file in a table is as follows: ##create the empty staging table first
create table grocery(line text);
## Remember, kmeans is very sensitive to outliers and scale. So we need to rescale our clusters ## As a way to normalize the data, We will calculate z-score for each variable. ## lapply function will apply scale formula (which is zscore) on each list elements. ## Remember I have been saying for a while that data.frame is a list. So thing of our features dataframe ## as list which looks like following: features = list(basketball=c(....), soccer=c(.....)) ## lapply will normalize all values for basketball, then scoccer.....
features_norm <- as.data.frame(lapply(features, scale))
for each TID split the dataset and then convert it into class "transactions"
groceries=as(split(gr$items,gr$tid),"transactions") summary(groceries)
set better support and confidence levels to learn more rules
groceryrules <- apriori(groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2)) groceryrules
converting the rule set to a data frame
groceryrules_df <- as(groceryrules, "data.frame") str(groceryrules_df)
a visualization of the sparse matrix for the first five transactions
image(groceries[1:5])
visualization of a random sample of 100 transactions
image(sample(groceries, 100))
look at the first five transactions
inspect(groceries[1:5])
sorting grocery rules by lift
inspect(sort(groceryrules, by = "lift")[1:5])
How to connect to PostgreSQL database. Install package first
install.packages("RPostgreSQL")
How to install packages?
install.packages("rattle")
examine the frequency of items
itemFrequency(groceries[, 1:3])
plot the frequency of items
itemFrequencyPlot(groceries, support = 0.1) itemFrequencyPlot(groceries, topN = 20) itemFrequencyPlot(groceries, support = 0.1,topN=20)
just to check the length of this vector do following and it will return 30000
lengh(ave_age)
Call the library
library(RPostgreSQL)
Call library to install the function
library(rattle)
Call the function
rattle()
What if you have 100 objects in workspace, how would you delete all the objects?
rm(list=ls())
## Since kmeans clustering begins with random seed, lets set some number as seed. ## This way, I can reproduce the same clusters again and again. Otherwise, each time I run the model, ## it will produce a different result.
set.seed(2345)
set your working directory.
setwd("yourdirectory/LAB02")
Defining a function
std <- function(x) sd(x) # defining a function v <- c(1:100) # create a test vector std(v) tellme <- function(x) { p1 <- paste("Type of", x, " is",typeof(x),sep=" ") print(p1) p2 <- paste("Class of", x, "is", class(x), sep=" ") print(p2) p3 <- paste("String rep of ",x," is", str(x), sep=" ") print(p3) p4 <- paste("Names for ", x, "is", names(x), sep=" ") print(p4) invisible() } tellme(t)
How do you look in more detail?
summary(lab1)
How do you make a table?
t <- table(ratings, reviewers)
but I do not see NA values in there. What is the problem? supply parameter useNA and set it to ifany or always
table(teens$gender, useNA = "ifany")
check our recoding work
table(teens$gender, useNA = "ifany") table(teens$female, useNA = "ifany") table(teens$no_gender, useNA = "ifany")
## Run the model/algorithm with K=5 and put the output in variables called teen_clusters
teen_clusters <- kmeans(features_norm, 5)
# look at the cluster centers
teen_clusters$centers
look at the size of the clusters
teen_clusters$size
## After this, the step is similar to the previous approach where we use ifelse statement to choose ave_age where teens$age is NA otherwise use ## teens' age and put this imputed value in teens$age.
teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)
So what to do for the outliers in age variable? ## eliminate age outliers and substitute them wth NA. We will deal with NA later in one go. ## For this, we will use ifelse function. It means if age is greater than or equal to 13 and age is ## age is less than 20 then value is age from the data otherwise age is NA. "&" stands for AND.
teens$age <- ifelse(teens$age >= 13 & teens$age < 20, teens$age, NA)
# apply the cluster IDs to the original data frame
teens$cluster <- teen_clusters$cluster
For categorical variables, we treat the missing values ## as a different kind of value called "unknown". in other words, we will have three genders (male, female, unknown). ## Also, we will create two dummy variables for each distinct three value of gender. ##unknown and female which will have values 1 (means yes) or 0 (means No) ## If both female and unknown are 0 (means NO), that implies person is male. So there is no point in creating dummy ## variable for male. ## In plain english, We create a variable female in dataframe teens and assign the output of ifelse function to it. ## In ifelse function, if gender variable is equal to "F" (shown by '==') and (shown by &) and gender is not na (!is.na(teens$gender)) ## then value of female dummy variable is 1 else 0.
teens$female <- ifelse(teens$gender == "F" & !is.na(teens$gender), 1, 0)
dbGetQuery does not convert strings to factors so lets do it explicitly by as.factor command
teens$gender=as.factor(teens$gender)
# look at the first five records. Remember RC (Rows Columns), So give me 1 to 5 rows and cluster, gender, age, friends as columns
teens[1:5, c("cluster", "gender", "age", "friends")]