library("RPostgreSQL")
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv,
host="dbhost",
user="postgres",
password="postgres",
port="5432",
dbname="dbname")
rs <- dbSendQuery(con,"select * from schema.my_view")
results <- fetch(rs,n=-1)
After a slight lag you'll have all the data loaded into memory (hopefully you're doing "small" data 🙂 ) - if not there are other R libraries that claim to help (I'm in the ~100k row area).
One of the nice things about association rules is that they are designed to work with categorical data, so if you bring down numeric or date attributes, you need to bin them. This is a nice contrast to decision trees in python, where you to force-fit all categorical attributes into boolean features (e.g. a feature might be "country_is_USA", "country_is_Canada", etc with a numeric value of 0 or 1).
You can see here that we need to do very little to transform my data - while the arules paper2 has binning examples, I don't have any attributes that require that treatment.
for (column in names(results)) {
results[column] <- as.factor(results[[column]])
}
lapply(results, class)
results_matrix <- as(results, "transactions")
Here is the key - we merely call the arules library with a bunch of arguments. The two most important ones are "minlength" and "maxlength" which identify how many conditions you'd like in your rules. If you pick "big" numbers (the default is 10) this will create astronomical numbers of rules, as each additional size will increase the rule count by orders of magnitude. Not only will this cause you to run out of RAM, it counter-intuitively fails after it claims to have generated all the rules, and is writing the results to some un-named file.
library("arules")
rules <- apriori(results_matrix,
parameter = list(support = 0.1, confidence = 0.6, maxlen = 5, minlen=2))
If you want to actually see these rules, I found I was forced to write them out to a file. The reason being a) it's nearly impossible to generate less then tens or hundreds of thousands of rules and b) nearly every operation on them will run out of memory trying to complete.
One example of this problem is the built-in "save to file" methods, which, oddly, convert the entire structure to a string before writing it to a file. I find it baffling that the R community can get these complex algorithms to work, but then fail to handle "large" file I/O in a graceful way. After messing around with various API methods from both arules in the R core, I ended up writing this to file myself:
batchSize = 10000
maxSize = length(rules)
i <- 1
while(i < maxSize)
{
print(i)
j <- i+batchSize
if (j > maxSize) {
j <- maxSize
}
write(rules[i:j],
file = "D:\\projects\\tree\\apriori-large.csv",
quote=FALSE,
sep = ",",
append=(i >1),
col.names = FALSE)
i <- i + batchSize
}
What I found is that in addition to giving me rules which predict certain behaviors, this also uncovered hidden business rules in my data, which I hadn't appreciated (there are also some obvious ones: attributes which are filled in by a common subsystem are filled in together - more interesting here would be the exception cases)
As one final consideration, it is relatively easy to make a dataframe that masks the original values with booleans, i.e. "this row has a value or doesn't have a value", which in some cases is more interesting than knowing patterns than the actual data (especially when the original data is unique IDs). These also train super-fast. Note here that I'm using several different representations of null, as these are all in my data.
targets<-c(0, " ", "", NA, "-1")
mask<-apply(t,2, "%in%" , targets)
rules <- apriori(mask,
parameter = list(support = 0.2,
confidence = 0.7,
maxlen = 5,
minlen=2))
The nice thing about association rules is that in addition to predicting outcomes, we can use them to explore concepts in the data, although without much depth, and they don't force the data into a hierarchy. One problem remains though, which is filtering the rules in an appropriate way- since these resemble what Prolog does, it might be possible to use it to build concept trees from the data. Unfortunately I'm yet to get SWI-Prolog to load my ruleset without running out of RAM - there is also a probabilistic Prolog, which looks promising.
It's worth noting that while you can filter these rules based on the confidence of the rule, it's not actually helpful: your application may force two attributes to be filled out at the same time, which would give them a high correlation, but not a useful one. Association rules will also give you strange variations on this: if attribute A1 and A2 are filled in at the same time, you get a lot of rules like {A1} => {A2} and {A2} => {A1} (expected), then also {A1, B} => A2, {A2, B}= > A1, and so on.
The arules library has a fair number of options, which I will explore in future posts. I also intend to approach this from a different angle, to find outlier data (sort of the opposite of what we're doing here).