I'm using R for data analytics and connected it with elasticsearch and retrieve a dataset of Shakespeare Complete Works.
library("elastic")
connect()
maxi <- count(index = 'shakespeare')
s <- Search(index = 'shakespeare',size=maxi)
dat <- s$hits$hits[[1]]$`_source`$text_entry
for (i in 2:maxi) {
dat <- c(dat , s$hits$hits[[i]]$`_source`$text_entry)
}
rm(s)
After that I want to do a tf_idf matrix but apparently I can't since it uses too much memory (I have 4GB of RAM), here is my code:
library("tm")
myCorpus <- Corpus(VectorSource(dat))
myCorpus <- tm_map(myCorpus, content_transformer(tolower),lazy = TRUE)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumbers),lazy = TRUE)
myCorpus <- tm_map(myCorpus, content_transformer(removePunctuation),lazy = TRUE)
myCorpus <- tm_map(myCorpus, content_transformer(removeWords), stopwords("en"),lazy = TRUE)
myTdm <- TermDocumentMatrix(myCorpus,control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
myCorpus is around 400 Mb.
But then I do:
> m <- as.matrix(myTdm)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow