I'm using scikit-learn to cluster text documents. I'm using the classes CountVectorizer, TfidfTransformer and MiniBatchKMeans to help me do that. New text documents are added to the system all the time, which means that I need to use the classes above to transform the text and predict a cluster.
How should I store the data on disk? Should I simply pickle the vectorizer, transformer and kmeans objects? Should I just save the data? If so, how do I add it back to the vectorizer, transformer and kmeans objects?