I am working on structured data (one value per field, the same fields for each row) that I have to put in a NoSql environment with Spark (as analyzing tool) and Hadoop. Though, I am wondering what format to use. I was thinking about json or csv but I'm not sure. What do you think and why? I don't have enough experience in this field to properly decide.
2nd question: I have to analyze these data (stored in an HDFS). So, as far as I know, I have two possibilities to query them (before the analysis):
-
direct reading and filtering. I mean that it can be done with Spark, for example:
data = sqlCtxt.read.json(path_data)
-
Use HBase/Hive to properly make a query and then process the data.
So, I don't know what is the standard way of doing all this and above all, what will be the fastest.