I am working on structured data (one value per field, the same fields for each row) that I have to put in a NoSql environment with Spark (as analysing tool) and Hadoop. Though, I am wondering what format to use. i was thinking about json or csv but I'm not sure. What do you think and why? I don't have enough experience in this field to properly decide.
2nd question : I have to analyse these data (stored in an HDFS). So, as far as I know I have two possibilities to query them (before the analysis):
-
direct reading and filtering. I mean that it can be done with Spark, for example:
data = sqlCtxt.read.json(path_data)
-
Use Hbase/Hive to properly make a query and then process the data.
So, I don't know what is the standard way of doing all this and above all, what will be the fastest.