Which Operating system is more preferable for data node

Question

Since I am using Cloudera CDH4 VM in Pseudo distributed mode. My question is, in actual hdfs cluster do we want to install hadoop on the datanode? Can we see the data split in the datanode drive by logging to datanode?.

Neha · Answer 1 · Sep 4, 2018

In a real installation (1 active namenode, many datanodes) hadoop must be installed on each of the nodes. CDH (and most other vendors) provide software to help with the distributed installation.

You can see file metadata (and generally browse hdfs) via webhdfs, by enabling webhdfs (set property dfs.webhdfs.enabled to true in hdfs-site.xml, and restart hdfs), directing your browser to localhost:50070, and browsing to a file of interest.

File metadata can also be retrieved programmatically in Java via the hadoop FileInputFormat API. e.g, for file splits, you can use getSplits(). It will return the location of each split of the file of interest. A more straight forward solution can be to use the FileSystem API, specifically FileSystem.listFiles() which returns block location information. The latter may be only included in later hadoop 2.x versions though, I'm not sure.