I have a file in my HDFS such huge that it is unable to fit in the memory. Can I find out a way such that I can clear cache and read the file line by line as any normal file?
I tried this.
for line in open("myfile", "r"):
# do some processing
I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.
Does using standard Hadoop command line tools using python subprocess make a difference?
Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?
cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
If there is another way to achieve what I described above without using an external library, I'm also pretty open.
Help me out on this one.