Issue with Python Read file as stream from HDFS

Question

I have a file in my HDFS such huge that it is unable to fit in the memory. Can I find out a way such that I can clear cache and read the file line by line as any normal file?

I tried this.

for line in open("myfile", "r"):
    # do some processing

I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.

Does using standard Hadoop command line tools using python subprocess make a difference?

Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)

If there is another way to achieve what I described above without using an external library, I'm also pretty open.

Help me out on this one.

ravikiran · Answer 1 · Jun 26, 2019

The easiest way is using the following method

import pydoop.hdfs as hdfs
with hdfs.open('/user/myuser/filename') as f:
    for line in f:
        do_something(line)

If you wish to avoid external dependencies then you can go to PyDoop which is currently developed and is used in CRS4 for Computational Biology Applications.

Hope this was helpful,

Happy Learning.