Memory management for Large dataset in python

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Saad Bin Ahmed
    New Member
    • Jul 2011
    • 25

    Memory management for Large dataset in python

    Hi Everyone,

    I am working on Latin dataset and I am supposed to read all the data from ~30,000 files. What i did was, I opened and read the file and written file contents in separate one file (say Master.nc) and then close the individual files. But Master.nc will not close unless to read the last file.

    I need Master.nc file at the end having the content information of all ~30,000 files.

    My program is running fine for small dataset (i.e., till 900 files). Whenever the dataset files increasing from 900 my program stucks and do not perform my desired processing.

    Managing large file as Master.nc is difficult to handle in this situation. Please guide me how can I handle this situation, I need at the end one Master.nc file because I have to use it for training.

    Please help me in this scenario.
    Thanks alot
  • dwblas
    Recognized Expert Contributor
    • May 2008
    • 626

    #2
    You should open the output file, then open, read, write, and close each of the input files before processing the next file. I don't understand what is meant by
    But Master.nc will not close unless to read the last file.
    as you close it after all the files are processed.
    Code:
    output = open(combined_file, "w")
    
    for fname in list_of_30000:
        fp=open(fname, "r"):
        for rec in fp:
            output.write(rec)
        fp.close()
    
    output.close()

    Comment

    • Saad Bin Ahmed
      New Member
      • Jul 2011
      • 25

      #3
      Actually I have to read, and save the contents of each file in separate netcdf file named Master.nc. Its mean all files content will be written in one file i.e., Master.nc. At the end I will have one file which should have contents of all files (in my case files=~30,000).

      Comment

      • Saad Bin Ahmed
        New Member
        • Jul 2011
        • 25

        #4
        I currently read,write and close every file but Master.nc will remain open until to read, write and close all 30,000.

        Comment

        • dwblas
          Recognized Expert Contributor
          • May 2008
          • 626

          #5
          That is correct. Also, are you sure that you are not running out of disk as the copy may require twice the amount of space on disk of the 30,000 files. You will have to post your code for any more detailed assistance.

          Comment

          • Saad Bin Ahmed
            New Member
            • Jul 2011
            • 25

            #6
            Yes, it seems that I am running out of disk by doing all the stuff. Whenever files increased to 900 or more it automatically hangs further processing. It does not show me any error message but also not processed further. I have already used garbage collector function gc.collect() that also could not solve the problem.

            Comment

            Working...