Fastest way to subtract elements of datasets of HDF5 file?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • RockRoll
    New Member
    • Jul 2020
    • 10

    Fastest way to subtract elements of datasets of HDF5 file?

    Hey Everyone:

    Here is one interesting problem.

    Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).

    Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.

    Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).

    Code:
    f_r = h5py.File('input.h5', 'r+')
    dset1 = f_r.get('dataset_1')
    dset2 = f_r.get('dataset_2')
    r1,c1 = dset1.shape
    r2,c2 = dset2.shape
    left, right, count = 0,0,0; W = 4000  # Window half-width ;n = 1
    f_w = h5py.File('data.h5', 'w')
    d1 = np.zeros(shape=(0, 4))
    dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
    
    for j in range(r1):
        e1 = dset1[j,1]
    
        # move left pointer so that is within -delta of e
        while left < r2 and dset2[left,1] - e1 <= -W:
            left += 1
        # move right pointer so that is outside of +delta
        while right < r2 and dset2[right,1] - e1 <= W:
            right += 1
    
        for i in range(left, right):
            delta = e1 - dset2[i,1]
            dset.resize(dset.shape[0] + n, axis=0)
            dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
            count += 1
    
    print("\nFinal shape of dataset created: " + str(dset.shape))
    
    f_w.close()
  • SioSio
    Contributor
    • Dec 2019
    • 272

    #2
    f_r is not closed.
    If it don't close the file, the code may slow down, it won't free up space in RAM and will affect performance.

    Comment

    • RockRoll
      New Member
      • Jul 2020
      • 10

      #3
      I closed it using "f_r.close( )" at the end and it didn't change anything. Any other suggestion?

      Comment

      • SioSio
        Contributor
        • Dec 2019
        • 272

        #4
        How about closing f_r immediately after using it?
        The use of f_r ends with the first 3rd lines.

        Comment

        • RockRoll
          New Member
          • Jul 2020
          • 10

          #5
          I tried that too. It gives an error "ValueError : Not a dataset (not a dataset)" at Line 12 where e1 is asking for dset1.

          I can't transfer dataset_1 and dataset_2 directly to a list/numpy array as dataset_1 or 2 are really large.

          Any other thought?

          Comment

          • SioSio
            Contributor
            • Dec 2019
            • 272

            #6
            The 9th line is written in f_w. Is this all right?
            It need f_w.flush() before f_w.close().

            Comment

            • RockRoll
              New Member
              • Jul 2020
              • 10

              #7
              Yeah, 9th seems to be fine. f_w is providing a file object. Basically I am creating a new "data.h5" that save in its dset at line 24.

              I can also change line 9 to: dset = f_r.create_data set('dataset_3' , data=d1, maxshape=(None, None), chunks=True)

              This (instead of creating a new hdf5 file), creates a new dataset3 in input.h5 ; but the computation time is unimpacted.

              My suspicion is something can be improved the way line 24/loop is saving the data, but not sure as I am not an expert in programming.

              Comment

              • SioSio
                Contributor
                • Dec 2019
                • 272

                #8
                Where is the value of n set?
                It looks like to comment "n = 1" on line 6.
                Code:
                left, right, count = 0,0,0; W = 4000  # Window half-width ;n = 1
                Please tell me about the change in the number of elements in the dset row you are trying to execute.
                In some cases, it may be able to move line 23 outside the outer for loop.

                Comment

                • RockRoll
                  New Member
                  • Jul 2020
                  • 10

                  #9
                  So here is what is happening:
                  1. I choose a flexible shape dset on line 9; flexible as I am dealing with large arrays and that can vary with the input file size
                  2. I fill in some values of interest at line 24
                  3. At line 23, I am basically expanding the current size of dset by n (=1). The added row is filled in with values I create at line 24.

                  Simply put, I am generating some numbers (line 22) and filling in dset by appending its row by 1 each time.

                  Can you please elaborate more when you say "line 23 outside the outer for loop."?

                  One quick thing I checked is even when line 23 and 24 are commented out (meaning I am just creating values in line 22, not storing in dset), still the computation time is huge (slow). So moving out line 23 may not be changing the execution speed.

                  Comment

                  • SioSio
                    Contributor
                    • Dec 2019
                    • 272

                    #10
                    I write once again,
                    In line 6, n = 1 is disabled.
                    Only the following parts are valid
                    Left, right, count = 0, 0, 0; W = 4000
                    It is a comment from here.
                    # Half the width of the window. n = 1
                    Therefore, line 23 is not resized.
                    On the contrary, by adding indeterminate variables, execution may be unstable.
                    If the showed code is a part of the whole, and n = 1 is set in the non-show part other than the 6th line, the 23rd line has a maximum row dimension is (right-left+1).
                    If the row dimension of the dset you write to the file is one larger than the initial size, you need to resize it only once outside the outer for loop.

                    Comment

                    • RockRoll
                      New Member
                      • Jul 2020
                      • 10

                      #11
                      Hey SioSio,

                      Thanks for your assistance. So, that "n" was just a typo here. I however solved the problem. It turns out that PyTables' "append" method was much faster than resizing the HDF5 file. Just wanted to mention here, if anyone stops by here in the future!

                      Thanks again for your time!

                      Comment

                      Working...