Fastest way to subtract elements of datasets of HDF5 file?

**SioSio** · Jul 31 '20, 01:45 AM

f_r is not closed.
If it don't close the file, the code may slow down, it won't free up space in RAM and will affect performance.

**RockRoll** · Jul 31 '20, 04:27 AM

I closed it using "f_r.close( )" at the end and it didn't change anything. Any other suggestion?

**SioSio** · Jul 31 '20, 04:50 AM

How about closing f_r immediately after using it?
The use of f_r ends with the first 3rd lines.

**RockRoll** · Jul 31 '20, 05:01 AM

I tried that too. It gives an error "ValueError : Not a dataset (not a dataset)" at Line 12 where e1 is asking for dset1.

I can't transfer dataset_1 and dataset_2 directly to a list/numpy array as dataset_1 or 2 are really large.

Any other thought?

**SioSio** · Jul 31 '20, 05:09 AM

The 9th line is written in f_w. Is this all right?
It need f_w.flush() before f_w.close().

**RockRoll** · Jul 31 '20, 05:32 AM

Yeah, 9th seems to be fine. f_w is providing a file object. Basically I am creating a new "data.h5" that save in its dset at line 24.

I can also change line 9 to: dset = f_r.create_data set('dataset_3' , data=d1, maxshape=(None, None), chunks=True)

This (instead of creating a new hdf5 file), creates a new dataset3 in input.h5 ; but the computation time is unimpacted.

My suspicion is something can be improved the way line 24/loop is saving the data, but not sure as I am not an expert in programming.

**SioSio** · Jul 31 '20, 05:56 AM

Where is the value of n set?
It looks like to comment "n = 1" on line 6.

Code:

left, right, count = 0,0,0; W = 4000  # Window half-width ;n = 1

Please tell me about the change in the number of elements in the dset row you are trying to execute.
In some cases, it may be able to move line 23 outside the outer for loop.

**RockRoll** · Jul 31 '20, 04:23 PM

So here is what is happening:
1. I choose a flexible shape dset on line 9; flexible as I am dealing with large arrays and that can vary with the input file size
2. I fill in some values of interest at line 24
3. At line 23, I am basically expanding the current size of dset by n (=1). The added row is filled in with values I create at line 24.

Simply put, I am generating some numbers (line 22) and filling in dset by appending its row by 1 each time.

Can you please elaborate more when you say "line 23 outside the outer for loop."?

One quick thing I checked is even when line 23 and 24 are commented out (meaning I am just creating values in line 22, not storing in dset), still the computation time is huge (slow). So moving out line 23 may not be changing the execution speed.

**SioSio** · Aug 1 '20, 10:54 AM

I write once again,
In line 6, n = 1 is disabled.
Only the following parts are valid
Left, right, count = 0, 0, 0; W = 4000
It is a comment from here.
# Half the width of the window. n = 1
Therefore, line 23 is not resized.
On the contrary, by adding indeterminate variables, execution may be unstable.
If the showed code is a part of the whole, and n = 1 is set in the non-show part other than the 6th line, the 23rd line has a maximum row dimension is (right-left+1).
If the row dimension of the dset you write to the file is one larger than the initial size, you need to resize it only once outside the outer for loop.

**RockRoll** · Aug 14 '20, 01:26 AM

Hey SioSio,

Thanks for your assistance. So, that "n" was just a typo here. I however solved the problem. It turns out that PyTables' "append" method was much faster than resizing the HDF5 file. Just wanted to mention here, if anyone stops by here in the future!

Thanks again for your time!

Fastest way to subtract elements of datasets of HDF5 file?

Fastest way to subtract elements of datasets of HDF5 file?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment