Large lists in python

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • fekioh
    New Member
    • Aug 2010
    • 11

    Large lists in python

    Hello,

    I need to store data in large lists (~e7 elements) and I often get a memory error in code that looks like:

    Code:
    f = open('data.txt','r')
    for line in f:
        list1.append(line.split(',')[1])
        list2.append(line.split(',')[2])
        # etc.

    I get the error when reading-in the data, but I don't really need all elements to be stored in RAM all the time. I work with chunks of that data.

    So, more specifically, I have to read-in ~ 10,000,000 entries (strings and numeric) from 15 different columns in a text file, store them in list-like objects, do some element-wise calculations and get summary statistics (means, stdevs etc.) for blocks of say 500,000. Fast access for these blocks would be needed!

    I also need to read everything in at once (so no f.seek() etc. to read the data a block at a time).

    Any advice on how to achieve this? Platform = windowsXP

    Cheers!
    Last edited by bvdet; Aug 14 '10, 03:22 PM. Reason: Add code tags
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    My thought would be to read in a range of lines at a time, process those lines and move onto the next range, storing the results in a file as needed.

    This function reads in a range of lines:
    Code:
    def fileLineRange(fn, start, end):
        f = open(fn)
        for i in xrange(start-1):
            try:
                f.next()
            except StopIteration, e:
                return "Start line %s is beyond end of file." % (num)
            
        outputList = []
        for line in xrange(start, end+1):
            outputList.append(f.next().strip())
        f.close()
        return outputList
    fileLineRange(f n, 700, 720) would read in lines 700 through 720.

    Comment

    • fekioh
      New Member
      • Aug 2010
      • 11

      #3
      Yes, I thought of that. Problem is: (i) I need to be able to calculate statistics for different block sizes without having to read the file over and over again and (ii) I need to know some info from the very last line (files have a time-column, started the same time but are not equally long).

      Is there any way to store the whole thing in some kind of data structure (e.g. to create a class "extending" list or something?) Sorry for the java terminology :)

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        You are only truly reading a group of lines at a time, but I understand that it might not be the most efficient way. You should consider storing all the data in a MySql database for efficient access. MySqldb is the Python interface.

        An afterthought to the code I posted. In case the end line number is greater than the number of lines, I added a try/except block:
        Code:
        def fileLineRange(fn, start, end):
            f = open(fn)
            for i in xrange(start-1):
                try:
                    f.next()
                except StopIteration, e:
                    return "Start line %s is beyond end of file." % (num)
                
            outputList = []
            for i in xrange(start, end+1):
                try:
                    outputList.append(f.next().strip())
                except StopIteration, e:
                    print "The last line in the file is line number %s." % (i-1)
                    break
            f.close()
            return outputList

        Comment

        • fekioh
          New Member
          • Aug 2010
          • 11

          #5
          Hmm, sorry I wasn't very clear. What I meant is:

          (i) the files contain ~ month long measurements and I'd like to be able when I've read a file in to have e.g. per-day or per-week means. Or for a specific file to focus on the first hours. So that's what I meant I don't want to read the whole thing over and over again...

          (ii) as for the last line, I guess it's not a big issue. I just need to know the duration of all measurements from the start to do some of the calculations. But I guess I should just read the last line in the beginning and then go back to the start of the file.

          Comment

          • fekioh
            New Member
            • Aug 2010
            • 11

            #6
            Also, not very familiar with MySQL. Is there no alternative "large list implementation" of say storing on disk and loading in RAM a chunk ("page") at a time.

            Comment

            • dwblas
              Recognized Expert Contributor
              • May 2008
              • 626

              #7
              You may want to use SQL, but since you do not say what specifically you want to access or how you want to do it, it is difficult to tell whether using a list is the best way. Most of us have code generators for quick and dirty apps, so below is a generated SQL example of what you might want to do, using SQLite which comes with Python (code comments are sparse though). I don't want to waste time on something that may not be used, so post back if you want more info.
              Code:
              import random
              import sqlite3 as sqlite
              
              class SQLTest:
                 def __init__( self ) :
                    self.SQL_filename = './SQLtest.SQL'
                    self.open_files()
              
                 ##----------------------------------------------------------------------
                 def add_rec( self, val_tuple) :
                    self.cur.execute('INSERT INTO example_dbf values (?,?,?,?,?)', val_tuple)
                    self.con.commit()
              
                 ##----------------------------------------------------------------------
                 def list_all_recs( self ) :
                    self.cur.execute("select * from example_dbf")
                    recs_list = self.cur.fetchall()
                    for rec in recs_list:
                       print rec
              
                 ##----------------------------------------------------------------------
                 def lookup_date( self, date_in ) :
                    self.cur.execute("select * from example_dbf where st_date==:dic_lookup", 
                            {"dic_lookup":date_in})
                    recs_list = self.cur.fetchall()
                    print
                    print "lookup_date" 
                    for rec in recs_list:
                       print "%3d %9s %10.6f %3d  %s" % (rec[0], rec[1], rec[2], rec[3], rec[4])
              
                 ##----------------------------------------------------------------------
                 def lookup_2_fields( self, lookup_dic ) :
                    self.cur.execute("select * from example_dbf where st_date==:dic_field_1 and st_int==:dic_field_2", lookup_dic)
              
                    recs_list = self.cur.fetchall()
                    print
                    print "lookup_2_fields" 
                    if len(recs_list):
                       for rec in recs_list:
                          print rec
                    else:
                       print "no recs found"
              
                 ##----------------------------------------------------------------------
                 def open_files( self ) :
                       ##  a connection to the database file
                       self.con = sqlite.connect(self.SQL_filename)
              
                       # Get a Cursor object that operates in the context of Connection con
                       self.cur = self.con.cursor()
              
                       ##--- CREATE FILE ONLY IF IT DOESN'T EXIST
                       self.cur.execute("CREATE TABLE IF NOT EXISTS example_dbf(st_rec_num int, st_date varchar, st_float, st_int int, st_lit varchar)")
              
              ##===================================================================
              if __name__ == "__main__":
                 ST = SQLTest()
              
                 """ add some records with the format
                     record_number  date  float  int  string
                 """
                 rec_num = 0
                 ccyy = 2010
                 for x in range(1, 11):
                    rec_num += 1
                    mm = x + 1
                    dd = x + 2
                    date = "%d%02d%02d" % (ccyy, mm, dd)
                    add_fl = random.random() * 1000
                    add_int = random.randint(1, 21)
                    lit = "test lit # %d" % (x)
                    ST.add_rec( (rec_num, date, add_fl, add_int, lit) )
              
                 ## add duplicate dates for testing
                 for x in range(1, 3):
                    for y in range(2):
                       rec_num += 1
                       mm = x + 1
                       dd = x + 2
                       date = "%d%02d%02d" % (ccyy, mm, dd)
                       add_fl = random.random() * 1000
                       add_int = random.randint(1, 21)
                       lit = "test lit # %d" % (x)
                       ST.add_rec( (rec_num, date, add_fl, add_int, lit) )
                
                 ST.list_all_recs()
                 ST.lookup_date("20100203")
              
                 lookup_dict = {"dic_field_1":"20100203",
                                "dic_field_2":10}
                 ST.lookup_2_fields(lookup_dict)

              Comment

              • fekioh
                New Member
                • Aug 2010
                • 11

                #8
                Thank you, i will look into this tomorrow and I'll post back if in trouble..

                Cheers!

                Comment

                Working...