very large dictionary

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Simon Strobl

    very large dictionary

    Hello,

    I tried to load a 6.8G large dictionary on a server that has 128G of
    memory. I got a memory error. I used Python 2.5.2. How can I load my
    data?

    SImon
  • Marc 'BlackJack' Rintsch

    #2
    Re: very large dictionary

    On Fri, 01 Aug 2008 00:46:09 -0700, Simon Strobl wrote:
    I tried to load a 6.8G large dictionary on a server that has 128G of
    memory. I got a memory error. I used Python 2.5.2. How can I load my
    data?
    What does "load a dictionary" mean? Was it saved with the `pickle`
    module?

    How about using a database instead of a dictionary?

    Ciao,
    Marc 'BlackJack' Rintsch

    Comment

    • Simon Strobl

      #3
      Re: very large dictionary

      What does "load a dictionary" mean?

      I had a file bigrams.py with a content like below:

      bigrams = {
      ", djy" : 75 ,
      ", djz" : 57 ,
      ", djzoom" : 165 ,
      ", dk" : 28893 ,
      ", dk.au" : 854 ,
      ", dk.b." : 3668 ,
      ....

      }

      In another file I said:

      from bigrams import bigrams
      How about using a database instead of a dictionary?
      If there is no other way to do it, I will have to learn how to use
      databases in Python. I would prefer to be able to use the same type of
      scripts with data of all sizes, though.

      Comment

      • bearophileHUGS@lycos.com

        #4
        Re: very large dictionary

        Simon Strobl:
        I had a file bigrams.py with a content like below:
        bigrams = {
        ", djy" : 75 ,
        ", djz" : 57 ,
        ", djzoom" : 165 ,
        ", dk" : 28893 ,
        ", dk.au" : 854 ,
        ", dk.b." : 3668 ,
        ...
        }
        In another file I said:
        from bigrams import bigrams
        Probably there's a limit in the module size here. You can try to
        change your data format on disk, creating a text file like this:
        ", djy" 75
        ", djz" 57
        ", djzoom" 165
        ....
        Then in a module you can create an empty dict, read the lines of the
        data with:
        for line in somefile:
        part, n = .rsplit(" ", 1)
        somedict[part.strip('"')] = int(n)

        Otherwise you may have to use a BigTable, a DB, etc.

        If there is no other way to do it, I will have to learn how to use
        databases in Python. I would prefer to be able to use the same type of
        scripts with data of all sizes, though.
        I understand, I don't know if there are documented limits for the
        dicts of the 64-bit Python.

        Bye,
        bearophile

        Comment

        • Sion Arrowsmith

          #5
          Re: very large dictionary

          Simon Strobl <Simon.Strobl@g mail.comwrote:
          >I tried to load a 6.8G large dictionary on a server that has 128G of
          >memory. I got a memory error. I used Python 2.5.2. How can I load my
          >data?
          Let's just eliminate one thing here: this server is running a
          64-bit OS, isn't it? Because if it's a 32-bit OS, the blunt
          answer is "You can't, no matter how much physical memory you
          have" and you're going to have to go down the database route
          (or some approach which stores the mapping on disk and only
          loads items into memory on demand).

          --
          \S -- siona@chiark.gr eenend.org.uk -- http://www.chaos.org.uk/~sion/
          "Frankly I have no feelings towards penguins one way or the other"
          -- Arthur C. Clarke
          her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump

          Comment

          • Raja Baz

            #6
            Re: very large dictionary

            On Fri, 01 Aug 2008 14:47:17 +0100, Sion Arrowsmith wrote:
            Simon Strobl <Simon.Strobl@g mail.comwrote:
            >>I tried to load a 6.8G large dictionary on a server that has 128G of
            >>memory. I got a memory error. I used Python 2.5.2. How can I load my
            >>data?
            >
            Let's just eliminate one thing here: this server is running a 64-bit OS,
            isn't it? Because if it's a 32-bit OS, the blunt answer is "You can't,
            no matter how much physical memory you have" and you're going to have to
            go down the database route (or some approach which stores the mapping on
            disk and only loads items into memory on demand).
            I very highly doubt he has 128GB of main memory and is running a 32bit OS.

            Comment

            • Raja Baz

              #7
              Re: very large dictionary

              On Fri, 01 Aug 2008 14:47:17 +0100, Sion Arrowsmith wrote:
              Simon Strobl <Simon.Strobl@g mail.comwrote:
              >>I tried to load a 6.8G large dictionary on a server that has 128G of
              >>memory. I got a memory error. I used Python 2.5.2. How can I load my
              >>data?
              Let's just eliminate one thing here: this server is running a 64-bit OS,
              isn't it? Because if it's a 32-bit OS, [etc...]
              I very highly doubt he has 128GB of main memory and is running a 32bit OS.

              Comment

              • Sean

                #8
                Re: very large dictionary

                Simon Strobl wrote:
                Hello,
                >
                I tried to load a 6.8G large dictionary on a server that has 128G of
                memory. I got a memory error. I used Python 2.5.2. How can I load my
                data?
                >
                SImon
                Take a look at the python bsddb module. Uing btree tables is fast, and
                it has the benefit that once the table is open, the programing interface
                is identical to a normal dictionary.



                Sean

                Comment

                • Steven D'Aprano

                  #9
                  Re: very large dictionary

                  On Fri, 01 Aug 2008 00:46:09 -0700, Simon Strobl wrote:
                  Hello,
                  >
                  I tried to load a 6.8G large dictionary on a server that has 128G of
                  memory. I got a memory error. I used Python 2.5.2. How can I load my
                  data?
                  How do you know the dictionary takes 6.8G?

                  I'm going to guess an answer to my own question. In a later post, Simon
                  wrote:

                  [quote]
                  I had a file bigrams.py with a content like below:

                  bigrams = {
                  ", djy" : 75 ,
                  ", djz" : 57 ,
                  ", djzoom" : 165 ,
                  ", dk" : 28893 ,
                  ", dk.au" : 854 ,
                  ", dk.b." : 3668 ,
                  ....

                  }
                  [end quote]


                  I'm guessing that the file is 6.8G of *text*. How much memory will it
                  take to import that? I don't know, but probably a lot more than 6.8G. The
                  compiler has to read the whole file in one giant piece, analyze it,
                  create all the string and int objects, and only then can it create the
                  dict. By my back-of-the-envelope calculations, the pointers alone will
                  require about 5GB, nevermind the objects they point to.

                  I suggest trying to store your data as data, not as Python code. Create a
                  text file "bigrams.tx t" with one key/value per line, like this:

                  djy : 75
                  djz : 57
                  djzoom : 165
                  dk : 28893
                  ....

                  Then import it like such:

                  bigrams = {}
                  for line in open('bigrams.t xt', 'r'):
                  key, value = line.split(':')
                  bigrams[key.strip()] = int(value.strip ())


                  This will be slower, but because it only needs to read the data one line
                  at a time, it might succeed where trying to slurp all 6.8G in one piece
                  will fail.



                  --
                  Steven

                  Comment

                  • Jorgen Grahn

                    #10
                    Re: very large dictionary

                    On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl <Simon.Strobl@g mail.comwrote:
                    >What does "load a dictionary" mean?
                    >
                    I had a file bigrams.py with a content like below:
                    >
                    bigrams = {
                    ", djy" : 75 ,
                    ", djz" : 57 ,
                    ", djzoom" : 165 ,
                    ", dk" : 28893 ,
                    ", dk.au" : 854 ,
                    ", dk.b." : 3668 ,
                    ...
                    >
                    }
                    >
                    In another file I said:
                    >
                    from bigrams import bigrams
                    >
                    >How about using a database instead of a dictionary?
                    >
                    If there is no other way to do it, I will have to learn how to use
                    databases in Python.
                    If you use Berkeley DB ("import bsddb"), you don't have to learn much.
                    These databases look very much like dictionaries string:string, only
                    they are disk-backed.

                    (I assume here that Berkeley DB supports 7GB data sets.)

                    /Jorgen

                    --
                    // Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
                    \X/ snipabacken.se R'lyeh wgah'nagl fhtagn!

                    Comment

                    • Jorgen Grahn

                      #11
                      Re: very large dictionary

                      On 3 Aug 2008 20:36:33 GMT, Jorgen Grahn <grahn+nntp@sni pabacken.sewrot e:
                      On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl <Simon.Strobl@g mail.comwrote:
                      ....
                      >If there is no other way to do it, I will have to learn how to use
                      >databases in Python.
                      >
                      If you use Berkeley DB ("import bsddb"), you don't have to learn much.
                      These databases look very much like dictionaries string:string, only
                      they are disk-backed.
                      .... all of which Sean pointed out elsewhere in the thread.

                      Oh well. I guess pointing it out twice doesn't hurt. bsddb has been
                      very pleasant to work with for me. I normally avoid database
                      programming like the plague.

                      /Jorgen

                      --
                      // Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
                      \X/ snipabacken.se R'lyeh wgah'nagl fhtagn!

                      Comment

                      • member thudfoo

                        #12
                        Re: very large dictionary

                        On 3 Aug 2008 20:40:02 GMT, Jorgen Grahn <grahn+nntp@sni pabacken.sewrot e:
                        On 3 Aug 2008 20:36:33 GMT, Jorgen Grahn <grahn+nntp@sni pabacken.sewrot e:
                        On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl <Simon.Strobl@g mail.comwrote:
                        >
                        ...
                        >
                        If there is no other way to do it, I will have to learn how to use
                        >databases in Python.
                        >
                        If you use Berkeley DB ("import bsddb"), you don't have to learn much.
                        These databases look very much like dictionaries string:string, only
                        they are disk-backed.
                        >
                        >
                        ... all of which Sean pointed out elsewhere in the thread.
                        >
                        Oh well. I guess pointing it out twice doesn't hurt. bsddb has been
                        very pleasant to work with for me. I normally avoid database
                        programming like the plague.
                        >
                        >
                        13.4 shelve -- Python object persistence

                        A ``shelf'' is a persistent, dictionary-like object. The difference
                        with ``dbm'' databases is that the values (not the keys!) in a shelf
                        can be essentially arbitrary Python objects -- anything that the
                        pickle module can handle. This includes most class instances,
                        recursive data types, and objects containing lots of shared
                        sub-objects. The keys are ordinary strings....

                        [...]

                        Comment

                        • Avinash Vora

                          #13
                          Re: very large dictionary


                          On Aug 4, 2008, at 4:12 AM, Jörgen Grahn wrote:
                          (You might want to post this to comp.lang.pytho n rather than to me --
                          I am just another c.l.p reader. If you already have done to, please
                          disregard this.)
                          Yeah, I hit "reply" by mistake and didn't realize it. My bad.
                          >>(I assume here that Berkeley DB supports 7GB data sets.)
                          >>
                          >If I remember correctly, BerkeleyDB is limited to a single file size
                          >of 2GB.
                          >
                          Sounds likely. But with some luck maybe they have increased this in
                          later releases? There seem to be many competing Berkeley releases.
                          It's worth investigating, but that leads me to:
                          >I haven't caught the earlier parts of this thread, but do I
                          >understand correctly that someone wants to load a 7GB dataset into
                          >the
                          >form of a dictionary?
                          >
                          Yes, he claimed the dictionary was 6.8 GB. How he measured that, I
                          don't know.

                          To the OP: how did you measure this?

                          --
                          Avi

                          Comment

                          • Simon Strobl

                            #14
                            Re: very large dictionary

                            On 4 Aug., 00:51, Avinash Vora <avinashv...@gm ail.comwrote:
                            On Aug 4, 2008, at 4:12 AM, Jörgen Grahn wrote:
                            >
                            (You might want to post this to comp.lang.pytho n rather than to me --
                            I am just another c.l.p reader. If you already have done to, please
                            disregard this.)
                            >
                            Yeah, I hit "reply" by mistake and didn't realize it. My bad.
                            >
                            >(I assume here that Berkeley DB supports 7GB data sets.)
                            >
                            If I remember correctly, BerkeleyDB is limited to a single file size
                            of 2GB.
                            >
                            Sounds likely. But with some luck maybe they have increased this in
                            later releases? There seem to be many competing Berkeley releases.
                            >
                            It's worth investigating, but that leads me to:
                            >
                            I haven't caught the earlier parts of this thread, but do I
                            understand correctly that someone wants to load a 7GB dataset into
                            the
                            form of a dictionary?
                            >
                            Yes, he claimed the dictionary was 6.8 GB. How he measured that, I
                            don't know.
                            >
                            To the OP: how did you measure this?
                            I created a python file that contained the dictionary. The size of
                            this file was 6.8GB. I thought it would be practical not to create the
                            dictionary from a text file each time I needed it. I.e. I thought
                            loading the .pyc-file should be faster. Yet, Python failed to create
                            a .pyc-file

                            Simon

                            Comment

                            • Steven D'Aprano

                              #15
                              Re: very large dictionary

                              On Mon, 04 Aug 2008 07:02:16 -0700, Simon Strobl wrote:
                              I created a python file that contained the dictionary. The size of this
                              file was 6.8GB.
                              Ah, that's what I thought you had done. That's not a dictionary. That's a
                              text file containing the Python code to create a dictionary.

                              My guess is that a 7GB text file will require significantly more memory
                              once converted to an actual dictionary: in my earlier post, I estimated
                              about 5GB for pointers. Total size of the dictionary is impossible to
                              estimate accurately without more information, but I'd guess that 10GB or
                              20GB wouldn't be unreasonable.

                              Have you considered that the operating system imposes per-process limits
                              on memory usage? You say that your server has 128 GB of memory, but that
                              doesn't mean the OS will make anything like that available.

                              And I don't know how to even start estimating how much temporary memory
                              is required to parse and build such an enormous Python program. Not only
                              is it a 7GB program, but it is 7GB in one statement.

                              I thought it would be practical not to create the
                              dictionary from a text file each time I needed it. I.e. I thought
                              loading the .pyc-file should be faster. Yet, Python failed to create a
                              .pyc-file
                              Probably a good example of premature optimization. Out of curiosity, how
                              long does it take to create it from a text file?



                              --
                              Steven

                              Comment

                              Working...