Database specialized in storing directed graphs?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Carl Banks

    Database specialized in storing directed graphs?

    I was wondering if anyone had any advice on this.

    This is not to study graph theory; I'm using the graph to represent a
    problem domain. The graphs could be arbitrarily large, and could
    easily have millions of nodes, and most nodes have a substantial
    amount of data associated with them. Obviously I don't want a whole
    such graph in memory at once, so libraries the just deal with in-
    memory graphs are out.

    I know I could implement this with a relational DB, and I'd be fine
    with a library that was built on top of one. But I'm hoping for
    specialzed behavior useful for graphs.

    For example, suppose I have a set of nodes called A. It would be
    useful to know if any of these nodes are connected by paths that
    include no nodes in A. I could, of course, do that by reading from
    the database and following the paths, but I clearly don't want to do
    that. I would want instead to somehow cache the outside connectedness
    relationship between different nodes of A. But then what happens if
    something changes elsewhere. How is the cache for A notified, and can
    it be updated efficiently using graph theorems, rather than
    regenerated?

    It's very tricky; that's why I hope someone else has done it.

    I'm guessing no.


    Carl Banks
  • Chris Rebert

    #2
    Re: Database specialized in storing directed graphs?

    On Mon, Oct 27, 2008 at 5:32 PM, Carl Banks <pavlovevidence @gmail.comwrote :
    I was wondering if anyone had any advice on this.
    >
    This is not to study graph theory; I'm using the graph to represent a
    problem domain. The graphs could be arbitrarily large, and could
    easily have millions of nodes, and most nodes have a substantial
    amount of data associated with them. Obviously I don't want a whole
    such graph in memory at once, so libraries the just deal with in-
    memory graphs are out.
    >
    I know I could implement this with a relational DB, and I'd be fine
    with a library that was built on top of one. But I'm hoping for
    specialzed behavior useful for graphs.
    >
    For example, suppose I have a set of nodes called A. It would be
    useful to know if any of these nodes are connected by paths that
    include no nodes in A. I could, of course, do that by reading from
    the database and following the paths, but I clearly don't want to do
    that. I would want instead to somehow cache the outside connectedness
    relationship between different nodes of A. But then what happens if
    something changes elsewhere. How is the cache for A notified, and can
    it be updated efficiently using graph theorems, rather than
    regenerated?
    >
    It's very tricky; that's why I hope someone else has done it.
    >
    I'm guessing no.
    By sacrificing a goat at the altar of the Almighty Google, I was able
    to locate a project I came upon a long while ago but couldn't remember
    the name of that's vaguely like what you want, in that it's a "graph
    database": Neo4j - http://neo4j.org/ (and yes, it's in Java; sigh)
    Not sure it's exactly what you're looking for, but anyway....

    Cheers,
    Chris
    --
    Follow the path of the Iguana...

    Comment

    • M.-A. Lemburg

      #3
      Re: Database specialized in storing directed graphs?

      On 2008-10-28 01:32, Carl Banks wrote:
      I was wondering if anyone had any advice on this.
      >
      This is not to study graph theory; I'm using the graph to represent a
      problem domain. The graphs could be arbitrarily large, and could
      easily have millions of nodes, and most nodes have a substantial
      amount of data associated with them. Obviously I don't want a whole
      such graph in memory at once, so libraries the just deal with in-
      memory graphs are out.
      >
      I know I could implement this with a relational DB, and I'd be fine
      with a library that was built on top of one. But I'm hoping for
      specialzed behavior useful for graphs.
      >
      For example, suppose I have a set of nodes called A. It would be
      useful to know if any of these nodes are connected by paths that
      include no nodes in A. I could, of course, do that by reading from
      the database and following the paths, but I clearly don't want to do
      that. I would want instead to somehow cache the outside connectedness
      relationship between different nodes of A. But then what happens if
      something changes elsewhere. How is the cache for A notified, and can
      it be updated efficiently using graph theorems, rather than
      regenerated?
      >
      It's very tricky; that's why I hope someone else has done it.
      >
      I'm guessing no.
      Aaron Watters is an expert on this and has implemented kjbuckets
      for doing this in memory:



      Gadfly uses the library to implement relational queries (and works
      on disk):



      The package is now maintained by Richard Jones.

      You might be able to reuse some parts of Gadfly for your
      purposes.

      Also have a look at Pygr:



      which is a Python library to build a graph interface on top of
      a relational database.

      --
      Marc-Andre Lemburg
      eGenix.com

      Professional Python Services directly from the Source (#1, Oct 28 2008)
      >>Python/Zope Consulting and Support ... http://www.egenix.com/
      >>mxODBC.Zope.D atabase.Adapter ... http://zope.egenix.com/
      >>mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
      _______________ _______________ _______________ _______________ ____________

      :::: Try mxODBC.Zope.DA for Windows,Linux,S olaris,MacOSX for free ! ::::


      eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
      D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
      Registered at Amtsgericht Duesseldorf: HRB 46611

      Comment

      • George Sakkis

        #4
        Re: Database specialized in storing directed graphs?

        On Oct 27, 8:32 pm, Carl Banks <pavlovevide... @gmail.comwrote :
        I was wondering if anyone had any advice on this.
        >
        This is not to study graph theory; I'm using the graph to represent a
        problem domain.  The graphs could be arbitrarily large, and could
        easily have millions of nodes, and most nodes have a substantial
        amount of data associated with them.  Obviously I don't want a whole
        such graph in memory at once, so libraries the just deal with in-
        memory graphs are out.
        >
        I know I could implement this with a relational DB, and I'd be fine
        with a library that was built on top of one.  But I'm hoping for
        specialzed behavior useful for graphs.
        If you're looking for FOSS, the Boost graph library [1] or its
        parallel extension [2] is probably your best bet; it also comes with
        Python bindings but they are not maintained any more. For commercial
        solutions, Star-P [3] seems an interesting platform, with bindings to
        Matlab and Python. Freebase [4] is apparently built on a special graph
        database but unfortunately only the stored data are available, not the
        DB source code.

        George

        [1] http://www.boost.org/doc/libs/1_36_0...doc/index.html
        [2] http://www.osl.iu.edu/research/pbgl/
        [3] http://www.interactivesupercomputing...sematrices.php
        [4] http://www.freebase.com/help/faq#q7

        Comment

        • bearophileHUGS@lycos.com

          #5
          Re: Database specialized in storing directed graphs?

          Sorry Carl Banks for the answering delay, there are problems in Google
          Groups.
          This is not to study graph theory; I'm using the graph to represent a
          problem domain. The graphs could be arbitrarily large, and could
          easily have millions of nodes, and most nodes have a substantial
          amount of data associated with them. Obviously I don't want a whole
          such graph in memory at once, so libraries the just deal with in-
          memory graphs are out.
          It doesn't sound a problem for modern PCs.
          I think you can keep the whole graph topology in RAM, and the node
          data on disk (for example in a file or in DB). The topology is
          represented by the arcs (you have said nothing regarding arc data, so
          I assume it's absent) and the nodes (in RAM you just need a 32 bit
          unsigned integer to represent the index of the node that is stored on
          disk. If memory becomes tight you can use just 3 bytes (2 ^ 24 = 16
          millions different nodes) for the nodes, but generally the memory
          required for the nodes is very little compared to the memory necessary
          to store the arcs).

          You haven't said how many arcs there are, total or average for node,
          and if such arcs are directed or undirected.

          Anyway, using my Graph class (that stores each arc twice), this takes
          about 1 minute and 1.3 GB of RAM (1 million nodes, 10 arcs for node):

          from graph import Graph
          from random import randrange
          g = Graph()
          N = 1000000
          g.addNodes(xran ge(N))
          for i in xrange(N * 10):
          g.addArc(randra nge(N), randrange(N))


          You have said "could easily have millions of nodes", and every arc may
          have tens of arcs or more.
          ("arbitraril y large" is an unsolvable problem because there are always
          limits in your program, there's no program that's able to work on an
          "arbitraril y large" dataset), so Python data structures become too
          much costly for the RAM. With a lower level language like D/C/C++ you
          can manage a bigger graph. You can use Boost Graph, but a homemade
          graph structure may suffice too for your purposes.

          For the problem you have explained I think a very simple graph
          representation may suffice: an array of integer pairs (starting_index ,
          len) (starting_index can be a pointer too), where len is the number of
          outbound arcs of the node n, plus an array of indexes/pointers that
          list the outbound arcs. If memory gets tight you can split this second
          array in two, and use an array of bytes for the lengths (if you have
          more than 256 outbound arcs you may need a short). Note that if you
          use indexes then a Python array.array (or Numpy) suffices.

          In this situation if nnodes = 10_000_000 and narcs/node = 40 (total
          nodes = 40 * 10_000_000):
          nnodes * narcs * 4 + nnodes * (4 + 4) = 1_680_000_000 bytes, that is
          often available on modern PCs.

          On a 64 bit machine the indexes take the same memory, but pointers
          twice that.

          In "memory saving" mode:
          nnodes * narcs * 3 + nnodes * (3 + 1) = 1_240_000_000 bytes.

          A more handy compromise is:
          nnodes * narcs * 3 + nnodes * (4 + 4) = 1_280_000_000 bytes.
          But then what happens if something changes elsewhere.
          Regarding the data structure, if you use arrays like I have explained,
          if your updates aren't much frequent then you can allocate small extra
          arrays to store more arcs coming out a node (but to do this you
          probably may enjoy using pointers instead of indexes). When you have a
          lot of updated you can save all to disk, and then regenerate the whole
          data structure.

          Bye,
          bearophile

          Comment

          • flyingfrog

            #6
            Re: Database specialized in storing directed graphs?

            Don't really know if this will be useful but i'd try pytables:

            it deals very well with every kind of hierarchical data sets, doesn't
            matter the size.
            It will let you load only significant data, and you'll be able to
            query your data.
            It's built on top of HDF5 libraries but exposes a very friendly
            pythonic interface.
            Surely you'll still have to implement all of the graph logic yourself
            but this could be a good starting point.
            Hope it helps

            Marco

            Comment

            • Aaron Brady

              #7
              Re: Database specialized in storing directed graphs?

              On Oct 27, 7:32 pm, Carl Banks <pavlovevide... @gmail.comwrote :
              I was wondering if anyone had any advice on this.
              >
              This is not to study graph theory; I'm using the graph to represent a
              problem domain.  The graphs could be arbitrarily large, and could
              easily have millions of nodes, and most nodes have a substantial
              amount of data associated with them.  Obviously I don't want a whole
              such graph in memory at once, so libraries the just deal with in-
              memory graphs are out.
              >
              I know I could implement this with a relational DB, and I'd be fine
              with a library that was built on top of one.  But I'm hoping for
              specialzed behavior useful for graphs.
              >
              For example, suppose I have a set of nodes called A.  It would be
              useful to know if any of these nodes are connected by paths that
              include no nodes in A.  I could, of course, do that by reading from
              the database and following the paths, but I clearly don't want to do
              that.  I would want instead to somehow cache the outside connectedness
              relationship between different nodes of A.  But then what happens if
              something changes elsewhere.  How is the cache for A notified, and can
              it be updated efficiently using graph theorems, rather than
              regenerated?
              >
              It's very tricky; that's why I hope someone else has done it.
              >
              I'm guessing no.
              >
              Carl Banks
              Are you just looking for a persistent graph? The usual options are
              'shelve', 'sqllite', 'mmap', and 'ctypes.Structu re'. Or did you need
              a special structure for the 'outside connectedness' properties?

              Storing that is the usual time-space tradeoff. (Actually, I think
              that term refers to something slightly distinct. This would be more
              properly termed read-time/write-time trade off, roughly.)

              Assuming you pick an RDB, you'd have a table of vertices and a table
              of edges. Did you want 'set A' to be a table and persist between runs
              of the program? If not, just keep a set of 2-tuples of nodes that
              satisfy that property. I think the set S you're after is: all x,y
              such that x is a vertex in A, y is a vertex in A, and there exists a P
              such that P is a path, P starts at x, P ends at y, and vertex v in P
              implies that v is x, v is y, or v is not in A. I don't know if you
              can cache any information about A that gives you any shortcuts in
              calculating S, or what the running time or running space of the
              fastest/smallest algorithm is. At worst, it's O( |V| * |E| ) on each
              change to V or E. Unverified.

              You might be able to store the shortest path in a mapping from 2-
              tuples to paths. Or better yet, map each node in A to the set of all
              nodes reachable from it only entering A once. Then, you might save
              time on either adding a node/vertex to A or G, or removing one from A
              or G, or both. Maybe mapping each node in G to that relative set.
              You didn't say whether the graph was directed and/or cyclic.

              A good place to start would be, if you add a node to G, can S
              decrease? No, because the original path P still exists. If you
              remove one from A, still in G, S can stay the same size, might
              decrease. If you remove a node from G, not in A, same, etc.

              Comment

              Working...