very large graph

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • chrispoliquin@gmail.com

    very large graph

    I need to represent the hyperlinks between a large number of HTML
    files as a graph. My non-directed graph will have about 63,000 nodes
    and and probably close to 500,000 edges.

    I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
    python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
    generating a file to store the graph, and I have also looked into
    Graphviz for visualization. I'm just not sure which modules are
    best. I need to be able to do the following:

    1) The names of my nodes are not known ahead of time, so I will
    extract the title from all the HTML files to name the nodes prior to
    parsing the files for hyperlinks (edges).

    2) Every file will be parsed for links and nondirectional connections
    will be drawn between the two nodes.

    3) The files might link to each other so the graph package needs to
    be able to check to see if an edge between two nodes already exists,
    or at least not double draw connections between the two nodes when
    adding edges.

    I'm relatively new to graph theory so I would greatly appreciate any
    suggestions for filetypes. I imagine doing this as a python
    dictionary with a list for the edges and a node:list paring is out of
    the question for such a large graph?
  • bearophileHUGS@lycos.com

    #2
    Re: very large graph

    chrispoliq...@g mail.com:
    My non-directed graph will have about 63,000 nodes
    and and probably close to 500,000 edges.
    That's large, but today not very large anymore. Today very large
    graphs probably have more than millions of nodes...
    You have to try, but I think any Python graph lib may be fit for your
    purpose.
    But I think a list of pairs won't do.

    For larger graphs you can use Boost graph too, but it may be overkill
    for your purposes.
    If your graphs are even larger, you can implement a graph with
    DataDraw in C and use the compiled module from Python, this is
    probably among the faster possible near-prebuilt data structures.
    For even larger graphs you may need the "external" STXXL with C++, and
    then custom code.

    The list of the operations you have to perform on the graph seems very
    simple, so any graph lib may be enough, mine too:
    Download pygraph for free. Python library for a fast and flexible graph data structure.


    For 63,000 nodes and 500,000 edges Graphviz may be too much slow, so
    you may need to find a faster visualization tool. Two (or more) of
    them are free too.

    Ask if you need more information.

    Bye,
    bearophile

    Comment

    • MRAB

      #3
      Re: very large graph

      On Jun 24, 1:26 am, chrispoliq...@g mail.com wrote:
      I need to represent the hyperlinks between a large number of HTML
      files as a graph.  My non-directed graph will have about 63,000 nodes
      and and probably close to 500,000 edges.
      >
      I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
      python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
      generating a file to store the graph, and I have also looked into
      Graphviz for visualization.  I'm just not sure which modules are
      best.  I need to be able to do the following:
      >
      1)  The names of my nodes are not known ahead of time, so I will
      extract the title from all the HTML files to name the nodes prior to
      parsing the files for hyperlinks (edges).
      >
      2) Every file will be parsed for links and nondirectional connections
      will be drawn between the two nodes.
      >
      3)  The files might link to each other so the graph package needs to
      be able to check to see if an edge between two nodes already exists,
      or at least not double draw connections between the two nodes when
      adding edges.
      >
      I'm relatively new to graph theory so I would greatly appreciate any
      suggestions for filetypes.  I imagine doing this as a python
      dictionary with a list for the edges and a node:list paring is out of
      the question for such a large graph?
      Perhaps a dictionary where the key is a node and the value is a set of
      destination nodes?

      Comment

      • A.T.Hofkamp

        #4
        Re: very large graph

        On 2008-06-24, MRAB <google@mrabarn ett.plus.comwro te:
        On Jun 24, 1:26 am, chrispoliq...@g mail.com wrote:
        >I need to represent the hyperlinks between a large number of HTML
        >files as a graph.  My non-directed graph will have about 63,000 nodes
        >and and probably close to 500,000 edges.
        >>
        >I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
        >python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
        >generating a file to store the graph, and I have also looked into
        >Graphviz for visualization.  I'm just not sure which modules are
        >best.  I need to be able to do the following:
        Afaik Graphviz is not good at abstracting the graph, which you may need here.
        A page with 63,000 circles on it, and 500,000 edges will probably come out of
        the printer as a black sheet of paper.
        (8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
        You need a factor 28 more circles which leads to a circle of about 0.007".)

        Even when actual paper format would not be a problem, you will need some
        abstraction and/or tools, as you will not find anything manually in an ocean of
        63,000 elements.

        One area you may want to start looking for tools is in state graphs, where the
        set of possible states of an entire system is unfolded. These things go up to
        over a million states, so you only have a 'small' problem there...
        >1)  The names of my nodes are not known ahead of time, so I will
        >extract the title from all the HTML files to name the nodes prior to
        >parsing the files for hyperlinks (edges).
        >>
        >2) Every file will be parsed for links and nondirectional connections
        >will be drawn between the two nodes.
        >>
        >3)  The files might link to each other so the graph package needs to
        >be able to check to see if an edge between two nodes already exists,
        >or at least not double draw connections between the two nodes when
        >adding edges.
        >>
        >I'm relatively new to graph theory so I would greatly appreciate any
        >suggestions for filetypes.  I imagine doing this as a python
        >dictionary with a list for the edges and a node:list paring is out of
        >the question for such a large graph?
        >
        Perhaps a dictionary where the key is a node and the value is a set of
        destination nodes?
        For undirected edges, you could make an Edge class and have a set of Edge's
        (where two Edge objects are equal when they connect the same nodes).
        I don't expect 500,000 elements in a set to be a problem.

        Sincerely,
        Albert

        Comment

        • castironpi

          #5
          Re: very large graph

          On Jun 24, 5:58 am, "A.T.Hofkam p" <h...@se-162.se.wtb.tue. nlwrote:
          On 2008-06-24, MRAB <goo...@mrabarn ett.plus.comwro te:
          >
          On Jun 24, 1:26 am, chrispoliq...@g mail.com wrote:
          I need to represent the hyperlinks between a large number of HTML
          files as a graph.  My non-directed graph will have about 63,000 nodes
          and and probably close to 500,000 edges.
          >
          I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
          python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
          generating a file to store the graph, and I have also looked into
          Graphviz for visualization.  I'm just not sure which modules are
          best.  I need to be able to do the following:
          >
          Afaik Graphviz is not good at abstracting the graph, which you may need here.
          A page with 63,000 circles on it, and 500,000 edges will probably come out of
          the printer as a black sheet of paper.
          (8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
          You need a factor 28 more circles which leads to a circle of about 0.007"..)
          >
          Even when actual paper format would not be a problem, you will need some
          abstraction and/or tools, as you will not find anything manually in an ocean of
          63,000 elements.
          >
          One area you may want to start looking for tools is in state graphs, where the
          set of possible states of an entire system is unfolded. These things go up to
          over a million states, so you only have a 'small' problem there...
          >
          >
          >
          >
          >
          1)  The names of my nodes are not known ahead of time, so I will
          extract the title from all the HTML files to name the nodes prior to
          parsing the files for hyperlinks (edges).
          >
          2) Every file will be parsed for links and nondirectional connections
          will be drawn between the two nodes.
          >
          3)  The files might link to each other so the graph package needs to
          be able to check to see if an edge between two nodes already exists,
          or at least not double draw connections between the two nodes when
          adding edges.
          >
          I'm relatively new to graph theory so I would greatly appreciate any
          suggestions for filetypes.  I imagine doing this as a python
          dictionary with a list for the edges and a node:list paring is out of
          the question for such a large graph?
          >
          Perhaps a dictionary where the key is a node and the value is a set of
          destination nodes?
          >
          For undirected edges, you could make an Edge class and have a set of Edge's
          (where two Edge objects are equal when they connect the same nodes).
          I don't expect 500,000 elements in a set to be a problem.
          >
          Sincerely,
          Albert- Hide quoted text -
          >
          - Show quoted text -
          Readability counts: Do you need to be able to examine the datafile by
          hand? If not, investigate the 'shelve' module. You can go two ways.
          One is to map each node to a list of nodes. It's probably more
          intuitive, but it needs repeated code:

          shelf['pageA'].add( pageB )
          shelf['pageB']= pageB
          shelf['pageB'].add( pageA )

          Which is incorrect as is anyway-- updates to shelved objects aren't
          committed automatically.

          a= shelf['pageA']
          a.add( pageB )
          shelf['pageA']= a
          b= shelf['pageB']
          b.add( pageA )
          shelf['pageB']= b

          is closer. Otherwise, two is to store a 2-tuple of node keys in a
          secondary file.

          shelfEdges[ frozenset(
          shelfVerts[ 'pageA' ], shelfVerts[ 'pageB' ]
          ) ]= True

          Comment

          • boggom@comcast.net

            #6
            Re: very large graph

            Drawing a large graph like this is not
            very insightful by itself, and doing this well
            is still an art form.
            Many cool visualizations, and all
            very domain and question dependent,
            can be found at
            Choose the layout that fits the mechanism: tree for hierarchy, matrix for dense ties, flow for process, radial for cycles.


            You can also search on flickr for network
            and graph drawing.

            Much of it is eyecandy though. What do you
            really want to know?

            Your graph is not overly huge for python, provided
            you do not want to compute nonlocal statistics
            such as betweenness metrics. (If you have
            a laptop running windows and with
            less than 1GB RAM, then you have a really large graph.)

            Given that you do not know much about graph theory,
            I would suggest that networkx applied to a
            subset of your large website --- one of the
            representative branches --- will allow
            for the fastest way to figure out what you
            want to do. Python provides access to many other
            libraries for html mangling or even webpage
            analysis. Imagine writing C code to ask questions
            like: how many pdfs and images? how big are they?
            when were these pages last updated? etc.
            You can come up with 10 questions faster
            than you can write C code, and this where
            python has its great advantage.

            You can learn graph theory while using these libraries,
            on one of the smaller branches of your website tree.
            Even interactively by using ipython.
            This provides at least a feeling of great power while
            stumbling in the dark.

            Many of your edges are probably repetitions
            associated with navigation menus to provide
            a standard look-and-feel. See how much of that
            you can strip out and find the cross-links that took
            effort to insert an manage. (I suggest
            doing the analyis on a digraph rather than a graph,
            even if you want to draw it as graph.)


            For visualization, the new ubigraph is quite
            fast compared to graphviz.
            See the cool networkx + ubigraph
            video at http://ubietylab.net/blog/


            Ask on the networkx mailinglist when stuck.

            Comment

            Working...