very large graph

**bearophileHUGS@lycos.com** · Jun 27 '08, 04:30 PM

Re: very large graph

chrispoliq...@g mail.com:

My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

That's large, but today not very large anymore. Today very large
graphs probably have more than millions of nodes...
You have to try, but I think any Python graph lib may be fit for your
purpose.
But I think a list of pairs won't do.

For larger graphs you can use Boost graph too, but it may be overkill
for your purposes.
If your graphs are even larger, you can implement a graph with
DataDraw in C and use the compiled module from Python, this is
probably among the faster possible near-prebuilt data structures.
For even larger graphs you may need the "external" STXXL with C++, and
then custom code.

The list of the operations you have to perform on the graph seems very
simple, so any graph lib may be enough, mine too:

pygraph

http://sourceforge.net/projects/pynetwork/

Download pygraph for free. Python library for a fast and flexible graph data structure.

For 63,000 nodes and 500,000 edges Graphviz may be too much slow, so
you may need to find a faster visualization tool. Two (or more) of
them are free too.

Ask if you need more information.

Bye,
bearophile

**MRAB** · Jun 27 '08, 04:30 PM

Re: very large graph

On Jun 24, 1:26 am, chrispoliq...@g mail.com wrote:

I need to represent the hyperlinks between a large number of HTML
files as a graph. My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.
>
I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. I'm just not sure which modules are
best. I need to be able to do the following:
>
1) The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).
>
2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.
>
3) The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.
>
I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?

Perhaps a dictionary where the key is a node and the value is a set of
destination nodes?

**A.T.Hofkamp** · Jun 27 '08, 04:30 PM

Re: very large graph

On 2008-06-24, MRAB <google@mrabarn ett.plus.comwro te:

On Jun 24, 1:26 am, chrispoliq...@g mail.com wrote:

>I need to represent the hyperlinks between a large number of HTML
>files as a graph. My non-directed graph will have about 63,000 nodes
>and and probably close to 500,000 edges.
>>
>I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
>python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
>generating a file to store the graph, and I have also looked into
>Graphviz for visualization. I'm just not sure which modules are
>best. I need to be able to do the following:

Afaik Graphviz is not good at abstracting the graph, which you may need here.
A page with 63,000 circles on it, and 500,000 edges will probably come out of
the printer as a black sheet of paper.
(8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
You need a factor 28 more circles which leads to a circle of about 0.007".)

Even when actual paper format would not be a problem, you will need some
abstraction and/or tools, as you will not find anything manually in an ocean of
63,000 elements.

One area you may want to start looking for tools is in state graphs, where the
set of possible states of an entire system is unfolded. These things go up to
over a million states, so you only have a 'small' problem there...

>1) The names of my nodes are not known ahead of time, so I will
>extract the title from all the HTML files to name the nodes prior to
>parsing the files for hyperlinks (edges).
>>
>2) Every file will be parsed for links and nondirectional connections
>will be drawn between the two nodes.
>>
>3) The files might link to each other so the graph package needs to
>be able to check to see if an edge between two nodes already exists,
>or at least not double draw connections between the two nodes when
>adding edges.
>>
>I'm relatively new to graph theory so I would greatly appreciate any
>suggestions for filetypes. I imagine doing this as a python
>dictionary with a list for the edges and a node:list paring is out of
>the question for such a large graph?

>
Perhaps a dictionary where the key is a node and the value is a set of
destination nodes?

For undirected edges, you could make an Edge class and have a set of Edge's
(where two Edge objects are equal when they connect the same nodes).
I don't expect 500,000 elements in a set to be a problem.

Sincerely,
Albert

**castironpi** · Jun 27 '08, 04:30 PM

Re: very large graph

On Jun 24, 5:58 am, "A.T.Hofkam p" <h...@se-162.se.wtb.tue. nlwrote:

On 2008-06-24, MRAB <goo...@mrabarn ett.plus.comwro te:
>

On Jun 24, 1:26 am, chrispoliq...@g mail.com wrote:

I need to represent the hyperlinks between a large number of HTML
files as a graph. My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

>

I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. I'm just not sure which modules are
best. I need to be able to do the following:

>
Afaik Graphviz is not good at abstracting the graph, which you may need here.
A page with 63,000 circles on it, and 500,000 edges will probably come out of
the printer as a black sheet of paper.
(8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
You need a factor 28 more circles which leads to a circle of about 0.007"..)
>
Even when actual paper format would not be a problem, you will need some
abstraction and/or tools, as you will not find anything manually in an ocean of
63,000 elements.
>
One area you may want to start looking for tools is in state graphs, where the
set of possible states of an entire system is unfolded. These things go up to
over a million states, so you only have a 'small' problem there...
>
>
>
>
>

1) The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).

>

2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.

>

3) The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.

>

I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?

>

Perhaps a dictionary where the key is a node and the value is a set of
destination nodes?

>
For undirected edges, you could make an Edge class and have a set of Edge's
(where two Edge objects are equal when they connect the same nodes).
I don't expect 500,000 elements in a set to be a problem.
>
Sincerely,
Albert- Hide quoted text -
>
- Show quoted text -

Readability counts: Do you need to be able to examine the datafile by
hand? If not, investigate the 'shelve' module. You can go two ways.
One is to map each node to a list of nodes. It's probably more
intuitive, but it needs repeated code:

shelf['pageA'].add( pageB )
shelf['pageB']= pageB
shelf['pageB'].add( pageA )

Which is incorrect as is anyway-- updates to shelved objects aren't
committed automatically.

a= shelf['pageA']
a.add( pageB )
shelf['pageA']= a
b= shelf['pageB']
b.add( pageA )
shelf['pageB']= b

is closer. Otherwise, two is to store a 2-tuple of node keys in a
secondary file.

shelfEdges[ frozenset(
shelfVerts[ 'pageA' ], shelfVerts[ 'pageB' ]
) ]= True

**boggom@comcast.net** · Jun 27 '08, 04:30 PM

Re: very large graph

Drawing a large graph like this is not
very insightful by itself, and doing this well
is still an art form.
Many cool visualizations, and all
very domain and question dependent,
can be found at

VisualComplexity - VISUAL COMPLEXITY

http://www.visualcomplexity.com/vc/

Choose the layout that fits the mechanism: tree for hierarchy, matrix for dense ties, flow for process, radial for cycles.

You can also search on flickr for network
and graph drawing.

Much of it is eyecandy though. What do you
really want to know?

Your graph is not overly huge for python, provided
you do not want to compute nonlocal statistics
such as betweenness metrics. (If you have
a laptop running windows and with
less than 1GB RAM, then you have a really large graph.)

Given that you do not know much about graph theory,
I would suggest that networkx applied to a
subset of your large website --- one of the
representative branches --- will allow
for the fastest way to figure out what you
want to do. Python provides access to many other
libraries for html mangling or even webpage
analysis. Imagine writing C code to ask questions
like: how many pdfs and images? how big are they?
when were these pages last updated? etc.
You can come up with 10 questions faster
than you can write C code, and this where
python has its great advantage.

You can learn graph theory while using these libraries,
on one of the smaller branches of your website tree.
Even interactively by using ipython.
This provides at least a feeling of great power while
stumbling in the dark.

Many of your edges are probably repetitions
associated with navigation menus to provide
a standard look-and-feel. See how much of that
you can strip out and find the cross-links that took
effort to insert an manage. (I suggest
doing the analyis on a digraph rather than a graph,
even if you want to draw it as graph.)

For visualization, the new ubigraph is quite
fast compared to graphviz.
See the cool networkx + ubigraph
video at http://ubietylab.net/blog/

Ask on the networkx mailinglist when stuck.

very large graph

very large graph

Comment

Comment

Comment

Comment

Comment