This is an archived cached-text copy of the developerWorks article. Please consider viewing the original article at: IBM developerWorks



IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & industry solutions      Support & downloads      My IBM     
skip to main content

developerWorks  >  Open source  >

Explore relationships among Web pages visually

Use HTML::SimpleLinkExtor, khtml2png, feh, and Graphviz to create new ways of visualizing any Web page's links

developerWorks
Document options

Document options requiring JavaScript are not displayed

Sample code


Rate this page

Help us improve this content


Level: Intermediate

Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM

15 May 2007

The Graphviz program from AT&T Research and others is a fantastic tool for automating the visualization of complicated link sets. This article shows how to combine the Graphviz tool set with Web-page thumbnail generators to create new ways of visualizing any Web page's link structures. You can use these techniques and descriptions to refine your display logic, and create directed and undirected Graphviz charts to enhance your understanding of organizational, software, and other complex linked data sets.

Requirements

Hardware

Any PC manufactured after 2000 should provide plenty of horsepower for compiling and running the code. If you intend to map a large number (more than 50) nodes, or rendering graphs with large Web-page thumbnails, make sure you have a gigahertz processor and multiple gigabytes of RAM. System configurations with lesser specifications will require long processing times when paging to disk is required.

Software

You'll need Graphviz, of course, and the HTML::SimpleLinkExtor module from CPAN. If you're running on Linux®, the khtml2png tool from Simon MacMullen can be used to generate automatic Web-page thumbnails. If you're on Windows®, there are many options for generating Web-page thumbnails, but the selection of a tool for your environment is left to the reader wishing to evaluate the free and no-cost licenses available for Windows programs. You'll also do well to find a simple and fast image viewer, I recommend feh by Tom Gilbert. See Resources for these tools.



Back to top


Installation

If you are running on Windows, download and run the Graphviz installation package. Note the location of the installed binaries because you'll need to use these later during the rendering process. On Linux, use your favorite package manager or get the source and binaries directly from Graphviz.org.

You'll also need the HTML::SimpleExtor module by Brian Foy. Assuming you already have a functional Perl environment on Windows, you can download and install the package automatically with the command ppm install HTML::SimpleLinkExtor. For Linux, simply run cpan -i HTML::SimpleLinkExtor.

On Linux, you can also use the khtml2png program to quickly generate Web-page thumbnails. On this development system, which is running Fedora Core 3, V1.0.2 performs as expected. You may find you have to use other versions of the khtml2png tool based on the Qt and KDE libraries on your system. Check Resources for a direct link to the kthml2png program. For Windows operators, check Resources for various programs you can use to create Web-page thumbnails on Windows.



Back to top


General strategy, desired visualization

What do we want to see?

One of the most difficult parts of developing a visualization is deciding what exactly you'd like to see. Through various strategies and some trial and error, this article decided on the following visualization: an undirected graph of Web-page thumbnails with the arrows between nodes having a thickness and color directly related to the number of links between pages. Sounds simple enough. Here's how it's done.

How do we represent it?

The first step in building our roughly defined graph is to acquire the number of links between interrelated Web pages. One of the easiest ways to accomplish this is with the HTML::SimpleLinkExtor module. Listing 1 shows the topLinks.pl program used to extract the top N links from the specified HTML file.


Listing 1. topLinks.pl Print frequency count of links from an HTML page
                
#!/usr/bin/perl -w
# topLinks.pl - print the top N links from an html file using SimpleLinkExtor
use strict;
use HTML::SimpleLinkExtor;

die "usage: toplinks.pl <html_file> <number>" unless @ARGV == 2;

my $extor = HTML::SimpleLinkExtor->new();
$extor->parse_file("$ARGV[0]");

my $maxLinks = $ARGV[1];
my %linkHash = ();
my @a_hrefs  = $extor->a;

for my $link ( @a_hrefs )
{
  next unless  $link =~ /http/;  # only process http links
  $link = substr($link,7);       # remove http://

  # handle the triple slash prefix
  $link = substr($link,1) unless substr($link,0,1) ne "/";
  
  # remove everything after slash
  $link = substr($link,0,index($link,'/')) unless $link !~ /\//;

  # remove all subdomains
  $link = substr($link,index($link,".")+1) unless ($link =~ tr/\.//) == 1;

  $linkHash{$link}++;

}#for each link

my $linkCount = 0;
for my $key( sort {$linkHash{$b} <=>$linkHash{$a}} keys %linkHash )
{
  print "$key $linkHash{$key}\n";
  last unless $linkCount < $maxLinks-1;
  $linkCount++;
}

After setting up the variables and the usage information, the specified HTML file is processed by the SimpleLinkExtor module. The next for loop processes just the HTTP links only. The code then modifies all links to be of the form domain.tld. If the link is www.ibm.com/index.html, everything after the slash and before the first period is trimmed off. This will give us a frequency count of all domains linked to from the input HTML file. With this hash of domains with values of frequency counts, we sort the list by value and print out the first N links. Download the source HTML of your desired starting page as startPage.html and run the program with the command perl topLinks.pl startPage.html 10 > top10Links.

For this example, we used slashdot.org/index.html as the HTML file, and the contents of the top10Links file are shown in Listing 2. Note that your values will be different due to updates to the Slashdot.org page. We'll use this data as the input to our Graphviz dot language-generation program.


Listing 2. Frequency count of Slashdot.org main page links
                
slashdot.org 15
ostg.com 6
bfast.com 3
blogs.com 3
jhuapl.edu 3
arstechnica.com 2
doubleclick.net 2
wikipedia.org 2
wsj.com 2
thinkgeek.com 2



Back to top


Web-page thumbnail generation

If you are running on Linux, one of the easiest ways to generate Web-page thumbnails is the khtml2png program. Using it on the command line is straightforward for creating thumbnails even on pages that include Flash content and other embellishments. For our purposes, we want to give the Flash content enough time to load, so we will use the --flash-delay option. Consider the following one-liner to build Web-page thumbnails for each domain mentioned in the top10Links file.


Listing 3. Web-page thumbnail generation one-liner
                
cat top10Links  | \
  perl -lane '$r=`./khtml2png --flash-delay 3 \
  --scaled-width 200 --scaled-height 200 http://$F[1] thumbnails/$F[1].png`'

Note that the \ characters are present for this article's formatting and need to be removed for this one-liner to work. Also note that the various thumbnails are placed in the thumbnails directory and have the filename domain_name.png — thumbnails/ostg.com.png, for example. This is important for ease of processing in the dot syntax-generation step described below.

If you are using Windows, there are many options for creating Web-page snapshots. Use one of the tools for Windows listed in Resources.



Back to top


Undirected graph dot language-syntax generation

Building the dot file

Now that the Web-page thumbnails are in place, the generation of the graph description language can begin.

The basic Graphviz language syntax focuses on the concept of nodes and edges. We'll make use of the less-frequently used concepts of clustering and shapefiles, as well. For more detail about the Graphviz dot language and the amazing graph-generation capabilities of Graphviz, see Resources. Listing 3 shows an example of the dot-language syntax necessary for creating the type of graph we want to produce:


Listing 4. Example dot syntax file
                
subgraph "cluster_jhuapl.edu" { label="jhuapl.edu"; labelloc="t"; "jhuapl.edu_icon"};

"jhuapl.edu_icon" [label="", shape=box, 
style=invis, shapefile="snapsSlash/jhuapl.edu.png"];

edge [ color="#93ca36", arrowtail="normal", arrowsize="3", 
arrowhead="none", style="setlinewidth(9)" ];

"cluster_jhuapl.edu" -- "cluster_slashdot.org";

Each of the thumbnails nodes need to be placed along with the node label in a coherent manner. One of the easiest ways to do this is to create a subgraph and place both the image and the label inside the subgraph. Line 1 of Listing 4 defines a subgraph called cluster_jhuapl.edu. This cluster has a label at the top location and contains the node known as jhuapl.edu_icon. Line 4 defines the jhuapl.edu_icon node with the attributes of no label, an invisible bounding box, and an image as the appropriate Web-page thumbnail file. Line 5 proceeds with defining the attributes of the edge between the home node and the current node. With a light green color, a large arrow pointing at the linked-to node, and an enhanced thickness to emphasize the linkage. Line 7 specifies the linkage between the home node and the linked-to node.

The dot syntax language-building program shown below will process the various edge attributes automatically to emphasize the linkages and create the graph we defined loosely. Listing 5 shows the buildDot.pl program for generating undirected graphs.


Listing 5. buildDot.pl Generate dot-language undirected graph
                
#!/usr/bin/perl -w
# buildDot.pl - generate dot language undirected graph
use strict;

die "usage: buildDot.pl <home_node> <thumbnail_dir>" unless @ARGV == 2;
my( $homeNode, $thumbsDir ) = @ARGV;

print "graph g {\n";

while(<STDIN>)
{
  my( $lineWidth, $nodeName ) = split;

  # specify clusters
  print qq(subgraph "cluster_$nodeName" );
  print qq({ label="$nodeName"; labelloc="t"; "${nodeName}_icon"};\n);
  print qq("${nodeName}_icon" [label="", shape=box, style=invis, );
  print qq(shapefile="$thumbsDir/$nodeName.png"];\n);

  # prevent recursive link
  if( $nodeName =~ /$homeNode/ ){ print qq(\n); next }

  # edge color classes, width emphasis
  my $lineColor = "#000000";
  if( $lineWidth > 8 ){ $lineColor = "#ff9900" }
  elsif( $lineWidth > 5 ){ $lineColor = "#ca7836" }
  elsif( $lineWidth > 3 ){ $lineColor = "#cabb36" }
  elsif( $lineWidth > 2 ){ $lineColor = "#93ca36" }
  $lineWidth *= 3;

  # specify edge
  print qq(edge [ color="$lineColor", arrowtail="normal", arrowsize="3", );
  print qq(style="setlinewidth($lineWidth)" ];\n);

  # specify linkage
  print qq("cluster_$nodeName" -- "cluster_$homeNode";\n\n);

}#while stdin

print qq(}\n);

With the option check and usage information in place, the program specifies that this is an undirected graph with the graph g{ line. We will use the link frequency count and domain names from Listing 2 as our lineWidth and node names, respectively. The first section inside the read-from-stdin loop prints out the cluster description, along with the node to display inside the cluster. The Web-page thumbnail directory specified on the command line is used to give each node the correct shapefile image.

In the next section, if the cluster and shapefile node that has just been printed is the home node, exit the main loop without writing edge or linkage information. This is to prevent self-referencing in the graph description. While the Graphviz parser for creating undirected graphs is smart enough to handle some of these self-references, we'll eliminate the possibility to ensure easier transition to other graphing algorithms if desired.

The final section creates a simple graduated coloring system using five intervals. If the link frequency count is small, make the line black; and if it is large, make it bright red. Link frequency counts somewhere in between are various shades of green and yellow, and all the edges have their thickness three times the link frequency count. After printing the full-edge descriptor, the linkage between the home node cluster and the current cluster is printed out.

Building the graph

On Linux, generate the dot syntax graph file with the command cat top10Links |perl buildDot.pl slashdot.org thumbnails > dotFile.slash. On Windows, the equivalent command is type top10Links |perl buildDot.pl slashdot.org thumbnails > dotFile.slash

Please note: Figure 1

The arrow colors and line widths in this graphic may not correspond to the data presented in the listings above. Certain restrictions on graphic content are imposed at developerWorks, and it is hoped that the spirit if not the precision of the intended graphic is conveyed. Please check the downloads section for per-pixel correct examples.

You'll now be able to use the Graphviz program to do the actual graphing, and your choice of rendering algorithms will greatly affect the output. Only one of the Graphviz filters (fdp) is specifically for drawing undirected graphs, even though the others will process the dotFile.slash syntax without error. Create your bitmapped output with the command fdp dotFile.slash -Tpng -o graph.png. This command is the same on Linux and Windows.

Open the graph.png file in your favorite viewer, and you will see something like the graph shown in Figure 1.


Figure 1. Undirected graph output
Undirected graph output



Back to top


Conclusion, further examples

The logical steps and programs presented here give you a good start on creating Web-page linkage visualizations. Make sure to check out the links to the full Graphviz documentation in Resources. You can read about creating clickable image maps for the nodes and the edges. Custom edge shapes, colors, and many other modifications to the graph syntax-generation steps are suitable for enhancing the visualizations. Consider modifying your script to show every page on your Web site, with bright red edges to represent the various 404 paths your visitors may follow. Graphviz, and its associated directed and undirected filters along with the text-processing power of Perl are a great combination for visualization creation.




Back to top


Download

DescriptionNameSizeDownload method
Sample fileswebPageViz_0.2.zip48KBHTTP
Information about download methods


Resources

Learn
  • Browse all the open source content on developerWorks.

  • To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.

  • Stay current with developerWorks' Technical events and webcasts.

  • Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.

  • Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.

  • Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.


Get products and technologies
  • AT&T Research created the Graphviz graph visualization software.

  • SourceForge hosts the khtml2png Web-page thumbnail creator.

  • You can find Perl for Windows at ActiveState

  • TopShareware.com hosts the Web Page Thumbnail Generator 1.0.6 for Windows.

  • You can generate Web-page thumbnails online using WebShots Pro.

  • Another free Web-page thumbnail creator for Windows is Thumbshots, a provider of Web preview technology.

  • Innovate your next open source development project with IBM trial software, available for download or on DVD.

  • Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss


About the author

Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top



    About IBM Privacy Contact