| Level: Introductory Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM
12 Aug 2008 The Find command in Firefox locates the user-specified text in the body of a Web
page. The command is an easy-to-use tool that works well enough for most users most of
the time. Sometimes, however, a more powerful Find-like tool would make locating text
easier. This article shows how to build a tool that isolates relevant text in Web pages
faster by detecting the presence and absence of nearby words.
Native text-search capabilities in Firefox provide useful highlighting of contiguous
search terms and phrases. Additional Firefox extensions are available to incorporate
regular-expression searches and other text-highlighting capabilities. This article
presents tools and code needed to add your own text-searching interface to Firefox.
With a Greasemonkey user script and some custom algorithms, you'll be able to add grep
-v functionality to text searches — that is, highlighting a first search term where a second one is not located nearby.
Requirements
Hardware
Text searches on typical Web pages with older (pre-2002) hardware are nearly
instantaneous. However, the code presented here is not designed for speed and may
require faster hardware to perform at a user-friendly speed on large Web pages.
Software
The code was developed for use with Firefox V2.0 and Greasemonkey V0.7. Newer versions
of both will require testing and possibly modifications to ensure their functionality.
As a Greasemonkey script, the code presented here should work on any operating system
that supports Firefox and Greasemonkey. We tested on Microsoft® Windows®
and Linux® Ubuntu V7.10 releases.
Greasemonkey and Firefox extensions
User modification to Web pages is the role Greasemonkey fulfills, and the code
presented here uses the Greasemonkey framework to search for and highlight the relevant
text. See Resources for the Greasemonkey Firefox extension.
Examples of what this Greasemonkey script is designed to do
Those familiar with the UNIX grep command and its common
-v option know how indispensable grep is for extracting relevant lines of text from a file. Text
files conforming the UNIX tradition of simplicity generally store their text in a
line-by-line format that makes it easy to find words close together. The -v option prints lines where the specified text is not found.
Unlike text files, Web pages generally divide text with tags and other markers
rendered into lines by the browser. A wide variety of browser window sizes makes it
difficult to isolate nearby text based on expected line positions. Tables, links, and
other text markup also make it difficult to isolate text that is in the same "line."
Algorithms in this article are designed to address some of these difficulties by
providing a simple grep -like functionality piped to a
function that works like grep 's -v
option. This allows the user to find a certain word of text, then only highlight
entries where a different word is not nearby. Figure 1 shows what this can look like.
Figure 1. Example of DOM and DOM hierarchy searches
In the top portion of the image, the search text of "DOM" is highlighted by the script.
In the bottom portion, notice how only the first three "DOM" entries are highlighted
because the second search text of "hierarchy" is found in close proximity to the third
"DOM."
Consider Figure 2.
Figure 2. Example of 2008 and 2008 PM searches
The first portion of the image shows all the 2008 entries, while the second portion
only shows the before-noon entries due to the -v keyword of
PM. Read on for full details and further examples of how to implement this functionality.
greppishFind.user.js Greasemonkey user script
An introduction to the unique aspects of the Greasemonkey programming environment are
beyond the scope of this article. Familiarity with Greasemonkey, including how to
install, modify, and debug scripts, is assumed. Consult the Resources for more information about Greasemonkey and how to
get started programming your own user scripts.
Generally speaking, the greppishFind.user.js user script is started on a page load,
provides a text area after a specific key combination is entered, and performs
highlighting searches based on user-entered text. Listing 1 shows the beginning of the greppishFind.user.js user script.
Listing 1. greppishFind.user.js program heading
// ==UserScript==
// @name greppishFind
// @namespace IBM developerWorks
// @description grep and grep -v function-ish for one or two word searches
// ==/UserScript==
var boxAdded = false; // user interface for search active
var dist = 10; // proximity distance between words
var highStart = '<high>'; // begin and end highlight tags
var highEnd = '</high>';
var lastSearch = null; // previous highlight text
window.addEventListener('load', addHighlightStyle,'true');
window.addEventListener('keyup', globalKeyPress,'true');
|
After defining the required metadata that describes the user script and its function,
global variables, and highlighting tags, the load and keyup event listeners are added
to process user-generated events. Listing 2 details the addHighlightStyle function called by the load event listener.
Listing 2. addHighlightStyle function
function addHighlightStyle(css)
{
var head = document.getElementsByTagName('head')[0];
if( !head ) { return; }
var style = document.createElement('style');
var cssStr = "high {color: black; background-color: yellow; }";
style.type = 'text/css';
style.innerHTML = cssStr;
head.appendChild(style);
}//addHighlightStyle
|
The function creates a new node in the current DOM hierarchy with the appropriate
highlighting information. In this case, it's a simple yellow-on-black text attribute.
Listing 3 shows the code of the other event listener, globalKeyPress , as well as the boxKeyPress function.
Listing 3. globalKeyPress , boxKeyPress functions
function globalKeyPress(e)
{
// add the user interface text area and button, set focus and event listener
if( boxAdded == false && e.altKey && e.keyCode == 61 )
{
boxAdded = true;
var boxHtml = "<textarea wrap='virtual' id='sBoxArea' " +
"style='width:300px;height:20px'></textarea>" +
"<input name='btnHighlight' id='tboxButton' " +
"value='Highlight' type='submit'>";
var tArea = document.createElement("div");
tArea.innerHTML = boxHtml;
document.body.insertBefore(tArea, document.body.firstChild);
tArea = document.getElementById("sBoxArea");
tArea.focus();
tArea.addEventListener('keyup', boxKeyPress, true );
var btn = document.getElementById("tboxButton");
btn.addEventListener('mouseup', processSearch, true );
}//if alt = pressed
}//globalKeyPress
function boxKeyPress(e)
{
if( e.keyCode != 13 ){ return; }
var textarea = document.getElementById("sBoxArea");
textarea.value = textarea.value.substring(0,textarea.value.length-1);
processSearch();
}//boxKeyPress
|
Catching each keystroke and listening for a specific combination is the purpose of
globalKeyPress . When the Alt+= keys are pressed (that is,
hold Alt and press the = key), the user interface for the search box is
added to the current DOM. This interface consists of a text area for entering the
keywords and a Submit button. After the new items are added, the text area needs
to be selected by the getElementById function to set the focus
correctly. Event listeners are then added to process the keystrokes in the text area, as
well as executing the search when the Submit button is clicked.
The second function in Listing 3 processes each keystroke in the text area. If the
Enter key is pressed, the text area's value has the newline removed and the processSearch function executed. Listing 4 details the processSearch function.
Listing 4. processSearch function
function processSearch()
{
// remove any existing highlights
if( lastSearch != null )
{
var splitResult = lastSearch.split( ' ' );
removeIndicators( splitResult[0] );
}//if last search exists
var textarea = document.getElementById("sBoxArea");
if( textarea.value.length > 0 )
{
var splitResult = textarea.value.split( ' ' );
if( splitResult.length == 1 )
{
oneWordSearch( splitResult[0] );
}else if( splitResult.length == 2 )
{
twoWordSearch( splitResult[0], splitResult[1] );
}else
{
textarea.value = "Only two words supported";
}//if number of words
}//if longer than required
lastSearch = textarea.value;
}//processSearch
|
Each search is stored in the lastSearch variable to be
removed each time processSearch is called. After the
removal, the search query is highlighted using oneWordSearch
if there is only one query word or if the twoWordSearch
function if the grep -v functionality is desired. Listing 5
shows the details on the removeIndicators function.
Listing 5. removeIndicators function
function removeIndicators( textIn )
{
// use XPath to quickly extract all of the rendered text
var textNodes = document.evaluate( '//text()', document, null,
XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
null );
for (var i = 0; i < textNodes.snapshotLength; i++)
{
textNode = textNodes.snapshotItem(i);
if( textNode.data.indexOf( textIn ) != -1 )
{
// find the appropriate parent node with the innerHTML to be removed
var getNode = getHtml( textNode );
if( getNode != null )
{
var temp = getNode.parentNode.innerHTML;
var reg = new RegExp( highStart, "g");
temp = temp.replace( reg, "" );
reg = new RegExp( highEnd, "g");
temp = temp.replace( reg, "" );
getNode.parentNode.innerHTML = temp;
}//if correct parent found
}//if word found
}//for each text node
}//removeIndicators
|
Instead of traversing the DOM tree manually, removeIndicators uses XPath to extract the text nodes in the
document quickly. If any of the text nodes contains the lastSearch text (the most recent highlighted word), getHtml finds the appropriate parent node, and the highlighted text
is removed. Note that combining the extract of innerHTML and
assignment of innerHTML into one step will cause various
issues, so temporarily assigning the innerHTML to an external
variable is required. Listing 6 is the getHtml function that
shows in detail how to find the appropriate parent node.
Listing 6. getHtml function
function getHtml( tempNode )
{
// walk up the tree to find the appropriate node
var stop = 0;
while( stop == 0 )
{
if( tempNode.parentNode != null &&
tempNode.parentNode.innerHTML != null )
{
// make sure it contains the tags to be removed
if( tempNode.parentNode.innerHTML.indexOf( highStart ) != -1 )
{
// make sure it's not the title or greppishFind UI node
if( tempNode.parentNode.innerHTML.indexOf( "<title>" ) == -1 &&
tempNode.parentNode.innerHTML.indexOf("btnHighlight") == -1)
{
return( tempNode );
}else{ return(null); }
// the highlight tags were not found, so go up the tree
}else{ tempNode = tempNode.parentNode; }
// stop the processing when the top of the tree is reached
}else{ stop = 1; }
}//while
return( null );
}//getHtml
|
While walking up the DOM tree in search of the innerHTML
with the highlighting tags inserted, it is important to disregard two specific nodes.
The nodes containing title and btnHighlight should not be updated, as changes in
these nodes cause the document to display incorrectly. When the correct node is found,
regardless of the number of parents up the DOM tree it is, the node is returned and the
highlighting removed. Listing 7 is the first of the functions that adds highlighting to the document.
Listing 7. oneWordSearch function
function oneWordSearch( textIn )
{
// use XPath to quickly extract all of the rendered text
var textNodes = document.evaluate( '//text()', document, null,
XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
null );
for (var i = 0; i < textNodes.snapshotLength; i++)
{
textNode = textNodes.snapshotItem(i);
if( textNode.data.indexOf( textIn ) != -1 )
{
highlightAll( textNode, textIn );
}//if word found
}//for each text node
}//oneWordSearch
|
Again using XPath, oneWordSearch processes each text node to
find the query. When found, the highlightAll function is
called, as shown in Listing 8.
Listing 8. highlightAll function
function highlightAll( nodeOne, textIn )
{
if( nodeOne.parentNode != null )
{
full = nodeOne.parentNode.innerHTML;
var reg = new RegExp( textIn, "g");
full = full.replace( reg, highStart + textIn + highEnd );
nodeOne.parentNode.innerHTML = full;
}//if the parent node exists
}//highlightAll
function highlightOne( nodeOne, wordOne, wordTwo )
{
var oneIdx = nodeOne.data.indexOf( wordOne );
var tempStr = nodeOne.data.substring( oneIdx + wordOne.length );
var twoIdx = tempStr.indexOf( wordTwo );
// only create the highlight if it's not too close
if( twoIdx > dist )
{
var reg = new RegExp( wordOne );
var start = nodeOne.parentNode.innerHTML.replace(
reg, highStart + wordOne + highEnd
);
nodeOne.parentNode.innerHTML = start;
}//if the distance threshold exceeded
}//highlightOne
|
Similar to the removeIndicators function, highlightAll uses a regular expression to replace the text to be
highlighted with markup, including the highlighting tags and the original text.
Function highlightOne , used later in the twoWordSearch function, checks that the first word is sufficiently
far away from the second word, then performs the same replacement. Word distance
checks need to take place in the rendered text as returned from the XPath statement;
otherwise, various markup, such as <b> , will affect the
distance calculations. Listing 9 shows the twoWordSearch
function in detail.
Listing 9. twoWordSearch function
function twoWordSearch( wordOne, wordTwo )
{
// use XPath to quickly extract all of the rendered text
var textNodes = document.evaluate( '//text()', document, null,
XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
null );
var nodeOne;
var foundSingleNode = 0;
for (var i = 0; i < textNodes.snapshotLength; i++)
{
textNode = textNodes.snapshotItem(i);
// if both words in the same node, highlight if not too close
if( textNode.data.indexOf( wordOne ) != -1 &&
textNode.data.indexOf( wordTwo ) != -1 )
{
highlightOne( textNode, wordOne, wordTwo );
foundSingleNode = 0;
nodeOne = null;
}else
{
if( textNode.data.indexOf( wordOne ) != -1 )
{
// if the first word is already found, highlight the entry
if( foundSingleNode == 1 &&
nodeOne.parentNode != null &&
nodeOne.parentNode.innerHTML.indexOf( wordTwo ) == -1 )
{
highlightAll( nodeOne, wordOne );
}//if second word is in the same parent node
// record current node found
nodeOne = textNode;
foundSingleNode = 1;
}//if text match
if( textNode.data.indexOf( wordTwo ) != -1 ){ foundSingleNode = 0; }
}//if both words in single node
}//for each text node
// no second word nearby, highlight all entries
if( foundSingleNode == 1 ){ highlightAll( nodeOne, wordOne ); }
}//twoWordSearch
|
Walking through each text node as retrieved from the XPath call is done the same way as
in the oneWordSearch function. If both words are found
within the current text node, the highlightOne function is
called to highlight the instances of wordOne where it is
sufficiently distant from wordTwo .
If both words are not in the same node, the foundSingleNode
variable is set on the first match. On subsequent matches, the highlightAll function is called when the single node is detected
again before a second node match. This ensures that each instance of the first word is
highlighted — even those that do not have the second word nearby. Upon
a loop, a final check is made to run highlightAll
if the last wordOne match was isolated and still needs to be highlighted.
Save the file created with the above code as greppishFind.user.js and read on for installation and usage details.
Installing the greppishFind.user.js script
Open your Firefox browser with the Greasemonkey V0.7 extension installed and enter
the URL to the directory where greppishFind.user.js is located. Click on the
greppishFind.user.js file and you should see the standard Greasemonkey install pop up.
Select install, then reload the page to activate the extension.
Usage examples
Once the greppishFind.user.js script is installed into Greasemonkey, you can mimic the
examples shown in Figure 1 by entering dom inspector as a
search query at www.google.com. When the results page appears, press Alt+= to
activate the user interface. Type the query DOM
(case-sensitive) and press Enter to see all entries of DOM highlighted. Change
the query to DOM hierarchy , and you'll see how only the
first three entries of DOM are highlighted, as shown in Figure 1.
Choose a directory listing such as file:///home/ or file:///c:/ to show entries like
those listed in Figure 2. You may want to experiment with changes to the distance
parameter or highlighting style to achieve results tailored to your searches.
Conclusion, further additions
With the code above and your completed greppishFind.user.js program, you now have a
baseline for implementing your own text-search capabilities in Firefox. Although this
program focuses on specific cases of certain words appearing in close proximity to
others, it provides a framework for further text-searching options.
Consider adding color changes for highlighted words based on how close the secondary
terms are. Expand the number of grep -v words to eliminate
entries gradually. Use the code here and your own ideas to create new Greasemonkey
user scripts that further enhance users' abilities to find text.
Download Description | Name | Size | Download method |
---|
Sample code | os-customserach-firefox-greppishFind_0.1.zip | 3KB | HTTP |
---|
Resources Learn
-
Learn more about Greasemonkey at Greasespot.net.
-
Read about JavaScript from the source at Mozilla.org
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Stay current with developerWorks' Technical events and webcasts.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
Get products and technologies
-
Grab the Greasemonkey Firefox
add-on (extension) from Mozilla.org.
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
About the author | | | Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies. |
Rate this page
| |