Source: http://drnicwilliams.com/2009/06/07/tdd-for-greasemonkey-scripts-and-introducing-ninja-search-js/

“this article shows how I used test-driven development tools and processes on a Greasemonkey script.” Though it also includes free ninjas.

1. Long drop downs hate humans

When I do online banking I need to select from a large list of other people’s bank accounts to which I might like to transfer money too. It is the massive drop down list that I must scroll through that I wish to raise issue with today. The problem of having to give other people money is probably a different discussion.

And take those time-zone selector drop down lists, for example, the massively long list rendered by Rails’ time_zone_select helper. Granted, I am thankful for you letting me choose my timezone in your web app. Though for those of us not living in the USA we must hunt for our closest city in the list. Dozens of locations, ordered by time zone and not the name of the city (see adjacent image). Unfortunately you can’t easily type a few letters of your current city to find it. Rather, you have to scroll. And if you live in the GMT+1000 time zone group (Eastern Australia), you have to scroll all the way to the bottom.

5. Choose from a small list

So I got to thinking I’d like a Greasemonkey (for Firefox) or GreaseKit (for Safari) script that automatically converted all ridiculously long HTML drop down lists into a sexy, autocompletion text field. You could then type in “bris” and be presented with “(GMT+1000) Brisbane”, or given the less amusing banking scenario then I could type “ATO” and get the bank account details for the Australian Tax Office.

I mean, how hard could it be?

This article is two things: an introduction to Ninja Search JS which gives a friendly ninja for every drop down field to solve the above problem. Mostly, the rest of this article shows how I used test-driven development tools and processes on a Greasemonkey script.

Introducing Ninja Search JS

Ninja Search JS banner

Click the banner to learn about and install the awesome Ninja Search JS. It includes free ninjas.

Currently it is a script for Greasemonkey (FireFox) or GreaseKit (Safari). It could be dynamically installed as necessary via a bookmarklet. I just haven’t done that yet. It could also be a FireFox extension so it didn’t have to fetch remote CSS and JS assets each time.

Ninja Search JS uses liquidmetal and jquery-flexselect projects created by Ryan McGeary.

Most importantly of all, I think, is that I wrote it all using TDD. That is, tests first. I don’t think this is an erroneous statement given the relatively ridiculous, and unimportant nature of Ninja Search JS itself.

TDD for Greasemonkey scripts

I love the simple idea of Greasemonkey scripts: run a script on a subset of all websites you visit. You can’t easily do this on desktop apps, which is why web apps are so awesome – its just HTML inside your browser, and with Greasemoney or browser extensions you can hook into that HTML, add your own DOM, remove DOM, add events etc.

But what stops me writing more of them is that once you cobble together a script, you push it out into the wild and then bug reports start coming back. Or feature requests, preferably. I’d now have a code base without any test coverage, so each new change is likely to break something else. Its also difficult to isolate bugs across different browsers, or in different environments (running Ninja Search JS in a page that used prototypejs originally failed), without a test suite.

And the best way to get yourself a test suite is to write it before you write the code itself. I believe this to be true because I know it sucks writing tests after I’ve writing the code.

I mostly focused on unit testing this script rather than integration testing. With integration testing I’d need to install the script into Greasemonkey, then display some HTML, then run the tests. I’ve no idea how’d I’d do that.

testing running

But I do know how to unit test JavaScript, and if I can get good coverage of the core libraries, then I should be able to slap the Greasemonkey specific code on top and do manual QA testing after that. The Greasemonkey specific code shouldn’t ever change much (it just loads up CSS and more JS code dynamically) so I feel ok about this approach.

For this project I used Screw.Unit for the first time (via a modified version of the blue-ridge rails plugin) and it was pretty sweet. Especially being able to run single tests or groups of tests in isolation.

Project structure

summary of project structure

All the JavaScript source – including dependent libraries such as jquery and jquery-flexselect – was put into the public folder. This is because I needed to be able to load the files into the browser without using file:// protocol (which was failing for me). So, I moved the entire project into my Sites folder, and added the project as a Passenger web app. I’m ahead of myself, but there is a reason I went with public for the JavaScript + assets folder.

In vendor/plugins, The blue-ridge rails plugin is a composite of several JavaScript libraries, including the test framework Screw.Unit, and a headless rake task to run all the tests without browser windows popping up everywhere. In my code base blue-ridge is slightly modified since my project doesn’t look like a rails app.

Our tests go in spec. In a Rails app using blue-ridge, they’d go in spec/javascripts, but since JavaScript is all we have in this project I’ve flattened the spec folder structure.

The website folder houses the github pages website (a git submodule to the gh-pages branch) and also the greasemonkey script and its runtime JavaScript, CSS, and ninja image assets.

A simple first test

For the Ninja Search JS I wanted to add the little ninja icon next to every <select> element on every page I ever visited. When the icon is clicked, it would convert the corresponding <select> element into a text field with fantastical autocompletion support.

For Screw.Unit, the first thing we need is a spec/ninja_search_spec.js file for the tests, and an HTML fixture file that will be loaded into the browser. The HTML file’s name must match to the corresponding test name, so it must be spec/fixtures/ninja_search.html.

For our first test we want the cute ninja icon to appear next to <select> drop downs.

require("spec_helper.js");
require("../public/ninja_search.js"); // relative to spec folder

Screw.Unit(function(){
  describe("inline activation button", function(){
    it("should display NinjaSearch image button", function(){
      var button = $('a.ninja_search_activation');
      expect(button.size()).to(be_gte, 1);
    });
  });
});

The Blue Ridge textmate bundle makes it really easy to create the describe (des) and it (it) blocks, and ex expands into a useful expects(...).to(matcher, ...) snippet.

The two ellipses are values that are compared by a matcher. Matchers are available via global names such as equals, be_gte (greater than or equal) etc. See the matchers.js file for the default available matchers.

The HTML fixture file is important in that it includes the sample HTML upon which the tests are executed.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
  <title>Ninja Search | JavaScript Testing Results</title>
  <link rel="stylesheet" href="screw.css" type="text/css" charset="utf-8" />
  <script src="../../vendor/plugins/blue-ridge/lib/blue-ridge.js"></script>
</head>

<body>
  <div>
    <label for="person_user_time_zone_id">Main drop down for tests</label>
    <select name="person[user][time_zone_id]" id="person_user_time_zone_id" style="display: inline;">
      <option value="Hawaii">(GMT-10:00) Hawaii</option>
      <option value="Alaska">(GMT-09:00) Alaska</option>
      <option value="Pacific Time (US & Canada)">(GMT-08:00) Pacific Time (US & Canada)</option>
      <option value="Arizona">(GMT-07:00) Arizona</option>
      <option value="Mountain Time (US & Canada)">(GMT-07:00) Mountain Time (US & Canada)</option>
      <option value="Central Time (US & Canada)">(GMT-06:00) Central Time (US & Canada)</option>
      <option value="Eastern Time (US & Canada)">(GMT-05:00) Eastern Time (US & Canada)</option>
    </select>
  </div>
</body>
</html>

In its header it loads the blue-ridge JavaScript library, which in turn loads Screw.Unit and ultimately our spec.js test file (based on corresponding file name), so ninja_search.html will cause a file spec/ninja_search_spec.js to be loaded.

To run our first test just load up the spec/fixtures/ninja_search.html file into your browser.

Your first test will fail. But that’s ok, that’s the point of TDD. Red, green, refactor.

Simple passing code

So now we need some code to make the test pass.

Create a file public/ninja_search.js and something like the following should work:

(function($){
  $(function() {
    $('select').each(function(index) {
      var id = $(this).attr('id');

      // create the Ninja Search button, with rel attribute referencing corresponding >select id="...">
      $('> rel="' + id + '">ninja search>/a>')
      .insertAfter($(this));
    });
  });
})(jQuery);

Reload your test fixtures HTML file and the test should pass.

Now rinse and repeat. The final suite of tests and fixture files for Ninja Search JS are on github.

Building a Greasemonkey script

Typically Greasemonkey scripts are all-inclusive affairs. One JavaScript file, named my_script.user.js, typically does the trick.

I decided I wanted a thin Greasemonkey script that would dynamically load my ninja-search.js, and any stylesheets and dependent libraries. This would allow people to install the thin Greasemonkey script once, and I can deploy new versions of the actual code base over time without them having to re-install anything.

Ultimately in production, the stylesheets, images, and JavaScript code would be hosted on the intertubes somewhere. Though during development that would be long-winded and painful to push the code to a remote host just to run tests.

So I have three Greasemonkey scripts:

  • public/ninja_search.dev.user.js – loads each dependent library and asset from the local file system
  • public/ninja_search.local.user.js – loads compressed library and asset from the local file system
  • public/ninja_search.user.js – loads compressed library and assets from remote server

Let’s ignore the optimisation of compressing dependent JavaScript libraries for the moment and just look at the dev.user.js and user.js files.

The two scripts differ in the target host from which they load assets and libraries. ninja_search.dev.user.js loads them from the local machine and ninja_search.user.js loads them from a remote server.

For example ninja_search.dev.user.js loads local dependencies like this:

require("http://ninja-search-js.local/jquery.js");
require("http://ninja-search-js.local/ninja_search.js");

And ninja_search.user.js loads remote dependencies like this:

require("http://drnic.github.com/ninja-search-js/dist/jquery.js");
require("http://drnic.github.com/ninja-search-js/dist/ninja_search.js");

In the final version of ninja_search.user.js we load a simple, conpressed library containing jquery, our code, and other dependencies, called ninja_search_complete.js.

Using Passenger to server local libraries

The problem with loading local JavaScript libraries using the file:// protocol, inferred earlier, is that it doesn’t work. So if I can’t load libraries using file:// then I must use the http:// protocol. That means I must route the requests through Apache/Ningx.

Fortunately there is a very simple solution: use Phusion Passenger which serves a “web app’s” public folder automatically. That’s why all the javascript, CSS and image assets have been placed in a folder public instead of src or lib or javascript.

On my OS X machine, I moved the repository folder into my Sites folder and wired up the folder as a Passenger web app using PassengerPane. It took 2 minutes and now I had http://ninja-search.local as a valid base URL to serve my JavaScript libraries to my Greasemonkey script.

Testing the Greasemonkey scripts

I can only have one of the three Greasemonkey scripts installed at a time, so I install the ninja-search.dev.user.js file to check that everything is basically working inside a browser on interesting, foreign sites (outside of the unit test HTML pages).

Once I’ve deployed the JavaScript files and assets to the remote server I can then install the ninja-search.user.js file (so can you) and double check that I haven’t screwed anything up.

Deploying via GitHub Pages

The normal, community place to upload and share Greasemonkey scripts is userscripts.org. This is great for one file scripts, though if your script includes CSS and image assets, let alone additional JavaScript libraries, then I don’t think its as helpful, which is a pity.

So I decided to deploy the ninja-search-js files into the project’s own GitHub Pages site.

After creating the GitHub Pages site using Pages Generator, I then pulled down the gh-pages branch, and then linked (via submodules) that branch into my master branch as website folder.

Something like:


git checkout origin/gh-pages -b gh-pages
git checkout master
git submodule add -b gh-pages git@github.com:drnic/ninja-search-js.git website

Now I can access the gh-pages branch from my master branch (where the code is).

Then to deploy our Greasemonkey script we just copy over all the public files into website/dist, and then commit and push the changes to the gh-pages branch.


mkdir -p website/dist
cp -R public/* website/dist/
cd website
git commit -a "latest script release"
git push origin gh-pages
cd ..

Then you wait very patiently for GitHub to deploy your latest website, which now contains your Greasemonkey script (dist/ninja-search.user.js) and all the libraries (our actual code), stylesheets and images.

Summary

Greasemonkey scripts might seem like small little chunks of code. But all code starts small and grows. At some stage you’ll wish you had some test coverage. And later you’ll hate yourself for ever having release the bloody thing in the first place.

I wrote all this up to summarise how I’d done TDD for the Ninja Search JS project, which is slightly different from how I added test cases to _why’s the octocat’s pajamas greasemonkey script when I first started hacking with unit testing Greasemonkey scripts. The next one will probably be slightly different again.

I feel good about the current project structure, I liked Screw.Unit and blue-ridge, and I’m amused by my use of GitHub Pages to deploy the application itself.

If anyone has any ideas on how this could be improved, or done radically differently, I’d love to hear them!

Source: http://blog.saush.com/2009/03/write-an-internet-search-engine-with-200-lines-of-ruby-code/

Most people who know me realises at some point in time that I go through cycles where I dive into things that catch my imagination and obsess over it for days, weeks or even months. Some things like origami and programming never really go away while others like bread-making make cometary orbits that come once in a few years. In the past long gone, this was fueled mainly through reading copious amounts of books and magazines, but since the arrival of the Internet, my obsessive cycles reach new peaks. It allowed me to reach out to like minded people and provided a massive amount of data that I can never hope to conquer.

Fortunately with search engines, I don’t need to conquer them by myself. Search engines are almost ubiquitous on the Internet, the most famous one has since become a verb. In June 2006, the verb ‘to google’ was officially added to the Oxford English Dictionary. The barrier to entry for obsessing over something is now nothing more than an Internet connection away. So it’s only natural that search engines has now become my new obsessive interest. This of course has been enhanced in particular by the fact that I work for Yahoo!, who owns second most popular web search engine in the world.

The topics on search engines are pretty vast. Much of it is centered around search engine optimization (SEO), which contrary to its name, is not about optimizing search engines but about optimizing web sites to make it rise to the top of a search engine results list. However this is not what I’m interested in (at this moment). What I’m interested in is how search engines work, what makes it tick, basically the innards of a search engine. As part of my research, I wrote a search engine using Ruby, called SaushEngine, which I will describe in full here.

Surprisingly, as I delved deeper into the anatomy of a search engine, most search engines are pretty much the same. Sure, the implementation can be vastly different but the basic ideas are pretty much universal in all the search engines I looked at, including web search engines such as Yahoo! and Google as well as standalone ones such as Sphinx, ht://Dig, Lucence and Nutch. In essence, any search engine has these 3 parts:

  1. Crawl – the part of the search engine that goes around extracting data from various data sources that are being searched
  2. Index – the part of the search engine that processes the data that has been extracted during the crawl and stores it
  3. Query – the part of the search engine that allows a user to query the search engine and result ranked results

This article will describe how each part works and how my Ruby search engine implements it. SaushEngine (pun intended) is a simple web search engine, that is around 200 lines of Ruby code. However I do cheat a bit and use various libraries extensively including Hpricot, DataMapper and the Porter Stemmer. SaushEngine is a web search engine which means it goes out to  Internet and harvests data on Internet web sites. Having said that, it’s definitely not production grade and lacks much of the features of a proper search engine (including high performance) in exchange for a simpler implementation.

Crawl

Let’s start with the crawl phase of SaushEngine. SaushEngine has a web crawler typically called Spider (spider crawls the web — get it? I know, it’s an old joke), in 2 files named spider.rb and index.rb. This crawler doesn’t just extract the information only though, it partly processes the data as well. It is designed this way to simplify the overall design. In this section we’ll concentrate on the crawling parts first.

There are a number of things to consider when designing a crawler, the most important amongst which are:

  • How do we crawl?
  • What kind of documents do we want to get?
  • Which sites are we allowed to visit?
  • How do we re-crawl and how often?
  • How fast can we crawl?

Many search engine designs approach crawling modularly by detaching the dependency of the crawler to the index and query design. For example in Lucene, the crawler is not even part of the search engine and the implementer is expected to provide the data source. On the other hand, some crawlers are not built for search engines, but for research into the Internet in general. In fact, the first crawler, the World Wide Web Wanderer, was deployed in 1993 to measure the size of the Internet. As mentioned before, to simplify the design, SaushEngine’s crawler is an integral part of the search engine and the design is not modular. The subject of web crawlers is a topic of intensive research, with tons of existing literature dating from the late 1990s till today. Without going extensively into them, let’s see how SaushEngine is designed to answer some of those questions.

How do we crawl?
This is the central question. Google claimed in a blog entry in the official Google blog in July 2008 that they found 1 trillion unique URLs. That’s a whole lot of web pages. As a result, there is always the question of the value of visiting a particular page and prioritizing which sites to crawl are important to get the strategic sites in an evenly distributed way.

Most crawlers work by starting with bunch ’seed’ URLs and following all links in those seeds, so this means choosing the correct seeds are critical to having a good index. SaushEngine uses a breadth-first strategy in crawling sites from its seed URLs.

While we’re discussing the generic Internet crawler here, there exist a more focused type of crawler that only crawls specific sites or topics etc. For example, I could crawl only food-related sites or anime-related sites. The specifics in implementation will be different but the core mechanism will remain the same.

What kind of documents do we want to get?
Obviously for an Internet web crawler, we what are looking are all the web pages on the various web sites on the Internet. Some more advanced search engines include Word documents, Acrobat documents, Powerpoint presentations and so on but predominantly the main data source are HTML-based web pages that are reference by URLs. The SaushEngine design only crawls for html documents and ignore all other types of documents. This has an impact when we go into the indexing phase. Besides blocking off URLs that end with certain types such as .doc, .pdf, .avi etc, SaushEngine also rejects URLs with ‘?’, ‘&’ and ‘/cgi-bin/’ and so on, in them as the likelihood of generated content or it being a web application is high.

Which sites are we allowed to visit?
A major concern many web sites have against web crawlers are that they will consume bandwidth and incur costs for the owners of the various site it crawls. As a response to this concern, the Robots Exclusion Protocol was created where a file named robots.txt is placed at the root level of a domain and this file tells compliant crawlers which parts of the site is allowed or not allowed. SaushEngine parses the robots.txt file for each site to extract the permissions and complies with the permissions in the file.

How do we re-crawl and how often?
Web pages are dynamic. Content in web pages can change regularly, irregularly or even never. A good search engine should be able to balance between the freshness of the page (which makes it more relevant) to the frequency of the visits (which consumes resources). SaushEngine make a simple assumption that any page should be revisited if it is older than 1 week (a controllable parameter).

How fast can we crawl?
Performance of the crawler is critical to the search engine. It determines the number of pages in its index, which in turn determines how useful and relevant the search engine is. The SaushEngine crawler is rather slow because it’s not designed for speed but for ease of understanding of crawler concepts.

Let’s go deeper into the design. This is the full source code for index.rb

  1. require ‘rubygems’
  2. require ‘dm-core’
  3. require ‘dm-more’
  4. require ‘stemmer’
  5. require ‘robots’
  6. require ‘open-uri’
  7. require ‘hpricot’
  8. DataMapper.setup(:default, ‘mysql://root:root@localhost/saush’)
  9. FRESHNESS_POLICY = 60 * 60 * 24 * 7 # 7 days
  10. class Page
  11. include DataMapper::Resource
  12. property :id,          Serial
  13. property :url, String, :length => 255
  14. property :title, String, :length => 255
  15. has n, :locations
  16. has n, :words, :through => :locations
  17. property :created_at,  DateTime
  18. property :updated_at,  DateTime
  19. def self.find(url)
  20. page = first(:url => url)
  21. page = new(:url => url) if page.nil?
  22. return page
  23. end
  24. def refresh
  25. update_attributes({:updated_at => DateTime.parse(Time.now.to_s)})
  26. end
  27. def age
  28. (Time.now – updated_at.to_time)/60
  29. end
  30. def fresh?
  31. age > FRESHNESS_POLICY ? false : true
  32. end
  33. end
  34. class Word
  35. include DataMapper::Resource
  36. property :id,         Serial
  37. property :stem, String
  38. has n, :locations
  39. has n, :pages, :through => :locations
  40. def self.find(word)
  41. wrd = first(:stem => word)
  42. wrd = new(:stem => word) if wrd.nil?
  43. return wrd
  44. end
  45. end
  46. class Location
  47. include DataMapper::Resource
  48. property :id,         Serial
  49. property :position, Integer
  50. belongs_to :word
  51. belongs_to :page
  52. end
  53. class String
  54. def words
  55. words = self.gsub(/[^\w\s]/,“”).split
  56. d = []
  57. words.each { |word| d < < word.downcase.stem unless (COMMON_WORDS.include?(word) or word.size > 50) }
  58. return d
  59. end
  60. COMMON_WORDS = [‘a’,‘able’,‘about’,‘above’,‘abroad’,‘according’,…,‘zero’] # truncated
  61. end
  62. DataMapper.auto_migrate! if ARGV[0] == ‘reset’

Note the various libraries I used for SaushEngine:

  1. DataMapper (dm-core and dm-more)
  2. Stemmer
  3. Robots
  4. Hpricot

I also use the MySQL relational database as the index (which we’ll get to later). In the Page class, note the fresh?, age and refresh methods. The fresh? method checks if the page is fresh or not, and the freshness of the page is determined by the age of the page since it was last updated. Also note that I extended the String class by adding a words method that returns a stem of all the words from the string, excluding an array of common words or if the word is really long. I create the array of word stems using the Porter stemmer. We’ll see how this is used in a while.

Now take a look at spider.rb, which is probably the largest file in SaushEngine, around 100 lines of code:

  1. require ‘rubygems’
  2. require ‘index’
  3. LAST_CRAWLED_PAGES = ‘seed.yml’
  4. DO_NOT_CRAWL_TYPES = %w(.pdf .doc .xls .ppt .mp3 .m4v .avi .mpg .rss .xml .json .txt .git .zip .md5 .asc .jpg .gif .png)
  5. USER_AGENT = ‘saush-spider’
  6. class Spider
  7. # start the spider
  8. def start
  9. Hpricot.buffer_size = 204800
  10. process(YAML.load_file(LAST_CRAWLED_PAGES))
  11. end
  12. # process the loaded pages
  13. def process(pages)
  14. robot = Robots.new USER_AGENT
  15. until pages.nil? or pages.empty?
  16. newfound_pages = []
  17. pages.each { |page|
  18. begin
  19. if add_to_index?(page) then
  20. uri = URI.parse(page)
  21. host = “#{uri.scheme}://#{uri.host}”
  22. open(page, “User-Agent” => USER_AGENT) { |s|
  23. (Hpricot(s)/“a”).each { |a|
  24. url = scrub(a.attributes[‘href’], host)
  25. newfound_pages < < url unless url.nil? or !robot.allowed? url or newfound_pages.include? url
  26. }
  27. }
  28. end
  29. rescue => e
  30. print “\n** Error encountered crawling – #{page} – #{e.to_s}”
  31. rescue Timeout::Error => e
  32. print “\n** Timeout encountered – #{page} – #{e.to_s}”
  33. end
  34. }
  35. pages = newfound_pages
  36. File.open(LAST_CRAWLED_PAGES, ‘w’) { |out| YAML.dump(newfound_pages, out) }
  37. end
  38. end
  39. # add the page to the index
  40. def add_to_index?(url)
  41. page = Page.find(scrub(url))
  42. # if the page is not in the index, then index it
  43. if page.new_record? then
  44. index(url) { |doc_words, title|
  45. doc_words.each_with_index { |w, l|
  46. loc = Location.new(:position => l)
  47. loc.word, loc.page, page.title = Word.find(w), page, title
  48. loc.save
  49. }
  50. }
  51. # if it is but it is not fresh, then update it
  52. elsif not page.fresh? then
  53. index(url) { |doc_words, title|
  54. page.locations.destroy!
  55. doc_words.each_with_index { |w, l|
  56. loc = Location.new(:position => l)
  57. loc.word, loc.page, page.title = Word.find(w), page, title
  58. loc.save
  59. }
  60. }
  61. page.refresh
  62. #otherwise just ignore it
  63. else
  64. return false
  65. end
  66. return true
  67. end
  68. # scrub the given link
  69. def scrub(link, host=nil)
  70. unless link.nil? then
  71. return nil if DO_NOT_CRAWL_TYPES.include? link[(link.size-4)..link.size] or link.include? ‘?’ or link.include? ‘/cgi-bin/’ or link.include? ‘&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;’ or link[0..8] == ‘javascript’ or link[0..5] == ‘mailto’
  72. link = link.index(‘#’) == 0 ? : link[0..link.index(‘#’)-1] if link.include? ‘#’
  73. if link[0..3] == ‘http’
  74. url = URI.join(URI.escape(link))
  75. else
  76. url = URI.join(host, URI.escape(link))
  77. end
  78. return url.normalize.to_s
  79. end
  80. end
  81. # do the common indexing work
  82. def index(url)
  83. open(url, “User-Agent” => USER_AGENT){ |doc|
  84. h = Hpricot(doc)
  85. title, body = h.search(‘title’).text.strip, h.search(‘body’)
  86. %w(style noscript script form img).each { |tag| body.search(tag).remove}
  87. array = []
  88. body.first.traverse_element {|element| array < < element.to_s.strip.gsub(/[^a-zA-Z ]/, ) if element.text? }
  89. array.delete(“”)
  90. yield(array.join(” “).words, title)
  91. }
  92. end
  93. end
  94. $stdout.sync = true
  95. spider = Spider.new
  96. spider.start

The Spider class is a full-fledged Internet crawler. It loads its seed URLs from a YAML file named seed.yml and processes each URL in that list. To play nice and comply with the Robots Exclusion Protocol, I use a slightly modified Robots library based on Kyle Maxwell’s robots library, and set the name of the crawler as ’saush-spider’. As the crawler goes through each URL, it tries to index each one of them. If it successfully indexes the page, it goes through each link in the page and tries to add it to newly found pages bucket if it is allowed (checking robots.txt), or if if it not already added. At the end of the original seed list, it will take this bucket and replace it in the seed URLs list in the YAML file. This is effectively a breadth-first search that goes through each URL at every level before going down to the next level. Please note that I did some minor modifications to the robots library and the link to this is my fork out of GitHub, not Kyle’s.

The flow of adding to the index goes like this. First, we need to scrub the URL by taking out URLs that are suspected to be web applications (like Google Calendar, Yahoo Mail etc)  we check if the URL is in the index already, if it’s not, we proceed to index it. If it already is, we’ll check for the page’s freshness and re-index it if it’s stale. Otherwise, we’ll just leave it alone.

That’s it! The crawling part is rather simple but slow. To use this effectively we will need to add in capabilities to run a whole bunch of spiders at the same time but this will do for now. Let’s get to the indexing part next.

Index

Indexing involves parsing and storing data that has been collected by the crawler. The main objective here is to build a index that can be used to query and retrieve the stored data quickly. But what exactly is an index?

An index in our context is a data structure that helps us to look up a data store quickly. Take any data store that has N number of records. If we want to find a document in this data store we would need to go through each record to check if that’s the one we’re looking for before proceeding to the next. If it takes 0.01s to check each document and if we have 100,000 records, we’ll take 0.01 second at best case, 1,000 seconds to search through the data store at worst case and 500 seconds on average. Performance is therefore time linear or we say it takes O(N) time. That’s way too slow for any search engine to effectively handle millions or billions of records.

To improve the search performance, we insert an intermediate data structure between the query and the actual data store. This data structure, which is the index, allows us to return results in much improved times.

There are many different types of data structures that can be used as an index to the data store but the most popular one used by search engines is the inverted index. A normal way of mapping documents to words (called the forward index) are to have each document map to a list of words. For example:

An inverted index however maps words to the documents where they are found.

A quick look at why inverted indices is much faster. Say we want to find all documents in the data store that has the word ‘fox’ in it. To do this on a normal forward index, we need to go through each record to check for the word ‘fox’ i.e. in our example above, we need to check the forward index 4 times to find that it exists in document 2 and 3. In the case of the inverted index, we need to just go to the ‘fox’ record and find out that it exists in document 2 and 3. Imagine a data store of a millions of records and you can see why an inverted index is so much more effective.

In addition, a full inverted index also includes the position of the word within the document:

This is useful for us later during the query phase.

SaushEngine implements an inverted index in a relational database. Let’s look at how this is implemented. I used MySQL as it is the most convenient but with some slight modifications you can probably use any relational database.

The Pages table stores the list of web pages that has been crawled, along with its title and URL. The updated_at field determines if the page is stale and should be refreshed. The Words table stores a list of all words found in all documents. The words in this table however are stemmed using the Porter stemmer prior to storing, in order to reduce the number of words stored. An improved strategy could be to also include alternate but non-standard synonyms for e.g. database can be referred as DB but to keep things simple, only stemming is used. By this time you should recognize that Locations table is the inverted index, which maps words and its positions to the document. To access and interact with these tables through Ruby, I used DataMapper, an excellent Ruby object-relational mapping library. However a point to note is that DataMapper doesn’t allow you to create indices on the relationship foreign keys, so you’ll have to do it yourself. I found that putting 3 indices — 1 for word_id, 1 for page_id and 1 with word_id and page_id improves the query performance tremendously. Also please note that the search engine requires you to have both dm-core and dm-more so remember to install those 2 gems.

Now that we have the index, let’s go back and look at parsing the information that is retrieved through the spider. The bulk of the work is in the index and add_to_index? methods in the Spider class. The index method is called from the add_to_index? method, along with the scrubbed URL and a code block that utilizes DataMapper to add the words and its locations into the index.

Using Open URI, the web page is extracted and Hpricot is used to parse the HTML. I take only the body of the document, and removed non-text portions of the document like inline scripts and styles, forms and so on. Going through each remaining element, I take only the explicitly text elements and drop the rest. Next, I tokenize the text. My strategy is pretty simple, which is to delimit the text by white spaces only. As a side note, while this works pretty ok in English text, it doesn’t work as well in other languages especially Asian languages such as Chinese or Korean (where sometimes there are no white space delineation at all). The resulting tokens, which are words, are run through the Porter stemmer to produce word stems. These word stems are the ones that are added to the index.

Another side note — if you’re not a Rubyist or a beginner Rubyist you might interested to see that I actually passed an entire code block into the index method. This block differs if it’s a new web page or a refreshed web page. Doing this reduces the amount of duplicate code I had to write which makes the code easier to maintain and to read.

Also if you’re wondering what happened to keyword meta-tags, you’re right, I didn’t use them. There is no big philosophical reasoning why I skipped them — I just wanted to focus on the document text and not give any special consideration to keywords in the meta tag.

Finally let’s look at the query phase of the search engine, which all that we’ve talked about before comes together.

Query

The query phase of the search engine allows a user to express his intent and get relevant results. The search query input is most often limited to a single text box. As a result of this constraint, there are a number of problems. Firstly, because the user can only express his question in a single line of text, the true intention of the user can vary widely. For example, if I type in ‘apple’ as the search input, do I mean the computer company or the fruit? There is no way the search engine can find out absolutely from the user. Of course, a more seasoned search user would put in more clues, for example, ‘apple mac’ or ‘apple ipod’ will return the more accurate intention.

Also, all intentions must be express only in text. For example, if I want to find a document that was written after a certain date, or a document written by at least 3 collaborators, or a photo with a particular person in it, expressing it in a text query is pretty challenging. This is the problem posed in parsing and identifying the search query, the first part of the query phase. For SaushEngine, I made broad assumptions and parsing the search query becomes only a matter of tokenizing the search input into words and stemming the words.

The second part of the query phase is to return relevant results. This is usually implementation specific and is dependent on the index that has been generated. The challenge in this part doesn’t really lie here, it lies in sorting the results by relevance. Although this sounds simple, it can be a truly profound problem with an amazing variety of solutions. Choosing the correct sorting solutions (also called ranking algorithms), generating the correct sorted relevance and responding with reasonable performance is the balance a search engine designer must make.

There are a large number of ranking algorithms that can be used for a search engine. The most famous one is probably PageRank, the ranking algorithm created by the Google founders, Larry Page and Sergey Brin. PageRank assigns a score for every page depending on the number and importance of pages that links to it and uses that to rank the page. Other algorithms use different methods, from getting feedback from the search engine users to rank the search results (the more times a user clicks on a search result, the higher it is ranked) to inspecting the contents of the page itself for clues on its relevance.

For SaushEngine, I chose 3 very simple ranking algorithms that are based on the content of the page itself:

  1. Frequency of the search words
  2. Location of the search words
  3. Distance of the search words

Frequency

The frequency ranking algorithm is also quite simple. The page that has more of the search words is assumed to be more relevant.

Location

The location ranking algorithm is very simple. The assumption made here is that if the search word is near to the top of the document, the page is more relevant.

Distance

The distance ranking algorithm inspects the distance of the search words between each other on every page. The closer the words are to each other on a page, the higher that page will be ranked. For example, if I search for ‘brown fox’ in these 2 documents:

1 – The quick brown fox jumped over the lazy dog
2 – The brown dog chased after the fox.

The will both turn up the search results, but document 1 will be more relevant as the distance between ‘brown’ and ‘fox’ in document 1 is 0 while in document 2 it is 4.

Let’s see how SaushEngine implements these 2 ranking algorithms. SaushEngine’s ranking algorithms are in a class named Digger, in a file named digger.rb. Here’s the full source code to digger.rb:

  1. require ‘index’
  2. class Digger
  3. SEARCH_LIMIT = 19
  4. def search(for_text)
  5. @search_params = for_text.words
  6. wrds = []
  7. @search_params.each { |param| wrds < < “stem = ‘#{param}'” }
  8. word_sql = “select * from words where #{wrds.join(“ or “)}”
  9. @search_words = repository(:default).adapter.query(word_sql)
  10. tables, joins, ids = [], [], []
  11. @search_words.each_with_index { |w, index|
  12. tables << “locations loc#{index}”
  13. joins << “loc#{index}.page_id = loc#{index+1}.page_id”
  14. ids << “loc#{index}.word_id = #{w.id}”
  15. }
  16. joins.pop
  17. @common_select = “from #{tables.join(‘, ‘)} where #{(joins + ids).join(‘ and ‘)} group by loc0.page_id”
  18. rank[0..SEARCH_LIMIT]
  19. end
  20. def rank
  21. merge_rankings(frequency_ranking, location_ranking, distance_ranking)
  22. end
  23. def frequency_ranking
  24. freq_sql= “select loc0.page_id, count(loc0.page_id) as count #{@common_select} order by count desc”
  25. list = repository(:default).adapter.query(freq_sql)
  26. rank = {}
  27. list.size.times { |i| rank[list[i].page_id] = list[i].count.to_f/list[0].count.to_f }
  28. return rank
  29. end
  30. def location_ranking
  31. total = []
  32. @search_words.each_with_index { |w, index| total << “loc#{index}.position + 1” }
  33. loc_sql = “select loc0.page_id, (#{total.join(‘ + ‘)}) as total #{@common_select} order by total asc”
  34. list = repository(:default).adapter.query(loc_sql)
  35. rank = {}
  36. list.size.times { |i| rank[list[i].page_id] = list[0].total.to_f/list[i].total.to_f }
  37. return rank
  38. end
  39. def distance_ranking
  40. return {} if @search_words.size == 1
  41. dist, total = [], []
  42. @search_words.each_with_index { |w, index| total << “loc#{index}.position” }
  43. total.size.times { |index| dist << “abs(#{total[index]} – #{total[index + 1]})” unless index == total.size – 1 }
  44. dist_sql = “select loc0.page_id, (#{dist.join(‘ + ‘)}) as dist #{@common_select} order by dist asc”
  45. list = repository(:default).adapter.query(dist_sql)
  46. rank = Hash.new
  47. list.size.times { |i| rank[list[i].page_id] = list[0].dist.to_f/list[i].dist.to_f }
  48. return rank
  49. end
  50. def merge_rankings(*rankings)
  51. r = {}
  52. rankings.each { |ranking| r.merge!(ranking) { |key, oldval, newval| oldval + newval} }
  53. r.sort {|a,b| b[1]<=>a[1]}
  54. end
  55. end

The implementation is mostly done in SQL, as can be expected. The basic mechanism is to generate the SQL queries (one for each ranking algorithm) from the code and send it to MySQL using a DataMapper pass-through method. The results from the queries are then processed as ranks and are sorted accordingly by a rank merging method.

Let’s look at each method in greater detail. The search method is the main one, which takes in a text string search query as input. This search query is then broken down into different word stems and we try to look for these words in our index, to get their row IDs for easier and faster manipulation. Next, we create the common part of the search query, something that is needed for each subsequent query and call the rank method.

The rank method is an aggregator that calls a number of ranking methods in sequence, and merges the returned rank lists together. A rank list is nothing more than an array of 2 item arrays, with the first item being the word ID and the second item being its rank by that algorithm:

[ [1123, 0.452],
[557 , 0.314],
[3263, 0.124] ]

The above means that there are 3 words in the rank list, the first being a word with the word ID 1123 and this word is ranked with a number 0.452, the second word is 557 and so on. Merging the returned rank lists would just mean that if we have the same words in the different rank lists but given a different ranking number, we add the rank numbers together.

Let’s look at the ranking algorithms. The easiest is the frequency algorithm. In this algorithm, we rank the page according to the number of times the searched words appear in that page. This is the SQL query that is normally generated from a search with 2 effective terms:

  1. SELECT loc0.page_id, count(loc0.page_id) as count FROM locations loc0, locations loc1
  2. WHERE loc0.page_id = loc1.page_id AND
  3. loc0.word_id = 1296 AND
  4. loc1.word_id = 8839
  5. GROUP BY loc0.page_id ORDER BY count DESC

This returns a resultset like this:

which obviously tells us which page has the most count of the given words. To normalize the ranking, just divide all the word counts with the largest count (in this case it is 1575). The highest ranked page has a rank count of 1, while the rest of the pages are ranked at numbers smaller than 1.

The location algorithm is about the same as the frequency algorithm. Here, we rank the page according to how close the word is to the top of the page. With multi-word searches, this becomes an exercise in adding the locations of each word. The is the SQL query from the same search:

  1. SELECT loc0.page_id, (loc0.position + 1 + loc1.position + 1) as total FROM locations loc0, locations loc1
  2. WHERE loc0.page_id = loc1.page_id AND
  3. loc0.word_id = 1296 AND
  4. loc1.word_id = 8839
  5. GROUP BY loc0.page_id ORDER BY total ASC

This returns a resultset like this:

which again obviously tells us which page has those words closest to the top of the page. To normalize the ranking however, we can’t use the same strategy as before, which is to divide each count by the largest count, as the lower the number is, the higher it should be ranked. Instead, I inversed the results (dividing the smallest total by each page total). For the smallest total this produces a rank count of 1 again, and the rest are ranked at numbers smaller than 1.

The previous 2 algorithms were relatively simple, the word distance is slightly tricky. We want to find the distance between all the search terms. For just 1 search term, this is a non-event as there is no distance. For 2 search terms, this is also pretty easy as it is only searching the distance between 2 words. It becomes tricky for 3 or more search terms. Should I find the distance between word 1 and 2 then word 1 and 3 then word 2 and 3? What if there are 4 or more terms? The growth in processing becomes factorial. I opted for a simpler approach, which is, for 3 terms, to find the distance between word 1 and 2 and then word 2 and 3. For 4 terms, it will be the distance between word 1 and word 2, then word 2 and word 3 and finally word 3 and word 4. The growth in processing then becomes a more manageable linear growth.

This is the SQL generated by this approach for 3 terms (adding 1 more additional term from above):

  1. SELECT loc0.page_id, (abs(loc0.position – loc1.position) + abs(loc1.position – loc2.position)) as dist
  2. FROM locations loc0, locations loc1, locations loc2
  3. WHERE loc0.page_id = loc1.page_id AND
  4. loc1.page_id = loc2.page_id AND
  5. loc0.word_id = 1296 AND
  6. loc1.word_id = 8839 AND
  7. loc2.word_id = 8870
  8. GROUP BY loc0.page_id ORDER BY dist ASC

I use the abs function in MySQL to get the distance between 2 word locations in the same page. This returns a resultset like this:

Like the location algorithm, the smaller the distance, the more relevant the page is, so the same strategy in creating the page rank.

The rankings that are returned by the different ranking algorithms are added up together equally to give a final ranking of each page. While I put equal emphasis on each ranking algorithm, this doesn’t necessarily need to be so. I could have easily put in parameters that would weigh the algorithms differently.

We’ve finished up the search engine proper. However search engines need an interface for users to interact with. We need wrap a nice web application around SaushEngine.

User Interface

I chose Sinatra, a minimal web application framework, to build a single page web application that wraps around SaushEngine. I chose it because of its style and simplicity and its ability to let me come up with a pretty good interface in just a few lines of code.

For this part of SaushEngine, you’ll need to install the Sinatra gem. We have 2 files, this is the Sinatra file, ui.rb:

  1. require ‘rubygems’
  2. require ‘digger’
  3. require ‘sinatra’
  4. get ‘/’ do
  5. erb :search
  6. end
  7. post ‘/search’ do
  8. digger = Digger.new
  9. t0 = Time.now
  10. @results = digger.search(params[:q])
  11. t1 = Time.now
  12. @time_taken = “#{“%6.2f” % (t1 – t0)} secs”
  13. erb :search
  14. end

And this is the only view file, which you need to put in to a views folder, search.rb:

  1. < !DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN”>
  2. <html>
  3. <head>
  4. <meta content=“text/html; charset=utf-8” http-equiv=“content-type”>
  5. <title>SaushEngine – UI</title>
  6. <style>
  7. body {
  8. font-family:“lucida grande”,tahoma,arial,sans-serif;
  9. font-size:11px;
  10. margin:5;
  11. padding:5;
  12. }
  13. </style>
  14. </meta></head>
  15. <body>
  16. <h2>SaushEngine</h2>
  17. <form method=“post” action=“/search”>
  18. <input name=“q” value=“<%= params[:q]%/>” size=“30”> <input value=“search” type=“submit”/>
  19. </form>
  20. <div>< %= “time taken – #{@time_taken}” unless @results.nil?%></div>
  21. <div id=‘results’>
  22. <ol>
  23. < % @results.each { |result| %>
  24. <li><a href=‘<%= Page.get(result[0]).url%>’>< %= Page.get(result[0]).title%></a> <br />
  25. < %= Page.get(result[0]).url%>  (score : < %= “%4.3f” % result[1]%>)</li>
  26. < % } unless @results.nil? %>
  27. </ol>
  28. </div>
  29. </body>
  30. </html>

I use ERB for templating because it’s the most familiar to me. This is how it turns out:

This is obviously not an industrial strength or commercially viable search engine. I wrote this as a tool to describe how search engines are written. There are much things to be improved, especially on the crawler, which is way too slow to be effective (unless it is a focused crawler) or the index (stored in a few MySQL tables but would have problems with a large index) or the query (ranking algorithms are too simplistic).

There are obviously a lot of literature out there I have referenced, many of them are now hyperlinked from this article. However, a book that have influenced me greatly especially in designing the index in MySQL is Toby Segaran’s Programming Collective Intelligence.

If you’re interested in extending it please feel free to take this code but would appreciate it if you drop me a note here. If you’re looking at an actual deployment please take at look at http://saushengine.saush.net. Note that this link might not be up at all times. If you’re looking for the code, you can find it here at git://github.com/sausheong/saushengine.git.