Needle in a world wide haystack

For a twenty year old, the World Wide Web’s pretty grown-up – at a conservative guess, there are 182 million websites accessible today. With this much information around, search engines are vital, and no company has done more for search engine technology than Google. Now a global power handling about 235 million searches per day in mid-2008, Google’s success comes from its search algorithm – and the secret behind that is a mathematical one.

Users want search engines to be fast, accurate, reliable, easy-to-use and comprehensive – but this misses one crucial element. With so much data out there, a search engine must make a good guess at which webpages will be most useful to the reader and present those first. If your search for “plumber Coventry” gets 18,000 hits, finding the ones that are likely to be useful to you is going to take some time. This is where search engines try and help you out with their page rank algorithms.

Who is number one?

Google networkIf we call the ranking of pages by importance and relevance the ‘charts’, how does a search engine decide who’s number one? Early search engines used to count the number of incoming links there were to that site. The more links the higher the position of a website in the internet charts. This is a very simple idea: each incoming link can be thought of as a recommendation for that website. If another site links to yours that means they must have found something there worth sending visitors to.

Take the example shown in the diagram which shows a (much!) reduced model of the internet.

Website

Number of incoming links

Chart position

1

1

=2

2

2

1

3

1

=2

Each node represents a website and the arrows represent hyperlinks. So the diagram shows a ‘world wide web’ with just three websites and four hyperlinks. The number of incoming links to each website,and how these convert into chart positions is shown in the table.

This makes some sense because sites 1 and 3 both recommend site 2, whereas they only receive one recommendation each themselves.

Systematic inequality

Google corrected a problem in this system: all recommendations were assumed to be equal, whereas in real life some are much more equal than others. If a website receives a link from a ‘big player’ website like bbc.co.uk this should do more to improve its chart position than a link from someone’s myspace! The question is how to express this idea in maths, so it can be implemented by computers? The answer turned out to be a classic piece of mathematics that virtually everyone learns at school – simultaneous equations!

To see this let’s go back to our simplified world wide web diagram.

Google networkThe idea is to calculate a ‘rating’ for each website to decide how high up the search engine charts each site should appear. To begin with these will be variables; x is the rating of site 1; y is the rating of site 2 and z is the rating of site 3.

Think of each hyperlink as a ‘vote’ from one site to another. Each site has its own ‘voting power’. It has to distribute its vote equally between all the sites that it votes for – so if a site links to two others, it will split its vote in two, giving half to each.

Look at the links coming in to site 2. The one from site 1 is worth x to site 2 and the one from site 3 is worth 0.5z to site 1. Make sure you can see why.

Counting the number of incoming votes in terms of x, y and z gives the table below.

Website

Number of incoming  votes

Voting power

1

z / 2

x

2

x + z / 2

y

3

y

z

But the really clever idea is that each site’s voting power is now defined to be equal to its total received, or incoming, vote. This gives rise to the following set of simultaneous equations:

x = z / 2

y = x + z / 2

z = y

Solving them will give each site’s rating and then their relative chart positions.

y = 2x = z

Site

Ranking

y

=1

z

=1

x

2

(There’s a mathematical theory that guarantees there is a solution which isn’t x =0, y=0, z=0. In fact, it says that there are infinitely many solutions but, conveniently, they all give the same chart positions).

Interrelated rating

So the rating of a website is determined by the ratings of those that links to it (and their rating is determined by the rating of those that link to them). Now do this for a network with 182 million nodes and many more connections, and you have yourself the basis for the most popular search engine in the world!

This is an example of how simultaneous equations, which are part of linear algebra, are used in internet development – but many other areas of mathematics are also used, particularly decision mathematics. Web technologies are a growing and rapidly developing area to work in. Not only do they provide lots of interesting challenges to crack with problem solving skills, the solutions that are found make millions of people’s lives a little bit easier, and may end up as household names.