View Single Post
  #4  
Old 01-24-2007, 06:04 PM
danson danson is online now
Registered User
 
Join Date: 01-10-2006
Posts: 96
I bet there is some way to make this even cleverer -

Can you think of some kind of datastructure that allows you to index not only what words occur in what documents but also some kind of offset from the beginning value?

I suppose the current index looks like:

WORD DOCUMENT-ID
================
wordA: 2 5 9 1 3
wordB: 2 12 99 293

You could update the index to show not just what documents the word lies in but also it's position:

wordA: 2(4) 5(29)...
wordB: 2(5) 9(23)...

So wordA occurs in document 2, offset 4 and document 5, offset 29.

Then searching for the phrase "wordA wordB" would simply be a case of returning all documents and comparing offsets that are different by 1 (or perhaps with some tolerance factor).

That final comparison can probably also be optimised with the right algorithm.

Perhaps though you do something much more clever already...

Daniel
Reply With Quote