Comparing Substrings

by David Richardson, March 12, 2008 10:47

Hi... long time no post!

I've been working with some text pulled in from various sources and am using the MySQL full text indexing to search it. However, there is often repeated text, which throws out the search.

For example, from one of the sources there is a piece of text which occurs on every page listing out the departments in the company: so "Marketing, Sales, Accounts, HelpDesk" will be on every page (in that order), although only a few of the pages actually have content that directly relates to those departments.

Now, I know I can model the page before import to make sure that the piece of text which repeats is omitted on import. However, I'm acutely aware that I'm going to have a load more of these documents (thousands!) thrown at me soon, so I was wondering if there's any way in ODBScript that I can compare a few fields to look for repeated substrings - and then automatically delete them. Even if I can create a list of candidate substrings for deletion that would be a big step forward.

Many thanks in advance for any thoughts: I know it's a toughie!


