You've answered the next question, but not the one I asked...

by David Richardson, March 12, 2008 23:34

That's a very useful bit of code, but it's not actually what I'm after (if I've read it correctly!).

What it looks to me like you've done there is deleted a known substring where it occurs in a string. I've not tried it with a very long string, but I'm guessing that you found:

$repl($very_long_string$,"text you want rid of","") 
didn't work for you with very long pieces of text.

HOWEVER, my problem is that I don't actually know in advance what the text I want to delete will be. Let me give you an example, in a kind of "pseudo" code fashion:

Text01="Here's some 0123EDf4 text"
Text02="And Here's some more 0123EDf4 text"
Text03="Oops, looks like some more 0123EDf4 words"
Text04="0123EDf4This is just filler text"

RESULTS: (where repeated_substring > 3 characters) 0123EDf4 = 4 occurences text = 3 occurences some = 3 occurences more = 2 occurences Here's = 2 occurences Here's some = 2 occurences

You and I can see that 0123EDf4 is repeated at some point in all four pieces of text, but I want to be able to identify it. In the text I'm looking at, the repetitions are almost always longer than 100 characters, and there are other criteria I could use for deletion. But my problem is identifying them in the first place, without my having to trawl through lots and lots of text.

I'm wondering if regular expressions could do it, but with each field being around 20,000 characters long, it might be too much for regex.

Forgive me if I've misinterpreted your code and it actually has the answer I'm looking for... in which case, please give me some pointers!


