Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
Re: smd_fuzzy_find needs beta testers/coders
Els wrote:
One thing I noticed: when two words are entered in the search field, the output doesn’t make much sense
D’oh, I’ll have to review the way it splits the words up. Thought I’d got it doing that but maybe I forgot or didn’t check. I’m not sure it’s intelligent enough to take the words and apply “proximity” searching yet like Google does, nor does it handle quoted strings (I don’t think – not tried).
But you’re right, it should find the two words ‘separately’ on a page rather than just find the 2nd, or none at all if one of them is too short. I’ll have to fix that. Thanks for the report.
And it would be nice to have an excerpt in the search results
Yeah that’s on the todo list. I’m intending to add a results_form attribute for that so it can be overriden, but I have to find a way of making it use the inbuilt form first :-)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Hire Txp Builders – finely-crafted code, design and Txp
Offline
#32 2007-06-11 14:44:16
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Don’t hurry! I think it’s great you’re working on this :)
Offline
Re: smd_fuzzy_find needs beta testers/coders
OK, after a small lag of several millennia, the next iteration is here. smd_fuzzy_find v0.03 beta (and the * cough * bug-fixed smd_lib) adds category support so you can refine your searches to either the current category and/or specific categories, or you can negate categories with the !myCategory syntax. Takes a comma-separated list (like section does), or ?c for the current category, or !c for NOT the current category.
subcats="1" will search in sub-categories of the given list of categories.
section accepts ?s and !s as well as negation. Can’t remember if it did before, but it does now.
status limits the search to articles of the given status(es) or statii or whatever the plural is. Choose from the usual crew of draft, hidden, pending, live, or sticky. Default is “live, sticky”.
Els will also be pleased to hear I’ve squashed the two major bugs he mentioned. At least I think I have. If searching for more than one word, the plugin will search for them both in proximity to one another. In other words, they act like you’ve put quotes round the whole phrase. I have no idea how to separate them so they are both found somewhere on the page (separately), without understanding how the algorithm works.
So if you searched for barbee and len it would only find documents that mentioned barbie and ken in the same sentence, exactly like that… or perhaps with a few other choices thrown in depending on how fuzzy you want it.
Also, the search terms are now honoured in the search_results form. Specify form="name of form" to override the default of search_results. The <txp:search_results_excerpt /> tag now correctly highlights the terms it thought you meant in the article excerpts. It’s a kludge, but it seems to work. Note, however, if it finds a match in a field other than the body (e.g. the keywords), it won’t display the excerpt because TXP’s built-in search doesn’t handle them either.
So take it away and break it, reporting all quantum strangeness here. Sorry it took so long and doesn’t have comment searching yet. That’s more involved than I hoped, so I rolled back to this version after two weeks of head-scratching. Maybe I’ll manage it next time.
Enjoy.
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Hire Txp Builders – finely-crafted code, design and Txp
Offline
#34 2007-08-06 21:30:16
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Bloke wrote:
Els will also be pleased to hear I’ve squashed the two major bugs he mentioned.
She, if you don’t mind ;)
Thanks for the update, already installed and running on my site.
A quick first try: searching for ‘bridg and targt’ does indeed suggest ‘bridge and target’.
‘bridg an targt’: nothing
‘bridg an target’: nothing
‘bridge an targt’: bridge and target
So I’m guessing that at least one word has to be correct, and that shouldn’t be the last one in the row?? (I’m not complaining, just observing!)
The excerpt display is a good improvement. I’ll do some more testing in the next days and report back.
Offline
Re: smd_fuzzy_find needs beta testers/coders
Els wrote:
She, if you don’t mind ;)
Ooops, apologies. I knew that, just being brain dead. Trying to remix a tune and update code and a forum post in the background at the same time is a recipe for disaster with my tiny mind.
searching for ‘bridg and targt’ does indeed suggest ‘bridge and target’.
phew!
‘bridg an targt’: nothing. ‘bridg an target’: nothing
Uh-oh :-(
So I’m guessing that at least one word has to be correct, and that shouldn’t be the last one in the row??
Not necessarily, it really depends on what you’re searching for (which is very annoying… hence the reason this is beta and I really want to find someone who understands how the search algorithm works!)
Take this bizarre situation for example on this test site:
Spelling ‘software’ as sotfwar = 2 hits (correct)
Spelling ‘software’ as softwar = 1 hit (not so correct, considering it’s a closer spelling)
Spelling ‘performance software’ as preformance softwar = 1 hit (correct)
Spelling ‘performance software’ as preformanc softwar = 0 hits (not so correct)
Spelling ‘performance software’ as reformance softwar = 1 hit (correct)
So it seems to be because it takes the string as a whole entity, if you chop out bits in the middle it gives up.
I suppose from a linguistic point of view (looking at them as a collection of letters and ignoring the fact they actually sort of spell words) :
bridgeantargt is pretty close to bridgeandtarget
whereas bridgantargt probably isn’t because removal of the ‘e’ turns the ‘dg’ sound (like the ‘j’ in ‘jam’) into the ‘dg’ of ‘edgar’. But that doesn’t easily explain why bridg and targt matches, unless the inclusion of the ‘d’ gives it more to work with?
I’m not making excuses for my sloppy coding, it may well be a bug! If you switch on debug= and put in a number from 1 to 3 you’ll get varying hunks of output as it tries and fails to match things. Feel free to post pertinent bits of that (or mail it to me with what you were trying to search for) and I’ll see if it’s something I’ve broken.
Maybe it’s simply a limitation of the search algorithm I stole that it only really works reliably for single words; and the fact that it sometimes suggests matches for multiple words is just good luck :-)
But for an algorithm that is supposed to intelligently suggest options for bad typing, it’s acting rather incoherently.
(sigh)
Last edited by Bloke (2007-08-06 22:32:49)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Hire Txp Builders – finely-crafted code, design and Txp
Offline
#36 2007-08-07 16:04:38
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Bloke wrote:
I’m not making excuses for my sloppy coding, it may well be a bug! If you switch on
debug=and put in a number from 1 to 3 you’ll get varying hunks of output as it tries and fails to match things. Feel free to post pertinent bits of that (or mail it to me with what you were trying to search for) and I’ll see if it’s something I’ve broken.
I’ll do some more testing later this week, using the frequently used search queries from my logs, especially in Dutch. But I must say that to me this is not such a big problem, it’s already much better than returning no results at all even when only one letter is missing or misspelled :)
Offline
#37 2007-11-21 21:58:14
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Hey Stef,
I finally found time to do some more testing, though not extensively. I found that setting tolerance to ‘1’ at least in Dutch gives better results than the default ‘2’. And the limit attribute, when used with ‘words’, is 1 off: for example limit="words:4" gives 5 words. It counts correctly with ‘articles’.
And there is something I don’t understand: have a look at the suggested words here for instance. Search term is ‘clikker’, while it should be spelled ‘clicker’. Now there are lots of words starting with ‘clicker’, like ‘clickeraar’, ‘clickeren’, etcetera. Why doesn’t it suggest those whole words, but instead ‘clickera’, ‘clickere’? At first I thought it was because of the word length, but in other cases that doesn’t seem to matter that much. Or is the difference here the search term being at the end or the beginning of the word? (This is not something I’m asking you to fix, I just don’t get how it works.)
I still like the plugin a lot :)
Offline
Re: smd_fuzzy_find needs beta testers/coders
Hey Els, thanks for testing this some more.
I found that setting
toleranceto ‘1’ at least in Dutch gives better results than the default ‘2’.
Weird, but cool. Nice to know.
And the
limitattribute, when used with ‘words’, is 1 off
D’oh, well spotted. A rogue missing “=” sign on line 383 of the plugin is the culprit. I’m hopefully going to be making this plugin as official as I can over the next week or so to take advantage of the better-written stuff in smd_lib v0.3 (which was essentially a rewrite of the “unofficial” version of the beta smd_lib 0.23 that fuzzy_find uses). In the meantime, if you want to patch the plugin, go down near the bottom and find:
if (array_key_exists("words", $limitBy) && $ctr > $limitBy["words"]) {
and change it to:
if (array_key_exists("words", $limitBy) && $ctr >= $limitBy["words"]) {
That makes it count better :-)
Search term is ‘clikker’, while it should be spelled ‘clicker’. Now there are lots of words starting with ‘clicker’, like ‘clickeraar’, ‘clickeren’, etcetera. Why doesn’t it suggest those whole words, but instead ‘clickera’, ‘clickere’? … is the difference here the search term being at the end or the beginning of the word?
It is odd isn’t it. I don’t know if the beginning/end of the word thing has any bearing on the results, but it may well be a factor (I wish I understood this guy’s function so I could have a stab at an intelligent answer!) If it’s any consolation it’s the same in english. In that last example I have no idea what’s going on with the results, especially the last article! Totally bizarre: for some reason the algorithm thinks that ‘ca’ is a suitable replacement for ‘arestia’ (a very badly-spelled portion from the middle of the word ‘marestail’). The mind boggles.
In the process of moving this plugin towards officialdom I’m going to review some of the stuff anyway and I’ll do some extensive testing to see if this sort of thing is a result of the fuzzy find algorithm throwing weird stuff out, or if it’s a(nother) bug in my “get the nearest word” function. If I find anything I’ll post here first.
Many thanks for giving the code a good seeing to, and for spotting that naughty bug. If I had Best Beta Tester badges available, you’d get one :0)
Last edited by Bloke (2007-11-22 09:22:02)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Hire Txp Builders – finely-crafted code, design and Txp
Offline
#39 2007-11-22 19:27:53
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Bloke wrote:
If I had Best Beta Tester badges available, you’d get one :0)
Thank you, but to be honest, where is the competition? ;-)
Offline