Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
#16 2007-03-31 12:10:58
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
It’s PHP 4.3.11 and MySQL 4.1.22.
In the meantime I’ll try and copy this site to another account with another host, and see what it does there. I’ve got other sites hosted elsewhere, but with much lesser content, and the two sites that could really use a feature like this, are unfortunately both hosted by the same host…
Offline
#17 2007-03-31 18:24:29
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Okay, I copied the site elsewhere, now using PHP 4.4.4 and MySQL 4.1.21-standard. And it works!!! Page is now here. (I took out the debug="3".)
I apologize Stef for making you go through all this.
Sometimes it is finding really good matches (belonning , should be ‘beloning’) and sometimes it shows an incredible imagination (positiefe, should be ‘positieve’) :)
What kind of testing that I can do would be helpful for you?
Edit: I did notice that tags, html code, urls in articles are also being searched. Don’t know if that can be turned off?
Edit again: the TXP search appears to do that as well, I just never noticed that before…
Last edited by els (2007-03-31 18:30:50)
Offline
Re: smd_fuzzy_find needs beta testers/coders
Els wrote:
It’s PHP 4.3.11 and MySQL 4.1.22.
Bingo. I know what it is then: Unicode regular expressions were introduced in 4.4.0 and I thought I’d be clever and use them. But since only half the smd_getWord() function uses them and full Unicode support isn’t going to be in PHP until v6 is out and well tested (so, around 2042 then, based on the current rate of adoption), it’s no hardship to go back and use standard character-by-character patterns. I’ll fix that so you can test it on your existing server. Sorry to cause you so much hassle.
What kind of testing that I can do would be helpful for you?
Just use it and see if anything jumps out as being really stupid or doesn’t work as you expect, then let me know what it is. I’ve had some odd matches come up (the one earlier was a prime example: ‘trianing’ shouldn’t match ‘infuriating’ first, then ‘training’ second in my mind, but maybe the algorithm thinks so… I’ll check it out and see if stuff like that can be tweaked). But, like you say, sometimes it’s quite creepy how close it can get to what you intended. If you experiment with min_word_length you can get it to only look for bigger words; useful for very technical sites.
Remember also that you can exclude sections using the section="!section_name, ..." syntax which is useful to exclude test areas from being searched. Haven’t rigorously tested using multiple section names yet and combinations of included/excluded sections. Putting debug="1" on will show you part of the query that MySQL sees so you can check that what you ask for is what you’re likely to get.
And if you don’t like the english “sorry” messages and don’t have MLP installed, you can still customise them in the local language using the no_match_label, suggest_label and too_short_label attributes.
The road to search enlightenment
If and when all’s well, the slightly bumpy roadmap looks something like this:
- category/subcategory searching to help with the speed on large sites
- perhaps allow some way of passing in reasonable tolerance values to the plugin. You could then offer users some matches and if it doesn’t come up with anything good, could offer to loosen the algorithm for them further (or maybe allow them to specify a ‘slider of fuzziness’ up-front). That’ll require some experimentation.
- extend it to image searching. And perhaps comments, if I can keep the speed up to a reasonable level. Speed is an important issue as you found out and I want to see if the algorithm is fast enough to cope with large sites. It’s a massive algorithm (of which I have pretty much no clue how it works its magic) and I’d like to maybe offer a “cheap, quick n dirty” search first, working its way up to the longer one if nothing good is found. I’d base that on number of articles found, with some (overrideable) cutoff point where it’s more efficient (in terms of time/processing) to do it in two hops. I’m sure there are faster/leaner algorithms out there, so if you know any programmery types who like a challenge, point them this way :-)
- Anything else? If you think the above are useful or superfluous, or can think of anything else, speak up
Edit: I did notice that tags, html code, urls in articles are also being searched. Don’t know if that can be turned off?… the TXP search appears to do that as well
Yeah, bit of a pain. I did try and get rid of that but it was ugly. Couldn’t fathom why I was getting a hit back on a page for the word “random” that blatantly wasn’t there. Then I realised I had an smd_random_banner tag in the article. D’oh! I even tried searching against Body_html in the hope that was pre-rendered without tags but it isn’t as far as I can tell :-( Guess we live with it unless someone can come up with a neat way of avoiding it…
Incidentally, if you generate a 404 on the Ablogment it runs another bit of code to create a ‘subcategory cloud’. I know Wilshire’s made one of those as part of another of his awesome plugins (I’m not sure how close it is to this one… does it give number of articles too?)
If there seems to be a market for one of these clouds and nobody’s done one already I was considering pluginising smd_cloud with of course, customisable content (category, section, article title, keywords, yahde yahde) and any number of classes so you can style the words in as many different types of things (e.g. font sizes) you care to make CSS rules for. Let me know your thoughts on that as well. I can fork a new thread for discussion if you think it has legs, or zap you with the MIB-style memory eraser if not.
Many thanks for taking the time to test drive this rather shaky code. I’ll post the pre-PHP 4.4.0 fix as soon as I get it working.
Last edited by Bloke (2007-03-31 20:17:49)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Hire Txp Builders – finely-crafted code, design and Txp
Offline
#19 2007-03-31 20:24:39
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Bloke wrote:
Sorry to cause you so much hassle.
Don’t be, I enjoy experimenting :)
I will play with it, and let you know if strange things happen. I really like this plugin, looking forward to seeing it further developed!
Offline
Re: smd_fuzzy_find needs beta testers/coders
I’ve fixed the problem with the unicode expressions. I’d post a new download but I’ve uncovered a bugette regarding apostrophes and can’t find a way round it right now (more in a mo).
In the meantime, go to smd_lib v0.23 near the bottom and edit this line from:
function smd_getWord($haystack,$offset=0,$chrs='#[\p{L}\p{N}]#u') {
to
function smd_getWord($haystack,$offset=0,$chrs='#[[:alnum:]\-\']#') {
That should fix it for PHP < 4.4.0 (and indeed work for all versions above too).
the apostrophe
Search for a word that contains an apostrophe in it. The URL shows up as:
http://www.domain.com/?q=anybody%2527s
And on-screen the plugin is sorry but it can’t find a match for “anybody%27s”. So it’s removing the %25 but ignoring the ‘27’. I’ve tried all manner of calls to functions like html_entity_decode(), htmlentities(), htmlspecialchars(), urldecode(), rawurldecode(), etc — with and without ENT_QUOTES. So far it’s resisting all attempts to change back into a regular apostrophe.
Maybe I need to use the multibyte mb_* calls instead to translate it? Or maybe I’m just being stupid and missing something obvious. If anyone has any ideas, please put me out of my misery. Ta.
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Hire Txp Builders – finely-crafted code, design and Txp
Offline
#21 2007-03-31 22:33:45
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
This is very strange: when I replace that line in smd_lib, and hit ‘Save’, I get a 500 internal server error.
EDIT: Oh sorry, I just needed to de-activate it first.
Another EDIT (I really should take the time to properly check what’s happening…): I was able to save the modified plugin, but when trying to view the site:
Fatal error: Cannot instantiate non-existent class: smd_mlp in /home/httpd/vhosts/doggiez.nl/httpdocs/textpattern/lib/txplib_misc.php(512) : eval()’d code on line 15
Last edited by els (2007-03-31 22:38:23)
Offline
Re: smd_fuzzy_find needs beta testers/coders
Els wrote:
EDIT: Oh sorry, I just needed to de-activate it first.
bq. Fatal error: Cannot instantiate non-existent class: smd_mlp in /home/httpd/vhosts/doggiez.nl/httpdocs/textpattern/lib/txplib_misc.php(512) : eval()’d code on line 15
Did you re-activate it? ;-)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Hire Txp Builders – finely-crafted code, design and Txp
Offline
#23 2007-03-31 22:45:44
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Yes I did :)
Offline
Re: smd_fuzzy_find needs beta testers/coders
Els wrote:
Yes I did :)
Just checking!
Then it’ll probably be a syntax error somewhere. Buggered if I can see it though. Try the full library download instead
P.S. I’ve had the 500 internal server error thing before too. Editing articles and/or plugin code can sometimes tip my hoster’s counterspam measures and they deliver a 500 status code. Friendly.
Last edited by Bloke (2007-03-31 22:56:49)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Hire Txp Builders – finely-crafted code, design and Txp
Offline
#25 2007-03-31 22:55:48
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Thank you!!! Error is gone and it’s working :)
Offline
#26 2007-03-31 23:01:48
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
…which is really strange because I copied the line of code from your post, and it’s exactly the same as the modified line in the plugin I just reinstalled…
Just noticed your edit :)
Last edited by els (2007-03-31 23:02:58)
Offline
#27 2007-06-09 20:36:07
- Logoleptic
- Plugin Author

- From: Kansas, USA
- Registered: 2004-02-29
- Posts: 482
Re: smd_fuzzy_find needs beta testers/coders
I admit I haven’t tested this myself, but while reading through the thread I thought of something that might increase the plugin’s performance. I just don’t know for sure if what I’m suggesting is actually possible.
Some plugins, like rss_unlimited_categories, store article information in a special database table when the article is saved. What if you moved all the filtering and soundex/metaphone processing to occurr when an article is saved, storing the resulting unique words and their metaphone and soundex keys in a new smd_searchkeys table? You’d have a primary key of article ID, and fields containing each unique word and its pronunciation info. When a search was run, it would only need to query this table instead of doing all that processing every time.
I’m a newbie to plugin-writing, so I’m not sure this is possible. Thought I’d toss it out there as a suggestion, though.
Offline
Re: smd_fuzzy_find needs beta testers/coders
Logoleptic wrote:
What if you moved all the filtering and soundex/metaphone processing to occurr when an article is saved,
Hey, that’s not a bad idea. I’ve no idea how to implement that or add hooks into save_article, but in theory it’d work. As you say, the good news is it only has to run the computationally expensive stuff at article save on a (comparatively) small data set and ferret it away in a table.
Of course, that table / table cluster would have to be designed such that it can maintain performance with large quantities of data or we’d just be substituting the “looking through a hunk of text” with “looking through a load of table indices”; which may not offer that much improvement (at least, with my database normalisation skllls anyway :-)
It’s certainly worth bearing in mind as an approach if speed turn out to be an issue. Have you tested the plugin in its current form on a large site, btw? My limit’s something like 40 or 50 articles and it’s still working pretty well considering the thousands of words it has to trawl through. I really would love to understand that fuzzy algorithm and optimize it. Maybe one day…
Speaking of which, I must also schedule a release of the update which is nearing completion. Couple of new useful features to add, when I get round to tidying the code up.
Many thanks for the feedback.
Last edited by Bloke (2007-06-09 20:57:57)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Hire Txp Builders – finely-crafted code, design and Txp
Offline
#29 2007-06-09 22:13:11
- Logoleptic
- Plugin Author

- From: Kansas, USA
- Registered: 2004-02-29
- Posts: 482
Re: smd_fuzzy_find needs beta testers/coders
Hi Stef,
I don’t have access to a really large site that it would be safe to test this on. I’m nearing the end of a client project that involves about 70 articles, but I have a feeling that he’d frown on running experimental software. ;-)
I’m hoping to have the time to test this at some point in the future, but right now I’m working on finishing up my own plugin (a port of Typogrify to Txp). If I can get back to smd_fuzzy_find, I’ll be sure to let you know how things work out. Meanwhile, I’ll be keeping my eye on this thread. I’ve been hoping for something like this for quite awhile now!
Offline
#30 2007-06-09 23:14:14
- els
- Moderator

- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Hi Stef,
I recently moved the fuzzy find from my test page to ‘live’, and it seems to work rather well on my site. Even in Dutch the ‘closest matches’ are, most of the time, very acceptable. At the moment the section that is being searched has 155 short articles.
One thing I noticed: when two words are entered in the search field, the output doesn’t make much sense (or at least it doesn’t help my visitors). I am aware that also in the regular search it would only give results if the two words occur in that exact order, but the way fuzzy find is handling this is not perfect. For instance on my site both words ‘hond’ and ‘clicker’ occur countless times, also both in one article.- Searching for ‘clicker hond’ just results in ‘Sorry, no results matched “clicker hond” exactly.’, without any suggestions.
- Searching for ‘hond clicker’ does give some suggestions, but only for the second word.
- If I enter two words, of which the first one has occurrences, and the second doesn’t, it only gives suggestions for the second word.
I assume it’s only looking at the second word in this situation. And when the second word is too short, it just doesn’t find (look for?) close matches.
And it would be nice to have an excerpt in the search results ;) (or rather: to have it use the entire search_results form).
But apart from these minor annoyances, it’s working nicely!
Last edited by els (2007-06-09 23:19:35)
Offline