Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#13 2007-03-30 23:30:24

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,270
Website GitHub

Re: smd_fuzzy_find needs beta testers/coders

Els wrote:

I installed v0.02, added refine="". See for yourself

Right, the fuzzy logic is working so it’s likely a bug in smd_getWord(). Typical!

OK, install v0.02_debug then and in your tag, add debug="3". That’ll throw me pretty much everything I need to try and work out why smd_getWord() is crying. Thanks.

Last edited by Bloke (2007-03-30 23:49:04)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

#14 2007-03-31 10:32:33

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

Output. Hope it helps :)

Offline

#15 2007-03-31 11:28:10

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,270
Website GitHub

Re: smd_fuzzy_find needs beta testers/coders

Els wrote:

Output. Hope it helps :)

Yes and no.

I took one of your lists of words containing the word ‘training’ and made a new article out of it in the Ablogment. Mine found the mis-spelled word. (incidentally, quite why it chose ‘infuriating’ over ‘training’ in the list is curious and I’ll have to investigate; it’s supposed to order the list by most likely match first).

Clearly, yours is not working whereas mine is and they’re using the same version of the code and virtually the same tags. I even changed my xmlns to match your language in case it was that: it wasn’t.

So that leaves a server issue and my money’s on me using a function in smd_getWord() that’s either not implemented in your version of PHP or that has changed in a later version. Which version of PHP (and MySQL?) are you running? I’m on PHP 4.4.4 with MySQL 4.1.21-standard.

Last edited by Bloke (2007-03-31 11:32:23)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

#16 2007-03-31 12:10:58

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

It’s PHP 4.3.11 and MySQL 4.1.22.

In the meantime I’ll try and copy this site to another account with another host, and see what it does there. I’ve got other sites hosted elsewhere, but with much lesser content, and the two sites that could really use a feature like this, are unfortunately both hosted by the same host…

Offline

#17 2007-03-31 18:24:29

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

Okay, I copied the site elsewhere, now using PHP 4.4.4 and MySQL 4.1.21-standard. And it works!!! Page is now here. (I took out the debug="3".)

I apologize Stef for making you go through all this.

Sometimes it is finding really good matches (belonning , should be ‘beloning’) and sometimes it shows an incredible imagination (positiefe, should be ‘positieve’) :)

What kind of testing that I can do would be helpful for you?

Edit: I did notice that tags, html code, urls in articles are also being searched. Don’t know if that can be turned off?
Edit again: the TXP search appears to do that as well, I just never noticed that before…

Last edited by els (2007-03-31 18:30:50)

Offline

#18 2007-03-31 20:06:52

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,270
Website GitHub

Re: smd_fuzzy_find needs beta testers/coders

Els wrote:

It’s PHP 4.3.11 and MySQL 4.1.22.

Bingo. I know what it is then: Unicode regular expressions were introduced in 4.4.0 and I thought I’d be clever and use them. But since only half the smd_getWord() function uses them and full Unicode support isn’t going to be in PHP until v6 is out and well tested (so, around 2042 then, based on the current rate of adoption), it’s no hardship to go back and use standard character-by-character patterns. I’ll fix that so you can test it on your existing server. Sorry to cause you so much hassle.

What kind of testing that I can do would be helpful for you?

Just use it and see if anything jumps out as being really stupid or doesn’t work as you expect, then let me know what it is. I’ve had some odd matches come up (the one earlier was a prime example: ‘trianing’ shouldn’t match ‘infuriating’ first, then ‘training’ second in my mind, but maybe the algorithm thinks so… I’ll check it out and see if stuff like that can be tweaked). But, like you say, sometimes it’s quite creepy how close it can get to what you intended. If you experiment with min_word_length you can get it to only look for bigger words; useful for very technical sites.

Remember also that you can exclude sections using the section="!section_name, ..." syntax which is useful to exclude test areas from being searched. Haven’t rigorously tested using multiple section names yet and combinations of included/excluded sections. Putting debug="1" on will show you part of the query that MySQL sees so you can check that what you ask for is what you’re likely to get.

And if you don’t like the english “sorry” messages and don’t have MLP installed, you can still customise them in the local language using the no_match_label, suggest_label and too_short_label attributes.

The road to search enlightenment

If and when all’s well, the slightly bumpy roadmap looks something like this:

  1. category/subcategory searching to help with the speed on large sites
  2. perhaps allow some way of passing in reasonable tolerance values to the plugin. You could then offer users some matches and if it doesn’t come up with anything good, could offer to loosen the algorithm for them further (or maybe allow them to specify a ‘slider of fuzziness’ up-front). That’ll require some experimentation.
  3. extend it to image searching. And perhaps comments, if I can keep the speed up to a reasonable level. Speed is an important issue as you found out and I want to see if the algorithm is fast enough to cope with large sites. It’s a massive algorithm (of which I have pretty much no clue how it works its magic) and I’d like to maybe offer a “cheap, quick n dirty” search first, working its way up to the longer one if nothing good is found. I’d base that on number of articles found, with some (overrideable) cutoff point where it’s more efficient (in terms of time/processing) to do it in two hops. I’m sure there are faster/leaner algorithms out there, so if you know any programmery types who like a challenge, point them this way :-)
  4. Anything else? If you think the above are useful or superfluous, or can think of anything else, speak up

Edit: I did notice that tags, html code, urls in articles are also being searched. Don’t know if that can be turned off?… the TXP search appears to do that as well

Yeah, bit of a pain. I did try and get rid of that but it was ugly. Couldn’t fathom why I was getting a hit back on a page for the word “random” that blatantly wasn’t there. Then I realised I had an smd_random_banner tag in the article. D’oh! I even tried searching against Body_html in the hope that was pre-rendered without tags but it isn’t as far as I can tell :-( Guess we live with it unless someone can come up with a neat way of avoiding it…

Incidentally, if you generate a 404 on the Ablogment it runs another bit of code to create a ‘subcategory cloud’. I know Wilshire’s made one of those as part of another of his awesome plugins (I’m not sure how close it is to this one… does it give number of articles too?)

If there seems to be a market for one of these clouds and nobody’s done one already I was considering pluginising smd_cloud with of course, customisable content (category, section, article title, keywords, yahde yahde) and any number of classes so you can style the words in as many different types of things (e.g. font sizes) you care to make CSS rules for. Let me know your thoughts on that as well. I can fork a new thread for discussion if you think it has legs, or zap you with the MIB-style memory eraser if not.

Many thanks for taking the time to test drive this rather shaky code. I’ll post the pre-PHP 4.4.0 fix as soon as I get it working.

Last edited by Bloke (2007-03-31 20:17:49)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

#19 2007-03-31 20:24:39

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

Bloke wrote:

Sorry to cause you so much hassle.

Don’t be, I enjoy experimenting :)

I will play with it, and let you know if strange things happen. I really like this plugin, looking forward to seeing it further developed!

Offline

#20 2007-03-31 22:27:24

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,270
Website GitHub

Re: smd_fuzzy_find needs beta testers/coders

I’ve fixed the problem with the unicode expressions. I’d post a new download but I’ve uncovered a bugette regarding apostrophes and can’t find a way round it right now (more in a mo).

In the meantime, go to smd_lib v0.23 near the bottom and edit this line from:

function smd_getWord($haystack,$offset=0,$chrs='#[\p{L}\p{N}]#u') {

to

function smd_getWord($haystack,$offset=0,$chrs='#[[:alnum:]\-\']#') {

That should fix it for PHP < 4.4.0 (and indeed work for all versions above too).

the apostrophe

Search for a word that contains an apostrophe in it. The URL shows up as:

http://www.domain.com/?q=anybody%2527s

And on-screen the plugin is sorry but it can’t find a match for “anybody%27s”. So it’s removing the %25 but ignoring the ‘27’. I’ve tried all manner of calls to functions like html_entity_decode(), htmlentities(), htmlspecialchars(), urldecode(), rawurldecode(), etc — with and without ENT_QUOTES. So far it’s resisting all attempts to change back into a regular apostrophe.

Maybe I need to use the multibyte mb_* calls instead to translate it? Or maybe I’m just being stupid and missing something obvious. If anyone has any ideas, please put me out of my misery. Ta.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

#21 2007-03-31 22:33:45

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

This is very strange: when I replace that line in smd_lib, and hit ‘Save’, I get a 500 internal server error.

EDIT: Oh sorry, I just needed to de-activate it first.

Another EDIT (I really should take the time to properly check what’s happening…): I was able to save the modified plugin, but when trying to view the site:

Fatal error: Cannot instantiate non-existent class: smd_mlp in /home/httpd/vhosts/doggiez.nl/httpdocs/textpattern/lib/txplib_misc.php(512) : eval()’d code on line 15

Last edited by els (2007-03-31 22:38:23)

Offline

#22 2007-03-31 22:43:55

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,270
Website GitHub

Re: smd_fuzzy_find needs beta testers/coders

Els wrote:

EDIT: Oh sorry, I just needed to de-activate it first.
bq. Fatal error: Cannot instantiate non-existent class: smd_mlp in /home/httpd/vhosts/doggiez.nl/httpdocs/textpattern/lib/txplib_misc.php(512) : eval()’d code on line 15

Did you re-activate it? ;-)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

#23 2007-03-31 22:45:44

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

Yes I did :)

Offline

#24 2007-03-31 22:51:59

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,270
Website GitHub

Re: smd_fuzzy_find needs beta testers/coders

Els wrote:

Yes I did :)

Just checking!

Then it’ll probably be a syntax error somewhere. Buggered if I can see it though. Try the full library download instead

P.S. I’ve had the 500 internal server error thing before too. Editing articles and/or plugin code can sometimes tip my hoster’s counterspam measures and they deliver a 500 status code. Friendly.

Last edited by Bloke (2007-03-31 22:56:49)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

Board footer

Powered by FluxBB