Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2007-03-26 14:19:05

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

smd_fuzzy_find needs beta testers/coders

[ EDIT: NOTE THAT THIS PLUGIN IS NOW LIVE ]

For those of you that don’t like (* shock *) rss_live_search or prefer a less javascripty way of offering search facilities, I’ve been playing with the concept of making the TXP search a little less accurate (albeit intelligently so). I like the idea of offering people a “closest match” if they mis-type a word, or the search term isn’t found.

I found an implementation of a ‘nearest match’ routine which is totally beyond me at the moment: need to sit down with my debugging hat on to work out how it does what it does, as I want to mod it a bit. In the meantime, it does work reasonably well so I slung it in smd_lib, added a couple of niceties, and made a plugin out of it. Now it’s your turn to test it to destruction and offer ideas before I officially release it.

For fairly content-specific sites, because the available pool of words to check against is (by definition) your site’s content, it’s automatically context-sensitive, and can be made more so by specifying the section attribute. As it says in the docs, if you run a zoo website and someone searches for “lino” they don’t get an article on flooring, but suggestions for articles on lions instead.

The plugin’s classed as “very very beta” so if it collapses, please let me know under what circumstances and I’ll see what I can do. Some of the options haven’t been vigorously tested yet so I’ve not caught all the corner cases. Also, if it doesn’t do what it says on the tin, or you can think of a better way of doing it — more options, less options, more intuitive options/defaults etc etc — speak up now.

I think long-term, once the fuzzy logic is modded to natively support metaphones and soundex for better results (though, in the current version of PHP, this is only of use for the English language) I would like to offer this library out to others, e.g. to anyone who wants to resurrect ob1_advanced_search, maybe?

I’m especially interested in its performance/results on non-english sites. I suspect it won’t work as well because I’ve added metaphone/soundex capabilities in the plugin itself (rather than in the fuzzy logic behind the scenes) which are quite specific to English. I may offer an option in future to turn that off if it causes problems.

Test it

To try it out, you’ll need the latest version of smd_lib v0.31 installed. You’ll also need the last beta version, which is smd_fuzzy_find 0.03. It’s MLP compatible so if you have that installed you can customize your strings.

The plugin help explains (most of) what’s going on but as a quick ‘n’ dirty test, grab yourself a copy of chh_if_data and do something like this on your default page (or wherever your <txp:if_search> appears) :

<txp:if_search>
  <txp:chh_if_data>
    <txp:article limit="8" listform="excerpts" />
  <txp:else />
    <txp:smd_fuzzy_find form="excerpts" />
  </txp:chh_if_data>
</txp:if_search>

Assuming you have a form called “excerpts”, that’ll work fine.

It’s not foolproof and I’m sure it’ll come back with some pretty funny ‘closest’ matches, but that’s the idea of offering it out now. If anybody has any faster, better (preferably shorter!) algorithms for determining closest matches in a block of text or has any interest in helping develop the library part of it further, please let me know. I just robbed the fuzzy logic class off the web and I’d love to understand it or extend it to make it more accurate.

‘scuse the code if you take a peek: it’s in a bit of a mess because I’m halfway through adding extra functionality (e.g. image name/alt/caption search capabilities), but it works well enough now to demo.

Play and enjoy.

Last edited by Bloke (2010-09-07 11:21:51)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#2 2007-03-26 15:58:44

mrdale
Member
From: Walla Walla
Registered: 2004-11-19
Posts: 2,215
Website

Re: smd_fuzzy_find needs beta testers/coders

Well that’s monstrously useful. I wish I had the time to test this. soon…

Offline

#3 2007-03-26 17:19:39

squaredeye
Member
From: Greenville, SC
Registered: 2005-07-31
Posts: 1,495
Website

Re: smd_fuzzy_find needs beta testers/coders

Can someone please buy me some time? I’d love to help with this. Good luck bloke.

Last edited by ma_smith (2007-03-26 17:21:33)


Offline

#4 2007-03-26 17:54:20

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_fuzzy_find needs beta testers/coders

ma_smith and mrdale wrote:

Can someone please buy me some time?
I wish I had the time to test this.

I’m fresh out of time capsules I’m afraid.

But there’s no immense hurry, I whipped the plugin up yesterday morning, tidied it today and just threw it out in its 80% tested state to see where it sticks. The fact you’ve registered interest means at least I know there’s a potential audience for it!

Anything helps. Even if you get two minutes over lunch, just hammering a couple of search terms into the Ablogment to check the reasonableness of the returned strings and/or trying to make the plugin fall over would be of value. Ta.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#5 2007-03-29 22:46:48

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

Stef, I like this idea very much! I definitely want to use this if I can get it to work. On your site it seems to be working rather well.
I have installed the plugin and have this code on my page:

<txp:chh_if_data>
<txp:article sort="posted desc" limit="50" /><br />
<txp:else />
<txp:smd_fuzzy_find form="search_results" tolerance="10" min_word_length="4" match_with="article:body;excerpt" />
</txp:chh_if_data>

As you can see I went up to tolerance="10", but it doesn’t help to show any matches, not even not close at all. Whatever I type in the search field, this is what I get:

Sorry, no results matched “whatever” exactly. The text you are searching for is probably too short. Try a longer word.

The site is in Dutch, but searching for – misspelled – English words (that definitely exist in the content) doesn’t make a difference.
Try for yourself if you wish: doggiez.nl. Some of the English words you can search for: training, obedience, agility, reinforcement, domestication.

Edit: okay, when I search for supercalifragilistic it doesn’t say that I should try a longer word ;) But it does for supercalifr, so up to 11 characters is ‘too short’…

Edit again: the search also takes ages, much longer than on your site. So I must be doing something wrong :)

Last edited by els (2007-03-29 23:09:52)

Offline

#6 2007-03-30 00:02:10

marios
Archived Plugin Author
Registered: 2005-03-12
Posts: 1,253

Re: smd_fuzzy_find needs beta testers/coders

@Bloke, exciting.

Very usefull, indeed. Seems to work really well. I’m using it in the same configuration with the MLP Pack using default forms and options. ( So far )

Thanks, for putting this together.

regards, marios


⌃ ⇧ < ⌃ ⇧ >

Offline

#7 2007-03-30 19:25:51

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_fuzzy_find needs beta testers/coders

Els wrote:

As you can see I went up to tolerance="10", but it doesn’t help to show any matches, not even not close at all. when I search for supercalifragilistic it doesn’t say that I should try a longer word ;) But it does for supercalifr, so up to 11 characters is ‘too short’…

Thanks for testing this. It’s probably the first of many bugs: I never tried it any higher then 4.

I think I misunderstood the use of ‘tolerance’ in the fuzzy logic function anyway. It seems to be checking if length of search string / tolerance+2 is greater than 1. So if you give it a small 7-character word and your tolerance is 10, then 7/(10+2) < 1 so it’ll throw the ‘too short’ error. Maybe I’ll have to rename the tolerance attribute! Best off sticking with 2 or 3 I think: I’ll change the docs.

the search also takes ages, much longer than on your site. So I must be doing something wrong :)

Not necessarily. Your tag usage is fine, the trouble is I suspect the algorithm is ridiculously slow when the number of articles increases. That’s why I want people to test it with large numbers of articles. And other languages; both of which you are doing! The Ablogment only has about 20 or 30 articles to search through so it will be quite quick.

For info, so you can get an idea of what the plugin is doing:

  1. It takes your search term and tolerance, throws them at the fuzzy find class. That does some stuff I haven’t figured out yet :-\
  2. It then grabs aaallll the articles from the database (optionally from the given/current section)
  3. For each article it looks in match_with and grabs all the info you asked (body, excerpt, keywords etc) and makes a massive list of (unique) words out of it all
  4. It throws this list of words at the fuzzy logic code and it does… something :-) We get back a list of possible matches
  5. I do some checking and filtering, passing each match through metaphone() and soundex() — v0.02 won’t do that unless you want it to — then decide which are the best words from all that lot
  6. The results get sent to the page

So it’s doing quite a lot of checking and filtering which may well take some time. And I don’t know how quick the smd_getAtts() function is at forming long lists of words. I might try and speed that up or use a built in PHP function, as all the extra functionality’s not really needed here.

You can limit the search by using section="" but that won’t help much if all you articles are in one main section of your site. What I think I will do is offer the option to filter by category as well, so if someone is looking in a particular category, you have the option of only searching those (sub)categories. That makes quite a bit of sense both from the user perspective (if they haven’t quite found what they’re looking for) and it’ll also speed the plugin up.

Try this:
  1. set tolerance back to 2 or 3
  2. go to the plugin code and find the line (around two-thirds of the way down) that says $werds = implode(" ",$werds[0]);. Just after that, add dmp($werds); . That’ll show you all the words being looked at in each article; should be a huuuuge list.

If that looks reasonable enough and there are tonnes of words displayed on the screen, get rid of that line and a few lines later, after $term = smd_getWord($werds,$idx); add dmp($term); That shows every word the plugin is considering as “close”. Again, this should display some (quite a few?) words. If it doesn’t, let me know.

I think that’ll do for now. After that it refines the words with metaphone and soundex searches so we’ll cross that bridge if the plugin is actually working up to that point.

Again, thanks for testing this out guys. This is why I didn’t want to release it just yet as it’s really pretty flaky right now.

Last edited by Bloke (2007-07-30 09:04:44)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#8 2007-03-30 23:02:43

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

Stef, thanks for your reply. I enjoy testing this!

First, I moved this testing to a separate section, so as not to confuse my visitors ;) Try it out on page testsearch.

The code I have now is:

<txp:chh_if_data>
<txp:article sort="posted desc" limit="50" /><br />
<txp:else />
<txp:smd_fuzzy_find form="search_results" tolerance="2" section="artikelen" min_word_length="4" match_with="article:body;excerpt" />
</txp:chh_if_data>

(Section ‘artikelen’ contains 155 short articles.)

go to the plugin code and find the line (around two-thirds of the way down) that says $werds = implode(" ",$werds[0]);. Just after that, add dmp($werds); . That’ll show you all the words being looked at in each article; should be a huuuuge list.

It is :)

If that looks reasonable enough and there are tonnes of words displayed on the screen, get rid of that line and a few lines later, after $term = smd_getWord($werds,$idx); add dmp($term); That shows every word the plugin is considering as “close”. Again, this should display some (quite a few?) words. If it doesn’t, let me know.

It doesn’t. What it does is output lots of empty <pre></pre> tags at the top of my page. See for instance doggiez.nl/testsearch/?q=trianing.

At least it loads much quicker now, even when I leave out the section attribute.

Offline

#9 2007-03-30 23:15:28

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

Bloke wrote:

It’s MLP compatible so if you have that installed you can customize your strings.

I don’t need MLP to make this work, right? Because I don’t have it installed.

Last edited by els (2007-03-30 23:16:37)

Offline

#10 2007-03-30 23:17:21

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_fuzzy_find needs beta testers/coders

Els wrote:

What it does is output lots of empty <pre></pre> tags at the top of my page.

Right, so that explains the lack of output… ok, let’s back up a bit. A few lines before that is $matches = $finder->search($werds); Add dmp($matches); just after that to see if the fuzzy code is matching anything in your articles at all.

Before you do that, you can try installing v0.02 instead. Just fixed a few things (case sensitivity was broken), and improved the performance of the word split. Plus it now splits on more punctuation characters instead of just space and comma like it did before, so the word list is bigger.

I also added the refine option. By default this is set to refine="metaphone, soundex" which employs both additional processes. You can select either, or none, of those to enable or disable additional filtering. Best practice is to set refine="" for predominantly non-english searches.

metaphone and soundex add very little to the plugin – it still finds most search terms without them – but occasionally, maybe 1 in 10 very badly typed things, it’ll offer a more intelligent suggestion (Briteneny Spares, anyone?)

I don’t need MLP to make this work, right?

Nope. Works fine without it. The Ablogment hasn’t got it (in fact I haven’t even tested it on a site with it installed yet… but my other plugins work and it’s more of the same… he says, hoping)

Last edited by Bloke (2007-03-30 23:51:19)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#11 2007-03-30 23:23:30

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

I installed v0.02, added refine="". See for yourself

Offline

#12 2007-03-30 23:30:17

els
Moderator
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: smd_fuzzy_find needs beta testers/coders

Stef, I have to get some sleep ;) need to rise early tomorrow. I’ll happily continue the testing tomorrow.

Offline

Board footer

Powered by FluxBB