Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
#25 2007-03-31 22:55:48
- els
- Moderator
- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Thank you!!! Error is gone and it’s working :)
Offline
#26 2007-03-31 23:01:48
- els
- Moderator
- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
…which is really strange because I copied the line of code from your post, and it’s exactly the same as the modified line in the plugin I just reinstalled…
Just noticed your edit :)
Last edited by els (2007-03-31 23:02:58)
Offline
#27 2007-06-09 20:36:07
- Logoleptic
- Plugin Author
- From: Kansas, USA
- Registered: 2004-02-29
- Posts: 482
Re: smd_fuzzy_find needs beta testers/coders
I admit I haven’t tested this myself, but while reading through the thread I thought of something that might increase the plugin’s performance. I just don’t know for sure if what I’m suggesting is actually possible.
Some plugins, like rss_unlimited_categories, store article information in a special database table when the article is saved. What if you moved all the filtering and soundex/metaphone processing to occurr when an article is saved, storing the resulting unique words and their metaphone and soundex keys in a new smd_searchkeys table? You’d have a primary key of article ID, and fields containing each unique word and its pronunciation info. When a search was run, it would only need to query this table instead of doing all that processing every time.
I’m a newbie to plugin-writing, so I’m not sure this is possible. Thought I’d toss it out there as a suggestion, though.
Offline
Re: smd_fuzzy_find needs beta testers/coders
Logoleptic wrote:
What if you moved all the filtering and soundex/metaphone processing to occurr when an article is saved,
Hey, that’s not a bad idea. I’ve no idea how to implement that or add hooks into save_article, but in theory it’d work. As you say, the good news is it only has to run the computationally expensive stuff at article save on a (comparatively) small data set and ferret it away in a table.
Of course, that table / table cluster would have to be designed such that it can maintain performance with large quantities of data or we’d just be substituting the “looking through a hunk of text” with “looking through a load of table indices”; which may not offer that much improvement (at least, with my database normalisation skllls anyway :-)
It’s certainly worth bearing in mind as an approach if speed turn out to be an issue. Have you tested the plugin in its current form on a large site, btw? My limit’s something like 40 or 50 articles and it’s still working pretty well considering the thousands of words it has to trawl through. I really would love to understand that fuzzy algorithm and optimize it. Maybe one day…
Speaking of which, I must also schedule a release of the update which is nearing completion. Couple of new useful features to add, when I get round to tidying the code up.
Many thanks for the feedback.
Last edited by Bloke (2007-06-09 20:57:57)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
#29 2007-06-09 22:13:11
- Logoleptic
- Plugin Author
- From: Kansas, USA
- Registered: 2004-02-29
- Posts: 482
Re: smd_fuzzy_find needs beta testers/coders
Hi Stef,
I don’t have access to a really large site that it would be safe to test this on. I’m nearing the end of a client project that involves about 70 articles, but I have a feeling that he’d frown on running experimental software. ;-)
I’m hoping to have the time to test this at some point in the future, but right now I’m working on finishing up my own plugin (a port of Typogrify to Txp). If I can get back to smd_fuzzy_find, I’ll be sure to let you know how things work out. Meanwhile, I’ll be keeping my eye on this thread. I’ve been hoping for something like this for quite awhile now!
Offline
#30 2007-06-09 23:14:14
- els
- Moderator
- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Hi Stef,
I recently moved the fuzzy find from my test page to ‘live’, and it seems to work rather well on my site. Even in Dutch the ‘closest matches’ are, most of the time, very acceptable. At the moment the section that is being searched has 155 short articles.
One thing I noticed: when two words are entered in the search field, the output doesn’t make much sense (or at least it doesn’t help my visitors). I am aware that also in the regular search it would only give results if the two words occur in that exact order, but the way fuzzy find is handling this is not perfect. For instance on my site both words ‘hond’ and ‘clicker’ occur countless times, also both in one article.- Searching for ‘clicker hond’ just results in ‘Sorry, no results matched “clicker hond” exactly.’, without any suggestions.
- Searching for ‘hond clicker’ does give some suggestions, but only for the second word.
- If I enter two words, of which the first one has occurrences, and the second doesn’t, it only gives suggestions for the second word.
I assume it’s only looking at the second word in this situation. And when the second word is too short, it just doesn’t find (look for?) close matches.
And it would be nice to have an excerpt in the search results ;) (or rather: to have it use the entire search_results form).
But apart from these minor annoyances, it’s working nicely!
Last edited by els (2007-06-09 23:19:35)
Offline
Re: smd_fuzzy_find needs beta testers/coders
Els wrote:
One thing I noticed: when two words are entered in the search field, the output doesn’t make much sense
D’oh, I’ll have to review the way it splits the words up. Thought I’d got it doing that but maybe I forgot or didn’t check. I’m not sure it’s intelligent enough to take the words and apply “proximity” searching yet like Google does, nor does it handle quoted strings (I don’t think – not tried).
But you’re right, it should find the two words ‘separately’ on a page rather than just find the 2nd, or none at all if one of them is too short. I’ll have to fix that. Thanks for the report.
And it would be nice to have an excerpt in the search results
Yeah that’s on the todo list. I’m intending to add a results_form
attribute for that so it can be overriden, but I have to find a way of making it use the inbuilt form first :-)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
#32 2007-06-11 14:44:16
- els
- Moderator
- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Don’t hurry! I think it’s great you’re working on this :)
Offline
Re: smd_fuzzy_find needs beta testers/coders
OK, after a small lag of several millennia, the next iteration is here. smd_fuzzy_find v0.03 beta (and the * cough * bug-fixed smd_lib) adds category support so you can refine your searches to either the current category and/or specific categories, or you can negate categories with the !myCategory
syntax. Takes a comma-separated list (like section
does), or ?c
for the current category, or !c
for NOT the current category.
subcats="1"
will search in sub-categories of the given list of categories.
section
accepts ?s
and !s
as well as negation. Can’t remember if it did before, but it does now.
status
limits the search to articles of the given status(es) or statii or whatever the plural is. Choose from the usual crew of draft, hidden, pending, live, or sticky. Default is “live, sticky”.
Els will also be pleased to hear I’ve squashed the two major bugs he mentioned. At least I think I have. If searching for more than one word, the plugin will search for them both in proximity to one another. In other words, they act like you’ve put quotes round the whole phrase. I have no idea how to separate them so they are both found somewhere on the page (separately), without understanding how the algorithm works.
So if you searched for barbee and len
it would only find documents that mentioned barbie and ken
in the same sentence, exactly like that… or perhaps with a few other choices thrown in depending on how fuzzy you want it.
Also, the search terms are now honoured in the search_results form. Specify form="name of form"
to override the default of search_results
. The <txp:search_results_excerpt />
tag now correctly highlights the terms it thought you meant in the article excerpts. It’s a kludge, but it seems to work. Note, however, if it finds a match in a field other than the body (e.g. the keywords), it won’t display the excerpt because TXP’s built-in search doesn’t handle them either.
So take it away and break it, reporting all quantum strangeness here. Sorry it took so long and doesn’t have comment searching yet. That’s more involved than I hoped, so I rolled back to this version after two weeks of head-scratching. Maybe I’ll manage it next time.
Enjoy.
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
#34 2007-08-06 21:30:16
- els
- Moderator
- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Bloke wrote:
Els will also be pleased to hear I’ve squashed the two major bugs he mentioned.
She, if you don’t mind ;)
Thanks for the update, already installed and running on my site.
A quick first try: searching for ‘bridg and targt’ does indeed suggest ‘bridge and target’.
‘bridg an targt’: nothing
‘bridg an target’: nothing
‘bridge an targt’: bridge and target
So I’m guessing that at least one word has to be correct, and that shouldn’t be the last one in the row?? (I’m not complaining, just observing!)
The excerpt display is a good improvement. I’ll do some more testing in the next days and report back.
Offline
Re: smd_fuzzy_find needs beta testers/coders
Els wrote:
She, if you don’t mind ;)
Ooops, apologies. I knew that, just being brain dead. Trying to remix a tune and update code and a forum post in the background at the same time is a recipe for disaster with my tiny mind.
searching for ‘bridg and targt’ does indeed suggest ‘bridge and target’.
phew!
‘bridg an targt’: nothing. ‘bridg an target’: nothing
Uh-oh :-(
So I’m guessing that at least one word has to be correct, and that shouldn’t be the last one in the row??
Not necessarily, it really depends on what you’re searching for (which is very annoying… hence the reason this is beta and I really want to find someone who understands how the search algorithm works!)
Take this bizarre situation for example on this test site:
Spelling ‘software’ as sotfwar = 2 hits (correct)
Spelling ‘software’ as softwar = 1 hit (not so correct, considering it’s a closer spelling)
Spelling ‘performance software’ as preformance softwar = 1 hit (correct)
Spelling ‘performance software’ as preformanc softwar = 0 hits (not so correct)
Spelling ‘performance software’ as reformance softwar = 1 hit (correct)
So it seems to be because it takes the string as a whole entity, if you chop out bits in the middle it gives up.
I suppose from a linguistic point of view (looking at them as a collection of letters and ignoring the fact they actually sort of spell words) :
bridgeantargt is pretty close to bridgeandtarget
whereas bridgantargt probably isn’t because removal of the ‘e’ turns the ‘dg’ sound (like the ‘j’ in ‘jam’) into the ‘dg’ of ‘edgar’. But that doesn’t easily explain why bridg and targt matches, unless the inclusion of the ‘d’ gives it more to work with?
I’m not making excuses for my sloppy coding, it may well be a bug! If you switch on debug=
and put in a number from 1 to 3 you’ll get varying hunks of output as it tries and fails to match things. Feel free to post pertinent bits of that (or mail it to me with what you were trying to search for) and I’ll see if it’s something I’ve broken.
Maybe it’s simply a limitation of the search algorithm I stole that it only really works reliably for single words; and the fact that it sometimes suggests matches for multiple words is just good luck :-)
But for an algorithm that is supposed to intelligently suggest options for bad typing, it’s acting rather incoherently.
(sigh)
Last edited by Bloke (2007-08-06 22:32:49)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
#36 2007-08-07 16:04:38
- els
- Moderator
- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: smd_fuzzy_find needs beta testers/coders
Bloke wrote:
I’m not making excuses for my sloppy coding, it may well be a bug! If you switch on
debug=
and put in a number from 1 to 3 you’ll get varying hunks of output as it tries and fails to match things. Feel free to post pertinent bits of that (or mail it to me with what you were trying to search for) and I’ll see if it’s something I’ve broken.
I’ll do some more testing later this week, using the frequently used search queries from my logs, especially in Dutch. But I must say that to me this is not such a big problem, it’s already much better than returning no results at all even when only one letter is missing or misspelled :)
Offline