Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2008-12-29 12:21:26

thebombsite
Archived Plugin Author
From: Exmouth, England
Registered: 2004-08-24
Posts: 3,251
Website

Search excerpt tag

Is the <txp:search_result_excerpt /> tag one of those where we can decide to escape or not HTML? If not, could it be made so?


Stuart

In a Time of Universal Deceit
Telling the Truth is Revolutionary.

Offline

#2 2008-12-29 13:48:11

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Search excerpt tag

The excerpt should be escaped properly (= doesn’t cause validation problems). If it is not, then I’d love to see an example (I’d consider that to be a bug).

Offline

#3 2008-12-29 19:27:03

thebombsite
Archived Plugin Author
From: Exmouth, England
Registered: 2004-08-24
Posts: 3,251
Website

Re: Search excerpt tag

Well actually it is causing a validation error. If you check this page (there’s a link in the footer for validation) you will see it throws 2 errors. The problem seems to be where the excerpt starts or finishes or both, and coincides with encoded punctuation.

If the punctuation appears in the middle of the excerpt it gets decoded into the relevant symbols, but if the punctuation occurs at the beginning or end of the excerpt, because (I presume) TXP is counting characters it can truncate the encoding which results in the symbols not being shown along with the potential for the errors you see on that page. It also makes the excerpt a little incoherent unless you happen to know what you are looking at.

What I’m wondering is whether it is possible to force TXP to not break an encoded entity? Give it a rule for length but make it a little flexible?


Stuart

In a Time of Universal Deceit
Telling the Truth is Revolutionary.

Offline

#4 2008-12-29 19:34:43

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Search excerpt tag

In /textpattern/publish/taghandlers.php there’s a function search_result_excerpt.
Try replacing this line:

preg_match_all("/\b.{1,50}".preg_quote($q).".{1,50}\b/iu", $result, $concat);

with:

preg_match_all("/\b(?:[^&]|&[^;]+;){1,50}".preg_quote($q)."(?:[^&]|&[^;]+;){1,50}\b/iu", $result, $concat);

Offline

#5 2008-12-29 20:06:04

thebombsite
Archived Plugin Author
From: Exmouth, England
Registered: 2004-08-24
Posts: 3,251
Website

Re: Search excerpt tag

Excellent! That seems to be working as it should. You can check the link again if you wish. :)

Will this get into the development code so I’m not running with a hack?


Stuart

In a Time of Universal Deceit
Telling the Truth is Revolutionary.

Offline

#6 2008-12-29 20:15:52

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Search excerpt tag

I’ll do a bit more testing myself and commit it. Should make it into 4.0.8

Offline

#7 2008-12-29 20:30:51

thebombsite
Archived Plugin Author
From: Exmouth, England
Registered: 2004-08-24
Posts: 3,251
Website

Re: Search excerpt tag

Great. :)


Stuart

In a Time of Universal Deceit
Telling the Truth is Revolutionary.

Offline

#8 2008-12-29 21:11:08

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Search excerpt tag

Actually, it’s not so great. That fix doesn’t always work. It breaks in very interesting ways ;)
I’m looking for an alternative approach.

Offline

#9 2008-12-29 21:28:31

thebombsite
Archived Plugin Author
From: Exmouth, England
Registered: 2004-08-24
Posts: 3,251
Website

Re: Search excerpt tag

OK. As long as you know about the problem. :)


Stuart

In a Time of Universal Deceit
Telling the Truth is Revolutionary.

Offline

#10 2008-12-29 23:18:31

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Search excerpt tag

What I tried to do was: match entities and/or normal characters (leading) + match search string + match either entities or normal characters (trailing).

That works fine if there’s no search string and when you’re not matching on word boundaries. But since we are searching, there is typically a search string and when that search string contains either the beginning or end of an entity, then my earlier solution breaks (read: same problem as you had to begin with. it doesn’t crash or anything dramatic like that, although it could skip certain matches in the excerpt). Same problem with matching on word boundaries. You don’t notice this until you start playing with excerpts that have an entity that are on the very edge of the excerpt limit or that have lots of trailing non-word characters.

This appears to be working better:

 		for ($i = 0, $r = array(); $i < min($limit, count($concat[0])); $i++)
 		{
-			$r[] = trim($concat[0][$i]);
+			$r[] = preg_replace('/^\w{0,10};\s*/', '', preg_replace('/\s*&[^;]*$/', '', trim($concat[0][$i])));
 		}

Offline

#11 2008-12-30 11:45:53

thebombsite
Archived Plugin Author
From: Exmouth, England
Registered: 2004-08-24
Posts: 3,251
Website

Re: Search excerpt tag

Do you want me to try that out? Is it with the original “preg_match_all” line above or the modified version?

EDIT – I’ve changed the “preg_match_all” line back to the original code and swapped in the new code.

Again it seems to be working fine. I’m trying out a few different search terms (though I have a limited number of articles to search on this site) particularly if I know they will pull out excerpts with encoded punctuation in them so something like this page which works OK.

Actually, and I’m not sure if this is a “proof of concept”, that search result page has the following in it:-

…the latest parser. It also uses the new tags <txp:variable />, <txp:if_variable>, <txp: … if_keywords> and <txp:modified />. With the exception of <txp:

which shows that TXP has managed to split one of the tags in the original article, <txp:if_keyword> into 2 separate excerpts without validation problems.

Last edited by thebombsite (2008-12-30 12:20:23)


Stuart

In a Time of Universal Deceit
Telling the Truth is Revolutionary.

Offline

#12 2008-12-30 12:48:42

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Search excerpt tag

The excerpting is done by first matching the search term and taking up to 50 characters on each side of the match, starting and ending on a word boundary. A word boundary is a place between characters where there is an alphanumeric character on one side and a non-alphanumeric character on the other side. So in the example you give, it first matches the ‘txp’ in txp:variable and around 50 character to the right side there’s word boundary just before the ‘:’ in txp:if_keywords (because p is alphanumeric, while : is not).

This appears to make sense when you’re matching pure text, but with since we do have entities in the data we’re trying to excerpt, we could opt to let an excerpt string begin/end with a space instead of a word boundary, like this:

 		$q = $pretext['q'];

 		$result = preg_replace('/\s+/', ' ', strip_tags(str_replace('><', '> <', $thisarticle['body'])));
-		preg_match_all("/\b.{1,50}".preg_quote($q).".{1,50}\b/iu", $result, $concat);
+		preg_match_all("/(^|\s|\G).{1,50}".preg_quote($q).".{1,50}(\s|$)/iu", $result, $concat);

 		for ($i = 0, $r = array(); $i < min($limit, count($concat[0])); $i++)
 		{
-			$r[] = trim($concat[0][$i]);
+			$r[] = preg_replace('/^\w{0,10};\s*/', '', preg_replace('/\s*&[^;]*$/', '', trim($concat[0][$i])));
 		}

 		$concat = join($break.n, $r);
 		$concat = preg_replace('/^[^>]+>/U', '', $concat);

Applying that to your example, the ... now appear just before the <txi:if_keywords> tag.
However, it doesn’t prevent that the second part of the excerpt starts at the same place where the first part of the excerpt stops. Nor would it prevent the ... from appearing in the middle of one of the txp tags in that article which have attributes (because attributes are surrounded by spaces).

Offline

Board footer

Powered by FluxBB