Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#37 2014-12-09 09:02:08

etc
Developer
Registered: 2010-11-11
Posts: 5,123
Website GitHub

Re: Making plugins first-class citizens

ruud wrote #286340:

It’s not that common in the simple if/else constructs that this optimisation targets.

Ruud, I really appreciate your argumentation, and you are certainly right re EvalElse() statistics. But look, I write about parse() optimization too. About 95% of article bodies contain "link":http://... (transformed into <a href="http://..."> by Textile), or some <image src="http://..." />, and no <txp:tag />. For these, optimizing parse() would make <txp:body /> processing much faster.

Offline

#38 2014-12-09 20:01:31

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Making plugins first-class citizens

You’re right. Patching a patch:

- static $prefix = null;
+ static $istag, $prefix = null,  

  if ($prefix === null) {
-   $prefix = get_pref('allow_short_tags', '1') ? '[a-z]{3}' : 'txp';
+   if (get_pref('allow_short_tags', '1')) {
+     $prefix = '[a-z]{3}';
+     $istag = ':';
+   } else {
+     $prefix = 'txp';
+     $istag = '<txp:';
+   }
  }
+
+ if (strpos($thing, $istag) === false) return $thing;

Adding a parse attribute <txp:body parse="0" /> is another way to solve this, although you lose part of the speed increase by having to parse the attribute ;)

Offline

#39 2014-12-10 11:59:52

etc
Developer
Registered: 2010-11-11
Posts: 5,123
Website GitHub

Re: Making plugins first-class citizens

Great! I would only make short tags opt-in, since it’s a new feature: if (get_pref('allow_short_tags', '0')) ...

ruud wrote #286383:

Adding a parse attribute <txp:body parse="0" /> is another way to solve this, although you lose part of the speed increase by having to parse the attribute ;)

No, we don’t want to lose a dime. Let’s win even more: why re-parse what is already parsed? When you process, say

<txp:article limit="9999">
	<txp:title />
</txp:article>

you call preg_split(... '<txp:title />' ...) 9999 times. Why not to do it once, and cache the result? How could this $thing inside <txp:article /> change anyway?

My current parse() running here looks like this (sorry for flooding):

function parse($thing='') {

	static $stack = array();
	if(strpos($thing, '<txp:') === false) return $thing;

//	don't parse $thing twice!!!

	$hash = sha1($thing);
	if(isset($stack[$hash])) $tags = $stack[$hash];
	else
	{
		$tags = array();
		$level  = 0;
		$inside = '';
		$istag  = FALSE;

		$f = '@(</?txp:\w+(?:\s+\w+\s*=\s*(?:"(?:[^"]|"")*"|\'(?:[^\']|\'\')*\'|[^\s\'"/>]+))*\s*/?'.chr(62).')@s';
		$t = '@:(\w+)(.*?)/?.$@s';

		$parsed = preg_split($f, $thing, -1, PREG_SPLIT_DELIM_CAPTURE);

		foreach ($parsed as $chunk)
		{
			if ($istag)
			{
				if ($level === 0)
				{
					preg_match($t, $chunk, $tag);

					if (substr($chunk, -2, 1) === '/')
						$tags[] = array('tag' => $tag[1], 'atts' => $tag[2], 'thing' => null);
					else
					{ # opening
						$level++;
					}
				}
				else
				{
					if (substr($chunk, 1, 1) === '/')
					{ # closing
						if (--$level === 0)
						{
							$tags[] = array('tag' => $tag[1], 'atts' => $tag[2], 'thing' => $inside);
							$inside = '';
						}
						else
						{
							$inside .= $chunk;
						}
					}
					elseif (substr($chunk, -2, 1) !== '/')
					{ # opening inside open
						++$level;
						$inside .= $chunk;
					}
					else
					{
						$inside .= $chunk;
					}
				}
			}
			else
			{
				if ($level) $inside .= $chunk;
				else $tags[] = $chunk;
			}
			$istag = !$istag;
		}
		$stack[$hash] = $tags;
	}

	$out = '';
	foreach($tags as $i => $tag) $out .= $i&1 ? processTags($tag['tag'], $tag['atts'], $tag['thing']) : $tag;

	return $out;
}

Same idea for EvalElse() and splat(). There is a risk of SHA1 collision, but none (unlike MD5) is known for the moment. The speedup on long loop structures is measurable, though preg_split() is not the slowest part of it. No particular memory consumption detected atm.

What you devs think of it?

Offline

#40 2014-12-10 20:41:01

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Making plugins first-class citizens

Some benchmarks, each row with a different $thing string:

$thing = 'some very long text that does not contain an else tag';
$thing = '<txp:tag />';
$thing = '<txp:tag><txp:tag><txp:tag><txp:tag><txp:tag></txp:tag></txp:tag></txp:tag></txp:tag></txp:tag>';
$thing = '<txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag />';

	0.6327	0.5520	0.6313	0.9997	1.1809
	2.2664	1.3447	1.6456	2.8423	3.8831
	5.2651	0.9088	1.4774	3.7682	6.3109
	16.1752	8.4240	9.6645	14.2784	25.8554

column1: current parser (250.000 times)
column2: etc parser (250.000 loop)
column3: etc parser (10 loop, 25.000 times)
column4: etc parser (2 loop, 125.000 times)
column5: etc parser (no loop, 250.000 times)

I should add that I removed the first optimisation from the etc parser to be able to test just the effect of sha1 hashing.

Summarised. You gain quite a bit in loops, but you also lose a lot in non looping series of tags.

Offline

#41 2014-12-10 21:29:03

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,391
Website GitHub

Re: Making plugins first-class citizens

ruud wrote #286407:

Summarised. You gain quite a bit in loops, but you also lose a lot in non looping series of tags.

Woah, there’s some great stuff in the etc parser, but that last case is a shame.

Open question then: is it possible to figure out through testing/benchmarking, the threshold at which the etc parser begins to degrade? Perhaps that would allow us to make an informed decision up-front via a simple test (even a tag count) on whether it’s beneficial to use the etc parser or the ruud-enhanced parser? Or a bit of both?

Just throwing it out there. I’ve no idea if that’s even feasible, but I’ve been looking at phpGolf today and I’m inspired by the lengths to which people will go, in order to find the most optimised (albeit in these cases unreadable) solution to a problem!


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#42 2014-12-10 22:25:10

etc
Developer
Registered: 2010-11-11
Posts: 5,123
Website GitHub

Re: Making plugins first-class citizens

ruud wrote #286407:

you also lose a lot in non looping series of tags.

Thanks for benchmarking! That’s not exactly what I get, but it could be server-dependent: for a string containing 1000 <txp:tag /> (more is very unlikely) parsed once I get ~0.012s for both. The most representative imho is column 3. I should add that processTags() is slightly optimized too, by replacing $out = $tag(splat($atts), $thing); inside if (maybe_tag($tag)) with

$out = $tag(ltrim($atts) === '' ? array() : splat($atts), $thing);

though this optimization shouldn’t intervene here, <txp:tag /> being fake.

Offline

#43 2014-12-11 07:56:07

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Making plugins first-class citizens

etc wrote #286411:

Thanks for benchmarking! That’s not exactly what I get,

I’ve replaced processTags with a function that either calls parse() if there is a $thing or otherwise just returns. My aim was to test only the parse() function for this benchmark. I’m not sure why that last $thing test string is so slow, because the only difference should be the one sha1() call. I’ll try to find out tonight.

Bloke wrote #286408:

Open question then: is it possible to figure out through testing/benchmarking, the threshold at which the etc parser begins to degrade?

So far the threshold seems to be: are we looping yes or no? If there is a cheap test to see if we’re in a loop, that could work. @etc has given me some inspiration for further optimisation ;)

Offline

#44 2014-12-12 17:17:44

etc
Developer
Registered: 2010-11-11
Posts: 5,123
Website GitHub

Re: Making plugins first-class citizens

ruud wrote #286422:

I’ve replaced processTags with a function that either calls parse() if there is a $thing or otherwise just returns.

That’s what I thought. Then the new parser will be slower, indeed, not because of sha1, but array assignments. But with the real processTags() the slowdown is of maybe 10%, and we gain ~40% in 10-loops (typical <txp:container /> limit).

@etc has given me some inspiration for further optimisation ;)

Great! I’m here if you need a pair of eyes. I’ve got a version that caches all tags (inside too) in one go, but the speed is roughly the same.

Last edited by etc (2014-12-12 17:27:58)

Offline

#45 2014-12-12 19:54:16

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Making plugins first-class citizens

etc wrote #286474:

That’s what I thought. Then the new parser will be slower, indeed, not because of sha1, but array assignments.
But with the real processTags() the slowdown is of maybe 10%

The real processTags executes the corresponding taghandler function, which calls parse() when there is a $thing. I’m just skipping the taghandler function.

Great! I’m here if you need a pair of eyes. I’ve got a version that caches all tags (inside too) in one go, but the speed is roughly the same.

That’s exactly what I was testing yesterday. In my tests it’s up to 1.35 times faster (that’s without looping; 5.5 times with infinite looping) than the current TXP parser for complex tag structures [1] and a bit slower than the one you posted for simple situations. I haven’t tested this, but I wonder how this effects memory usage, because parsing all tags at once results in a hash table that contains a multiple of the original template size, due to tag nesting. On the other hand, most templates aren’t that big.

Have you run tests on an actual TXP install, compared to the original parser? How does this effect runtime in testing (or debug) mode?

[1] 5 levels of tag nesting with 5 self closed tags on the deepest level

Offline

#46 2014-12-12 22:14:31

etc
Developer
Registered: 2010-11-11
Posts: 5,123
Website GitHub

Re: Making plugins first-class citizens

ruud wrote #286479:

In my tests it’s up to 1.35 times faster (that’s without looping; 5.5 times with infinite looping) than the current TXP parser for complex tag structures

These are nice figures! Together with

parsing all tags at once results in a hash table that contains a multiple of the original template size, due to tag nesting.

it makes me think you have done it a different way than me. Mind posting the code?

Have you run tests on an actual TXP install, compared to the original parser? How does this effect runtime in testing (or debug) mode?

Yes, though I don’t remember the exact figures. There is no real change (~10%) in runtime/memory consumption for an “average” page. But trying to help in this case (long heavy loops) I have replaced <txp:tags /> with etc_query {tokens}, that are parsed only once. It has reduced the runtime (~3s) to ~1s, which was noticeable. I guess the new parse() would give the same result in “extreme” cases, though you can not beat <txp:php /> at these.

Offline

#47 2014-12-12 23:48:45

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Making plugins first-class citizens

Okay, here it goes. If you don’t use forms, the first parse() call puts everything you need in the hash table. The parseElse() function is equal to calling parse(EvalElse()).

One thing that may be interesting to try is to cache the $stack contents in a database. That way you don’t just benefit on loops, but also on repeated requests of the same page.

function parse($thing) {
  global $stack;

  $hash = sha1($thing);

  if(isset($stack[$hash])) {
    $tags[0] = $stack[$hash];
  } else {
    $tags[0] = array();
    $tag    = array(); 
    $level  = 0;
    $inside = array();
    $istag  = FALSE;  

    $f = '@(</?txp:\w+(?:\s+\w+\s*=\s*(?:"(?:[^"]|"")*"|\'(?:[^\']|\'\')*\'|[^\s\'"/>]+))*\s*/?'.chr(62).')@s';
    $t = '@:(\w+)(.*?)/?.$@s';

    $parsed = preg_split($f, $thing, -1, PREG_SPLIT_DELIM_CAPTURE);

    foreach ($parsed as $chunk) {
      if ($istag) {
        preg_match($t, $chunk, $tag[$level]);

        if (substr($chunk, -2, 1) === '/') {
          # self closed
          $tags[$level][] = array($tag[$level][1], $tag[$level][2], null);
          if ($level) $inside[$level] .= $chunk; 
        } elseif (substr($chunk, 1, 1) !== '/') {
          # opening
          if ($level) $inside[$level] .= $chunk;
          $level++;
          $inside[$level] = '';
          $tags[$level] = array();
        } else {
          # closing
          $sha = sha1($inside[$level]);
          $stack[$sha] = $tags[$level];
          $level--;
          $tags[$level][] = array($tag[$level][1], $tag[$level][2], $inside[$level+1]);
          if ($level) $inside[$level] .= $inside[$level+1] . $chunk;
        }
      } else {
        $tags[$level][] = $chunk;
        if ($level) $inside[$level] .= $chunk;
      }
      $istag = !$istag;
    }
    $stack[$hash] = $tags[0];
  }

  $out = '';
  foreach($tags[0] as $i => $tag) $out .= $i&1 ? processTags($tag[0], $tag[1], $tag[2]) : $tag;
  return $out;
}


function parseElse($thing, $condition)
{
  global $stack;

  if (strpos($thing, ':else') === false) {
    return $condition ? parse($thing) : '';
  }

  $tags = $stack[sha1($thing)];
  $nr   = 1;
  $tot  = count($tags);

  while ($nr < $tot and $tags[$nr][0] !== 'else') $nr += 2;

  if ($condition) {
    $out = $tags[0];
    $min = 1;
    $max = $nr - 1;
  } elseif ($nr < $tot) {
    $out = $tags[$nr + 1];
    $min = $nr + 2;
    $max = $tot;   
  } else {
    return '';
  }

  for ($i = $min; $i < $max; $i += 2) {
    $out .= processTags($tags[$i][0], $tags[$i][1], $tags[$i][2]) . $tags[$i + 1];
  }

  return $out;
} 

Benchmarks

$thing1 = 'some very long text that does not contain an else tag';
$thing2 = '<txp:tag />';
$thing3 = '<txp:if>something<txp:else></txp:if>';
$thing4 = '<txp:if><txp:tag>something<txp:else></txp:if>';
$thing5 = '<txp:if><txp:tag/><txp:tag/><txp:tag/><txp:tag/><txp:tag/><txp:else /><txp:tag/><txp:tag/><txp:tag/><txp:tag/><txp:tag/></txp:if>';
$thing6 = '<txp:if><txp:tag><txp:tag><txp:tag><txp:tag><txp:tag><txp:if><txp:else /></txp:if></txp:tag></txp:tag></txp:tag></txp:tag></txp:tag></txp:if>';
$thing7 = '<txp:if><txp:if><txp:if><txp:if><txp:if><txp:else></txp:if></txp:if></txp:if></txp:if></txp:if></txp:if>';
$thing8 = '<txp:tag><txp:tag><txp:tag><txp:tag><txp:tag></txp:tag></txp:tag></txp:tag></txp:tag></txp:tag>';
$thing9 = '<txp:tag><txp:tag><txp:tag><txp:tag><txp:tag><txp:x /><txp:x /><txp:x /><txp:x /><txp:x /></txp:tag></txp:tag></txp:tag></txp:tag></txp:tag>';
$thing10 = '<txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag />';
--- 250000 runs
1	0.6190	0.5795	0.5208	1.3569	1.2031
2	1.7544	0.9276	0.8718	2.9528	2.7302
3	2.3013	0.5777	0.5196	5.2591	2.9899
4	2.7781	0.5764	0.5138	6.3647	3.4664
5	17.1107	3.2209	7.8146	18.0441	20.3693
6	41.0843	7.0340	13.1653	27.3075	49.0829
7	38.6543	4.9259	20.8092	21.5019	44.6545
8	18.0888	4.4227	4.1398	18.2603	23.1512
9	34.7269	6.7369	6.4449	25.9961	42.0318
10	10.7585	3.7477	3.7159	14.4655	14.1418

1	0.6322	0.5906	0.5217	1.3820	1.2133
2	1.7328	0.9324	0.8587	2.9548	2.7509
3	2.2970	0.5853	0.5220	5.2234	2.9572
4	2.7618	0.5822	0.5181	6.3478	3.4632
5	17.2101	3.2082	7.9638	18.0067	20.5029
6	14.1054	1.7007	7.9149	21.8310	15.9788
7	11.6276	1.5889	6.5863	17.9265	13.4327
8	18.0982	4.3862	4.1270	18.2606	23.1700
9	34.8114	6.7226	6.3946	25.9918	41.9020
10	10.6841	3.7376	3.7256	14.4354	14.2409

Column 1 = original parser
Column 2 = my parser + parseElse, infinite loop
Column 3 = etc parser + EvalElse, infinite loop
Column 4 = my parser + parseElse, no loop
Column 5 = etc parser + EvalElse, no loop

First set of results with condition true for if tags.
Second set of results with condition false for if tags.

These results are measured in seconds, but for a single parse you should divide them by 250000. So even slowest result takes only 0.2ms in reality for a single parse. Of course these are relatively short test strings, but it does put things in perspective.

Offline

#48 2014-12-13 00:35:13

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Making plugins first-class citizens

Couldn’t resist. Same benchmarks.
Column 1 is still the original parser.
Column 2 is my parser + parseElse, infinite loop
Column 3 is my parser + parseElse, no loop

The difference with the previous results is that this time I’m simulating what would happen if you unserialize a serialized $stack after fetching it from a database. This doesn’t include the overhead of the additionally required MySQL call, but does include the effect of the unserialize() call, which is why Column 2 is faster, because it only unserializes once there.

--- 250000 runs
	0.6307	0.5672	1.0113
	1.7773	0.9071	1.6515
	2.3299	0.5715	1.2029
	2.8387	0.5705	1.2139
	17.3982	3.2207	7.4642
	42.0089	7.0624	11.6038
	39.4484	5.1024	8.7357
	18.4378	4.4360	7.5659
	35.3087	6.8439	11.4250
	11.1749	3.7019	7.1695

	0.6406	0.5751	1.0067
	1.7908	0.9112	1.6654
	2.3452	0.5795	1.2157
	2.8173	0.5747	1.2043
	17.7839	3.2729	7.5011
	14.3136	1.7049	6.1402
	11.8892	1.5756	5.1973
	18.5314	4.4291	7.5564
	35.4523	6.7621	11.4024
	11.0429	3.7263	7.1274

I think adding such a caching mechanism would only require a few lines of code, but we need some real-world parser input to test if doing so makes sense.

--- 1000 runs (default TXP page template, excluding forms):
	0.5308	0.1109	0.2300

	0.8086	0.1530	0.2716
--- 1000 runs (this time with the template + forms for an individual article page combined):
	1.1444	0.2789	0.5225

	1.4097	0.2960	0.5448

In these examples, the difference is less than 1 millisecond. Which makes me wonder if this is worth doing at all (although it’s certainly fun trying).

Offline

Board footer

Powered by FluxBB