Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
Re: Making plugins first-class citizens
ruud wrote #286340:
It’s not that common in the simple if/else constructs that this optimisation targets.
Ruud, I really appreciate your argumentation, and you are certainly right re EvalElse()
statistics. But look, I write about parse()
optimization too. About 95% of article bodies contain "link":http://...
(transformed into <a href="http://...">
by Textile), or some <image src="http://..." />
, and no <txp:tag />
. For these, optimizing parse()
would make <txp:body />
processing much faster.
Offline
Re: Making plugins first-class citizens
You’re right. Patching a patch:
- static $prefix = null;
+ static $istag, $prefix = null,
if ($prefix === null) {
- $prefix = get_pref('allow_short_tags', '1') ? '[a-z]{3}' : 'txp';
+ if (get_pref('allow_short_tags', '1')) {
+ $prefix = '[a-z]{3}';
+ $istag = ':';
+ } else {
+ $prefix = 'txp';
+ $istag = '<txp:';
+ }
}
+
+ if (strpos($thing, $istag) === false) return $thing;
Adding a parse attribute <txp:body parse="0" />
is another way to solve this, although you lose part of the speed increase by having to parse the attribute ;)
Offline
Re: Making plugins first-class citizens
Great! I would only make short tags opt-in, since it’s a new feature: if (get_pref('allow_short_tags', '0')) ...
ruud wrote #286383:
Adding a parse attribute
<txp:body parse="0" />
is another way to solve this, although you lose part of the speed increase by having to parse the attribute ;)
No, we don’t want to lose a dime. Let’s win even more: why re-parse what is already parsed? When you process, say
<txp:article limit="9999">
<txp:title />
</txp:article>
you call preg_split(... '<txp:title />' ...)
9999 times. Why not to do it once, and cache the result? How could this $thing
inside <txp:article />
change anyway?
My current parse()
running here looks like this (sorry for flooding):
function parse($thing='') {
static $stack = array();
if(strpos($thing, '<txp:') === false) return $thing;
// don't parse $thing twice!!!
$hash = sha1($thing);
if(isset($stack[$hash])) $tags = $stack[$hash];
else
{
$tags = array();
$level = 0;
$inside = '';
$istag = FALSE;
$f = '@(</?txp:\w+(?:\s+\w+\s*=\s*(?:"(?:[^"]|"")*"|\'(?:[^\']|\'\')*\'|[^\s\'"/>]+))*\s*/?'.chr(62).')@s';
$t = '@:(\w+)(.*?)/?.$@s';
$parsed = preg_split($f, $thing, -1, PREG_SPLIT_DELIM_CAPTURE);
foreach ($parsed as $chunk)
{
if ($istag)
{
if ($level === 0)
{
preg_match($t, $chunk, $tag);
if (substr($chunk, -2, 1) === '/')
$tags[] = array('tag' => $tag[1], 'atts' => $tag[2], 'thing' => null);
else
{ # opening
$level++;
}
}
else
{
if (substr($chunk, 1, 1) === '/')
{ # closing
if (--$level === 0)
{
$tags[] = array('tag' => $tag[1], 'atts' => $tag[2], 'thing' => $inside);
$inside = '';
}
else
{
$inside .= $chunk;
}
}
elseif (substr($chunk, -2, 1) !== '/')
{ # opening inside open
++$level;
$inside .= $chunk;
}
else
{
$inside .= $chunk;
}
}
}
else
{
if ($level) $inside .= $chunk;
else $tags[] = $chunk;
}
$istag = !$istag;
}
$stack[$hash] = $tags;
}
$out = '';
foreach($tags as $i => $tag) $out .= $i&1 ? processTags($tag['tag'], $tag['atts'], $tag['thing']) : $tag;
return $out;
}
Same idea for EvalElse()
and splat()
. There is a risk of SHA1 collision, but none (unlike MD5) is known for the moment. The speedup on long loop structures is measurable, though preg_split()
is not the slowest part of it. No particular memory consumption detected atm.
What you devs think of it?
Offline
Re: Making plugins first-class citizens
Some benchmarks, each row with a different $thing string:
$thing = 'some very long text that does not contain an else tag';
$thing = '<txp:tag />';
$thing = '<txp:tag><txp:tag><txp:tag><txp:tag><txp:tag></txp:tag></txp:tag></txp:tag></txp:tag></txp:tag>';
$thing = '<txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag />';
0.6327 0.5520 0.6313 0.9997 1.1809
2.2664 1.3447 1.6456 2.8423 3.8831
5.2651 0.9088 1.4774 3.7682 6.3109
16.1752 8.4240 9.6645 14.2784 25.8554
column1: current parser (250.000 times)
column2: etc parser (250.000 loop)
column3: etc parser (10 loop, 25.000 times)
column4: etc parser (2 loop, 125.000 times)
column5: etc parser (no loop, 250.000 times)
I should add that I removed the first optimisation from the etc parser to be able to test just the effect of sha1 hashing.
Summarised. You gain quite a bit in loops, but you also lose a lot in non looping series of tags.
Offline
Re: Making plugins first-class citizens
ruud wrote #286407:
Summarised. You gain quite a bit in loops, but you also lose a lot in non looping series of tags.
Woah, there’s some great stuff in the etc parser, but that last case is a shame.
Open question then: is it possible to figure out through testing/benchmarking, the threshold at which the etc parser begins to degrade? Perhaps that would allow us to make an informed decision up-front via a simple test (even a tag count) on whether it’s beneficial to use the etc parser or the ruud-enhanced parser? Or a bit of both?
Just throwing it out there. I’ve no idea if that’s even feasible, but I’ve been looking at phpGolf today and I’m inspired by the lengths to which people will go, in order to find the most optimised (albeit in these cases unreadable) solution to a problem!
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Online
Re: Making plugins first-class citizens
ruud wrote #286407:
you also lose a lot in non looping series of tags.
Thanks for benchmarking! That’s not exactly what I get, but it could be server-dependent: for a string containing 1000 <txp:tag />
(more is very unlikely) parsed once I get ~0.012s for both. The most representative imho is column 3. I should add that processTags()
is slightly optimized too, by replacing $out = $tag(splat($atts), $thing);
inside if (maybe_tag($tag))
with
$out = $tag(ltrim($atts) === '' ? array() : splat($atts), $thing);
though this optimization shouldn’t intervene here, <txp:tag />
being fake.
Offline
Re: Making plugins first-class citizens
etc wrote #286411:
Thanks for benchmarking! That’s not exactly what I get,
I’ve replaced processTags with a function that either calls parse() if there is a $thing or otherwise just returns. My aim was to test only the parse() function for this benchmark. I’m not sure why that last $thing test string is so slow, because the only difference should be the one sha1() call. I’ll try to find out tonight.
Bloke wrote #286408:
Open question then: is it possible to figure out through testing/benchmarking, the threshold at which the etc parser begins to degrade?
So far the threshold seems to be: are we looping yes or no? If there is a cheap test to see if we’re in a loop, that could work. @etc has given me some inspiration for further optimisation ;)
Offline
Re: Making plugins first-class citizens
ruud wrote #286422:
I’ve replaced processTags with a function that either calls parse() if there is a $thing or otherwise just returns.
That’s what I thought. Then the new parser will be slower, indeed, not because of sha1, but array assignments. But with the real processTags()
the slowdown is of maybe 10%, and we gain ~40% in 10-loops (typical <txp:container />
limit).
@etc has given me some inspiration for further optimisation ;)
Great! I’m here if you need a pair of eyes. I’ve got a version that caches all tags (inside too) in one go, but the speed is roughly the same.
Last edited by etc (2014-12-12 17:27:58)
Offline
Re: Making plugins first-class citizens
etc wrote #286474:
That’s what I thought. Then the new parser will be slower, indeed, not because of sha1, but array assignments.
But with the realprocessTags()
the slowdown is of maybe 10%
The real processTags executes the corresponding taghandler function, which calls parse() when there is a $thing. I’m just skipping the taghandler function.
Great! I’m here if you need a pair of eyes. I’ve got a version that caches all tags (inside too) in one go, but the speed is roughly the same.
That’s exactly what I was testing yesterday. In my tests it’s up to 1.35 times faster (that’s without looping; 5.5 times with infinite looping) than the current TXP parser for complex tag structures [1] and a bit slower than the one you posted for simple situations. I haven’t tested this, but I wonder how this effects memory usage, because parsing all tags at once results in a hash table that contains a multiple of the original template size, due to tag nesting. On the other hand, most templates aren’t that big.
Have you run tests on an actual TXP install, compared to the original parser? How does this effect runtime in testing (or debug) mode?
[1] 5 levels of tag nesting with 5 self closed tags on the deepest level
Offline
Re: Making plugins first-class citizens
ruud wrote #286479:
In my tests it’s up to 1.35 times faster (that’s without looping; 5.5 times with infinite looping) than the current TXP parser for complex tag structures
These are nice figures! Together with
parsing all tags at once results in a hash table that contains a multiple of the original template size, due to tag nesting.
it makes me think you have done it a different way than me. Mind posting the code?
Have you run tests on an actual TXP install, compared to the original parser? How does this effect runtime in testing (or debug) mode?
Yes, though I don’t remember the exact figures. There is no real change (~10%) in runtime/memory consumption for an “average” page. But trying to help in this case (long heavy loops) I have replaced <txp:tags />
with etc_query {tokens}
, that are parsed only once. It has reduced the runtime (~3s) to ~1s, which was noticeable. I guess the new parse()
would give the same result in “extreme” cases, though you can not beat <txp:php />
at these.
Offline
Re: Making plugins first-class citizens
Okay, here it goes. If you don’t use forms, the first parse()
call puts everything you need in the hash table. The parseElse()
function is equal to calling parse(EvalElse())
.
One thing that may be interesting to try is to cache the $stack contents in a database. That way you don’t just benefit on loops, but also on repeated requests of the same page.
function parse($thing) {
global $stack;
$hash = sha1($thing);
if(isset($stack[$hash])) {
$tags[0] = $stack[$hash];
} else {
$tags[0] = array();
$tag = array();
$level = 0;
$inside = array();
$istag = FALSE;
$f = '@(</?txp:\w+(?:\s+\w+\s*=\s*(?:"(?:[^"]|"")*"|\'(?:[^\']|\'\')*\'|[^\s\'"/>]+))*\s*/?'.chr(62).')@s';
$t = '@:(\w+)(.*?)/?.$@s';
$parsed = preg_split($f, $thing, -1, PREG_SPLIT_DELIM_CAPTURE);
foreach ($parsed as $chunk) {
if ($istag) {
preg_match($t, $chunk, $tag[$level]);
if (substr($chunk, -2, 1) === '/') {
# self closed
$tags[$level][] = array($tag[$level][1], $tag[$level][2], null);
if ($level) $inside[$level] .= $chunk;
} elseif (substr($chunk, 1, 1) !== '/') {
# opening
if ($level) $inside[$level] .= $chunk;
$level++;
$inside[$level] = '';
$tags[$level] = array();
} else {
# closing
$sha = sha1($inside[$level]);
$stack[$sha] = $tags[$level];
$level--;
$tags[$level][] = array($tag[$level][1], $tag[$level][2], $inside[$level+1]);
if ($level) $inside[$level] .= $inside[$level+1] . $chunk;
}
} else {
$tags[$level][] = $chunk;
if ($level) $inside[$level] .= $chunk;
}
$istag = !$istag;
}
$stack[$hash] = $tags[0];
}
$out = '';
foreach($tags[0] as $i => $tag) $out .= $i&1 ? processTags($tag[0], $tag[1], $tag[2]) : $tag;
return $out;
}
function parseElse($thing, $condition)
{
global $stack;
if (strpos($thing, ':else') === false) {
return $condition ? parse($thing) : '';
}
$tags = $stack[sha1($thing)];
$nr = 1;
$tot = count($tags);
while ($nr < $tot and $tags[$nr][0] !== 'else') $nr += 2;
if ($condition) {
$out = $tags[0];
$min = 1;
$max = $nr - 1;
} elseif ($nr < $tot) {
$out = $tags[$nr + 1];
$min = $nr + 2;
$max = $tot;
} else {
return '';
}
for ($i = $min; $i < $max; $i += 2) {
$out .= processTags($tags[$i][0], $tags[$i][1], $tags[$i][2]) . $tags[$i + 1];
}
return $out;
}
Benchmarks
$thing1 = 'some very long text that does not contain an else tag';
$thing2 = '<txp:tag />';
$thing3 = '<txp:if>something<txp:else></txp:if>';
$thing4 = '<txp:if><txp:tag>something<txp:else></txp:if>';
$thing5 = '<txp:if><txp:tag/><txp:tag/><txp:tag/><txp:tag/><txp:tag/><txp:else /><txp:tag/><txp:tag/><txp:tag/><txp:tag/><txp:tag/></txp:if>';
$thing6 = '<txp:if><txp:tag><txp:tag><txp:tag><txp:tag><txp:tag><txp:if><txp:else /></txp:if></txp:tag></txp:tag></txp:tag></txp:tag></txp:tag></txp:if>';
$thing7 = '<txp:if><txp:if><txp:if><txp:if><txp:if><txp:else></txp:if></txp:if></txp:if></txp:if></txp:if></txp:if>';
$thing8 = '<txp:tag><txp:tag><txp:tag><txp:tag><txp:tag></txp:tag></txp:tag></txp:tag></txp:tag></txp:tag>';
$thing9 = '<txp:tag><txp:tag><txp:tag><txp:tag><txp:tag><txp:x /><txp:x /><txp:x /><txp:x /><txp:x /></txp:tag></txp:tag></txp:tag></txp:tag></txp:tag>';
$thing10 = '<txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag /><txp:tag />';
--- 250000 runs
1 0.6190 0.5795 0.5208 1.3569 1.2031
2 1.7544 0.9276 0.8718 2.9528 2.7302
3 2.3013 0.5777 0.5196 5.2591 2.9899
4 2.7781 0.5764 0.5138 6.3647 3.4664
5 17.1107 3.2209 7.8146 18.0441 20.3693
6 41.0843 7.0340 13.1653 27.3075 49.0829
7 38.6543 4.9259 20.8092 21.5019 44.6545
8 18.0888 4.4227 4.1398 18.2603 23.1512
9 34.7269 6.7369 6.4449 25.9961 42.0318
10 10.7585 3.7477 3.7159 14.4655 14.1418
1 0.6322 0.5906 0.5217 1.3820 1.2133
2 1.7328 0.9324 0.8587 2.9548 2.7509
3 2.2970 0.5853 0.5220 5.2234 2.9572
4 2.7618 0.5822 0.5181 6.3478 3.4632
5 17.2101 3.2082 7.9638 18.0067 20.5029
6 14.1054 1.7007 7.9149 21.8310 15.9788
7 11.6276 1.5889 6.5863 17.9265 13.4327
8 18.0982 4.3862 4.1270 18.2606 23.1700
9 34.8114 6.7226 6.3946 25.9918 41.9020
10 10.6841 3.7376 3.7256 14.4354 14.2409
Column 1 = original parser
Column 2 = my parser + parseElse, infinite loop
Column 3 = etc parser + EvalElse, infinite loop
Column 4 = my parser + parseElse, no loop
Column 5 = etc parser + EvalElse, no loop
First set of results with condition true for if tags.
Second set of results with condition false for if tags.
These results are measured in seconds, but for a single parse you should divide them by 250000. So even slowest result takes only 0.2ms in reality for a single parse. Of course these are relatively short test strings, but it does put things in perspective.
Offline
Re: Making plugins first-class citizens
Couldn’t resist. Same benchmarks.
Column 1 is still the original parser.
Column 2 is my parser + parseElse, infinite loop
Column 3 is my parser + parseElse, no loop
The difference with the previous results is that this time I’m simulating what would happen if you unserialize a serialized $stack
after fetching it from a database. This doesn’t include the overhead of the additionally required MySQL call, but does include the effect of the unserialize()
call, which is why Column 2 is faster, because it only unserializes once there.
--- 250000 runs
0.6307 0.5672 1.0113
1.7773 0.9071 1.6515
2.3299 0.5715 1.2029
2.8387 0.5705 1.2139
17.3982 3.2207 7.4642
42.0089 7.0624 11.6038
39.4484 5.1024 8.7357
18.4378 4.4360 7.5659
35.3087 6.8439 11.4250
11.1749 3.7019 7.1695
0.6406 0.5751 1.0067
1.7908 0.9112 1.6654
2.3452 0.5795 1.2157
2.8173 0.5747 1.2043
17.7839 3.2729 7.5011
14.3136 1.7049 6.1402
11.8892 1.5756 5.1973
18.5314 4.4291 7.5564
35.4523 6.7621 11.4024
11.0429 3.7263 7.1274
I think adding such a caching mechanism would only require a few lines of code, but we need some real-world parser input to test if doing so makes sense.
--- 1000 runs (default TXP page template, excluding forms):
0.5308 0.1109 0.2300
0.8086 0.1530 0.2716
--- 1000 runs (this time with the template + forms for an individual article page combined):
1.1444 0.2789 0.5225
1.4097 0.2960 0.5448
In these examples, the difference is less than 1 millisecond. Which makes me wonder if this is worth doing at all (although it’s certainly fun trying).
Offline