Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
#46 2004-11-02 04:59:17
- zem
- Developer Emeritus

- From: Melbourne, Australia
- Registered: 2004-04-08
- Posts: 2,579
Re: Plugin idea: Word Count?
> return preg_match_all('/\b\w+\b/', $words, $m);
> Thats not super accurate for all cases either. It gets pretty confused by html elements and counts things like doesn’t as 3 words. The explode method doesn’t have that problem, but you are right, it does have issues with not counting “foo,bar” and counting 3 spaces as 2 words.
strip_tags() was already done. html_entity_decode() is also probably a requirement to get it right, whatever method you use. Which means PHP 4.3.0 is required, which means you might as well use str_word_count().
> $result = preg_split(“/[\s,]+/”, $clean);
That’ll count “ . . . . . “ (space dot space dot) as words.
There’s more to it than just commas. Read the regexp manual on ‘\b’. Someone’s already solved this problem before.
> If you have preceding or trailing white space, it will be given it’s own element in $result by preg_split. So I had to trim() that off.
preg_match_all achieves the same thing as split/count, without passing arrays around as much, and without zero count problems.
Alex
Offline
#47 2004-11-02 05:01:12
- zem
- Developer Emeritus

- From: Melbourne, Australia
- Registered: 2004-04-08
- Posts: 2,579
Re: Plugin idea: Word Count?
> I was thinking of writing some sort of state machine that would count a word as anything separated by white space.
/\b\w+\b/
Don’t need a state machine when there’s only one state and one transition.
Alex
Offline
Re: Plugin idea: Word Count?
> zem wrote:
> > return preg_match_all('/\b\w+\b/', $words, $m);
> > Thats not super accurate for all cases either. It gets pretty confused by html elements
> > and counts things like doesn’t as 3 words. The explode method doesn’t have that
> > problem, but you are right, it does have issues with not counting “foo,bar” and counting
> > 3 spaces as 2 words.
> strip_tags() was already done. html_entity_decode() is also probably a requirement to get
> it right, whatever method you use. Which means PHP 4.3.0 is required, which means you
> might as well use str_word_count().
I’m confused, where was strip_tags() already done?
$thisarticle[‘body’] contains the html for the article. So strip_tags() needs to be done to clean that up.
The problem with html_entity_decode() is it doesn’t do anything with html character entities like ’ so they still end up getting counted as multiple words.
See this code:
<pre>
<?php
$a = “haven’t”;
echo html_entity_decode($a);
echo “\n”;
?>
</pre>
It looks like the textile code doesn’t use htmlentities(), it has it’s own bits to handle these cases.
> > $result = preg_split(“/[\s,]+/”, $clean);
> That’ll count “ . . . . . “ (space dot space dot) as words.
Ahh, good point. And str_word_count() or preg_match_all("/\b\w+\b/", $string, $m) correctly handle that case. But they don’t correctly handle any of the html entities not striped by strip_tags() or decoded by html_entity_decode(). And those problems come up in nearly every article on my site. I would rarely have something like “. . . . .”
> There’s more to it than just commas. Read the regexp manual on ‘\b’. Someone’s
> already solved this problem before.
If simply ‘\b’ worked I would use it. It’s the other html character entities that mess up the works. We could have more regex to strip them or miss a few edge cases.
> > If you have preceding or trailing white space, it will be given it’s own element in $result
> > by preg_split. So I had to trim() that off.
> preg_match_all achieves the same thing as split/count, without passing arrays around as
> much, and without zero count problems.
Except it counts haven’t as 3 words.
Try running this code to see for your self.
<pre>
<?php
echo preg_match_all(“/\b\w+\b/”, “haven’t”, $m);
echo “\n”;
?>
</pre>
If you have any ideas about how to strip out those other html entities without a huge mess of regex, i would love to hear them.
Or perhaps there is yet another php function that I’m not familiar with.
Another idea I’ve thought about is seeing if there is access to the textile and if thats any easier to count. Could always get at it with sql, but that seems even more ugly.
Last edited by kelp (2004-11-02 06:36:26)
Offline
Re: Plugin idea: Word Count?
Wait wait wait. All this time there has been a php function called str_word_count?
Offline
Re: Plugin idea: Word Count?
Yep :)
And it might do an pretty good job if the input is run through strip_tags() and then all things that look like html entities are striped out with something like preg_replace().
I was just too lazy and non regex clued to strip out all the html entities.
Offline
#51 2004-11-05 03:19:14
- Andrew
- Plugin Author

- Registered: 2004-02-23
- Posts: 730
Re: Plugin idea: Word Count?
You know, I had noticed that yesterday as well while browsing string functions, but had just assumed there was a logical reason everyone had passed it up. D.Oh.
Offline
Re: Plugin idea: Word Count?
Oooer! Who knew I could cause so much trouble!
I’ve noticed that I tend not to use many tags so ramanan’s plugin is pretty accurate. Its only been a few words out for each post so far (I checked in a couple of regular word processors).
And thanks again for writing it, ramanan – its really come in handy.
Carla
I Elaborate
http://www.ielaborate.info
Offline
Re: Plugin idea: Word Count?
> ramanan wrote:
> Wait wait wait. All this time there has been a php function called str_word_count?
hehe, that function is mentioned in both the 3rd and 4th posts in this thread :)
Offline
#54 2006-08-02 02:29:05
- datumax
- Member
- Registered: 2005-01-12
- Posts: 16
Re: Plugin idea: Word Count?
Sweet! But has anyone thought to, or is it possible to use the output of this plugin dictate behavior? Like…
If wordcount >= 200
then truncate post
Can we build around this?
Offline