Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
Re: Plugin idea: Word Count?
I was thinking of making a little hack to update stats as you go. Attaching a word count to the textpattern table, and then updating this value on insert and save. That would involve hacking the install though, which I think isn’t the best way to go. Compooter, I think you would return an array, so you would need to iterate through it and do that count. I’m not saying it wouldn’t work, I just wonder how slow it might be.
Offline
Re: Plugin idea: Word Count?
I’ve hacked on rsx_word_count code a bit to make it a little more accurate.
Here are my results:
<pre>function rsx_word_count($atts) {
global $thisarticle;
if ( ! isset($thisarticle[‘body’])) {
return 0;
} else {
$words = strip_tags($thisarticle[‘body’]);
if (preg_match(‘/\S/’, $words) == 0) {
return 0;
} else {
return count(explode(” “, $words));
}
}
}</pre>
Probably some faster ways to do that, but I’m very new to PHP.
It will now properly count articles with no text, like those with only a link to an image. It will also not count html tags as words.
Hopefully this is useful to some people.
Last edited by kelp (2004-11-01 07:17:06)
Offline
#39 2004-11-01 02:54:18
- zem
- Developer Emeritus
- From: Melbourne, Australia
- Registered: 2004-04-08
- Posts: 2,579
Re: Plugin idea: Word Count?
> return count(explode(" ", $words));
That’s not very accurate. It’ll count three spaces as two words, and “foo,bar” as one word.
Try this:
return preg_match_all('/\b\w+\b/', $words, $m);
Alex
Offline
Re: Plugin idea: Word Count?
Cool. Thanks for the suggestions. I think I’ll take this week to go through and update my plugins.
Offline
Re: Plugin idea: Word Count?
> zem wrote:
>
> > return count(explode(" ", $words));
>
> That’s not very accurate. It’ll count three spaces as two words, and “foo,bar” as one word.
>
> Try this:
>
> return preg_match_all('/\b\w+\b/', $words, $m);
Thats not super accurate for all cases either. It gets pretty confused by html elements and counts things like doesn’t
as 3 words. The explode method doesn’t have that problem, but you are right, it does have issues with not counting “foo,bar” and counting 3 spaces as 2 words.
So I’ve played around with things quite a bit and finally got to this which works pretty well for me:
<pre>
function rsx_word_count($atts) {
global $thisarticle;
if ( ! isset($thisarticle[‘body’])) {
return 0;
} else {
$words = strip_tags($thisarticle[‘body’]);
$clean = trim($words);
if (preg_match(‘/\S/’, $clean) == 0) {
return 0;
} else {
$result = preg_split(“/[\s,]+/”, $clean);
return count($result);
}
}
}
</pre>
I have to check if $clean contains nothing but white space because even if body only contains html tags that we then strip, we still have one element in $result. I would rather return zero, since there are no real words.
If you have preceding or trailing white space, it will be given it’s own element in $result by preg_split. So I had to trim() that off.
Then we just spit on white space or “,” and all is good.
I also looked at using str_word_count which would have been the obvious choice, but it also has issues with doesn’t
, counting it as 2 words.
Last edited by kelp (2004-11-01 07:33:28)
Offline
Re: Plugin idea: Word Count?
I was thinking of writing some sort of state machine that would count a word as anything separated by white space. I might look into how the unix command wc
does things.
Offline
Re: Plugin idea: Word Count?
> ramanan wrote:
> I was thinking of writing some sort of state machine that would count a word as anything separated by white space. I might look into how the unix command wc
does things.
from wc(1):
<pre>
a word is defined as a string of characters delim-ited by white space characters.
White space characters are the set of characters for which the isspace(3) function
returns true.
</pre>
And from isspace(3):
<pre> The isspace() function tests for the standard white-space characters….
In the ASCII character set, this includes the following characters (with their numeric values shown in octal): 011 ht 012 nl 013 vt 014 np 015 cr 040 sp </pre>So wc
considers tab, newline, vertical tab, formfeed, cariage return, and space all as word boundry characters.
My code actually does exactly what wc does, except also counts “foo,bar” as 2 words.
Although I’m sure there are much faster ways to do this. And maybe better.
Last edited by kelp (2004-11-02 03:53:26)
Offline
Re: Plugin idea: Word Count?
If you want you can just post up your version of the word count plugin here. There is no reason I need be the authority on such a thing. I think i’ll probably try and write up something that counts words similar to the way wc does, but for the time being, other people may want to use your version of this plugin. Though I guess they can copy and paste in your code just as easily.
Last edited by ramanan (2004-11-02 04:26:43)
Offline
Re: Plugin idea: Word Count?
> ramanan wrote:
> If you want you can just post up your version of the word count plugin here. There is no reason I need be the authority on such a thing. I think i’ll probably try and write up something that counts words similar to the way wc does, but for the time being, other people may want to use your version of this plugin. Though I guess they can copy and paste in your code just as easily.
Sure :)
I would like to see your new version too!
Offline
#46 2004-11-02 04:59:17
- zem
- Developer Emeritus
- From: Melbourne, Australia
- Registered: 2004-04-08
- Posts: 2,579
Re: Plugin idea: Word Count?
> return preg_match_all('/\b\w+\b/', $words, $m);
> Thats not super accurate for all cases either. It gets pretty confused by html elements and counts things like doesn’t
as 3 words. The explode method doesn’t have that problem, but you are right, it does have issues with not counting “foo,bar” and counting 3 spaces as 2 words.
strip_tags() was already done. html_entity_decode() is also probably a requirement to get it right, whatever method you use. Which means PHP 4.3.0 is required, which means you might as well use str_word_count().
> $result = preg_split(“/[\s,]+/”, $clean);
That’ll count “ . . . . . “ (space dot space dot) as words.
There’s more to it than just commas. Read the regexp manual on ‘\b’. Someone’s already solved this problem before.
> If you have preceding or trailing white space, it will be given it’s own element in $result by preg_split. So I had to trim() that off.
preg_match_all achieves the same thing as split/count, without passing arrays around as much, and without zero count problems.
Alex
Offline
#47 2004-11-02 05:01:12
- zem
- Developer Emeritus
- From: Melbourne, Australia
- Registered: 2004-04-08
- Posts: 2,579
Re: Plugin idea: Word Count?
> I was thinking of writing some sort of state machine that would count a word as anything separated by white space.
/\b\w+\b/
Don’t need a state machine when there’s only one state and one transition.
Alex
Offline
Re: Plugin idea: Word Count?
> zem wrote:
> > return preg_match_all('/\b\w+\b/', $words, $m);
> > Thats not super accurate for all cases either. It gets pretty confused by html elements
> > and counts things like doesn’t
as 3 words. The explode method doesn’t have that
> > problem, but you are right, it does have issues with not counting “foo,bar” and counting
> > 3 spaces as 2 words.
> strip_tags() was already done. html_entity_decode() is also probably a requirement to get
> it right, whatever method you use. Which means PHP 4.3.0 is required, which means you
> might as well use str_word_count().
I’m confused, where was strip_tags() already done?
$thisarticle[‘body’] contains the html for the article. So strip_tags() needs to be done to clean that up.
The problem with html_entity_decode() is it doesn’t do anything with html character entities like ’
so they still end up getting counted as multiple words.
See this code:
<pre>
<?php
$a = “haven’t”;
echo html_entity_decode($a);
echo “\n”;
?>
</pre>
It looks like the textile code doesn’t use htmlentities(), it has it’s own bits to handle these cases.
> > $result = preg_split(“/[\s,]+/”, $clean);
> That’ll count “ . . . . . “ (space dot space dot) as words.
Ahh, good point. And str_word_count() or preg_match_all("/\b\w+\b/", $string, $m)
correctly handle that case. But they don’t correctly handle any of the html entities not striped by strip_tags() or decoded by html_entity_decode(). And those problems come up in nearly every article on my site. I would rarely have something like “. . . . .”
> There’s more to it than just commas. Read the regexp manual on ‘\b’. Someone’s
> already solved this problem before.
If simply ‘\b’ worked I would use it. It’s the other html character entities that mess up the works. We could have more regex to strip them or miss a few edge cases.
> > If you have preceding or trailing white space, it will be given it’s own element in $result
> > by preg_split. So I had to trim() that off.
> preg_match_all achieves the same thing as split/count, without passing arrays around as
> much, and without zero count problems.
Except it counts haven’t
as 3 words.
Try running this code to see for your self.
<pre>
<?php
echo preg_match_all(“/\b\w+\b/”, “haven’t”, $m);
echo “\n”;
?>
</pre>
If you have any ideas about how to strip out those other html entities without a huge mess of regex, i would love to hear them.
Or perhaps there is yet another php function that I’m not familiar with.
Another idea I’ve thought about is seeing if there is access to the textile and if thats any easier to count. Could always get at it with sql, but that seems even more ugly.
Last edited by kelp (2004-11-02 06:36:26)
Offline