Plugin idea: Word Count?

ramanan · 2004-09-27 00:46:50

I can’t think of a fast way to do it, like a simple sql statement or something. You could select out all the entries and then cycle through them, but that seems slow. I’ll think about it more when i get the chance. Anyone else know how to tackle the problem.

kennethlove666 · 2004-10-23 00:28:14

any update on the total word count mode? i’d love this for the same reason (nanowrimo).

ramanan · 2004-10-23 14:02:07

Sorry guys. Now that I’ve started work I’m much slower and getting stuff done. I think it would be slow to process this on the fly. (In my head, the only way I see to do this dynamically would be to select out all your articles, and iterate through them all summing up the word count.) If there was a way to do this in SQL it would be ideal.

kennethlove666 · 2004-10-23 16:22:36

well, since all the articles would be posted to one section, couldn’t it do the count whenever an article is published to that section, and just update the # in the db or a flatfile?

Andrew · 2004-10-23 17:40:25

I know that similar versions of this have been made for Wordpress and Movable Type. Perhaps looking at some of those would give some inspiration.

Last edited by compooter (2004-10-23 17:55:07)

Andrew · 2004-10-23 18:04:30

I’m looking at the current code,

<pre>
function rsx_total_word_count($atts) { return getThing(“select sum(length(body)) from textpattern”);
}
</pre>

Length() totals all chars of the string, not what we want, and mySQL queries are just gonna complicate things. Why not just use the same technique that’s in rsx_word_count() but with all of body in textpattern?

<pre>
function rsx_total_word_count($atts) { $all_content = getThing(“select body from textpattern”); return count(explode(” “, $all_content));
}
</pre>

NOTE: this hasn’t been tested. Just blue-skying here.

Also as a sidenote: this could be extended with an optional ‘from’ and ‘to’ attribute, using strtotime, so that you could limit the selection to a certain timespan or from a certain starting date. Anyone see any flaws in this approach?

Last edited by compooter (2004-10-23 18:09:46)

ramanan · 2004-10-23 20:28:27

I was thinking of making a little hack to update stats as you go. Attaching a word count to the textpattern table, and then updating this value on insert and save. That would involve hacking the install though, which I think isn’t the best way to go. Compooter, I think you would return an array, so you would need to iterate through it and do that count. I’m not saying it wouldn’t work, I just wonder how slow it might be.

kelp · 2004-11-01 01:19:51

I’ve hacked on rsx_word_count code a bit to make it a little more accurate.

Here are my results:

<pre>function rsx_word_count($atts) { global $thisarticle; if ( ! isset($thisarticle[‘body’])) { return 0; } else { $words = strip_tags($thisarticle[‘body’]); if (preg_match(‘/\S/’, $words) == 0) { return 0; } else { return count(explode(” “, $words)); } }
}</pre>

Probably some faster ways to do that, but I’m very new to PHP.

It will now properly count articles with no text, like those with only a link to an image. It will also not count html tags as words.

Hopefully this is useful to some people.

Last edited by kelp (2004-11-01 07:17:06)

zem · 2004-11-01 02:54:18

> return count(explode(" ", $words));

That’s not very accurate. It’ll count three spaces as two words, and “foo,bar” as one word.

Try this:

return preg_match_all('/\b\w+\b/', $words, $m);

ramanan · 2004-11-01 03:19:57

Cool. Thanks for the suggestions. I think I’ll take this week to go through and update my plugins.

kelp · 2004-11-01 07:30:41

> zem wrote:
>
> > return count(explode(" ", $words));
>
> That’s not very accurate. It’ll count three spaces as two words, and “foo,bar” as one word.
>
> Try this:
>
> return preg_match_all('/\b\w+\b/', $words, $m);

Thats not super accurate for all cases either. It gets pretty confused by html elements and counts things like doesn’t as 3 words. The explode method doesn’t have that problem, but you are right, it does have issues with not counting “foo,bar” and counting 3 spaces as 2 words.

So I’ve played around with things quite a bit and finally got to this which works pretty well for me:

<pre>
function rsx_word_count($atts) { global $thisarticle; if ( ! isset($thisarticle[‘body’])) { return 0; } else { $words = strip_tags($thisarticle[‘body’]); $clean = trim($words); if (preg_match(‘/\S/’, $clean) == 0) { return 0; } else { $result = preg_split(“/[\s,]+/”, $clean); return count($result); } }
}
</pre>

I have to check if $clean contains nothing but white space because even if body only contains html tags that we then strip, we still have one element in $result. I would rather return zero, since there are no real words.

If you have preceding or trailing white space, it will be given it’s own element in $result by preg_split. So I had to trim() that off.

Then we just spit on white space or “,” and all is good.

I also looked at using str_word_count which would have been the obvious choice, but it also has issues with doesn’t, counting it as 2 words.

Last edited by kelp (2004-11-01 07:33:28)

ramanan · 2004-11-02 03:10:24

I was thinking of writing some sort of state machine that would count a word as anything separated by white space. I might look into how the unix command wc does things.

kelp · 2004-11-02 03:51:53

> ramanan wrote:

> I was thinking of writing some sort of state machine that would count a word as anything separated by white space. I might look into how the unix command wc does things.

from wc(1):

<pre> a word is defined as a string of characters delim-ited by white space characters. White space characters are the set of characters for which the isspace(3) function returns true.
</pre>

And from isspace(3):

<pre> The isspace() function tests for the standard white-space characters….

In the ASCII character set, this includes the following characters (with their numeric values shown in octal): 011 ht 012 nl 013 vt 014 np 015 cr 040 sp </pre>

So wc considers tab, newline, vertical tab, formfeed, cariage return, and space all as word boundry characters.

My code actually does exactly what wc does, except also counts “foo,bar” as 2 words.

Although I’m sure there are much faster ways to do this. And maybe better.

Last edited by kelp (2004-11-02 03:53:26)

ramanan · 2004-11-02 04:23:56

If you want you can just post up your version of the word count plugin here. There is no reason I need be the authority on such a thing. I think i’ll probably try and write up something that counts words similar to the way wc does, but for the time being, other people may want to use your version of this plugin. Though I guess they can copy and paste in your code just as easily.

Last edited by ramanan (2004-11-02 04:26:43)

kelp · 2004-11-02 04:32:46

> ramanan wrote:

> If you want you can just post up your version of the word count plugin here. There is no reason I need be the authority on such a thing. I think i’ll probably try and write up something that counts words similar to the way wc does, but for the time being, other people may want to use your version of this plugin. Though I guess they can copy and paste in your code just as easily.

Sure :)

I would like to see your new version too!

Textpattern CMS

Textpattern CMS support forum

#31 2004-09-27 00:46:50

Re: Plugin idea: Word Count?

#32 2004-10-23 00:28:14

Re: Plugin idea: Word Count?

#33 2004-10-23 14:02:07

Re: Plugin idea: Word Count?

#34 2004-10-23 16:22:36

Re: Plugin idea: Word Count?

#35 2004-10-23 17:40:25

Re: Plugin idea: Word Count?

#36 2004-10-23 18:04:30

Re: Plugin idea: Word Count?

#37 2004-10-23 20:28:27

Re: Plugin idea: Word Count?

#38 2004-11-01 01:19:51

Re: Plugin idea: Word Count?

#39 2004-11-01 02:54:18

Re: Plugin idea: Word Count?

#40 2004-11-01 03:19:57

Re: Plugin idea: Word Count?

#41 2004-11-01 07:30:41

Re: Plugin idea: Word Count?

#42 2004-11-02 03:10:24

Re: Plugin idea: Word Count?

#43 2004-11-02 03:51:53

Re: Plugin idea: Word Count?

#44 2004-11-02 04:23:56

Re: Plugin idea: Word Count?

#45 2004-11-02 04:32:46

Re: Plugin idea: Word Count?

Board footer