Plugin idea: Word Count?

zem · 2004-11-02 04:59:17

> return preg_match_all('/\b\w+\b/', $words, $m);

> Thats not super accurate for all cases either. It gets pretty confused by html elements and counts things like doesn’t as 3 words. The explode method doesn’t have that problem, but you are right, it does have issues with not counting “foo,bar” and counting 3 spaces as 2 words.

strip_tags() was already done. html_entity_decode() is also probably a requirement to get it right, whatever method you use. Which means PHP 4.3.0 is required, which means you might as well use str_word_count().

> $result = preg_split(“/[\s,]+/”, $clean);

That’ll count “ . . . . . “ (space dot space dot) as words.

There’s more to it than just commas. Read the regexp manual on ‘\b’. Someone’s already solved this problem before.

> If you have preceding or trailing white space, it will be given it’s own element in $result by preg_split. So I had to trim() that off.

preg_match_all achieves the same thing as split/count, without passing arrays around as much, and without zero count problems.

zem · 2004-11-02 05:01:12

> I was thinking of writing some sort of state machine that would count a word as anything separated by white space.

/\b\w+\b/

Don’t need a state machine when there’s only one state and one transition.

kelp · 2004-11-02 06:34:22

> zem wrote:

> > return preg_match_all('/\b\w+\b/', $words, $m);

> > Thats not super accurate for all cases either. It gets pretty confused by html elements
> > and counts things like doesn’t as 3 words. The explode method doesn’t have that
> > problem, but you are right, it does have issues with not counting “foo,bar” and counting
> > 3 spaces as 2 words.

> strip_tags() was already done. html_entity_decode() is also probably a requirement to get
> it right, whatever method you use. Which means PHP 4.3.0 is required, which means you
> might as well use str_word_count().

I’m confused, where was strip_tags() already done?

$thisarticle[‘body’] contains the html for the article. So strip_tags() needs to be done to clean that up.

The problem with html_entity_decode() is it doesn’t do anything with html character entities like ’ so they still end up getting counted as multiple words.

See this code:

It looks like the textile code doesn’t use htmlentities(), it has it’s own bits to handle these cases.

> > $result = preg_split(“/[\s,]+/”, $clean);

> That’ll count “ . . . . . “ (space dot space dot) as words.

Ahh, good point. And str_word_count() or preg_match_all("/\b\w+\b/", $string, $m) correctly handle that case. But they don’t correctly handle any of the html entities not striped by strip_tags() or decoded by html_entity_decode(). And those problems come up in nearly every article on my site. I would rarely have something like “. . . . .”

> There’s more to it than just commas. Read the regexp manual on ‘\b’. Someone’s
> already solved this problem before.

If simply ‘\b’ worked I would use it. It’s the other html character entities that mess up the works. We could have more regex to strip them or miss a few edge cases.

> > If you have preceding or trailing white space, it will be given it’s own element in $result
> > by preg_split. So I had to trim() that off.

> preg_match_all achieves the same thing as split/count, without passing arrays around as
> much, and without zero count problems.

Except it counts haven’t as 3 words.

Try running this code to see for your self.

If you have any ideas about how to strip out those other html entities without a huge mess of regex, i would love to hear them.

Or perhaps there is yet another php function that I’m not familiar with.

Another idea I’ve thought about is seeing if there is access to the textile and if thats any easier to count. Could always get at it with sql, but that seems even more ugly.

Last edited by kelp (2004-11-02 06:36:26)

ramanan · 2004-11-05 02:50:57

Wait wait wait. All this time there has been a php function called str_word_count?

kelp · 2004-11-05 03:03:02

Yep :)

And it might do an pretty good job if the input is run through strip_tags() and then all things that look like html entities are striped out with something like preg_replace().

I was just too lazy and non regex clued to strip out all the html entities.

Andrew · 2004-11-05 03:19:14

You know, I had noticed that yesterday as well while browsing string functions, but had just assumed there was a logical reason everyone had passed it up. D.Oh.

fantasylit · 2004-11-10 10:54:11

Oooer! Who knew I could cause so much trouble!

I’ve noticed that I tend not to use many tags so ramanan’s plugin is pretty accurate. Its only been a few words out for each post so far (I checked in a couple of regular word processors).

And thanks again for writing it, ramanan – its really come in handy.

Carla

nishark · 2004-11-10 11:12:06

> ramanan wrote:

> Wait wait wait. All this time there has been a php function called str_word_count?

hehe, that function is mentioned in both the 3rd and 4th posts in this thread :)

datumax · 2006-08-02 02:29:05

Sweet! But has anyone thought to, or is it possible to use the output of this plugin dictate behavior? Like…

If wordcount >= 200
then truncate post

Can we build around this?

Textpattern CMS

Textpattern CMS support forum

#46 2004-11-02 04:59:17

Re: Plugin idea: Word Count?

#47 2004-11-02 05:01:12

Re: Plugin idea: Word Count?

#48 2004-11-02 06:34:22

Re: Plugin idea: Word Count?

#49 2004-11-05 02:50:57

Re: Plugin idea: Word Count?

#50 2004-11-05 03:03:02

Re: Plugin idea: Word Count?

#51 2004-11-05 03:19:14

Re: Plugin idea: Word Count?

#52 2004-11-10 10:54:11

Re: Plugin idea: Word Count?

#53 2004-11-10 11:12:06

Re: Plugin idea: Word Count?

#54 2006-08-02 02:29:05

Re: Plugin idea: Word Count?

Board footer