Textpattern Forum

You are not logged in. Register | Login | Help

#151 2012-02-16 03:19:14

maniqui
Moderator
From: Buenos Aires, Argentina
Registered: 2004-10-10
Posts: 2,989
Website

Re: smd_xml : extract data from XML feeds

Quick (uncomplete) report: set_empty="1" seems not to be working for me (on v0.3, haven’t tested on v0.4beta yet). The {replacement} is getting printed. Not sure how to troubleshoot.

Edit: nardo reported a similar (if not the same) issue.

Last edited by maniqui (2012-02-16 03:22:42)


La música ideas portará y siempre continuará

TXP Builders – finely-crafted code, design and txp

Offline

#152 2012-02-17 16:09:09

jelle
Member
Registered: 2006-06-07
Posts: 165

Re: smd_xml : extract data from XML feeds

Hi Stef,

I’m using smd_xml to load a some RSS streams. One of the sites it fetches went offline a few moments ago. Smd_xml went on trying to loading the (non-exsistent) data….crashing the browser and eventually locking up my session :)

PHP output as follows:
bc. Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: fsockopen() [function.fsockopen]: php_network_getaddresses: getaddrinfo failed: Naam of dienst is niet bekend on line 125

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: fsockopen() [function.fsockopen]: unable to connect to www.zwolle.nl:80 (php_network_getaddresses: getaddrinfo failed: Naam of dienst is niet bekend) on line 125

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: fputs() expects parameter 1 to be resource, boolean given on line 132

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: stream_set_timeout() expects parameter 1 to be resource, boolean given on line 133

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: stream_get_meta_data() expects parameter 1 to be resource, boolean given on line 134

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: feof() expects parameter 1 to be resource, boolean given on line 137

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: fgets() expects parameter 1 to be resource, boolean given on line 138

those last line went on and on and on and on…..

Offline

#153 2012-02-17 17:20:44

Gocom
Developer
Registered: 2006-07-14
Posts: 4,476
Website

Re: smd_xml : extract data from XML feeds

Hi Stef,

Some suggestions and findings, “inspired” by jelle’s post above. I’m quoting source code from v0.30 release.

gps('force') […] $cache_time

force_read GET option should really be optional. That gives great opportunity to get the server banned from the fetched services by a bad nasty troll.

On that note, cache time should by default be set. To a 15 minutes, a hour or something similar. Constantly fetching the feeds is great way to cause issues to 3rd-party services and get user’s server banned. Something that end-users may not know or think about.

// Cached document is gzipped and then (yuk!) base64’d if zlib is compiled in.
// Would prefer to store binary data directly but trying to insert it into a txp_prefs
// text field always gave problems on insertion and/or retrieval

I’m not sure what you mean unless you are trying to use statements that are not escaped. You are not, right? There shouldn’t be any other issue than statement’s size. What you are getting is just XML, which is just text.

If you want to store the actual presentation of the data, use serialize instead of that wonky base64 encoding and compression thing.

case 'curl': […] case 'fsock':

I would advice ditching other and concentrating on one of them, preferably cURL. You require PHP5, no need for fallbacks. cURL is there. There are couple things you may want to check to prevent Jelle’s issues; HTTP status and received content.

If HTTP status code is anything different than 200 or content is empty or doesn’t fulfill XML’s basic requirement (which would be a container tag for instance), end processing of the feed. E.g.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $data);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
$src = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);

if(!$src || strpos($src, '<') === false || $http != 200) {
	trigger_error('Can not fetch feed.');
	return;
}

Of course you may want to serve the old data if exists, and not just ending there, but you get the point.

constructor […] smd_xml_parser()

__construct(). The plugin requires PHP5, I would use its features and not some PHP4 b/c.

preg_replace("/>"."[[:space:]]+"."</i"

In-casesensitive space? Space? I’m in spaceeeeeeeeeee.

$target_enc

Not sure about its purpose. Textpattern pages are always UTF-8. Space. I love space.

{

And lastly there is my favorite thing, which you going to hate me for. Curly-tags. All fetched feeds have full execution right to the server. You are aware of that, right? So let’s say, Last.fm went down or got intruded. The attacker could collect all Textpattern installation passwords that use smd_xml by replacing feeds with <txp:php> /* bad code here */ </txp:php>. That’s not exactly a good thing.

In general smd_xml would benefit of having a way to filter the contents. Setting tags that are allowed (XSS injections are not fun) and strip out harmful bytes.

Last edited by Gocom (2012-02-17 17:26:13)


Rah-plugins | What? I’m a little confused… again :-) <txp:is_god />

Offline

#154 2012-02-18 00:02:45

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 5,909
Website

Re: smd_xml : extract data from XML feeds

Gocom

Thanks for the rundown. v0.40 is a bit hardier than v0.30 but the main reason I haven’t officially released it is because I’ve been holding back while I investigate ways to make the curly tags thing better.

Anyway, just briefly to answer your excellent points. In v0.40:

  • force_read: removed
  • cache_time: default=1hour
  • gzip/base64: the comment is misleading but basically I was trying to save space in the prefs. Some XML documents are colossal so gzip seemed the logical way to compress it, but you can’t store such binary data in a MySQL TEXT field without problems, so I base64_encode() it for safety. I seem to remember trying serialize(), btw, but that only worked if the input was pure text and didn’t have any spurious (blob or malicious?) embedded data. For example, if the URL resolved to an error scenario and squirted HTML back in a foreign language, or someone supplied, I dunno, a URL to a jpg, the possibly binary data presented to the plugin broke things. And since the caching occurs before any checks on the data itself take place and before the XML parser has a chance to reject it for malformedness, I left it that way. I’ll see if there’s a better method.
  • cURL vs fsock: I didn’t realise it was a pre-installed component of PHP5. I thought people could still opt to install it (at least that’s how I interpret the docs). I might be wrong. In any case, I’ve swapped the test around in the code so if you don’t supply a transport, cURL is checked and used first if compiled in.
  • check http status: I’d put a lot more defensive coding into v0.40 to trap any malformed or non-existent feeds (which should have prevented the issue Jelle encountered). But I’ll add the explicit HTTP status check and perform some simple tests for XMLness as you suggest. Can’t hurt to be more cautious.
  • __construct(): good call. Done.
  • case-insensitive space: errr, yeah. I think I copied that from a PHP.net comment somewhere and didn’t spot the /i on the end.
  • $target_enc: just a feature of PHP’s XML parser (XML_OPTION_TARGET_ENCODING) so you can (for whatever reason) translate the incoming feed into something other than UTF-8 (or convert it to UTF-8 if it’s in some other strange encoding). I’ve never ued it, but since it was a stock option of the constructor I employed it.
  • curly quote syntax. Yeah, fix is in the works. I’ve been gradually going through all my plugins and sorting things out to either:
    • remove it (if it adds little value)
    • replace it with a <txp:smd_proper_tag>
    • only use {replacements} for values that are generated by the plugin (e.g. counters), not stuff that comes from potentially tainted user input
    • internally make {replacement} use <txp:smd_blah_info name="replacement"> if the curly quote syntax is much cleaner than a tag, or would cause real b/c problems
    • make <txp:smd_blah_info> tags more compelling by expanding their capabilities to allow lists of names with wraptag/break/class. That means {replacements} are often less elegant than a tag, which makes people rely on them less
    • rewrite the code so they are no longer necessary
    • stop using strtr() because it’s dog slow. and use str_replace() instead
    • employ filtering options to sanitize data
    • some or all of the above in tandem

The curly tag thing is just taking a little longer to do than I’d hoped. I’ve tentatively done smd_if and smd_query (and a few of my as-yet unreleased plugins), with a few other oldies in various states of completion (smd_bio, smd_xml, smd_gallery, etc). I’ll get there eventually.

Last edited by Bloke (2012-02-18 00:04:13)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern.

Txp Builders – finely-crafted code, design and Txp

Offline

#155 2012-02-18 00:53:59

Gocom
Developer
Registered: 2006-07-14
Posts: 4,476
Website

Re: smd_xml : extract data from XML feeds

Bloke wrote:

cURL vs fsock: I didn’t realise it was a pre-installed component of PHP5

Heh, it’s not. But it really ,really should be installed on any server, and for most part, it is. If it isn’t then that server or hosting provider isn’t worth using. cURL is the thing that should be used to do these type of tasks.

I copied that from a PHP.net comment

Not the best place to get your resources, I would say. The actual docs are fine, but the comments variate greatly in quality. I wouldn’t take anything from the comments section if in doubt.

PHP’s XML parser

Speaking of parsers, don’t you think it might be a good idea to use SimpleXML? Instead of that ancient… thingy. It’s enabled by default since PHP 5.1.2 and is pretty fly. Would save you that whole smd_xml_parser class. All you would need is something as:

try {
	@$r = new SimpleXMLElement($xml, LIBXML_NOCDATA);
}
catch(Exception $e){
	trigger_error('Invalid XML document, or bitch* is just stupid. *SimpleXML');
	return;
}

You get neatly organized resource. Returning values is a breeze. Or what about unlimited nodes? Unlimited power? No level cap, you bet!

/**
 * <txp:smd_xml_data name="node->that->is->in-deep->mess->or->just->title" />
 */
function smd_xml_data($atts) {

	global $smd_xml;

	extract(lAtts(array(
		'name' => NULL,
		'escape' => 1,
	), $atts));

	$r = $smd_xml;

	foreach(explode('->', $name) as $n) {	
		if(!isset($r->{$n})) {
			return;
		}
		$r = $r->{$n};
	}

	$r = (string) $r;	
	return $escape ? htmlspecialchars($r) : $r;
}

Last edited by Gocom (2012-02-18 00:55:27)


Rah-plugins | What? I’m a little confused… again :-) <txp:is_god />

Offline

#156 2012-03-30 16:22:16

MattD
Plugin Author
From: Monterey, California
Registered: 2008-03-21
Posts: 1,188
Website

Re: smd_xml : extract data from XML feeds

I need to convert a twitter feeds created_at date to YYYY,MM,DD format. I’m using format="created_at|date|%Y,%m,%d" but I just get the year. The raw value in the twitter feed is Tue Mar 27 19:06:47 +0000 2012.

What am I doing wrong?

My guess is it’s the commas.

edit: for now I’m using rah_replace to get around this.

Last edited by MattD (2012-03-30 16:28:39)


My Plugins

Piwik Dashboard, Minibar, Article Image Colorpicker, Admin Datepicker, Admin Google Map, Admin Colorpicker

Offline

#157 2012-03-30 18:32:26

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 5,909
Website

Re: smd_xml : extract data from XML feeds

MattD wrote:

My guess is it’s the commas.

Yep, you need to override the delim attribute.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern.

Txp Builders – finely-crafted code, design and Txp

Offline

#158 2012-04-03 05:20:31

MattD
Plugin Author
From: Monterey, California
Registered: 2008-03-21
Posts: 1,188
Website

Re: smd_xml : extract data from XML feeds

Ok, now I am using a feed where i need to get an attribute of an xml tag photo|taken but I also need to format it like above format="created_at|date|%Y.%m.%d".

How do I replace created_at with photo|taken? It seems both format and the xml values use param_delim.


My Plugins

Piwik Dashboard, Minibar, Article Image Colorpicker, Admin Datepicker, Admin Google Map, Admin Colorpicker

Offline

#159 2012-04-03 08:18:21

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 5,909
Website

Re: smd_xml : extract data from XML feeds

MattD wrote:

It seems both format and the xml values use param_delim.

Yes they do. Well spotted. Try the latest beta which introduces tag_delim (default = |) for dealing specifically with the XML stream tags. param_delim retains its usage solely for plugin attributes.

Please let me know if it works; I haven’t actually tested it: I’d like to test it on the feed you’re working on since it saves me hunting around for a feed that matches the criteria. If you have any problems, please would you let me know the feed URL and I can take a look directly (PM me if necessary). Thanks.

Last edited by Bloke (2012-04-03 08:18:44)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern.

Txp Builders – finely-crafted code, design and Txp

Offline

#160 2012-04-03 15:48:54

MattD
Plugin Author
From: Monterey, California
Registered: 2008-03-21
Posts: 1,188
Website

Re: smd_xml : extract data from XML feeds

Bloke wrote:

MattD wrote:

Please let me know if it works; I haven’t actually tested it: I’d like to test it on the feed you’re working on since it saves me hunting around for a feed that matches the criteria.

Seems to work. I’m using the flickr.photos.search method of the Flickr API and I’m asking for the following extra fields of date_taken, title, url_l and tags.


My Plugins

Piwik Dashboard, Minibar, Article Image Colorpicker, Admin Datepicker, Admin Google Map, Admin Colorpicker

Offline

Board footer

Powered by FluxBB