Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#145 2012-02-08 09:43:28

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_xml : extract data from XML feeds

nardo wrote:

is there a way to debug output?

For the time being (until I get round to doing it properly) you can add debug="3" to the tag. You’ll see the feed contents dumped to the screen if it’s working right, then a load of gubbins after that detailing each matched record it finds and what the replacements are.

without the cache, it’s reading every time, right … ?

Yep.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#146 2012-02-08 10:03:52

nardo
Member
From: tuvalahiti
Registered: 2004-04-22
Posts: 743

Re: smd_xml : extract data from XML feeds

I get no output with debug=“3”

looking at the feed again, could the quotes be causing an issue?

<title>"The theatre is a source of heartaches and immeasurable joys."</title><description>“The theatre is a source of heartaches and immeasurable joys.”<br/><br/> - <em><em>Konstantin Stanislavsky</em></em></description>

Offline

#147 2012-02-08 10:12:40

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_xml : extract data from XML feeds

nardo wrote:

I get no output with debug=“3”

Odd. It sounds like the plugin’s not being called. Any errors if you put the site in testing/development mode? No dangling tags or anything in the markup anywhere?

could the quotes be causing an issue?

Maybe. Do you have a feed link you can post here (or send me), along with your smd_xml tag/form/container? I can try it out myself then and see if I can spot the issue.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#148 2012-02-09 02:55:36

aslsw66
Member
From: Canberra, Australia
Registered: 2004-08-04
Posts: 342
Website

Re: smd_xml : extract data from XML feeds

I didn’t notice the debugging option before. Here’s what happens when I use it:

1. Production status is set to “Debugging”.

2. My coding is as follows:

<txp:smd_xml data="http://rss.weather.com.au/act/canberra" record="item" fields="description,pubDate,link" limit="1" cache_time="900" format="pubDate|date|%d/%m/%Y %H:%I" debug="3">
<p>{description}</p>
<p class="posted">{pubDate}<br /><a href="{link}">weather.com.au</a></p>
</txp:smd_xml>

3. The browser displays the following:

++ READING CACHE smd_xml_data_bd2986 ++

++ FILTERED SOURCE DATA ++

4. My tag trace is:

<!-- txp tag trace:
[SQL (0.00014305114746094): select name, data from txp_lang where lang='en-gb' AND ( event='public' OR event='common')]
[SQL (0.0026988983154297): select name, code, version from txp_plugin where status = 1 AND type IN (0,1) order by load_order]
[SQL (0.00054311752319336): select name,code,version from txp_plugin where status = 1 AND name='mem_form']
[SQL (0.00010895729064941): select name,code,version from txp_plugin where status = 1 AND name='glz_custom_fields']
[SQL (0.00018095970153809): select name from txp_section where `name` like 'weather' limit 1]
[SQL (5.793571472168E-5): select page, css from txp_section where name = 'weather' limit 1]
[SQL (0.00041007995605469): select host from txp_log where ip='203.217.150.69' limit 1]
[SQL (0.00029516220092773): insert into txp_log set `time`=now(),page='/weather',ip='203.217.150.69',host='203.217.150.69',refer='',status='200',method='GET']
[SQL (5.8174133300781E-5): select user_html from txp_page where name='weather']
[Page: weather]
<txp:feed_link flavor="atom" format="link" label="Atom" />
<txp:feed_link flavor="rss" format="link" label="RSS" />
<txp:css format="link" />
<txp:rsd />
<txp:smd_xml data="http://rss.weather.com.au/act/canberra" record="item" fields="description,pubDate,link" limit="1" cache_time="900" format="pubDate|date|%d/%m/%Y %H:%I" debug="3">
</txp:smd_xml>
[ ~~~ secondpass ~~~ ]

Finally, if I change the production status to “Live”, then I get a ’500 Internal Server Error’; the server log shows a timeout error. This makes me wonder whether it is a problem with my host, or with my Textpattern installation, or with the plugin.

Offline

#149 2012-02-09 11:40:51

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_xml : extract data from XML feeds

aslsw66 wrote:

whether it is a problem with my host, or with my Textpattern installation, or with the plugin.

For the record, I tried your exact code on my server and it worked a treat in both Live and Debugging production status.

There is a subtle bug in the latest beta if you’re using the ‘escape’ format on servers below PHP 5.4.0 but that doesn’t apply in your case since you’re not using it. I’ve fixed it on my dev server but haven’t pushed it live yet. Will do that later.

Out of curiosity, what PHP/MySQL are you using?

Possibly realted: the problem with nardo’s code is down to the content of the feed itself because I tried it with a different feed from the same server and it worked fine. I have not as yet tracked down the issue because each record on its own is fine, and I can serve a clone of the feed from my own server or even directly in the data attribute and it’s all fine. But try and get that exact same feed from the real server and it just does nothing. Weird beyond belief.

Last edited by Bloke (2012-02-10 09:27:15)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#150 2012-02-10 05:00:45

aslsw66
Member
From: Canberra, Australia
Registered: 2004-08-04
Posts: 342
Website

Re: smd_xml : extract data from XML feeds

Thanks for checking this out – I knew it wouldn’t be a problem with the plugin, but this confirms that something weird is happening with my hosting. I’m about ready to move.

By the way, this is using PHP5.2.4 and MySQL5.0.51a.

Offline

#151 2012-02-16 03:19:14

maniqui
Member
From: Buenos Aires, Argentina
Registered: 2004-10-10
Posts: 3,070
Website

Re: smd_xml : extract data from XML feeds

Quick (uncomplete) report: set_empty="1" seems not to be working for me (on v0.3, haven’t tested on v0.4beta yet). The {replacement} is getting printed. Not sure how to troubleshoot.

Edit: nardo reported a similar (if not the same) issue.

Last edited by maniqui (2012-02-16 03:22:42)


La música ideas portará y siempre continuará

TXP Builders – finely-crafted code, design and txp

Offline

#152 2012-02-17 16:09:09

jelle
Member
Registered: 2006-06-07
Posts: 165

Re: smd_xml : extract data from XML feeds

Hi Stef,

I’m using smd_xml to load a some RSS streams. One of the sites it fetches went offline a few moments ago. Smd_xml went on trying to loading the (non-exsistent) data….crashing the browser and eventually locking up my session :)

PHP output as follows:
bc. Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: fsockopen() [function.fsockopen]: php_network_getaddresses: getaddrinfo failed: Naam of dienst is niet bekend on line 125

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: fsockopen() [function.fsockopen]: unable to connect to www.zwolle.nl:80 (php_network_getaddresses: getaddrinfo failed: Naam of dienst is niet bekend) on line 125

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: fputs() expects parameter 1 to be resource, boolean given on line 132

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: stream_set_timeout() expects parameter 1 to be resource, boolean given on line 133

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: stream_get_meta_data() expects parameter 1 to be resource, boolean given on line 134

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: feof() expects parameter 1 to be resource, boolean given on line 137

Tag foutmelding <txp:smd_xml data=“http://www.zwolle.nl/RSS-feed-nieuwsoverzicht.htm” format=“pubDate|date|%d-%m” record=“item” fields=“title, link, pubDate” form=“feed” limit=“6” /> -> Warning: fgets() expects parameter 1 to be resource, boolean given on line 138

those last line went on and on and on and on…..

Offline

#153 2012-02-17 17:20:44

Gocom
Developer Emeritus
From: Helsinki, Finland
Registered: 2006-07-14
Posts: 4,533
Website

Re: smd_xml : extract data from XML feeds

Hi Stef,

Some suggestions and findings, “inspired” by jelle’s post above. I’m quoting source code from v0.30 release.

gps('force') […] $cache_time

force_read GET option should really be optional. That gives great opportunity to get the server banned from the fetched services by a bad nasty troll.

On that note, cache time should by default be set. To a 15 minutes, a hour or something similar. Constantly fetching the feeds is great way to cause issues to 3rd-party services and get user’s server banned. Something that end-users may not know or think about.

// Cached document is gzipped and then (yuk!) base64’d if zlib is compiled in.
// Would prefer to store binary data directly but trying to insert it into a txp_prefs
// text field always gave problems on insertion and/or retrieval

I’m not sure what you mean unless you are trying to use statements that are not escaped. You are not, right? There shouldn’t be any other issue than statement’s size. What you are getting is just XML, which is just text.

If you want to store the actual presentation of the data, use serialize instead of that wonky base64 encoding and compression thing.

case 'curl': […] case 'fsock':

I would advice ditching other and concentrating on one of them, preferably cURL. You require PHP5, no need for fallbacks. cURL is there. There are couple things you may want to check to prevent Jelle’s issues; HTTP status and received content.

If HTTP status code is anything different than 200 or content is empty or doesn’t fulfill XML’s basic requirement (which would be a container tag for instance), end processing of the feed. E.g.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $data);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
$src = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);

if(!$src || strpos($src, '<') === false || $http != 200) {
	trigger_error('Can not fetch feed.');
	return;
}

Of course you may want to serve the old data if exists, and not just ending there, but you get the point.

constructor […] smd_xml_parser()

__construct(). The plugin requires PHP5, I would use its features and not some PHP4 b/c.

preg_replace("/>"."[[:space:]]+"."</i"

In-casesensitive space? Space? I’m in spaceeeeeeeeeee.

$target_enc

Not sure about its purpose. Textpattern pages are always UTF-8. Space. I love space.

{

And lastly there is my favorite thing, which you going to hate me for. Curly-tags. All fetched feeds have full execution right to the server. You are aware of that, right? So let’s say, Last.fm went down or got intruded. The attacker could collect all Textpattern installation passwords that use smd_xml by replacing feeds with <txp:php> /* bad code here */ </txp:php>. That’s not exactly a good thing.

In general smd_xml would benefit of having a way to filter the contents. Setting tags that are allowed (XSS injections are not fun) and strip out harmful bytes.

Last edited by Gocom (2012-02-17 17:26:13)

Offline

#154 2012-02-18 00:02:45

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_xml : extract data from XML feeds

Gocom

Thanks for the rundown. v0.40 is a bit hardier than v0.30 but the main reason I haven’t officially released it is because I’ve been holding back while I investigate ways to make the curly tags thing better.

Anyway, just briefly to answer your excellent points. In v0.40:

  • force_read: removed
  • cache_time: default=1hour
  • gzip/base64: the comment is misleading but basically I was trying to save space in the prefs. Some XML documents are colossal so gzip seemed the logical way to compress it, but you can’t store such binary data in a MySQL TEXT field without problems, so I base64_encode() it for safety. I seem to remember trying serialize(), btw, but that only worked if the input was pure text and didn’t have any spurious (blob or malicious?) embedded data. For example, if the URL resolved to an error scenario and squirted HTML back in a foreign language, or someone supplied, I dunno, a URL to a jpg, the possibly binary data presented to the plugin broke things. And since the caching occurs before any checks on the data itself take place and before the XML parser has a chance to reject it for malformedness, I left it that way. I’ll see if there’s a better method.
  • cURL vs fsock: I didn’t realise it was a pre-installed component of PHP5. I thought people could still opt to install it (at least that’s how I interpret the docs). I might be wrong. In any case, I’ve swapped the test around in the code so if you don’t supply a transport, cURL is checked and used first if compiled in.
  • check http status: I’d put a lot more defensive coding into v0.40 to trap any malformed or non-existent feeds (which should have prevented the issue Jelle encountered). But I’ll add the explicit HTTP status check and perform some simple tests for XMLness as you suggest. Can’t hurt to be more cautious.
  • __construct(): good call. Done.
  • case-insensitive space: errr, yeah. I think I copied that from a PHP.net comment somewhere and didn’t spot the /i on the end.
  • $target_enc: just a feature of PHP’s XML parser (XML_OPTION_TARGET_ENCODING) so you can (for whatever reason) translate the incoming feed into something other than UTF-8 (or convert it to UTF-8 if it’s in some other strange encoding). I’ve never ued it, but since it was a stock option of the constructor I employed it.
  • curly quote syntax. Yeah, fix is in the works. I’ve been gradually going through all my plugins and sorting things out to either:
    • remove it (if it adds little value)
    • replace it with a <txp:smd_proper_tag>
    • only use {replacements} for values that are generated by the plugin (e.g. counters), not stuff that comes from potentially tainted user input
    • internally make {replacement} use <txp:smd_blah_info name="replacement"> if the curly quote syntax is much cleaner than a tag, or would cause real b/c problems
    • make <txp:smd_blah_info> tags more compelling by expanding their capabilities to allow lists of names with wraptag/break/class. That means {replacements} are often less elegant than a tag, which makes people rely on them less
    • rewrite the code so they are no longer necessary
    • stop using strtr() because it’s dog slow. and use str_replace() instead
    • employ filtering options to sanitize data
    • some or all of the above in tandem

The curly tag thing is just taking a little longer to do than I’d hoped. I’ve tentatively done smd_if and smd_query (and a few of my as-yet unreleased plugins), with a few other oldies in various states of completion (smd_bio, smd_xml, smd_gallery, etc). I’ll get there eventually.

Last edited by Bloke (2012-02-18 00:04:13)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#155 2012-02-18 00:53:59

Gocom
Developer Emeritus
From: Helsinki, Finland
Registered: 2006-07-14
Posts: 4,533
Website

Re: smd_xml : extract data from XML feeds

Bloke wrote:

cURL vs fsock: I didn’t realise it was a pre-installed component of PHP5

Heh, it’s not. But it really ,really should be installed on any server, and for most part, it is. If it isn’t then that server or hosting provider isn’t worth using. cURL is the thing that should be used to do these type of tasks.

I copied that from a PHP.net comment

Not the best place to get your resources, I would say. The actual docs are fine, but the comments variate greatly in quality. I wouldn’t take anything from the comments section if in doubt.

PHP’s XML parser

Speaking of parsers, don’t you think it might be a good idea to use SimpleXML? Instead of that ancient… thingy. It’s enabled by default since PHP 5.1.2 and is pretty fly. Would save you that whole smd_xml_parser class. All you would need is something as:

try {
	@$r = new SimpleXMLElement($xml, LIBXML_NOCDATA);
}
catch(Exception $e){
	trigger_error('Invalid XML document, or bitch* is just stupid. *SimpleXML');
	return;
}

You get neatly organized resource. Returning values is a breeze. Or what about unlimited nodes? Unlimited power? No level cap, you bet!

/**
 * <txp:smd_xml_data name="node->that->is->in-deep->mess->or->just->title" />
 */
function smd_xml_data($atts) {

	global $smd_xml;

	extract(lAtts(array(
		'name' => NULL,
		'escape' => 1,
	), $atts));

	$r = $smd_xml;

	foreach(explode('->', $name) as $n) {	
		if(!isset($r->{$n})) {
			return;
		}
		$r = $r->{$n};
	}

	$r = (string) $r;	
	return $escape ? htmlspecialchars($r) : $r;
}

Last edited by Gocom (2012-02-18 00:55:27)

Offline

#156 2012-03-30 16:22:16

MattD
Plugin Author
From: Monterey, California
Registered: 2008-03-21
Posts: 1,254
Website

Re: smd_xml : extract data from XML feeds

I need to convert a twitter feeds created_at date to YYYY,MM,DD format. I’m using format="created_at|date|%Y,%m,%d" but I just get the year. The raw value in the twitter feed is Tue Mar 27 19:06:47 +0000 2012.

What am I doing wrong?

My guess is it’s the commas.

edit: for now I’m using rah_replace to get around this.

Last edited by MattD (2012-03-30 16:28:39)


My Plugins

Piwik Dashboard, Google Analytics Dashboard, Minibar, Article Image Colorpicker, Admin Datepicker, Admin Google Map, Admin Colorpicker

Offline

Board footer

Powered by FluxBB