smd_xml : extract data from XML feeds

Bloke · 2014-10-02 21:01:02

johnno wrote #284458:

Advice on how to hard-code this would help enormously!

Until I can get round to adding the option to the plugin, you can hack it in yourself. Edit the plugin code and find function smd_xml_curl. In there you’ll see all the curl_setopt calls. Add this somewhere among that list:

curl_setopt($c, CURLOPT_USERAGENT, “bbpress-verify”);

Job done.

johnno · 2014-10-03 16:52:12

Job—as you rightly observe—done!
Thanks Stef.

jayrope · 2015-05-05 09:50:12

Hi Stef, would it be possible to add a user agent string to smd_xml accessing a feed? Some sites block access to their feeds without a user agent entry.
Thanx much in advance.

Bloke · 2015-05-05 10:32:57

jayrope wrote #290465:

would it be possible to add a user agent string to smd_xml accessing a feed?

If you want to give the next version a go then that has this ability and could do with some beta testers. It’s not yet packaged into an installable plugin, but you can copy and paste just the code portion over your old version (or use ied_plugin_composer to install the full thing).

Configure it using the transport_config, separating each entry with delim (comma) and param_delim (pipe) between key and data, e.g:

transport_config="useragent|Mozilla/5.0 (Android; Tablet; rv:26.0) Gecko/26.0 Firefox/26.0"

It defaults to Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) Gecko/20100101 Firefox/25.0.

Note that the config keys are not the same as the actual strings the browser sends (useragent vs User-Agent). This is so they can be whitelisted for now. Support is in there for a bunch of other keys to do with SSL / certificate verification for sites that support it, carriage-returns, ports, binary feeds and so on. The full list is in the plugin help.

Please let me know how successful (or otherwise) it is.

jayrope · 2015-05-05 10:47:37

Thanx Stef, will definately beta test this now. Awesome! You rock!

photonomad · 2015-10-07 15:31:57

Hi Stef, I recently had a page crash with the following error and traced it back to a feed I was reading with smd_xml:

Fatal error: Maximum execution time of 30 seconds exceeded in /public_html/sitedomain/textpattern/lib/txplib_misc.php(812) : eval()’d code on line 137

The source was an old Yahoo Pipe feed that I’d forgotten about. Pipes is now gone (as of 9/30/15). I just tested smd_xml with an intentionally bad source and get the same error.

Just curious if there is any way to detect a bad source and let smd_xml fail without breaking the page with a fatal error?

Bloke · 2015-10-07 15:54:03

photonomad wrote #295468:

Just curious if there is any way to detect a bad source and let smd_xml fail without breaking the page with a fatal error?

Hmmm, in theory the timeout attribute (which defaults to 10 seconds) ought to bomb out before PHP does. Maybe the Pipes feed is still “there” (active from an HTTP standpoint) but not actually returning data? Not sure if there’s anything I can do, but if you could throw me a URL that triggers the error I’ll see if I can trap it.

photonomad · 2015-10-07 16:08:30

Thanks for your quick reply. I was just about to send you the links/info and then checked the version of smd_xml – was using 0.3… oops!

Just upgraded to 0.4 and the feed fails gracefully! So sorry to have troubled you!

jakob · 2016-03-23 23:22:54

Can I use smd_xml to only retrieve certain records where the content of a particular field within the record matches a certain value? Is that what the match attribute is for?

Or does this need looping through all the records and doing some kind of retrieve to a variable + if_variable has no content then skip test? Or an smd_if equivalent?

The use case is a wordpress WXR (extended rss in xml format) file that contains a large number of entries of a custom post type. I don’t have access to the sql file unfortunately and importing this back into a fresh wp installation only brings in the standard page, post and attachment types. I wanted to loop through the xml file (9 MB!) and process only those <item> records where wp:post_type equals a certain type. If the type matches, I’d then extract the remaining relevant items from that record.

Bloke · 2016-03-24 09:27:50

jakob wrote #298401:

Can I use smd_xml to only retrieve certain records where the content of a particular field within the record matches a certain value? Is that what the match attribute is for?

In theory, yes, that’s what it’s for. Is it not playing ball? Is the feed actually XML? What does it say if you add debug="3" to the tag? Does it indicate that it’s read it?

If you want to drop me a link to the WXR doc in question I could have a play with it and see if smd_xml can process it.

phuture303 · 2016-07-23 09:21:27

jakob wrote #298401:

Can I use smd_xml to only retrieve certain records where the content of a particular field within the record matches a certain value? Is that what the match attribute is for?

Hey, I’ve the same question, but have to step back a little: I don’t have a clue how to use the match attribute.

My XML-Data has entries like:

<spielzeiten>
	<vorstellung>
		<film_id>2</film_id>
		<zeit>1607211445</zeit>
		<vorst_nr>478</vorst_nr>
		<saal_nr>2</saal_nr>
	</vorstellung>
	<vorstellung>
		<film_id>7</film_id>
		<zeit>1607211445</zeit>
		<vorst_nr>558</vorst_nr>
		<saal_nr>1</saal_nr>
    </vorstellung>
    <vorstellung>
		<film_id>8</film_id>
		<zeit>1607211445</zeit>
		<vorst_nr>471</vorst_nr>
		<saal_nr>3</saal_nr>
    </vorstellung>
    <vorstellung>
		<film_id>2</film_id>
		<zeit>1607211445</zeit>
		<vorst_nr>457</vorst_nr>
		<saal_nr>4</saal_nr>
	</vorstellung>
</spielzeiten>

My output should only be the vorst_nr nodes when film_id is “2”.

Something like:

<txp:smd_xml data="data.xml" record="vorstellung" fields="film_id,vorst_nr" wraptag="ul" limit="9">
<li>{film_id}: {vorst_nr}</li>
</txp:smd_xml>

Output:

<ul>
<li>2: 478</li>
<li>2: 457</li>
</ul>

… ignoring the film_ids 7 and 8.

How is the right syntax to use match?

Hopefully this question is not too stupid… :-)

Thanks!
David

Last edited by phuture303 (2016-07-23 09:22:05)

mario.paolucci · 2016-11-03 10:57:46

Ok, I’ve been trying several ways to use the plugin with absolutely no luck. I have installed it, activated it, and just nothing shows up in cases a and b below. No errors are shown when I go in debug mode. What do I do wrong? Thanks for any indication!

a) put the following code in a form:

<txp:smd_xml
     data="http://feeds.feedburner.com/welovetxp"
     record="item" fields="title,description, link, pubDate"
     wraptag="ul" limit="3" 
     cache_time="86400"
     debug="3" >
   <li>
      <a href="{link}">
         {title}
      </a><span class="published">{pubDate}</span>
      <br />{description}
   </li>
</txp:smd_xml>

and then an article with:

<txp:output_form form="test-2" />

b) I’ve put the following code in an article:

<txp:smd_xml
     data="http://feeds.feedburner.com/welovetxp"
     record="item" fields="title,description, link, pubDate"
     wraptag="ul" limit="3" pageform="single"
     cache_time="86400"
     debug="3" >
   <li>
      <a href="{link}">
         {title}
      </a><span class="published">{pubDate}</span>
      <br />{description}
   </li>
</txp:smd_xml>

This is my configuration:

Textpattern version: 4.6.0 (86d82f868a753eb919f2250d82f4dcae)
Last update: 2016-10-14 16:13:50/2016-10-14 16:22:20
Article URL pattern: section_title
upload_tmp_dir: /tmp
Temporary directory path: /tmp
Site URL: labss.istc.cnr.it
PHP version: 5.3.3
GD Graphics Library: bundled (2.0.34 compatible); supported formats: GIF, JPG, PNG.
Server TZ: Europe/Rome
Server local time: 2016-11-03 11:51:19
Daylight Saving Time enabled?: 0
Automatically adjust Daylight Saving Time setting?: 1
Time zone: Europe/Rome (3600)
MySQL: 5.1.73
Database server time: 2016-11-03 11:51:19
Database server time offset: 0 s
Database server timezone: SYSTEM
Database session timezone: SYSTEM
Locale: C
Server: Apache
Apache version: Apache
PHP server API: apache2handler
RFC 2616 headers: 
Server OS: Linux 2.6.32-573.22.1.el6.x86_64
Active plugins: smd_xml-0.40m, adi_menu-1.3.1, kuo_ace-0.4, wet_quicklink-0.8.2, wet_peex-1.0
Admin-side theme: hive 4.6.0

Bloke · 2016-11-03 11:36:17

mario.paolucci wrote #302607:

No errors are shown when I go in debug mode.

What’s in your pageform? If it’s empty or somehow mangled I think the plugin hangs. A silly bug in the code I expect. Without that attribute, this works for me:

<txp:smd_xml
     data="http://feeds.feedburner.com/welovetxp"
     record="item" fields="title, link, pubDate, content:encoded"
     wraptag="ul" limit="3"
     cache_time="86400"
     debug="1">
   <li>
      <a href="{link}">
         {title}
      </a><span class="published">{pubDate}</span>
      <br />{content:encoded}
   </li>
</txp:smd_xml>

Presumably, you have allow-url-fopen available? Otherwise your site won’t fetch directly from URLs.

mario.paolucci · 2016-11-03 15:25:30

Thank you very much for the prompt answer, Bloke!

About the form, I have tried to remove it, or to put the default form, still no luck.

I have checked my php and it says it is on, although the include one is not.

Core
PHP Version 	5.3.3
Directive	Local Value	Master Value
allow_call_time_pass_reference	Off	Off
allow_url_fopen	On	On
allow_url_include	Off	Off
always_populate_raw_post_data	Off	Off
arg_separator.input	&	&

It’s really confusing.. Maybe I have some strange configuration I’m not aware of. Just to show that the plugin is executing, I noticed that I had the problem reported in http://forum.textpattern.com/viewtopic.php?id=46538 (where it says it’s only a problem in debug mode, not in live mode), and I made it go away with

Txp::get('\Textpattern\Tag\Registry')
   ->register('smd_xml');

So the plugin is somehow executed, only it returns nothing. Can I feed it some local data? What should I test next?

Thanks again for your attention…

Bloke · 2016-11-03 15:52:01

mario.paolucci wrote #302615:

About the form, I have tried to remove it, or to put the default form, still no luck.

Well don’t use the default form or it’ll get mighty confused! pageform is solely for pagination links (next/prev, etc) so if it encounters any other article-related stuff it’ll likely blow up or run out of memory or something. When I had that attribute specified, my results were the same as yours: the plugin did “nothing” after a loooong period of trying hard and returned a blank screen with a few warnings on it.

Can I feed it some local data?

Absolutely. You could load the feed from your destination URL into your browser, do View Source and copy the lot (or a valid XML subset) directly into your smd_xml data attribute.

Textpattern CMS

Textpattern CMS support forum

#211 2014-10-02 21:01:02

Re: smd_xml : extract data from XML feeds

johnno wrote #284458:

#212 2014-10-03 16:52:12

Re: smd_xml : extract data from XML feeds

#213 2015-05-05 09:50:12

Re: smd_xml : extract data from XML feeds

#214 2015-05-05 10:32:57

Re: smd_xml : extract data from XML feeds

jayrope wrote #290465:

#215 2015-05-05 10:47:37

Re: smd_xml : extract data from XML feeds

#216 2015-10-07 15:31:57

Re: smd_xml : extract data from XML feeds

#217 2015-10-07 15:54:03

Re: smd_xml : extract data from XML feeds

photonomad wrote #295468:

#218 2015-10-07 16:08:30

Re: smd_xml : extract data from XML feeds

#219 2016-03-23 23:22:54

Re: smd_xml : extract data from XML feeds

#220 2016-03-24 09:27:50

Re: smd_xml : extract data from XML feeds

jakob wrote #298401:

#221 2016-07-23 09:21:27

Re: smd_xml : extract data from XML feeds

jakob wrote #298401:

#222 2016-11-03 10:57:46

Re: smd_xml : extract data from XML feeds

#223 2016-11-03 11:36:17

Re: smd_xml : extract data from XML feeds

mario.paolucci wrote #302607:

#224 2016-11-03 15:25:30

Re: smd_xml : extract data from XML feeds

#225 2016-11-03 15:52:01

Re: smd_xml : extract data from XML feeds

mario.paolucci wrote #302615:

Board footer