smd_xml : extract data from XML feeds

tye · 2011-10-07 03:52:23

OK – stef I’ve been testing extracting hierarchical nodes

I’ve thrown all kinds of feeds at this Facebook, Youtube, Vimeo, xml formatted files, rss blog feeds – and it worked perfectly with all of them – except one (from txp’s arch enemy)

http://en.blog.wordpress.com/feed/

This one has an enclosed media:content field – I can get the media:title fine… but I can’t get the url from the media:content

I was using

{media:content|url} & {media:content}

<media:content url="http://1.gravatar.com/avatar/767fc9c115a1b989744c755db47feb60?s=96&d=retro" medium="image">
	<media:title type="html">Matt Mullenweg</media:title>
      </media:content>

Bloke · 2011-10-07 10:37:31

tye wrote:

it worked perfectly with all of them – except one

Woah! Thanks for the report. Just fixed a howler of a bug that was in the plugin since, like, forever.

New beta uploaded to the same place so please re-download it and let me know how it performs on this feed, and please verify it’s not broken anything with the other ones you’ve tested.

Sorry about that. ^{hangs head in shame}

Last edited by Bloke (2011-10-07 10:37:51)

tye · 2011-10-08 00:54:02

Bloke wrote:

Sorry about that. ^{hangs head in shame}

What for? You are a txp legend :)

Quick weekend test confirms that now works perfectly – thanks

I’ll give it proper test again next week, have a great weekend :)

Mats · 2011-10-08 07:32:23

Thanks for a great plugin!

Is there a way to extract attribute values from the record? I’m trying to use time both as the record and it’s values:

http://www.yr.no/place/Sweden/Stockholm/Stockholm/forecast_hour_by_hour.xml

(If anyone needs free weather data Yr got a lot of places: http://fil.nrk.no/yr/viktigestader/verda.txt)

Bloke · 2011-10-08 13:27:46

Mats wrote:

Is there a way to extract attribute values from the record?

There is now, thanks for the nudge :-) Re-download the beta to see it in action.

<txp:smd_xml
     data="http://www.yr.no/place/Sweden/Stockholm/Stockholm/forecast_hour_by_hour.xml"
     record="time"
     fields="temperature">
Weather between {time|from} and {time|to}:
{temperature|value} {temperature|unit}
</txp:smd_xml>

I’ve relaxed the rules a bit now too so the fields attribute is no longer mandatory, as it’s valid to just pull out a record and use its attributes without any further fields being wanted.

Note that you could have used record="tabular" fields="time, temperature" to achieve something similar but the results would have been concatenated so you would probably have needed to use ontagend to pluck the values out on the fly: ontagend="xmltag|time" concat="0". That would pass control to the xmltag Form whenever a time node ended.

Hope the new version meets your needs, and thanks again.

Mats · 2011-10-08 14:17:56

Thank you, Stef!

tye · 2011-10-17 11:40:15

Hey Stef – working back on this again…. going back to this feed

http://en.blog.wordpress.com/feed/

I’m trying to isolate all the <media:content url="http://1.gravatar.com/avatar/d96117a2e8c038359d85de6d6c8da605?s=96&d=retro" medium="image"><media:title type="html">ericarenaejohnson</media:title></media:content> elements so I can use each of them in my form.

I looked and tried your method in this thread – but can’t get it to work, even your example :(

I think I understand what you are trying to do… but what would I use to output a specific image (say I only wanted to use the second media:content image)…

In your example, should the last piece out put all the values? I just get the variable value :(

<!-- Variables are now all built up; just display them for now -->
<ul>
<li><txp:variable name="Record01" /></li>
<li><txp:variable name="Record02" /></li>
<li><txp:variable name="Record03" /></li>
<li><txp:variable name="Record08" /></li>
</ul>

Bloke · 2011-10-17 12:23:19

tye wrote:

I’m trying to isolate all the <media:content ... media:title elements so I can use each of them in my form.

The info in Lazlo’s thread is out of date now, sorry. Since then, smd_xml has gained the ontag attributes which greatly simplifies this kind of feed. Here’s an example:

<txp:smd_xml limit="5"
     data="http://en.blog.wordpress.com/feed/"
     record="item" concat="0"
     fields="media:content, media:title"
     ontagend="tag_tye|media:content" />

A few things to notice:

No container. Since we’re using ontagend to decode the feed on the fly there’s no need to worry about the ‘for each record, do this’ portion.
It uses concat="0" for the same reason (no need to concatenate the results if we’re not using them at the end of each record).
The form tag_tye prints out stuff whenever the end tag is hit. So you can put the following code inside it to print out each item as it is encountered:

TITLE : {media:title}<br />
URL : {media:content|url}<br />

If you wanted to only extract the 2nd <media:content> element from each <item> you’d have to employ some kind of counter (for this I heartily recommend the indespensible adi_calc).

Unfortunately the current smd_xml plugin version you have had a small oversight in the code that made things trickier than necessary, so please re-download the beta that I’ve just fixed and then try this:

<txp:smd_xml limit="5"
     data="http://en.blog.wordpress.com/feed/"
     record="item" fields="media:content, media:title"
     ontagend="tag_tye|media:content" concat="0">
   <txp:variable name="counter" value="0" />
</txp:smd_xml>

and in form tag_tye:

<txp:adi_calc name="counter" add="1" />
<txp:if_variable name="counter" value="2">
TITLE : {media:title}<br />
URL : {media:content|url}<br />
</txp:if_variable>

So at each new record we reset the counter to 0 in the smd_xml plugin container. Then each time we access the tag_tye Form we increment that counter and immediately test it to see if it matches 2. If it does, we output the item.

See how you get on with that.

Last edited by Bloke (2011-10-17 12:43:22)

tye · 2011-10-18 00:56:38

Thanks Stef – on first use I messed it up, but now I realise that the ‘tag_tye’ form is just for the media content… I was trying to put all my fields in there :)

I’m pretty sure I understand now and will report back with superb results :)

Bloke · 2011-10-18 08:27:57

tye wrote:

I realise that the ‘tag_tye’ form is just for the media content… I was trying to put all my fields in there :)

Yeah, under normal operation the plugin’s Form/container is only executed after each record ends. Thus it contains {replacements} for all fields you have elected to extract, and each field is concatenated into one long string if the same field appears more than once inside that record.

In contrast, ontag runs immediately as soon as the nominated tag(s) are encountered — either when the tag is first encountered (ontagstart) or when its closing tag is detected (ontagend). This allows you to “break out” of the record and take some action when certain tags are hit. You could choose to output the data right there, you could collect the data into another variable, you could insert it directly into the database: pretty much anything you can dream up. And if that’s not enough you can use match to only pull out fields if its content (data between its start and end tags) matches some regular expression.

The ontag attributes allow you to pick and choose which Forms are executed on which tags — you can send multiple tags to the same Form by separating each tag with param_delim, or send different tags to separate Forms depending on your application (separate each Form|tag|tag|tag|... chain by delim).

Looking forward to seeing what you can do with the plugin. I’ve got some SOAP functionality to finalise yet (I’m beginning to loathe SOAP!), but it’s getting closer to an official release.

aslsw66 · 2011-10-21 19:58:21

Stef, I’ve been experimenting with this new version but I must confess without much luck – I can use the basic functionality but don’t really understand the granularity you have added.

The feed I’m interested in is the Australian Bureau of Meteorology feed for Canberra.

The specific item I’m after is the fire danger rating which sits within <text type="fire_danger">Low-Moderate</text>. In fact, the whole feed is like this.

Is it possible to only extract a specific element like this? I see from an earlier response that you can pull out data that sits within a single tag, but I want data that is between tags with a specific attribute. I assume I can assign this to a txp:variable so that I can display an image (the danger rating sign) instead of the text.

Also, this might be harder, but depending on the time of day the fire danger rating might be in either forecast-period index="0" (the rest of today) or forecast-period index="1" (tomorrow) – it seems to update in the evening. I really would like to know, so that I can tell people whether the fire danger rating is for today or tomorrow. In other words, having found a matching item I would like to know something about the parent item.

Thanks. Sorry, I don’t mean for you to do all of the work for me, but at least a pointer in the right direction.

Bloke · 2011-10-22 20:31:43

aslsw66 wrote:

The specific item I’m after is the fire danger rating which sits within <text type="fire_danger">Low-Moderate</text>

Ah, a subtle bug migt have prevented you from doing that. Fixed in the latest beta download. But try this:

<txp:smd_xml
     data="ftp://ftp2.bom.gov.au/anon/gen/fwo/IDN10035.xml"
     record="area" fields="forecast-period, text"
     transport="curl" ontagend="4cast|text"
     concat="0" cache_time="600" />

That farms out each <text> node to the 4cast Form. In that form you probably need to resort to smd_if since the node is being used multiple times in the same forecast-period parent node. I might see if I can fix that so this plugin can do it all (not hopeful but I’ll try), so for now:

<txp:smd_if field="{forecast-period|index}, {text|type}"
     operator="isnum, eq" value=", fire_danger">
{area|description}
{forecast-period|index}
{text|type} : {text}<br />
</txp:smd_if>

That’ll print out the area, the forecast-period index attribute and the matching text node for you. Style / HTMLise to taste.

aslsw66 · 2011-10-24 10:36:17

Thanks, this works perfectly. I’ve played around a bit with the code to see exactly how the plugin does what it does – I had guessed it was something to do with the ontag attributes but in the absence of an example in the help I didn’t really understand how to use it. So it now looks like we can pull out an attribute of a node (<node attribute="value">) and the contents inside a node (<node>contents</node>).

I’m just waiting for the time to tick over in Australia to see what happens to the forecast index.

aslsw66 · 2011-10-25 15:42:18

Is it correct to say that the format attribute doesn’t work once processing is passed off to a form through ontag?

I assume this because the plugin doesn’t know what fields are available until the form starts processing, but there is no capacity to format the results in the form.

I would like to grab the date and format it, instead of assuming that index=0 is today or index=1 is tomorrow (why not use the date when it is delivered in the feed?). I have thought about using something like rah_function but this would require two passes – first to convert the string to a proper date and then to get the date as a string in the correct format. Although it’s pretty safe to assume that I’m missing something here …

Bloke · 2011-10-25 20:44:35

aslsw66 wrote:

Is it correct to say that the format attribute doesn’t work once processing is passed off to a form through ontag?

Correct. Kind of an oversight, but pretty difficult to realise without some major code refactoring. And since I have another plugin — spawned from the idea behind the formatattribute — already written which can do exactly what you describe in one pass, refactoring smd_xml is not something I’m looking to undertake right now.

Regardless, I’ll make this clear in the documentation, thanks for pointing out the oversight. And if you’d like to try the new plugin, just holler.

Textpattern CMS

Textpattern CMS support forum

#106 2011-10-07 03:52:23

Re: smd_xml : extract data from XML feeds

#107 2011-10-07 10:37:31

Re: smd_xml : extract data from XML feeds

#108 2011-10-08 00:54:02

Re: smd_xml : extract data from XML feeds

#109 2011-10-08 07:32:23

Re: smd_xml : extract data from XML feeds

#110 2011-10-08 13:27:46

Re: smd_xml : extract data from XML feeds

#111 2011-10-08 14:17:56

Re: smd_xml : extract data from XML feeds

#112 2011-10-17 11:40:15

Re: smd_xml : extract data from XML feeds

#113 2011-10-17 12:23:19

Re: smd_xml : extract data from XML feeds

#114 2011-10-18 00:56:38

Re: smd_xml : extract data from XML feeds

#115 2011-10-18 08:27:57

Re: smd_xml : extract data from XML feeds

#116 2011-10-21 19:58:21

Re: smd_xml : extract data from XML feeds

#117 2011-10-22 20:31:43

Re: smd_xml : extract data from XML feeds

#118 2011-10-24 10:36:17

Re: smd_xml : extract data from XML feeds

#119 2011-10-25 15:42:18

Re: smd_xml : extract data from XML feeds

#120 2011-10-25 20:44:35

Re: smd_xml : extract data from XML feeds

Board footer