smd_xml : extract data from XML feeds

Bloke · 2010-01-19 09:26:04

nardo wrote:

what would happen if you have a data feed that has updated with additional content for existing items as well as new items … and you INSERT into Txp database …

Trouble is what you’d get :-)

I’ve just had this exact scenario for a project I’m working on. There are two approaches I can think of:

MySQL’s Insert… on duplicate key — which does an update if the row already exists
Check first to see if the row you are using exists and do an UPDATE if it does; else do an INSERT

I use method 2 like this:

<txp:smd_xml fileds="title, link, description, pubDate" blah blahblah>

   <!-- Assign the body tag (though we could use any field) to the
       rec_exists variable if there's a record with the current link in it -->
   <txp:variable name="rec_exists"><txp:smd_query query="SELECT * FROM textpattern WHERE custom_3='{link}'">{Body}</txp:smd_query></txp:variable>

   <txp:if_variable name="rec_exists" value="">
      <!-- Doesn't exist, so INSERT -->
      <txp:smd_query query="INSERT INTO textpattern SET thingy='whatnot', etc" />

   <txp:else />
      <!-- Record exists so UPDATE it -->
      <txp:smd_query query="UPDATE textpattern SET thingy='whatnot', etc WHERE custom_3='{link}'" />

   </txp:if_variable>

</txp:smd_xml>

That should ensure nothing is duplicated. As long as your initial query that assigns stuff to rec_exists can always only return one row — and the WHERE clause you use matches the WHERE clause of the UPDATE statement — you’re good to go.

Having said that, method 1 is probably cleaner. If you do that, please post your code here so I can learn how to do it :-)

Last edited by Bloke (2010-01-19 09:27:43)

Destry · 2010-01-24 15:52:48

Thanks for bringing my attention to this one, Stef. Really wonderful plugin! Maybe one of the “top 10 of all time.” :)

This addresses some other things that were coming up, or that I had pushed on the back burner. Time to do some fiddling.

Destry · 2010-01-25 15:38:37

Bloke, is there a way to be extra selective about which data within a given field you choose to pull?

For example, I’m experimenting with pulling Ning blog posts, and one of the fields included is <author> (actually, it’s structured like this: <author><name>Jim Dean</name></author>). However, we don’t want to pull posts from every author of the entire community, only the leaders of the community.

So two questions in this case:

Do I need to pull both author and name, or will just author or just name grab “Jim Dean”? (I’m guessing just name would be the more accurate, but not sure with the nesting.)
How might I notate to just pull a defined ~~number~~ set of names from the given field?

Last edited by Destry (2010-01-25 19:14:21)

Bloke · 2010-01-25 23:34:47

Destry

Thanks for the kind words about the plugin. Hope it can scratch all your XML itches.

is there a way to be extra selective about which data within a given field you choose to pull?

At the moment no, but I did consider adding some kind of filtering system to say, for example, “get me the ‘name’ field, but only if it matches Jim*”. In the end I opted out of this, thinking that it would be easier to delegate such meddling to a nested smd_if. However, that was before I added paging, which gets mashed if you do that kind of thing, so I guess I ought to build in filtering of data too. That’ll take some brain power: I’d best take a look with a fresh head on.

In the meantime, yes name is usually sufficient to grab the name; adding the containing author field is superfluous. But you may run into problems if name is used elsewhere in the same record, as the skip attribute probably won’t help in this situation. One way I’m considering to get round this is by allowing you to specify a hierarchy, for example fields="author->name". Trouble is, the XML parser built into PHP that this plugin hijacks is kind of a ‘fire and forget’ parser and has no intrinsic history built in; it’s just “found a tag; here it is. Found another one; here it is. Found an end tag…” so I’d have to put some logic in there to track the document structure. That would also require brain power because I’d have to keep a note of the hierarchy to see if the current tag is inside another (and potentially inside another, and another…) so it gets pretty hairy very quickly.

If you can’t wait and do happen to have some say over the structure of the XML stream, a quick cheat is to add an attribute to the name tag on certain people. For example <name type="leader">Jim Dean</name>; the plugin will then allow you to pluck {name|type} directly. Won’t help with paging though :-(

Leave it with me. Will dedicate some thought mojo to this one.

Destry · 2010-01-26 00:41:49

Thanks, man! There’s no rush on this so don’t bang that head too hard. If you get filtering to work, cool. If not, no worries. Still a fantabulous and fun plugin and I’ll have plenty of uses for it.

pieman · 2010-02-20 09:23:20

Hi Stef

With a delicious RSS import successfully up and running I inevitably got greedy and tried to import my entire set of (1322) delicious bookmarks.

The only export available from delicious seems to be in Netscape Bookmark File Format:

<DT><A HREF="http://www.guardian.co.uk/online/howto/story/0,15824,1433861,00.html" ADD_DATE="1114022333" PRIVATE="0" TAGS="bookmarks">Cream of the crop:</A>
<DD>Guardian's 100 most useful websites

I cleaned it up with find & replace to make it valid XML, but it brought up a few brain teasers. Mainly because I need to convert the timestamp into a valid format, and initially I struggled with attributes. Can you apply formatting to attributes?

I couldn’t figure it, so to simplify things I transformed them all into nodes

    <item>
      <link>http://www.guardian.co.uk/online/howto/story/0,15824,1433861,00.html</link>
      <pubDate>1114022333</pubDate>
      <tags>bookmarks,tag2,tag3</tags>
      <title>Cream of the crop:</title>
      <description>Guardian&amp;#039;s 100 most useful websites</description>
    </item>

Here’s my smd tag magic

<txp:smd_xml data='<txp:variable name="linkfeed" />' 
  record="item" 
  fields="link, pubDate, tags, title|uTitle, description" 
  concat_delim="," 
  convert="&amp;#039;|'" 
  format="pubDate|date|%Y-%m-%d %H:%I:%S, uTitle|sanitize|url_title, title|escape, description|escape" 
  set_empty="1"
  debug="0"
  wraptag="ul"
>
  <li>Pubdate: {pubDate}</li>
  <li>url_title: {uTitle}</li>
  <li>Title: {title}</li>
  <li>Link: {link}</li>
  <li>Description: {description}</li>
  <li>Tags: {tags}</li>
</txp:smd_xml>

And the output. All is well except for the pubDate value.

<ul>
  <li>Pubdate: 1114022333</li>
  <li>url_title: cream-of-the-crop</li>
  <li>Title: Cream of the crop:</li>

  <li>Link: http://www.guardian.co.uk/online/howto/story/0,15824,1433861,00.html</li>
  <li>Description: Guardian\'s 100 most useful websites</li>
  <li>Tags: bookmarks,tag2,tag3</li>
</ul>

I’m not sure whether sd_xml can reformat from that kind of timestamp – is it possible?

One last thing… describing attributes in the Replacement Tags bit of smd_xml help, it says {name:id} : wile_e_coyote, but should it say {name|id} : wile_e_coyote?

Bloke · 2010-02-22 09:30:34

pieman wrote:

I need to convert the timestamp into a valid format

Dammit I missed that, thanks for spotting it. I was thinking of offering a timestamp format attribute but it actually makes sense to modify the date format to handle timestamps. Try this for now until I get round to rolling it into the next release. Around line 491 where the date case is handled you’ll see this line:

$nd = strtotime($this->xmldata['{'.$sfield.'}']);

Replace it with this:

if (is_numeric($this->xmldata['{'.$sfield.'}'])) {
	$nd = $this->xmldata['{'.$sfield.'}'];
} else {
	$nd = strtotime($this->xmldata['{'.$sfield.'}']);
}

EDIT: or more succinctly use this one line:

$nd = (is_numeric($this->xmldata['{'.$sfield.'}'])) ? $this->xmldata['{'.$sfield.'}'] : strtotime($this->xmldata['{'.$sfield.'}']);

That should get you where you wanna go.

should it say {name|id} : wile_e_coyote?

Oops, typo. Will be fixed.

Last edited by Bloke (2010-02-22 09:33:08)

pieman · 2010-02-22 23:01:29

Bloke wrote:

EDIT: or more succinctly use this one line:
$nd = (is_numeric($this->xmldata['{'.$sfield.'}'])) ? $this->xmldata['{'.$sfield.'}'] : strtotime($this->xmldata['{'.$sfield.'}']);

It only works, dunnit! Thanks Bloke.

lazlo · 2010-03-01 20:38:29

Hi Stef (or others)

I have multiple non-unique fields in my xml feed and I currently when I grab them they all get merged into one field.
All the individual <MeasureTypeCode>Foo</MeasureTypeCode> tags are being joined into custom field MeasurementType { 01 02 03 08 }
All the individual <Measurement>Foo</Measurement> tags are being joined into custom field Measurement { 8.5 5.5 .607 8.5 } and so on

instead of

<MeasureTypeCode>01</MeasureTypeCode> <Measurement>8.5</Measurement> <MeasureUnitCode>in</MeasureUnitCode> being joined into Height { 8.5 in }.
<MeasureTypeCode>02</MeasureTypeCode> <Measurement>5.5</Measurement> <MeasureUnitCode>in</MeasureUnitCode> being joined into Width { 5.5 in }.
<MeasureTypeCode>03</MeasureTypeCode> <Measurement>.607 in</Measurement> <MeasureUnitCode>in</MeasureUnitCode> being joined into Depth { .607 in }.
<MeasureTypeCode>08</MeasureTypeCode> <Measurement>382</Measurement> <MeasureUnitCode>in</MeasureUnitCode> being joined into Weight { 8.5 gr }.

Sample xml

<Product>
    <RecordReference>9780889221482</RecordReference>
    <Measure>
        <MeasureTypeCode>01</MeasureTypeCode>
        <Measurement>8.5</Measurement>
        <MeasureUnitCode>in</MeasureUnitCode>
    </Measure>
     <Measure>
         <MeasureTypeCode>02</MeasureTypeCode>
         <Measurement>5.5</Measurement>
         <MeasureUnitCode>in</MeasureUnitCode>
      </Measure>
      <Measure>
         <MeasureTypeCode>03</MeasureTypeCode>
         <Measurement>0.607</Measurement>
         <MeasureUnitCode>in</MeasureUnitCode>
      </Measure>
      <Measure>
          <MeasureTypeCode>08</MeasureTypeCode>
          <Measurement>382</Measurement>
          <MeasureUnitCode>gr</MeasureUnitCode>
       </Measure>
  </Product>

My current code just makes no account for multiple <measure > tags because I am not sure how to do all under the same <RecordReference>9780889221482</RecordReference>.
I can correctly join one <measure> group under one record but not multiple <measure> groups under one record.

Any enlightenment would help.

AND if you know of a way to using the XML doctype to label custom fields that would be really helpful as well.
<!DOCTYPE ONIXMessage SYSTEM “http://www.editeur.org/onix/2.1/02/reference/onix-international.dtd”>
<MeasureTypeCode>01</MeasureTypeCode> = Height is there a way just to look this up?

regards
Les Smith

Last edited by lazlo (2010-03-01 20:50:31)

Bloke · 2010-03-03 23:20:20

lazlo wrote:

I have multiple non-unique fields in my xml feed and I currently when I grab them they all get merged into one field.

Right, this looks like it’s going to take some trickery. Assuming that your sample XML is one Record (i.e. there are multiple ‘Product’ records in your data feed) you can proceed using some txp:variable magic, and some smd_each / smd_if goodness to stitch it all together.

What I’ve done is make up two fake ‘records’ to show how it works with a data feed. I’ve assigned the records to a txp:variable for now; you may employ the direct feed in your smd_xml’s data attribute instead, though notice I’ve added the ‘fake’ container tag around the whole XML feed? For some reason the plugin didn’t like it when ‘Product’ was the top level and it only showed the first record — that might be a bug in the plugin or I might have been a bit dim when I tried it; I’ll have to track this down and see which it is!

So here’s my sample XML, which is a lot like yours but with two records in it:

<txp:variable name="the_xml">
<Fake>
  <Product>
    <RecordReference>9780889221482</RecordReference>
    <Measure>
        <MeasureTypeCode>01</MeasureTypeCode>
        <Measurement>8.5</Measurement>
        <MeasureUnitCode>in</MeasureUnitCode>
    </Measure>
     <Measure>
         <MeasureTypeCode>02</MeasureTypeCode>
         <Measurement>5.5</Measurement>
         <MeasureUnitCode>in</MeasureUnitCode>
      </Measure>
      <Measure>
         <MeasureTypeCode>03</MeasureTypeCode>
         <Measurement>0.607</Measurement>
         <MeasureUnitCode>in</MeasureUnitCode>
      </Measure>
      <Measure>
          <MeasureTypeCode>08</MeasureTypeCode>
          <Measurement>382</Measurement>
          <MeasureUnitCode>gr</MeasureUnitCode>
       </Measure>
  </Product>
  <Product>
    <RecordReference>9221482978088</RecordReference>
    <Measure>
        <MeasureTypeCode>01</MeasureTypeCode>
        <Measurement>35</Measurement>
        <MeasureUnitCode>cm</MeasureUnitCode>
    </Measure>
     <Measure>
         <MeasureTypeCode>02</MeasureTypeCode>
         <Measurement>809</Measurement>
         <MeasureUnitCode>mm</MeasureUnitCode>
      </Measure>
      <Measure>
         <MeasureTypeCode>03</MeasureTypeCode>
         <Measurement>6</Measurement>
         <MeasureUnitCode>cm</MeasureUnitCode>
      </Measure>
      <Measure>
          <MeasureTypeCode>08</MeasureTypeCode>
          <Measurement>38</Measurement>
          <MeasureUnitCode>kg</MeasureUnitCode>
       </Measure>
  </Product>
</Fake>
</txp:variable>

Before we begin this journey, we need to initialise 4 txp:variables — one for each ‘type’ with their relevant starting strings, e.g. Height { and Width {, etc. Note that the Record numbers match the {MeasureTypeCode} numbers. What we’re going to do is concatenate the relevant entries from the {Measurement} and {MeasureUnitCode} strings onto the correct txp:variable to build up a final set of variables containing the full strings.

Now comes the awkward bit. We’re going to let smd_xml pull each record out and concatenate the contents, but we’re going to use concat_delim to delimit them with a pipe. Then, inside the smd_xml container, use smd_each to iterate over the {MeasureTypeCode} list. So for each record, this will walk over the type codes in turn: 01, then 02, then 03, then 08. These 4 values have an automatic counter assigned to them in the smd_each plugin so we know where we are in the list; we’ll test this counter later with smd_if.

Inisde the main smd_each container are two further smd_each tags — both have different var_prefixes so we can distingiush the values from the outer smd_each. Essentially, they loop over each Measurement, then each MeasureUnitCode and — when they find the entry that matches the current counter — they tack the value onto the current txp:variable.

Thus, when we’re looking at the first MeasureTypeCode, the smd_if fires when we are looking at the first Measurement and then fires when we reach the first MeasureUnitCode. On the next iteration (2nd MeasureTypeCode) the smd_if fires when we reach the 2nd Measurement and the 2nd MeasureUnitCode, and so on.

The variables are built up piece by piece until, when all the smd_each tags are done, we have a complete string. These are then output before the end of the smd_xml tag so you can see them, but of course you can do what you like with them at that point.

Here’s the code anyway… hope it makes some kind of sense:

<txp:smd_xml data='<txp:variable name="the_xml" />' record="Product"
     fields="RecordReference, MeasureTypeCode, Measurement, MeasureUnitCode"
     concat_delim="|" wraptag="ul" break="li">
{RecordReference}
<br />

<!-- Define the initial state of the txp:variables; one for each MeasurementTypeCode -->
<txp:variable name="Record01">Height {</txp:variable>
<txp:variable name="Record02">Width {</txp:variable>
<txp:variable name="Record03">Depth {</txp:variable>
<txp:variable name="Record08">Weight {</txp:variable>

<txp:smd_each type="fixed" paramdelim="|" include="MeasureTypeCode|{MeasureTypeCode}" subset="2">

   <txp:smd_each type="fixed" paramdelim="|" include="Measurement|{Measurement}" subset="2" var_prefix="meas_">

      <!-- Only interested in the {smd_var_counter}th entry of the Measurement -->
      <txp:smd_if field="{smd_var_counter}" value="{meas_var_counter}">

         <!-- Concatenate the Measurement value. Record{smd_var_value} is the txp:variable name of the current MeasureTypeCode -->
         <txp:variable name='Record{smd_var_value}'><txp:variable name='Record{smd_var_value}' /> {meas_var_value}</txp:variable>

      </txp:smd_if>

   </txp:smd_each>

   <txp:smd_each type="fixed" paramdelim="|" include="MeasureUnitCode|{MeasureUnitCode}" subset="2" var_prefix="unit_">

      <!-- Only interested in the {smd_var_counter}th entry of the MeasureUnitCode -->
      <txp:smd_if field="{smd_var_counter}" value="{unit_var_counter}">

         <!-- Concatenate the MeasurenitCode value. Record{smd_var_value} is still the txp:variable name of the current MeasureTypeCode -->
         <txp:variable name='Record{smd_var_value}'><txp:variable name='Record{smd_var_value}' /> {unit_var_value} }</txp:variable>

      </txp:smd_if>

   </txp:smd_each>

</txp:smd_each>

<!-- Variables are now all built up; just display them for now -->
<ul>
<li><txp:variable name="Record01" /></li>
<li><txp:variable name="Record02" /></li>
<li><txp:variable name="Record03" /></li>
<li><txp:variable name="Record08" /></li>
</ul>

</txp:smd_xml>

Last edited by Bloke (2010-03-03 23:20:44)

mapu · 2010-03-12 20:41:54

How can make one a list of his Top Last.fm albums when the XML response looks like this:

<topalbums user="RJ" type="overall">
  <album rank="1">
    <name>Images and Words</name>
    <playcount>174</playcount>
    <mbid>f20971f2-c8ad-4d26-91ab-730f6dedafb2</mbid>  
    <url>
      http://www.last.fm/music/Dream+Theater/Images+and+Words
    </url>
    <artist>
      <name>Dream Theater</name>
      <mbid>28503ab7-8bf2-4666-a7bd-2644bfc7cb1d</mbid>
      <url>http://www.last.fm/music/Dream+Theater</url>
    </artist>
    <image size="small">...</image>
    <image size="medium">...</image>
    <image size="large">...</image>
  </album>
</topalbums>

Everything works fine until I’m trying to parse the nested fields and the <image size="..."> fields. Could someone give me a hint how to acomplish this, please? please?

Otherwise, another great plugin, Stef! Makes me feel again like a programmer! ;-)

Bloke · 2010-03-12 21:08:58

mapu wrote:

Everything works fine until I’m trying to parse the nested fields and the <image size="..."> fields.

I think you may need to wait until I’ve finished the nested rules portion of the code, sorry. I have a tentative version about 60% done, just been sidetracked the last few days.

Since lastfm are reusing things like name, mbid, and url you’ll need a version of the plugin that allows you to specify that you want fields="name, artist->name, artist->mbid" and so on. The plugin should be able to keep those separate so you can grab them separately from the feed. I’m also thinking about a way of allowing a shorthand so you don’t have to specify each and every sub-tag if you happen to want them all. Not sure if I can figure that out, but I’ll try.

I also need to be smarter with concatenation of like-named nodes. At the moment it doesn’t take attributes into account, but it should. Bad plugin *spank spank* no gruel for you…

Otherwise, another great plugin, Stef! Makes me feel again like a programmer! ;-)

Thanks, uhhh, I think ;-)

Last edited by Bloke (2010-03-12 21:09:23)

mapu · 2010-03-13 20:03:33

Then I will wait patiently for the new version! ;-)

nardo · 2010-03-30 00:22:07

Using Flickr API to get info back (due to limit on photos via RSS) … and having some issues with feed below

<rsp stat="ok">
<photos page="1" pages="1" perpage="500" total="109">
<photo id="444" owner="444" secret="444" server="2804" farm="3" title="Photo title" ispublic="1" isfriend="0" isfamily="0" ownername="Photo Owner Name" dateadded="1269854014" />
<photo id="555" owner="555" secret="444" server="2804" farm="3" title="Photo title" ispublic="1" isfriend="0" isfamily="0" ownername="Photo Owner Name" dateadded="1269854014" />

… etc …

</photos>
</rsp>

If I set attribute record as “photos” – I see one result (and replacement tags do nothing – i.e. don’t replace)
If I set attribute record as “photo” – I see 109 result (and replacement tags do nothing – i.e. don’t replace)

Is this data format not compatible with smd_xml due to the self-closing tags?

nardo · 2010-03-30 00:48:11

UPDATE – by requesting “extras” from the Flickr API, I now get the following:

<photo id="666" owner="666" secret="666" server="2698" farm="3" title="summer" ispublic="1" isfriend="0" isfamily="0" ownername="Owner Name" dateadded="1269796453" license="0" dateupload="1269751696" datetaken="2010-03-21 12:24:50" datetakengranularity="0" iconserver="2761" iconfarm="3" lastupdate="1269774620" latitude="0" longitude="0" accuracy="0" tags="summer" machine_tags="" views="1">
<description>summer</description>
</photo>

setting attribute record to “photo”, I can extract {description} … but not the other metadata within the “photo” tag …

Textpattern CMS

Textpattern CMS support forum

#31 2010-01-19 09:26:04

Re: smd_xml : extract data from XML feeds

#32 2010-01-24 15:52:48

Re: smd_xml : extract data from XML feeds

#33 2010-01-25 15:38:37

Re: smd_xml : extract data from XML feeds

#34 2010-01-25 23:34:47

Re: smd_xml : extract data from XML feeds

#35 2010-01-26 00:41:49

Re: smd_xml : extract data from XML feeds

#36 2010-02-20 09:23:20

Re: smd_xml : extract data from XML feeds

#37 2010-02-22 09:30:34

Re: smd_xml : extract data from XML feeds

#38 2010-02-22 23:01:29

Re: smd_xml : extract data from XML feeds

#39 2010-03-01 20:38:29

Re: smd_xml : extract data from XML feeds

#40 2010-03-03 23:20:20

Re: smd_xml : extract data from XML feeds

#41 2010-03-12 20:41:54

Re: smd_xml : extract data from XML feeds

#42 2010-03-12 21:08:58

Re: smd_xml : extract data from XML feeds

#43 2010-03-13 20:03:33

Re: smd_xml : extract data from XML feeds

#44 2010-03-30 00:22:07

Re: smd_xml : extract data from XML feeds

#45 2010-03-30 00:48:11

Re: smd_xml : extract data from XML feeds

Board footer