Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#25 2010-01-12 23:32:35

nardo
Member
From: tuvalahiti
Registered: 2004-04-22
Posts: 743

Re: smd_xml : extract data from XML feeds

I’m trying to parse this feed from a Google Spreadsheet

I can get 12 names output – but no more – whether calling the file from Google or from a local version of the file saved to hard drive

stumped

Offline

#26 2010-01-13 10:05:13

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_xml : extract data from XML feeds

nardo wrote:

I can get 12 names output – but no more – whether calling the file from Google or from a local version of the file saved to hard drive

Eeeek! There are no line breaks in that feed and the plugin had a max line length of 8192 characters (which is fine for 99% of feeds). I’ve added a line_length attribute now so you can set it to something huge, but that might not help you because I think PHP enforces some internal limit that you’d have to to raise by mucking around with php.ini.

It’s far better if you possibly can to switch to transport="curl" because it doesn’t have any line length restrictions.

Here’s v0.22 anyway. Hope that helps, and thanks for the report.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#27 2010-01-14 04:23:15

nardo
Member
From: tuvalahiti
Registered: 2004-04-22
Posts: 743

Re: smd_xml : extract data from XML feeds

good stuff bloke – the fact that there were no line-breaks was annoying me (given the cruft) but didn’t clue me in to why the feed was truncated … when I move off XAMPP to live server I’ll give curl a go … for the moment the line_length attribute is working neatly

another question … I’m pulling in a list of names … sorted & displayed alphabetically … I’d like to provide named anchors to jump down the list to “T” or “W” … if_different makes that easy … but how to extract the first character (e.g. “a” from “ALBERTA”) from a field’s data? … not suggesting smd_xml should be able to do it tho’ …

Offline

#28 2010-01-14 11:36:29

pieman
Member
From: Bristol, UK
Registered: 2005-09-22
Posts: 491
Website

Re: smd_xml : extract data from XML feeds

Only just got around to trying this one out. Shame on me…

Needless to say it’s another Bloke classic. This is getting boring ;-)

Offline

#29 2010-01-17 12:17:34

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_xml : extract data from XML feeds

v0.3 steps up to the plate and smacks xml documents out of the park. Features:

  • format attribute which allows you to further manipulate each field in a variety of ways prior to seeing it in the form/container. Exampels are changing the case of strings, sanitizing the data, reformatting dates/times, escaping data ready for import. If anyone can think of any other transformations I’ve missed, please yell
  • URL params can now be passed in the data attribute (ahem: minor oversight on my part)
  • linkify is deprecated: use format="field_name|link" instead now. You will receive a warning if you use linkify any more
  • the link creation regex is improved so it catches more URLs. If anyone has any problems with it, please let me know which URLs it chokes on
  • IMPORTANT : param_delim default is now the pipe symbol (|) instead of the colon (:). Please update any existing smd_xml tag attributes accordingly or add param_delim=":" to preserve the existing functionality. The colon proved to be used too often in too many streams and meant you had to pretty much use param_delim in every smd_xml tag to make it useful. Plus, you often want to use colons in date/time strings so it made sense to alter it

See how you get on. This version is much better at being able to embed smd_query tags inside XML streams to insert data into your TXP database from feeds. See example 6 in the help for a concrete implementation of this. As always, report good / bad / ugly stuff here and I’ll send the mermaids out to your oil rig to fix everything.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#30 2010-01-19 04:15:09

nardo
Member
From: tuvalahiti
Registered: 2004-04-22
Posts: 743

Re: smd_xml : extract data from XML feeds

Bloke, what would happen if you have a data feed that has updated with additional content for existing items as well as new items … and you INSERT into Txp database … would it append ALL as new articles? or would it append new info to the relevant Txp fields in existing articles and make new articles where there are new items in the feed?

Offline

#31 2010-01-19 09:26:04

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_xml : extract data from XML feeds

nardo wrote:

what would happen if you have a data feed that has updated with additional content for existing items as well as new items … and you INSERT into Txp database …

Trouble is what you’d get :-)

I’ve just had this exact scenario for a project I’m working on. There are two approaches I can think of:

  1. MySQL’s Insert… on duplicate key — which does an update if the row already exists
  2. Check first to see if the row you are using exists and do an UPDATE if it does; else do an INSERT

I use method 2 like this:

<txp:smd_xml fileds="title, link, description, pubDate" blah blahblah>

   <!-- Assign the body tag (though we could use any field) to the
       rec_exists variable if there's a record with the current link in it -->
   <txp:variable name="rec_exists"><txp:smd_query query="SELECT * FROM textpattern WHERE custom_3='{link}'">{Body}</txp:smd_query></txp:variable>

   <txp:if_variable name="rec_exists" value="">
      <!-- Doesn't exist, so INSERT -->
      <txp:smd_query query="INSERT INTO textpattern SET thingy='whatnot', etc" />

   <txp:else />
      <!-- Record exists so UPDATE it -->
      <txp:smd_query query="UPDATE textpattern SET thingy='whatnot', etc WHERE custom_3='{link}'" />

   </txp:if_variable>

</txp:smd_xml>

That should ensure nothing is duplicated. As long as your initial query that assigns stuff to rec_exists can always only return one row — and the WHERE clause you use matches the WHERE clause of the UPDATE statement — you’re good to go.

Having said that, method 1 is probably cleaner. If you do that, please post your code here so I can learn how to do it :-)

Last edited by Bloke (2010-01-19 09:27:43)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#32 2010-01-24 15:52:48

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,909
Website

Re: smd_xml : extract data from XML feeds

Thanks for bringing my attention to this one, Stef. Really wonderful plugin! Maybe one of the “top 10 of all time.” :)

This addresses some other things that were coming up, or that I had pushed on the back burner. Time to do some fiddling.

Offline

#33 2010-01-25 15:38:37

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,909
Website

Re: smd_xml : extract data from XML feeds

Bloke, is there a way to be extra selective about which data within a given field you choose to pull?

For example, I’m experimenting with pulling Ning blog posts, and one of the fields included is <author> (actually, it’s structured like this: <author><name>Jim Dean</name></author>). However, we don’t want to pull posts from every author of the entire community, only the leaders of the community.

So two questions in this case:

  1. Do I need to pull both author and name, or will just author or just name grab “Jim Dean”? (I’m guessing just name would be the more accurate, but not sure with the nesting.)
  2. How might I notate to just pull a defined number set of names from the given field?

Last edited by Destry (2010-01-25 19:14:21)

Offline

#34 2010-01-25 23:34:47

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,250
Website GitHub

Re: smd_xml : extract data from XML feeds

Destry

Thanks for the kind words about the plugin. Hope it can scratch all your XML itches.

is there a way to be extra selective about which data within a given field you choose to pull?

At the moment no, but I did consider adding some kind of filtering system to say, for example, “get me the ‘name’ field, but only if it matches Jim*”. In the end I opted out of this, thinking that it would be easier to delegate such meddling to a nested smd_if. However, that was before I added paging, which gets mashed if you do that kind of thing, so I guess I ought to build in filtering of data too. That’ll take some brain power: I’d best take a look with a fresh head on.

In the meantime, yes name is usually sufficient to grab the name; adding the containing author field is superfluous. But you may run into problems if name is used elsewhere in the same record, as the skip attribute probably won’t help in this situation. One way I’m considering to get round this is by allowing you to specify a hierarchy, for example fields="author->name". Trouble is, the XML parser built into PHP that this plugin hijacks is kind of a ‘fire and forget’ parser and has no intrinsic history built in; it’s just “found a tag; here it is. Found another one; here it is. Found an end tag…” so I’d have to put some logic in there to track the document structure. That would also require brain power because I’d have to keep a note of the hierarchy to see if the current tag is inside another (and potentially inside another, and another…) so it gets pretty hairy very quickly.

If you can’t wait and do happen to have some say over the structure of the XML stream, a quick cheat is to add an attribute to the name tag on certain people. For example <name type="leader">Jim Dean</name>; the plugin will then allow you to pluck {name|type} directly. Won’t help with paging though :-(

Leave it with me. Will dedicate some thought mojo to this one.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#35 2010-01-26 00:41:49

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,909
Website

Re: smd_xml : extract data from XML feeds

Thanks, man! There’s no rush on this so don’t bang that head too hard. If you get filtering to work, cool. If not, no worries. Still a fantabulous and fun plugin and I’ll have plenty of uses for it.

Offline

#36 2010-02-20 09:23:20

pieman
Member
From: Bristol, UK
Registered: 2005-09-22
Posts: 491
Website

Re: smd_xml : extract data from XML feeds

Hi Stef

With a delicious RSS import successfully up and running I inevitably got greedy and tried to import my entire set of (1322) delicious bookmarks.

The only export available from delicious seems to be in Netscape Bookmark File Format:

<DT><A HREF="http://www.guardian.co.uk/online/howto/story/0,15824,1433861,00.html" ADD_DATE="1114022333" PRIVATE="0" TAGS="bookmarks">Cream of the crop:</A>
<DD>Guardian's 100 most useful websites

I cleaned it up with find & replace to make it valid XML, but it brought up a few brain teasers. Mainly because I need to convert the timestamp into a valid format, and initially I struggled with attributes. Can you apply formatting to attributes?

I couldn’t figure it, so to simplify things I transformed them all into nodes

    <item>
      <link>http://www.guardian.co.uk/online/howto/story/0,15824,1433861,00.html</link>
      <pubDate>1114022333</pubDate>
      <tags>bookmarks,tag2,tag3</tags>
      <title>Cream of the crop:</title>
      <description>Guardian&amp;#039;s 100 most useful websites</description>
    </item>

Here’s my smd tag magic

<txp:smd_xml data='<txp:variable name="linkfeed" />' 
  record="item" 
  fields="link, pubDate, tags, title|uTitle, description" 
  concat_delim="," 
  convert="&amp;#039;|'" 
  format="pubDate|date|%Y-%m-%d %H:%I:%S, uTitle|sanitize|url_title, title|escape, description|escape" 
  set_empty="1"
  debug="0"
  wraptag="ul"
>
  <li>Pubdate: {pubDate}</li>
  <li>url_title: {uTitle}</li>
  <li>Title: {title}</li>
  <li>Link: {link}</li>
  <li>Description: {description}</li>
  <li>Tags: {tags}</li>
</txp:smd_xml>

And the output. All is well except for the pubDate value.

<ul>
  <li>Pubdate: 1114022333</li>
  <li>url_title: cream-of-the-crop</li>
  <li>Title: Cream of the crop:</li>

  <li>Link: http://www.guardian.co.uk/online/howto/story/0,15824,1433861,00.html</li>
  <li>Description: Guardian\'s 100 most useful websites</li>
  <li>Tags: bookmarks,tag2,tag3</li>
</ul>

I’m not sure whether sd_xml can reformat from that kind of timestamp – is it possible?

One last thing… describing attributes in the Replacement Tags bit of smd_xml help, it says {name:id} : wile_e_coyote, but should it say {name|id} : wile_e_coyote?

Offline

Board footer

Powered by FluxBB