Multiple xpath lookups in YQL

To understand what we’re doing here, you may want to read my introduction to using YQL and Feed43 to create custom RSS feeds. I wanted to pull multiple unconnected elements from a webpage to make an RSS feed, but every example I’d seen recently only used xpath to get one element from the page. The solution to this problem is quite simple; however, the solution evaded me for quite a while.

Our example this time is creating an RSS feed from the webcomic Girls With Slingshots that includes the comic image. I also wanted to grab the URL so that when you clicked on the image, you’re taken to the correct webpage. So, we need to put the comic’s RSS feed for the comic into the YQL console as we’re used to doing:

select * from html where url in (select link from feed where url='http://feeds.feedburner.com/Girls_With_Slingshots')

The most important part of the page is definitely the comic image. I used the chrome extension xpath on click to find that it’s located at //div[@id='comic']/img, so we can change the YQL query to extract that specific image from each page:

select * from html where url in (select link from feed where url='http://feeds.feedburner.com/Girls_With_Slingshots') and xpath="//div[@id='comic']/img"

But now I also want the page URL! After poking around in the source for a comic page, I notice that it’s nice enough to have a meta tag with the url for each page, exactly what I’m looking for. It turns out that if you want multiple elements from a page, you simply have to separate each xpath with a pipe, |, and it will return all of them. Here’s our final query:

select * from html where url in (select link from feed where url='http://feeds.feedburner.com/Girls_With_Slingshots') and xpath="//div[@id='comic']/img|//head/link[@rel='canonical']"

I then took the REST url and put it into Feed43 to make my custom RSS feed from this information. Here’s the finished feed: Girls With Slingshots RSS.

Andrew Guyton
http://www.disavian.net/

Leave a Reply