Skip Site Navigation «

dnews

app«dnews

Howto MagpieRSS: Implementing Web Developer News

Intro

By popular demand, I have authored this howto and tutorial to help loadaverageZero visitors understand how the dnews application was built using the MagpieRSS feed parsing library, PHP and MySQL. The instructions for installing MagpieRSS are clear and simple, so I will not reiterate them here.

Note: There is a lot of code and data to look at here, and I’m too busy to write an article that includes every gory detail (such as installing the Magpie library). I write pretty clean code, with adequate commenting. So if you are comfortable with PHP and MySQL, then you should have no problem grasping the logic from reading the source code. If not, there are a ton of resources available using the drx application to help get you started.

Data

First, the data. Luckily, I have already built another application for browsing both the schema (table structure) and contents (data) of my databases. And it is called dbrowse. What a shock!

There are only two simple MySQL tables that make up the backend of the dnews application: news, from which the ”channel selector” menu is built, and feed, which holds some details about each RSS channel. The two are bound together by the foreign key nid.

Note: You may get side tracked in dbrowse. If you need help understanding how that application works, I recommend you rewind all the way back to the beginning and start from there.

Code

All of my applications are derived from a base PHP class called application. Another shocker! In order to understand what is going on long before this kicks in, I recommend reading as much of the PHP Labs series as you can stomach. This builds everything that surrounds the content area which is the container that this document is sitting in. After that, anything goes.

So, to build dnews, I first create a node, or instance of the application class.



Example initialization code:

<? /* lib/dnews/node.php -- dnews application bootstrap code (c) Copyright 2004-2006, Douglas W. Clifton, all rights reserved. for more copyright information visit the following URI: http://loadaveragezero.com/info/copyright.php */ $_s ='select name, title from file where fid=' . $fid; if (!$_r = mysql_query($_s, $_db)) return null; list($name, $title) = mysql_fetch_row($_r); mysql_free_result($_r); include('app.php'); $app = new application($name, $title, $did); $app->db(MY_DB); $cachedir = '/none/of/your/beeswax/lib/magpie/cache'; define('MAGPIE_OUTPUT_ENCODING', $charset); // UTF-8, set in lib/preamble.php define('MAGPIE_AGENT_URL', 'http://' . $laz['domain'] . $app->URI); define('MAGPIE_CACHE_DIR', $cachedir); include('magpie/rss_fetch.inc'); $path_info = explode($app->x, trim($_SERVER['PATH_INFO'], $app->x)); if ($path = $path_info[0]) $feed = feed($path); // possibly null function feed($path) { // return feed info from path info global $_db, $app; $_s = 'select news.site, news.img, feed.title, feed.URI'; $_s .= ' from news, feed'; $_s .= ' where news.nid=feed.nid'; $_s .= ' and feed.title=' . like(str_replace('_', ' ', $path)); if ($_r = mysql_query($_s, $_db)) { $feed = mysql_fetch_object($_r); if (strpos($feed->title, ':')) $feed->title = str_replace(':', '/', $feed->title); mysql_free_result($_r); return $feed; } $app->error(mysql_error()); return null; } // feed() // lib/dnews/node.php ?>

Node

The feed() function is only called after the news channel menu is rendered and a user selects an entry. At which point the the script calls itself with a PATH_INFO argument indicating which feed was selected. The feed() function returns an object which contains the site label, the image basename, and the feed’s title and URI. A #news fragment identifier is also used in the URI to move the focus of the page down past the introduction, so the user can begin scanning headlines, or select another channel from the menu.

Armed with this information, dnews can now proceed to render the feed’s banner, link and headlines. To understand how all of this is assembled, take a moment to read the dnews source code. For the insanely curious, you can also view the PHP source code to this document.

Below is the source code to the API functions that are used by the application. To render the channel selector menu, dnews calls the feeds() function, which returns the menu pre-built as an XHTML ordered list. All presentation details are are separated from the markup by employing several CSS stylesheets. In addition to the base classes in root.css, dnews also uses classes shared by drx and several other applications and documents on loadaverageZero. And they include resource.css, hreview.css and lists.css. Further information on the CSS stylesheets used on this site can be found at the Sitemap. Good luck with all of that.

Fetch

Okay, we’re on the home stretch now. All that needs to be done is to fetch the feed, and render each item’s headline, link, description (aka blurb or teaser) and URI.

To do this, dnews calls the fetch() wrapper function with the feed object. fetch() returns the banner image as a link to the source site, and an array of <item> objects which hold the details of each headline as described above. These are by default in reverse chronological order, in other words newest first, as is the convention with RSS news feeds. This is where Magpie really kicks in, since the items and the $rss->items returned by the fetch_rss() function are one and the same. With some slight modifications.

Cache

If you don’t appreciate the importance of caching the requests made to the services provided by your news sources, then you probably don’t produce your own feeds—or worse, don’t pay attention to your log files. Even after many hours of hard work on my part, and extended contacts with the aggregator folks, the requests made to this host by said providers outmatch any other category of requests, including search engine indexing bots like those from Google and Yahoo. Certain places will even ban you, such as Slashdot, if you keep hammering them with requests. Not only does this keep you in good graces with your sources, it also dramatically improves the response time for your users by storing local (serialized) copies of the feed data until they go stale. Hell, the bandwidth savings alone is worth the effort.

Luckily, MagpieRSS makes this very simple. All that is necessary to trigger the caching is to define a constant MAGPIE_CACHE_DIR, which is just that, a path to your cache directory. See the code in the first snip box above for an example. Okay, if you’re thinking this isn’t strictly necessary, you’re right. Magpie will default to using the working directory to store cache files if you don’t specify one in your program. Trust me, you do not want to do this.

Items

Almost done, I promise. Now that dnews has an array of <item> objects, all it needs to do is loop through this list and display each one. To do this, dnews calls the item() function, which just returns the data formatted into a XHTML ordered list element. Since each feed source is a little different, this function tries its best to present a unified format for dates, and it makes a weak effort at cleaning up markup and character entities that may be present in the description. Occasionally, some feeds (such as Technorati) will not function correctly do to invalid (not well-formed) markup. Not that this is necessarily their fault, their sources are countless blogs and you have no control over the quality of these sources. So, if you are using XHTML for markup like I am, it is very important to pick your feeds carefully, and test them thoroughly.

If there is an error, dnews catches it, does its best to display what went wrong, and even provides a link to an RSS Validator so you can determine why it failed. Okay, you got me, so I can determine what went wrong. You probably could care less and will move on to another feed or another site.

That’s it! Sorry, I do not have a commenting system yet. if you would like to provide feedback, you can send me an email via the Contact page and I will be happy to respond, and perhaps even include comments here, or at least add clarifications and of course fix any errors or omissions.

Notes

Fetching secure feeds.
If you want to fetch feeds from SSL (secure) servers, use the https protocol (aka scheme) and make sure you have cURL installed. I found that in order to fetch some feeds, from mozilla.org for instance, I needed to compile and install cURL, and edit the Snoopy package used by MagpieRSS. This was necessary because the default location for cURL is /usr/bin/curl, and I put mine in /usr/local/bin. See the Snoopy.class.inc file in the MagpieRSS extlib install directory for details.

Open source at work.
I personally like to see who is accessing my resources. Since MagpieRSS is a popular feed parser and I produce several feeds myself, I see a lot of requests for mine using this software. I felt it was important to identify myself when doing the same for other feeds. Kellan provides a hook to set the user-agent string in the rss_fetch.inc file by defining the constant MAGPIE_USER_AGENT before you include the file. Because I only wanted to modify the URL portion of the agent string (much like you will see when friendly robots are indexing your site), and leave the rest alone (the agent identifying string and version number), I created a new constant MAGPIE_AGENT_URL and modified rss_fetch.inc accordingly. Note that you can still override the entire user-agent string by defining the constant as described above before including the library. Below is a context diff you can apply to the rss_fetch.inc using the patch program if you want this same functionality. Or you can download the file as rss_fetch.diff. To apply the patch, simply upload it into the same directory as the original and issue a:

shell> patch < rss_fetch.diff

After making these modifications, my new user-agent string when requesting feeds via dnews and MagpieRSS is:

MagpieRSS/0.7 (+http://loadaveragezero.com/app/dnews)

Alternative approach.
Using MagpieRSS in this manner is PHP version agnostic (although I recommend using at least 4.3.x). If you’re interested in another approach that leverages PHP5, its built-in SimpleXML parsing library, the DOM XML extension and the APC from PECL, then have a look at Rasmus’ simple_rss.php. You’ll find more great code like this at the Yahoo! Developer Network.

Good luck with your own RSS News feed pages using MagpieRSS, PHP and MySQL (should you choose to implement something similar to dnews).

—Douglas Clifton

4 Comments

atom cache magpie mysql parsing php rss xml
 
*** /usr/local/src/magpierss-0.71.1/rss_fetch.inc Wed Feb 9 14:59:01 2005 --- rss_fetch.inc Mon Jul 18 11:19:38 2005 *************** *** 371,384 **** } if ( !defined('MAGPIE_USER_AGENT') ) { - $ua = 'MagpieRSS/'. MAGPIE_VERSION . ' (+http://magpierss.sf.net'; ! if ( MAGPIE_CACHE_ON ) { ! $ua = $ua . ')'; ! } ! else { ! $ua = $ua . '; No cache)'; ! } define('MAGPIE_USER_AGENT', $ua); } --- 371,383 ---- } if ( !defined('MAGPIE_USER_AGENT') ) { ! $ua = 'MagpieRSS/' . MAGPIE_VERSION . ' (+'; ! $author_URL = 'http://magpierss.sourceforge.net/'; ! ! $ua .= (defined('MAGPIE_AGENT_URL')) ? MAGPIE_AGENT_URL : $author_URL; ! if (!MAGPIE_CACHE_ON) $ua .= '; No cache'; ! $ua .= ')'; define('MAGPIE_USER_AGENT', $ua); }
<? /* dnews/api.php -- dnews application interface module (c) Copyright 2004-2006, Douglas W. Clifton, all rights reserved. for more copyright information visit the following URI: http://loadaveragezero.com/info/copyright.php */ function feeds() { // return a list of channels to select from global $_db, $app; $_s = 'select news.site, feed.title, news.img'; $_s .= ' from news, feed'; $_s .= ' where news.nid=feed.nid'; $_s .= ' order by news.site desc, feed.title desc'; if ($_r = mysql_query($_s, $_db)) { $feeds = '<ol class="recent">' . "\n"; $path = 'fav/drx/'; $ext = '.gif'; $class = 'icon'; $fid = '#news'; $in = ' '; while (list($site, $title, $img) = mysql_fetch_row($_r)) { $icon = image($path . $img . $ext, $img, $title, $class); $URI = $app->URI . $app->x . str_replace(' ', '_', strtolower($title)) . $fid; if (strpos($title, ':')) $title = str_replace(':', '/', $title); $link = $site . ': ' . $title; $feed = anchor($URI, $link, $link, null, 'next'); $feeds .= $in . ' <li>' . $icon . $feed . '</li>' . "\n"; } mysql_free_result($_r); $feeds .= $in . '</ol>'; return $feeds; } $app->error(mysql_error()); return null; } // feeds() function fetch(&$feed, $scheme = 'http') { /* return channel image and heading, and the items object for this feed object: $feed->site -- site label $feed->title -- feed label $feed->img -- image basename $feed->URI -- feed URI less the scheme */ $scheme .= '://'; if ($rss = @fetch_rss($scheme . $feed->URI)) { $path = 'dnews/'; $ext = '.gif'; $title = $feed->site . ': RSS News Feed'; $channel->image = anchor($rss->channel['link'], image($path . $feed->img . $ext, $feed->img, null), $title); $channel->heading = heading(3, $feed->img, $feed->site . ': ' . $feed->title, $feed->title, $class); // skip moreover ads if ($feed->img == 'moreover') array_shift($rss->items); return array($channel, $rss->items); } $feed->URI = $scheme . $feed->URI; // caller can still have a complete URI $feed->error = magpie_error(); // and also know what went wrong return null; } // fetch() function item(&$item, $i) { // return an item formatted as a simplified hreview record global $format; $max->title = 58; $max->URI = 70; $ell = '...'; $fid = '#news'; if (strlen($item['title']) > $max->title) $htitle = amp(substr($item['title'], 0, $max->title)) . $ell; else $htitle = amp($item['title']); $title = htmlspecialchars($item['title']); $description = htmlspecialchars(amp(strip_tags(trim($item['description'])))); $tlink = amp($item['link']); if (strlen($item['link']) > $max->URI) $link = amp(substr($item['link'], 0, $max->URI)) . $ell; else $link = $tlink; if ($item['date_timestamp']) $pubdate = gmdate($format['rfc1123'], $item['date_timestamp']); elseif ($item['dc']['date']) $pubdate = $item['dc']['date']; elseif ($item['pubdate']) $pubdate = $item['pubdate']; else $pubdate = gmdate($format['rfc1123'], time()); $heading = $i . '. ' . anchor($tlink, $htitle, $tlink); $pub = em('Published:', 'bold') . ' <abbr class="dtreviewed">' . $pubdate . '</abbr>'; $link = em('URI:', 'bold') . anchor($tlink, $link, $title, 'uri', 'bookmark'); $top = anchor($fid, $ent['nbsp'], 'Select another Web Developer News Channel', 'end', 'contents'); return <<<_L <li class="hreview"> <h4 class="summary">$heading</h4> <p class="description">$description</p> $pub<br /> $link<br /> $top </li> _L; } // item() function amp($str) { // clean-up ampersand problem in strings if (strpos($str, '&nbsp;')) $str = str_replace('&nbsp;', ' ', $str); return (strpos($str, '&amp;')) ? $str : str_replace('&', '&amp;', $str); } // amp() function unhtmlentities($str) { // get HTML entities table and flip key/values $table = array_flip(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES)); // add &apos; entity $table += array('&apos;' => '\''); // replace entities with values return strtr($str, $table); } // unhtmlentities() function parse_date(&$item) { // oh that RSS where standardized and developers // followed the rules for those formats that are global $format; if (!$timestamp = $item['date_timestamp']) { if ($item['pubdate']) $timestamp = strtotime($item['pubdate']); elseif ($item['dc']['date']) $timestamp = parse_w3cdtf($item['dc']['date']); elseif ($item['issued']) $timestamp = parse_w3cdtf($item['issued']); else $timestamp = time(); } return gmdate($format['rfc1123'], $timestamp); } // parse_date() // dnews/api.php ?>
Last updated: Sunday, November 30th, 2008 @ 1:01 AM EST [2008-11-30T06:01:28Z]   home

(c) 2008-2010, Douglas W. Clifton, loadaveragezero.com, all rights reserved.