Parsing XML using SimpleXML
Posted on 3/5/07 by Tim Koschützki
Parsing XML Data with PHP's SimpleXML
Introduction
Extensible Markup Language (XML) has become the number one format for disparate systems to communicate. Its most common applications are probably the Really Simple Syndication (RSS) Feeds embraced by the blogging community - including http://php-coding-practices.com. :)
One of the most significant changes made to PHP5 is the way it handles XML data. A few seamless set of XML parsing tools have been integrated directly into the language itself. The old days where us poor programmers had to use external tools and libraries are finally over! The purpose of this article is to give a closer look on one the cool new xml libraries - SimpleXML.
Short XML-Roundup
If you have ever worked with XHTML (Extensible Hypertext Markup Language), then you'are familiar with an application of XML, since XHTML is a reformulation of HTML4 as XML. I assume you are familiar with XML already. If not, head over to the W3 Schools Site and learn about it.
Important things in an XML Document
The most important things in an XML document are the following:
- Entity: An entity is a named unit of storage. Entities can work as "variables" in an XML document. They can also be used to embed angular brackets or other characters that can normally not be part of an XML document. Entities can be included directly into the script or from en external source.
- Element: A data object that can contain other elements or raw textual data. Elements can also feature one or more attributes.
- Document Type Declaration A set of instructions that describes the accepted structure of the XML file. They can be embedded or externally defined.
XML documents should be valid. That means they are well-formed (all tags are nested recursively and correctly) and they contain a Document Type Declaration (DTD). The DTD is not a requirement and in fact, you will see many documents without a DTD. You should stick to it, though. This is not a php coding best practice, but an XML one. Think about it. ;)
An Example of valid XML Documents
The above document is only well-formed, but it is not valid. This is because it contains no DTD. Let's fix that:
< !DOCTYPE message SYSTEM "message.dtd">
Now that is a valid XML document! It is well-formed, all tags are nested correctly and it contains a DTD.
Introduction to SimpleXML
Over are the difficult days of PHP4 when external libraries had to be used to parse and change XML files. With PHP5 came a number of integrated XML libraries - one of which is SimpleXML.
True to its namestake, it provides an easy way to work with xml documents. SimpleXML, however, is geared through parsing and reading xml files and is rather inferior when it comes to alternating documents. Yes, you can alter xml documents with SimpleXML, but the dom library, among others, is far superior in this field. The good news is that you can juggle parsed xml file objects back and forth between the new built-in libraries, which makes the overall task pretty easy.
Creating an XML Document
In order to learn how to parse XML files with PHP SimpleXML, we will need a document first. For that, we simply use the current sitemap.xml file for http://php-coding-practices.com. You can view or download it from http://php-coding-practices.com/sitemap.xml.
Here is an excerpt:
<!-- generator="wordpress/2.1.1" -->
<!-- sitemap-generator-url="http://www.arnebrachhold.de" sitemap-generator-version="2.7.1" -->
<!-- Debug: Total comment count: 8 -->
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://php-coding-practices.com/</loc>
<lastmod>2007-05-02T21:51:04+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<!-- Debug: Start Postings -->
<!-- Debug: Priority report of postID 55: Comments: 0 of 8 = 0 points -->
<url>
<loc>http://php-coding-practices.com/beautifying-your-code/php-code-beautifier-tool/</loc>
<lastmod>2007-05-02T22:51:04+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.1</priority>
</url>
<!-- Debug: Priority report of postID 54: Comments: 2 of 8 = 0.3 points -->
<url>
<loc>http://php-coding-practices.com/refactoring/refactoring-a-first-example/</loc>
<lastmod>2007-05-02T16:16:22+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.3</priority>
</url>
The document should be pretty straightforward if you are familiar with XML. It provides a number of urls that each have a location, a last modification date, a change frequency and a priority. It is used with the Google Webmaster Tools to make it easier for google to index all pages on http://php-coding-practices.com.
Loading an XML File
Let's get a head start with SimpleXML on our sitemap.xml. Create a new simplexml.php file within the same directory where you placed the sitemap.xml file. Make sure both files are in your htdocs directory somewhere so you can access the php file on your local php-enabled system. Put the following source code into the simplexml.php file:
// load as string
$xmlstr = file_get_contents($source);
$sitemap1 = simplexml_load_string($xmlstr);
// load as file
$sitemap2 = simplexml_load_file($source);
The code is pretty straightforward. First of we use SimpleXML's simplexml_load_string function to load a previously read xml file (which got stored in a string) as a string. Secondly, we parse the xml directly from the file using simplexml_load_file(), which is faster and makes more sense.
The file could also be a path to a remote xml file, depending on your allow_url_fopen php.ini setting. Note, that both $sitemap1 and $sitemap2 are instances of the SimpleXMLElement class.
SimpleXML also has an OOP-centric approach, where you can create those SimpleXMLElement objects on the fly:
// load as string
$xmlstr = file_get_contents($source);
$sitemap = new SimpleXMLElement($xmlstr);
// load as file
$sitemap = new SimpleXMLElement($source,null,true);
Not much need of explanation here, except that, as you see, the constructor of the SimpleXMLElement class can receive two optional parameters. The first parameter can hold additional information on how the file should be parsed, whereas the second one informs the class that the first parameter is a path to a file instead of a string.
We left the second parameter to null at this point, because we do not need it for journey. If you are eager to learn what you can do with it, check out the optional constants you can provide as an array for the second parameter.
Accessing Children
SimpleXML is so cool and easy, because when you parse a document as we have done now, all children are stored as nodes of the SimpleXMLElement object - allowing us to access them easily. Let's look at this now:
$sitemap = new SimpleXMLElement($source,null,true);
foreach($sitemap as $url) {
echo "{$url->loc} - {$url->lastmod} - {$url->changefreq} - {$url->priority}\r\n";
}
The result is a great list of all urls and their sub-nodes. The drawback is here that we need to know about all the names of the nodes. If the xml document changes, we would need to change our client code, too. Let's take care of that:
echo $child->getName().":
";
foreach($child->children() as $subchild) {
echo "--->".$subchild->getName().": ".$subchild."
";
}
}
Coolness! What we have done here is simply using the children() method of the SimpleXMLElement class that provides an iteration interface to iterate over all children of a node. Your output should be something like this:
url:
--->loc: http://php-coding-practices.com/
--->lastmod: 2007-05-02T21:51:04+00:00
--->changefreq: daily
--->priority: 1
url:
--->loc: http://php-coding-practices.com/beautifying-your-code/php-code-beautifier-tool/
--->lastmod: 2007-05-02T22:51:04+00:00
--->changefreq: weekly
--->priority: 0.1
Now what if you simply want to dump all xml data recursively with all children? You would not want to create 20 foreach-loops right? SimpleXML itself does not provide an easy recursive function that does that. However, we can easily do it on our own:
function displayChildrenRecursive($xmlObj,$depth=0) {
foreach($xmlObj->children() as $child) {
echo str_repeat('-',$depth).">".$child->getName().": ".$subchild."
";
displayChildrenRecursive($child,$depth+1);
}
}
The recursive function is provided with a SimpleXMLElement object and a recursion depth. Then it dumps all of the object's children one by one and calls itself on the fly to process all subchilds of the current child.
Accessing Attributes
If our xml document contained attributes - for example if the urls had an id or number - we could access them as well. XML Example:
<!-- generator="wordpress/2.1.1" -->
<!-- sitemap-generator-url="http://www.arnebrachhold.de" sitemap-generator-version="2.7.1" -->
<!-- Debug: Total comment count: 8 -->
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url num="1">
<loc>http://php-coding-practices.com/</loc>
<lastmod>2007-05-02T21:51:04+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<!-- Debug: Start Postings -->
<!-- Debug: Priority report of postID 55: Comments: 0 of 8 = 0 points -->
<url num="2">
<loc>http://php-coding-practices.com/beautifying-your-code/php-code-beautifier-tool/</loc>
<lastmod>2007-05-02T22:51:04+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.1</priority>
</url>
<!-- Debug: Priority report of postID 54: Comments: 2 of 8 = 0.3 points -->
<url num="3">
<loc>http://php-coding-practices.com/refactoring/refactoring-a-first-example/</loc>
<lastmod>2007-05-02T16:16:22+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.3</priority>
</url>
Here is how we would parse them with the first method:
$sitemap = new SimpleXMLElement($source,null,true);
foreach($sitemap as $url) {
echo "Number: {$url['num']}: {$url->loc} - {$url->lastmod} - {$url->changefreq} - {$url->priority}\r\n";
}
Look at that array-like approach for attributes. Isn't that cool? Here is the implementation using the attributes() method of the SimpleXMLElement object:
echo $child->getName().":
";
foreach($child->attributes() as $attr) {
echo "->".$attr->getName().": ".$attr."
";
}
foreach($child->children() as $subchild) {
echo "--->".$subchild->getName().": ".$subchild."
";
}
}
Simple, isn't it?
XPath Queries
The XML Path Language (XPath) is a W3C standardized language that is used to access and search XML documents. It is used extensively in Extensible Stylesheet Language Transformation (XSLT) and forms the basis for XML Query (XQuery) and XML Pointer (XPointer). It is a query language to access specific nodes deep in the XML tree in a comfortable way.
SimpleXMLElement comes with its xpath() method, that does all the bulk work for us. Keep in mind that xpath() searches only within the node from which it is accessed.
If you use xpath() on the root SimpleXMLElement it searches the entire document - if you use it with a child, it searches only within the child and so on. It returns an array of SimpleXMLElement objects - even if only a single element is returned.
<?xml version="1.0" encoding="UTF-8"?>
<urlset>
<url>
<loc>http://php-coding-practices.com/</loc>
<lastmod>2007-05-02T21:51:04+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>http://php-coding-practices.com/beautifying-your-code/php-code-beautifier-tool/</loc>
<lastmod>2007-05-02T22:51:04+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.1</priority>
</url>
</urlset>
XML;
$sitemap = new SimpleXMLElement($xml);
$results = $sitemap->xpath('url/loc');
print_r($results);
foreach($results as $location) {
echo $location.'
';
}
Important Note: The sitemap.xml file that we use doesn't seem to be liked by xpath, because it contains comments and contained a namespace on the urlset-node:
<url>
<loc>http://php-coding-practices.com/</loc>
<lastmod>2007-05-02T21:51:04+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
...
[/urlset]
If we do not register the namespace with xpath, it will not work. For now, let's remove the namespace (xmlns="http://www.google.com/schemas/sitemap/0.84"). A way we can make XPath work alongside namespaces will be discussed later.
Modifying XML Documents with SimpleXML
Adding elements and attributes
Prior to PHP 5.1.3, SimpleXML had no means to change an xml document, meaning it could not add or remove elements or attributes. Yes it could change their values, but the only way to add or remove elements or attributes was to export the SimpleXMLElement object to the DOM library. However, with PHP 5.1.3 the method addChild() and addAttribute() were introduced to the SimpleXMLElement object.
Let's look at the addChild() method first:
$url->addChild('loc','http://php-design-patterns.com');
$url->addChild('lastmod','2007-05-02T21:51:04+00:00');
$url->addChild('changefreq','daily');
$url->addChild('priority','0.5');
header('Content-type: text/xml');
echo $sitemap->asXML();
The addChild() method returns a SimpleXMLElement itself, to which you can add childs again. It accepts three parameters - the node's name, an optional value and an optional namespace. We will come to namespaces in a minute.
Via the asXML() method of the SimpleXMLElement you can also output the entire document again, which comes in handy with the header() function to tell the browser that your script's output has to be treated as XML content. The asXML() method also accepts a file path parameter to which it can save the document. In this case it returns a boolean value indicating whether the safe operation was successful or not.
The addAttribute() method is quite similar:
$url->addAttribute('featured','true');
$url->addChild('loc','http://php-design-patterns.com');
$url->addChild('lastmod','2007-05-02T21:51:04+00:00');
$url->addChild('changefreq','daily');
$url->addChild('priority','0.5');
header('Content-type: text/xml');
echo $sitemap->asXML();
.
We have now added an attribute "featured" with the value "true" to our url node, as we can see in the script's output:
<url featured="true">
<loc>http://php-design-patterns.com</loc>
<lastmod>2007-05-02T21:51:04+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>0.5</priority></url>
The addAttribute() method can also receive an optional namespace.
Removing elements and attributes
While SimpleXML provides the functionality for adding childs and attributes, it does not provide a means to remove them - at least not directly via its API. However, you can remove an element with:
This will not remove attributes from the element at the url level. You could set the attribute value to null as well, but that would not actually remove it. The attribute will only become empty. To really remove attributes and elements, you have to export your SimpleXMLElement objects to the DOM library (explained in a later article).
Working with Namespaces
The use of namespaces allows you to associate certain element and attribute names with namespaces identified by URIs. This has the benefit of avoiding naming conflicts when two elements of the same name exist, but contain different data.
Our sitemap contains a namespace already - check for the string xmlns="http://www.google.com/schemas/sitemap/0.84" in the urlset node. Let's add a few more:
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84"
xmlns:meta="http://example.com/meta/"
xmlns:foo="http://example.com/foo/">
<url>
<loc>http://php-coding-practices.com/</loc>
<lastmod>2007-05-02T21:51:04+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>http://php-coding-practices.com/beautifying-your-code/php-code-beautifier-tool/</loc>
<lastmod>2007-05-02T22:51:04+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.1</priority>
</url>
...
Since PHP 5.1.3, SimpleXML has had the ability to return all namespaces declared in a document (getDocNamespaces()), return all namespaces used in a document (getNamespaces()) and register a namespace prefix used in making an XPath query (registerXPathNamespace()). Here is an example for getDocNamespaces() :
foreach($namespaces as $key => $value) {
echo "{$key} => {$value}
";
}
This will output
=> http://www.google.com/schemas/sitemap/0.84
meta => http://example.com/meta/
foo => http://example.com/foo/
Fair enough, our initial namespace didn't have a name, so that first line looks a bit weird.
A call to getNamespaces() will return nothing, since we do not use any yet. if we used namespaces within our document, by typing something like
<loc>http://php-coding-practices.com/beautifying-your-code/php-code-beautifier-tool/</loc>
<lastmod>2007-05-02T22:51:04+00:00</lastmod>
<changefreq>weekly</changefreq>
<meta :priority>0.1
getNamespaces() would return an array of used namespaces.
The tricky thing is to use namespaces and XPath with registerXPathNamespace(). The function creates a prefix/ns context for the next XPath query. In particular, this is helpful if the provider of the given XML document alters the namespace prefixes. registerXPathNamespace() will create a prefix for the associated namespace, allowing one to access nodes in that namespace without the need to change code to allow for the new prefixes dictated by the provider.
Example:
$result = $sitemap->xpath('//c:loc');
print_r($result);
foreach($result as $value) {
echo $value.'
';
}
Voila, our XPath query works now and lists all url locations. :]
Conclusion
We have come to the end of our little SimpleXML journey. As you see, SimpleXML is a very lightweight and easy-to-use xml parser that provides simple yet effective solutions to the most common xml needs.
If you need to change an xml document, then SimpleXML is not the way to go. We will have a look at an according library for this, namely the DOM library, in a later article.
Thanks for reading! Have a good one. :)