Scala 2.11.7 and XML XSD Validation

I've posted about some features of Scala's XML library that I like before. And even showed a simple CLI example of validating XML generated by play. But if you're writing scripts to parse and convert XML to other forms of data. It's always a good idea to check that XML is valid before you do so.

As you know, XML has a method for this via Schemas. And schema checking like the handy xmllint. But asking someone to read a man page, or even a README can sometimes be like pulling teeth. So to protect yourself from your own users, I'd like to show you how to load XML in scala, and validate it on load. This will allow you to gracefully stop a program if the XML being handled is invalid. It's better than a NoSuchElementException from trying to get an item in a list that isn't there!

Luckily for the internet at large, there's already a decent blog post about what I'm talking about in this one. I say decent because it's well written, has a lot of information, and was useful to me. However, the code posted doesn't work. Which inspired this post. Because I believe that code examples that work are a beautiful thing. Which is why you can find this code on github, with tests and the ability to run it yourself if you've got sbt installed.

For our example we'll be loading xml that I used in my previous post. The sample xml looks like this:

	<?xml version="1.0" encoding="UTF-8"?>
	<TestInfoList 
			xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
			xsi:noNamespaceSchemaLocation="http://localhost:9000/testInfo.xsd"
	>
			<TestInfo>
					<Id>1</Id>
					<Name>Name</Name>
					<Days>
							<Day>Monday</Day>
					</Days>
			</TestInfo>
	</TestInfoList>

This follows our schema and will validate. So loading this won't be a problem. For example, in scala we could do the following:

	$ sbt
	> console
	[info] Starting scala interpreter...
	[info] 
	Welcome to Scala version 2.11.7 (OpenJDK 64-Bit Server VM, Java 1.7.0_55).
	Type in expressions to have them evaluated.
	Type :help for more information.

	scala> import scala.xml._
	import scala.xml._

	scala> val myXML = XML.loadFile("src/test/resources/sample.xml")
	myXML: scala.xml.Elem =
	<TestInfoList xsi:noNamespaceSchemaLocation="http://localhost:9000/testInfo.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
			<TestInfo>
					<Id>1</Id>
					<Name>Name</Name>
					<Days>
							<Day>Monday</Day>
					</Days>
			</TestInfo>
	</TestInfoList>

	scala> 

This is great, using Scala's built in XML parsing results in us using the data. Say we had some function to retrieve the Day elements in the Days element of our XML:

	scala> def getDays(x: scala.xml.Elem) = { (x \ "TestInfo" \ "Days" \\ "Day").map(_.text).toList }
	getDays: (x: scala.xml.Elem)List[String]

	scala> getDays(myXML)
	res3: List[String] = List(Monday)

This is great, however there's one problem. Take for example what happens when we load the invalid xml:

	scala> val badXml = XML.loadFile("src/test/resources/invalid-sample.xml")
	badXml: scala.xml.Elem =
	<TestInfoList xsi:noNamespaceSchemaLocation="http://localhost:9000/testInfo.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
			<TestInfoIncorrectElement>
					<Id>1</Id>
					<Name>Name</Name>
					<Days>
							<Day>Monday</Day>
					</Days>
			</TestInfoIncorrectElement>
	</TestInfoList>

	scala> getDays(badXml)
	res5: List[String] = List()

Now this is correct behavior as far as the function is concerned. But definitely not what we'd like in the best case. The main issue is that when we deal with XML, we lose all type safety that we could expect if we had passed a case class or well defined model to our function instead. However, by not using the default scala.xml.XML.loadFile but our own implementation that validates XML with a schema we can ensure that if we create a function to handle a specific type of XML we'll only be passing in that XML to it:

	import java.io._
	import javax.xml.XMLConstants
	import javax.xml.transform.stream.StreamSource
	import javax.xml.validation.SchemaFactory

	import org.xml.sax.InputSource

	import scala.xml.XML

	object LoadXmlWithSchema {
			def apply(filePath: String, schemaResource: String) = {
					val sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
					val s = sf.newSchema(new StreamSource(getClass().getClassLoader().getResourceAsStream(schemaResource)))
					val in = new FileInputStream(new File(filePath));
					val is = new InputSource(new InputStreamReader(in));
					new SchemaAwareFactoryAdapter(s).loadXML(is)
			}
	}

The SchemaAwareFactoryAdapter is the main piece of code taken from the blog post I mentioned. But the above code is what was neccesary to get off the ground. While the other post mentions the creation of the SchemaFactory it must be using a different version of Java as InputSource does not take a File as depicted. The above code handles this by creating an InputStreamReader from the FileInputStream and passing that to the InputSource.

Since we're dealing with Files and StreamReaders you might think that you need to call .close on the opened streams. Namely the FileInputStream and InputStreamReader. However, the JavaDocs for InputSource say:

An InputSource object belongs to the application: the SAX parser shall never modify it in any way (it may modify a copy if necessary). However, standard processing of both byte and character streams is to close them on as part of end-of-parse cleanup, so applications should not attempt to re-use such streams after they have been handed to a parser.

But hey, better safe than sorry:

	def ... {
			// ...
			val isr = new InputStreamReader(in)
			val is = new InputSource(isr);
			try {
					new SchemaAwareFactoryAdapter(s).loadXML(is)    
			} finally {
					isr.close()
					in.close()
			}
	}

This nice trick is fairly common for resource management. A try finally without a catch can be used to ensure that some code is always executed. Even after a return statement has happened. Loading this up we can try out our examples again:

	> console
	[info] Starting scala interpreter...
	[info] 
	Welcome to Scala version 2.11.7 (OpenJDK 64-Bit Server VM, Java 1.7.0_55).
	Type in expressions to have them evaluated.
	Type :help for more information.

	scala> def getDays(x: scala.xml.Elem) = { (x \ "TestInfo" \ "Days" \\ "Day").map(_.text).toList }
	getDays: (x: scala.xml.Elem)List[String]

	scala> val myXml = com.github.edgecaseberg.LoadXmlWithSchema("src/test/resources/sample.xml","sample.xsd")
	myXml: scala.xml.Elem = <TestInfoList xsi:noNamespaceSchemaLocation="http://localhost:9000/testInfo.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><TestInfo><Id>1</Id><Name>Name</Name><Days><Day>Monday</Day></Days></TestInfo></TestInfoList>

All's good so far. Now what if we try to load the invalid XML?

	scala> val myXml = com.github.edgecaseberg.LoadXmlWithSchema("src/test/resources/invalid-sample.xml","sample.xsd")
	org.xml.sax.SAXParseException: cvc-complex-type.2.4.a: Invalid content was found starting with element 'TestInfoIncorrectElement'. One of '{TestInfo}' is expected.
		at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
		at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(ErrorHandlerWrapper.java:134)
		...

Perfect! Now we can check for exceptions or use the Try method from scala.util to easily handle cases where we want to have error handling around improper XML!

Note:

If you want to run the above console session you'll need to copy the xsd file to the resources folder in the main directory from the test directory. Otherwise:

	scala> val myXml = com.github.edgecaseberg.LoadXmlWithSchema("src/test/resources/sample.xml","src/main/resources/sample.xsd")
	org.xml.sax.SAXParseException: schema_reference.4: Failed to read schema document 'null', because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>.
		at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
		at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(ErrorHandlerWrapper.java:134)
		at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:437)
		at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:347)
		at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.reportSchemaErr(XSDHandler.java:4166)
		at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.reportSchemaError(XSDHandler.java:4149)
		at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.getSchemaDocument1(XSDHandler.java:2484)
		at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.getSchemaDocument(XSDHandler.java:2187)
		at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.parseSchema(XSDHandler.java:573)
		at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadSchema(XMLSchemaLoader.java:616)
		at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:574)
		at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:540)
		at com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory.newSchema(XMLSchemaFactory.java:252)
		at javax.xml.validation.SchemaFactory.newSchema(SchemaFactory.java:627)
		at com.github.edgecaseberg.LoadXmlWithSchema$.apply(LoadXmlWithSchema.scala:16)
		... 43 elided

Will occur because the test directory isn't on the classpath unless you're running tests! So the classloader will fail. You could adopt the LoadXmlWithSchema to use a regular File or path, but I prefer the classloader since we should be keeping XSD files that pertain to XML used by the application in the resources for the project itself.

Taking it further:

So we can load XML, but it's doubtful we want to always specify a Schema whenever we want to load XML of a certain type. Rather we should be specifying the schema along with our own models. So let's do that:

	case class Box(id: Int, name: String, items: List[String]) 

	object BoxXmlLoader {
			def apply(filePath: String) = LoadXmlWithSchema(filePath, schemaResource = "box.xsd")   
	}

Then in the REPL we can use use this to load up a file:

	scala> com.github.edgecaseberg.BoxXmlLoader("src/test/resources/box-sample.xml")
	res0: scala.xml.Elem = <Boxes xsi:noNamespaceSchemaLocation="https://raw.githubusercontent.com/EdgeCaseBerg/scala-xsd-validation/master/src/main/resources/box.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Box><Id>0</Id><Name>First Box</Name><Contains><BoxedItem>Thing 1</BoxedItem><BoxedItem>Thing 2</BoxedItem></Contains></Box><Box><Id>2</Id><Name>Second Box</Name><Contains><BoxedItem>Second Thing 1</BoxedItem><BoxedItem>Second Thing 2</BoxedItem></Contains></Box></Boxes>

And as expected it will fail on Non-Box XML files.

	scala> com.github.edgecaseberg.BoxXmlLoader("src/test/resources/sample.xml")
	org.xml.sax.SAXParseException: cvc-elt.1: Cannot find the declaration of element 'TestInfoList'.
		at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
		at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(ErrorHandlerWrapper.java:134)
		at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:437)
		at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
		at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:325)
		at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleStartElement(XMLSchemaValidator.java:1906)
		at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.startElement(XMLSchemaValidator.java:746)
		at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorHandlerImpl.startElement(ValidatorHandlerImpl.java:570)
		at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509)
		at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:378)
		at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl$NSContentDriver.scanRootElementHook(XMLNSDocumentScannerImpl.java:604)
		at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3122)
		at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:880)
		at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
		at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
		at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
		at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
		at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
		at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
		at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
		at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
		at com.github.edgecaseberg.SchemaAwareFactoryAdapter.loadXML(SchemaAwareFactoryAdapter.scala:37)
		at com.github.edgecaseberg.LoadXmlWithSchema$.apply(LoadXmlWithSchema.scala:21)
		at com.github.edgecaseberg.BoxXmlLoader$.apply(Models.scala:7)
		... 43 elided

But doing this isn't that helpful as far as getting more typesafe. It'd be better to convert the XML File to models that the rest of our code can then use:

	object BoxXmlLoader {
			def apply(filePath: String) = {
					val loadedXml = LoadXmlWithSchema(filePath, "box.xsd")
					(loadedXml \\ "Box").map { boxNode =>
							Box(
									(boxNode \ "Id").text.toInt,
									(boxNode \ "Name").text,
									(boxNode \ "Contains" \\ "BoxedItem").map(_.text).toList
							)
					}
			}
	}

Because we've validated the XML with our schema, we can be sure that our toInt call won't fail and that we'll be able to transform the XML to a model without any odd things happening:

	scala> com.github.edgecaseberg.BoxXmlLoader("src/test/resources/box-sample.xml")
	res0: scala.collection.immutable.Seq[com.github.edgecaseberg.Box] = List(Box(0,First Box,List(Thing 1, Thing 2)), Box(2,Second Box,List(Second Thing 1, Second Thing 2)))

Since every element of the Box element was required, the above has no example of dealing with elements that might not exist or elements that would have default values. Let's get an example of that for completeness, take the following schema:

<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'>
	<xs:element name="Forest">
		<xs:complexType>
			<xs:sequence>
				<xs:element ref="Tree" minOccurs='0' maxOccurs='unbounded'/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>

	<xs:element name="Tree">
		<xs:complexType>
			<xs:sequence>
				<xs:element ref="State" minOccurs='0' maxOccurs='1'/>
				<xs:element ref="Leaves" minOccurs='1' maxOccurs='1'/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>

	<xs:simpleType name="TreeState" final="restriction" >
		<xs:restriction base="xs:string">
				<xs:enumeration value="ALIVE" />
				<xs:enumeration value="DEAD" />
		</xs:restriction>
	</xs:simpleType>

	<xs:element name="State" type='TreeState'/>
	<xs:element name="Leaves">
		<xs:complexType>
			<xs:sequence>
				<xs:element ref="Leaf" minOccurs='0' maxOccurs='unbounded'/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>

	<xs:element name="Leaf">
		<xs:complexType>
			<xs:attribute name="color" type="xs:string"/>
		</xs:complexType>
	</xs:element>
</xs:schema>

The key point here is that the State element has a minOccurs value of 0 and therefore is an optional element when it's inside a Tree. It's up to our application to make the decision of what the default should be. So converting our XSD file into a model:

object TreeState extends Enumeration {
	type TreeState = Value
	val ALIVE = Value("ALIVE")
	val DEAD = Value("DEAD")
}

case class Tree(state: TreeState.Value, leaves: List[String])

Note that Value("ALIVE") is just a way to declare how an enumeration should print, otherwise we'd get a simple index like normal enumeration. Since the model does not have the state as Optional, we need to be sure that our parser handles sets the default. Let's say that the default is for dead trees (being pessimistic here), so our parsing code could look like this:

object TreeXmlLoader {
	def apply(filePath: String) = {
		val loadedXml = LoadXmlWithSchema(filePath, "tree.xsd")
		(loadedXml \\ "Tree").map { treeNode => 
			Tree(
				(treeNode \ "State").map { optionalStateNode =>
					optionalStateNode.text match {
						case ts:String => TreeState.withName(ts)
						case _ => TreeState.DEAD
					}
				}.headOption.getOrElse(TreeState.DEAD),
				(treeNode \ "Leaves" \\ "Leaf" \\ "@color").map(_.text).toList
			)
		}
	}
}

Easy enough right? The only other thing about the code above is that we're pulling the leaf color from the attribute of the node via @color in our xpath.

Hopefully this helps anyone dealing with XML out there to get a handle on their models and be a bit safer in parsing their XML into data that they can be sure of. If you're not validating, you'll never know when a runtime error might blow up on you and bring down part of your application. Better safe than sorry!