Basic Rules for using XML tags
- Tags are case sensitive. <TITLE> and <Title> are different tags.
- No whitespace is allowed at the beginning of a tag. For example, < TITLE>. Whitespace at the end of tags is allowed, such as <TITLE >.
- The tag name must start with either a letter or an underscore.
- The tag name may contain any of the following: letters, numerals, hyphens (-), periods (.), or underscores (_).
- Each tag must be closed. For example, <TITLE></TITLE>.
- Each element must be approximately nested before another tag is opened. If a tag is opened inside an element, it must be closed inside that element also. In HTML, the following will work:
- In XML, this is not valid because the font tag is opened within the bold tag and is closed after the end of the bold tag. The correct one should be:
- Any data that needs to be displayed should be stored as an element.
- Any data meant to modify the way an element displays should be stored as an attribute.
- Consist of a property name, an equal sign, and the property value in quotation marks(e.g. status=”paid”).
- The property name is case sensitive. An attribute named Status is different from status.
- There can never be two properties of the same name in any one tag.
- There can be more than one attribute per tag.
- There must be quotation marks around the value of an attribute. Either single quotes or double quotes may be used.
<?xml version=”1.0”?> <?xml version=”1.0” encoding=”UTF-8” ?> <?xml version=”1.0” standalone=”yes” ?>
- version – sets the version of the XML specification being used by the XML document.
- encoding – Defines the character encoding. The default is UTF-8.
- standalone – declares whether or not the XML document has other files that must be processed, such as an external stylesheet or document type definition (DTD).
Document Type Declaration (DTD)
DTD describes the structural requirements of an XML document. This means that a DTD can define the following:
- The elements and attributes that can appear in a document
- Which elements are child elements and what number, order, and placement they must have
- The default values for elements and attributes
Example of DTD file
<?xml version="1.0" encoding="UTF-8"?> <!ELEMENT SHOWS (PERFORMANCE*)> <!ELEMENT PERFORMANCE (TITLE?, AUTHOR?, DESCRIPTION?, DATE?)+ > <!ELEMENT TITLE (#PCDATA)> <!ELEMENT AUTHOR (#PCDATA)> <!ELEMENT DESCRIPTION (#PCDATA)> <!ELEMENT DATE (#PCDATA)> <!ATTLIST DATE status (canceled) #IMPLIED>
- The content of each element declaration begins with the element name it’s defining. The element TITLE may contain #PCDATA.<!ELEMENT TITLE (#PCDATA)>
- Possible values for contents include
- A lit of other elements
- The keyword EMPTY (no contents)
- The keyword ALL (anything possible)
- The keyword #PCDATA (parsed character data only)
- Any reasonable mix of the above
- The elements can be combined using the following operators:
- Comma (,) used as an and operator. Example, (TITLE, AUTHOR). The element must have one TITLE element and one AUTHOR element as children.
- Pipe (|) used as an or operator. Example, (TITLE | AUTHOR). The element must have either a TITLE or an AUTHOR child element.
- The question mark (?) means that the element is optional. Example, (AUTHOR, TITLE?). The element must have a child AUTHOR element and may also have a child TITLE element.
- The plus sign (+) is used to signify one or more. Example, (TITLE+). The element must have at least one TITLE child element.
- The asterisk (*) is used to signify that any number may exist. Example, (TITLE*). The element can have any number of child elements named TITLE.
- Parentheses is to force processing. Example, (A | (B, C)) means that the element must have either an A child element or both B and C child elements.
<!ATTLIST DATE status (canceled) #IMPLIED>
The contents begin with the name of the element whose attributes we are describing,
list the name of the attribute, then we define either its data type or a list of literal values
that it can have. Last is the behaviour of the attribute. Possible data types and values used to describe attributes are:
- Enumerated list of values which possible in the name/value pair. Example, ( canceled | onschedule) .
- CDATA – governed by the same rules regarding content as text data found within elements.
- ID – gives an element a label guaranteed to be unique in the document.
Behaviour of the attribute can be any of the followings:
- When a string in quotes is given, it becomes the default value. If the user doesn’t include the attribute, it will be created with the default value in the document structure when it is parsed.
- #IMPLIED – optional attribute.
- #REQUIRED – required attribute and no default value.
DTD Validation in XML
<?xml version="1.0"?> <!DOCTYPE SHOWS SYSTEM "shows.dtd"> <SHOWS> <PERFORMANCE> <TITLE>Fairy Princess</TITLE> <AUTHOR/> <DESCRIPTION> Scratch sound with emphasis on color, texture. </DESCRIPTION> <DATE status="canceled">09/11/2001</DATE> </PERFORMANCE> </SHOWS>
- The contents of this reference begin with the root element to which the DTD applies; in this case it is SHOWS.
- SYSTEM keyword – DTD is unpublished and that the location of the following is the DTD for this XML document.
- PUBLIC keyword – DTD is available to the public for validating documents. Usually used when XML documents are being passed between companies.
Schemas will replace DTDs and written in XML. Thus, it is extensible for future additions. Advantages of schemas over DTDs:
- Schemas are XML documents themselves; it can be validated and programmatically extended.
- Schemas have the ability to describe the data type of element text data.
- A schemas describes elements and attributes, which means, adding elements to the validated XML document won’t break the validation provided of a different namespace.
Some characters may not appear in any part of an XML document as they are delimiters to the XML parser. For example, a delimiter is the less than character (<) will cause the parser to report errors. The character needs to replace with the character entities.
<TITLE> less than : < </TITLE>
<TITLE> less than : < </TITLE>
- Areas in which the parser doesn’t process the XML data.
- The parser knows that this part of the document contains no markup, just text.
- The parser can handle characters that would normally delimit markup because it’s not looking for any markup.
<![CDATA[ your data ]]>
- All characters inside the innermost square brackets are treated as text with no markup.
- The sequence of characters ]] > cannot be a part of the text in a CDATA section and represented by ]]>.
Rules for XML Comments
<! – comment goes here -->
- Can be placed anywhere in the document, except for the first line.
- <!– starts a comment, and –> ends a comment.
- Comments may not be nested.
- Double dashes (–) can be used within a comment because this is the delimiter that tells the processor that the comment is finished. Example:
<!-- --extra double dashes causes an error-- -->
Well-Formed & Validated Documents
- A single root element
- Properly nested tags
- Properly closed tags
- Attribute values within quotation marks
- Only one value per attribute
- No offending characters
- The XML file is parsed immediately upon being loaded. If any part of the file is in violation of a well-formed document rule, an error will be displayed.
- If the XML parser is a validating parser, it will read the DTD or schema associated with the XML document to determine whether the XML document conforms to it. If it conforms, the processing will be continued else an error will be displayed and the processing will be discontinued, depending upon the parser.
- If the parser is non-validating, it is able to read the DTD or schema, but cannot check to ensure that the XML document conforms to it.