A Document Type Definition (DTD) is a way to define the legal building blocks of an XML document. It specifies the allowed elements, attributes, their order, nesting, and data types. Essentially, a DTD acts as a schema or grammar for an XML document, ensuring consistency and allowing for validation.
Key Concepts
- Validation: A DTD allows you to validate an XML document. Validation checks whether the XML document conforms to the rules defined in the DTD. If the document follows the rules, it's considered valid with respect to that DTD. If not, it's invalid.
- Structure and Content Rules: The DTD defines:
- Which elements are allowed.
- Which attributes each element can have.
- The allowed order and nesting of elements (the hierarchy).
- The data types of element content and attribute values (although DTD data typing is limited).
- External or Internal: A DTD can be:
- External DTD: A separate file (usually with a .dtd extension) that is referenced from the XML document. This is the most common approach for reusability.
- Internal DTD: Defined within the XML document itself, inside the <!DOCTYPE> declaration. Used for smaller, document-specific rules.
- Not Required: An XML document does not have to have a DTD. If it doesn't, it can still be well-formed (following the basic XML syntax rules), but it cannot be validated against a DTD.
Basic DTD Syntax and Declarations
A DTD consists of a set of declarations that define the elements, attributes, entities, and notations. Here are the key declaration types:
- <!ELEMENT ...>: Defines an element.
- Syntax: <!ELEMENT elementName (contentModel)>
- elementName: The name of the element.
- contentModel: Specifies what the element can contain. This is where the structure is defined.
- EMPTY: The element must be empty (e.g., <br />).
- ANY: The element can contain any content (elements or text). Generally discouraged.
- #PCDATA: The element can contain parsed character data (text).
- elementName: The element must contain only that specific child element.
- (element1, element2, ...): A sequence of elements, in the specified order.
- (element1 | element2 | ...): A choice of elements (one of the listed elements must appear).
- elementName*: Zero or more occurrences of the element.
- elementName+: One or more occurrences of the element.
- elementName?: Zero or one occurrence of the element (optional).
- You can combine these to create complex content models.
- <!ATTLIST ...>: Defines the attributes for an element.
- Syntax: <!ATTLIST elementName attributeName attributeType attributeDefault>
- elementName: The name of the element to which the attributes apply.
- attributeName: The name of the attribute.
- attributeType: The type of data allowed for the attribute value. Common types:
- CDATA: Character data (text). This is the most common type.
- ID: A unique identifier (must be unique within the document).
- IDREF: A reference to an ID attribute of another element.
- IDREFS: A list of IDREF values.
- (value1 | value2 | ...): An enumerated list of allowed values.
- NMTOKEN: Name Token, must start with a letter, an underscore (), or a colon (:). Can be followed by letters, digits, hyphens (-), underscores (), colons (:), or full stops (.).
- NMTOKENS: A list of NMTOKEN values.
- attributeDefault: Specifies the default behavior of the attribute.
- #REQUIRED: The attribute must be present.
- #IMPLIED: The attribute is optional.
- #FIXED "value": The attribute must have the specified fixed value.
- "defaultValue": A default value that will be used if the attribute is not specified.
- <!ENTITY ...>: Defines an entity (a reusable piece of text or markup).
- Syntax:
- <!ENTITY entityName "entityValue"> (Internal entity)
- <!ENTITY entityName SYSTEM "URI"> (External entity - refers to an external file)
- Purpose A way of giving a name to a fragment of text, XML or even an external resource.
- Example <!ENTITY author "J.R.R. Tolkien">
- <!NOTATION ...>: Defines a notation (used for specifying the format of non-XML data). This is rarely used directly in modern XML.
Example: DTD for a Book
Code snippet
<!ELEMENT bookstore (book+)>
<!ELEMENT book (title, author, year, price)>
<!ATTLIST book category CDATA #REQUIRED>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT price (#PCDATA)>
Explanation:
- <!ELEMENT bookstore (book+)>: The bookstore element must contain one or more book elements.
- <!ELEMENT book (title, author, year, price)>: A book element must contain a title, author, year, and price element, in that order.
- <!ATTLIST book category CDATA #REQUIRED>: The book element must have a category attribute, which is of type CDATA (character data), and it's required.
- <!ELEMENT title (#PCDATA)>, etc.: The title, author, year, and price elements can contain parsed character data (text).
Linking a DTD to an XML Document
- Internal DTD:
XML
<?xml version="1.0"?>
<!DOCTYPE bookstore [
<!ELEMENT bookstore (book+)>
<!ELEMENT book (title, author, year, price)>
<!ATTLIST book category CDATA #REQUIRED>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT price (#PCDATA)>
]>
<bookstore>
</bookstore>
- External DTD (Public Identifier - Rarely Used):
XML
<?xml version="1.0"?>
<!DOCTYPE bookstore PUBLIC "-//W3C//DTD Bookstore DTD 1.0//EN"
"http://www.example.com/bookstore.dtd">
<bookstore>
</bookstore>
- External DTD (System Identifier - Most Common):
XML
<?xml version="1.0"?>
<!DOCTYPE bookstore SYSTEM "bookstore.dtd">
<bookstore>
</bookstore>
Advantages of Using DTDs
- Validation: Ensures that XML documents conform to a predefined structure.
- Consistency: Maintains consistency across multiple XML documents.
- Data Integrity: Helps prevent errors by enforcing rules on the data.
- Interoperability: Facilitates data exchange between different systems by providing a shared understanding of the data structure.
- Documentation: Can serve as documentation for the expected XML structure.
Limitations of DTDs
- Limited Data Typing: DTDs have limited support for data types (mostly just text). You can't specify that an element must contain a number, date, or other specific data type in a strong way.
- Complex Syntax: The DTD syntax can be complex and less intuitive than other schema languages.
- No Namespace Support: DTDs don't support XML namespaces, which can be a problem in complex XML applications.
- Not XML-Based: DTDs have their own syntax, which is not XML. This can be seen as inconsistent.
Alternatives to DTDs
Because of the limitations of DTDs, other XML schema languages have become more popular:
- XML Schema (XSD): Much more powerful and flexible than DTDs. XSDs are written in XML, support namespaces, and provide rich data typing. This is the most common and recommended schema language for most applications.
- RELAX NG: Another XML-based schema language that is often considered simpler and more readable than XSD.
- Schematron: A rule-based validation language that uses XPath expressions to define constraints. It's often used in conjunction with other schema languages.
In Summary
DTDs provide a way to define the structure and content rules for XML documents, enabling validation and ensuring consistency. While they have limitations compared to more modern schema languages like XSD, DTDs are still used in some contexts, especially for simpler XML formats or legacy systems. Understanding DTDs is valuable for working with existing XML systems and for grasping the fundamental concepts of XML validation. However, for new projects, XSD is generally the preferred choice.