# XML

eXtensible Markup Language (XML) is used to store, transmit, and reconstruct data. It's not a programming language. It's a markup language used to represent data using tags. W3C's XML specifications and other related recommendations help in implementing XML in an interoperable manner.

By design, XML focuses on generality and usability across the Internet. It supports Unicode that allows communication of information in any written human language. It supports common computer science data structures such as records, lists, and trees. It allows validation using schema languages such as XSD and Schematron. Several schema systems facilitate the definition of XML-based languages.

Verbosity, complexity and redundancy are the major disadvantages of XML.

## Discussion

• What are the essential XML terms?

Some key terms used in XML programming are:

• XML Declaration: XML documents mostly begin with a declaration that gives information about them. An example is <?xml version="1.0" encoding="UTF-8"?>.
• Tag: Tags are basic markups used in XML. They define structure of the document. Each tag begin with character < and ends with character >. There are three types of tags namely start tag, end tag and empty-element tag.
• Attribute: It's a name-value pair that exists within a start tag or empty-element tag. Example, in boy="rahul" 'boy' is name and the 'rahul' is value.
• Element: Everything within and including the start tag and the end tag. Everything between these tags is called element content.
• Markup & Content: An XML document is made up of markup and content. Strings that begin and end with characters < and > are markup. Non-markup strings are content.
• Processor: It analyzes markup and passes structured information to an application.
• What are some applications of XML?

XML is widely used for interchange of data over the internet. It's also used for formatting documents, web searching and storing configuration data. Document formats such as SVG, RSS, Atom and XHTML have been developed using XML. It serves as the base language for communication protocols such as SOAP and XMPP. It's the message exchange format for the Asynchronous JavaScript and XML (AJAX) programming technique.

Microsoft Office 2007 introduced XML-based file formats for documents, spreadsheets, presentations, their templates, and more. XML led to smaller files and easier data recovery. Open XML format allowed other tools to interoperate for tasks such as comparing files or removing sensitive information.

In the cloud, XML is commonly used. Google Cloud Storage exposes an XML API. Azure has transformation policies to convert API requests and responses between JSON and XML. Application builds and deployments can be configured in XML. Oracle XML DB (on Amazon RDS) natively supports XML.

XML is the base for many industry data standards such as Health Level 7, OpenTravel Alliance, FpML, MISMO, and National Information Exchange Model.

• What's a valid well-formed XML document?

XML documents that conform to syntactic rules defined in the W3C Recommendation are said to be well-formed. For a document to be valid, it should also conform to a Document Type Definition (DTD) or XML Schema. A valid document is necessarily well-formed but the reverse is not true. A well-formed document may be invalid or even without a DTD or schema.

Since the W3C Recommendation defines the syntactic rules, no additional inputs are necessary to check for well-formedness. But to check for validity we need to specify a DTD file in the XML document, such as, <!DOCTYPE advert SYSTEM "http://www.foo.org/ad.dtd">, where advert is the root element. Alternatively, XML Schema may be specified within the document or passed to the parser separately.

• What are some important syntactic rules in XML?

Each XML element must have a closing tag. Even for an empty-element tag (such as <author name="J Smith" />), characters /> signify the closing part of the element. In HTML, empty-element tags are used for images, breaks, horizontal lines, inputs, metadata, etc.

XML tags are case-sensitive. For example, <Body> and <body> are two different tags. Hence </body> can't be the end tag for <Body>. Keywords such as DOCTYPE and ENTITY must be in uppercase.

All XML elements must be properly nested and not overlap. For example, <b><i>Hello world!</b></i> is incorrect since </i> must come before </b>. A root element is a must in all XML documents. This implies that elements can't be siblings at the top level. They must be enclosed with a parent root element. Each element can have only one parent.

Attribute values must always be quoted. For example, <note data=05/05/05> should be corrected as <note data="05/05/05">. Whitespaces are significant in XML. Tag names must not contain spaces. They should not start with a number or punctuation but can start with an underscore.

• What are some other XML constructs worth knowing?

Comments in XML start with <!-- and end with -->. Occurrence of -- inside comments is invalid. For example, <!-- declarations for <head> --> is a well-formed comment and <head> is not interpreted as a start tag.

Processing Instructions (PIs) pass useful information to applications. They start with <? and end with ?>. For example, <?xml-stylesheet href="person.css" type="text/css"?> tells a web browser to apply the specified stylesheet.

Character Data (CDATA) is a section of element content marked for the parser to interpret as character data. In other words, CDATA content is not interpreted as markup. Hence, any character can be included in CDATA content. It can be used to embed an XML document within another. It's one method to ensure that an XML document is well-formed. CDATA section starts with the string <![CDATA[ and ends with the ]]>.

• What's a typical XML processing pipeline?

The processor takes an input XML pipeline document and execute pipeline processes. An XML pipeline document specifies the processes to be executed in a declarative manner. One can use multistage processing pipeline in which XML components are processed sequentially or in parallel. The output of one stage of processing is the input of another stage of processing. One can write a pipeline document that defines the inputs and outputs of the processes. The figure shows a possible pipeline sequence.

A typical XML processing pipeline parses the input XML documents, then validates those documents and then transforms the documents.

• What are the differences between XML and HTML?

HyperText Markup Language (HTML) is a static markup language used to create web pages and web applications. Like XML, HTML has a hierarchy of nested elements and attributes. However, HTML's main concern is how documents are displayed on a web browser. XML's main concern is how information is organized.

XML has syntax but no semantics. HTML has both. For example, <h1> tells web browsers to render the content as a header, perhaps in a larger font size and bolder weight. Similar semantics are missing in XML. Semantics are imposed by applications and not by XML parsers.

HTML tags are all well-defined and limited to the web. XML can be extended and used in any application. Data architects and developers can create tag names. Businesses can use XML to integrate information flows across systems.

A web browser parsing HTML is more tolerant of errors. It may ignore overlapping tags, case mismatch of tag names, missing end tags or missing quotes around attribute values. XML parses are generally not tolerant of errors. They expect well-formed documents.

• What are the shortcomings of XML?

XML is a verbose language. Developers need to type opening and closing tags while writing XML documents. It doesn't support array data types. Repetition is a major disadvantage. Every element/attribute name has to be explicitly written for every element/attribute instance. Many web developers choose JSON over XML due to its lack of terseness.

Due to its verbosity, XML impacts performance: more disk storage, more bandwidth to transfer XML documents, and more memory consumption and longer time to process. The choice of character encoding can increase the size. For example, integer value 15383 needs only 2 bytes in binary but might need 5-20 bytes in text form. Unicode UCS-4 encoding requires 4 bytes for each character.

Since XML documents can refer to external files, this presents a security risk apart from the performance impact due to slow network connections.

XML has no semantics. Each particular XML tag has to define what its elements and attributes mean. XML doesn't predefine anything for them. One needs to write application code (Java, Python, etc.) and their metadata (in *.xml) separately. It diminishes the readability of application code.

• What are other W3C technologies closely related to XML?

W3C technologies closely related to XML include:

• XML Schema Definition (XSD): It specifies how to formally define the elements in an XML document.
• Extensible Stylesheet Language Transformations (XSLT): It's used to convert XML into other document formats such as HTML for web pages, plain text or XSL-FO, which further may be converted into other formats such as PDF, PostScript and PNG.
• XML Path Language (XPath): It's used to address parts of an XML document. It can be used by both XSLT and XPointer.
• XML Query (XQuery): It's a standardized language for combining documents, databases and web pages. It can replace complex Java or C++ programs with a few lines of code.
• XML Pointer (XPointer): It's a W3C Recommendation that allows links to point to specific parts of an XML document. It uses XPath expressions to navigate.
• XSL Formatting Objects (XSL-FO): It's used for the transformation and formatting of XML data. It's most often used to generate PDF files.

## Milestones

1995

Some SGML (Standard Generalized Markup Language) practitioners experienced with the new World Wide Web (WWW) believe that SGML can offer solutions to some of the upcoming problems on the Web. Dan Connolly joins W3C and adds SGML to the list of W3C's activities.

1998

XML 1.0 becomes a World Wide Web Consortium (W3C) recommendation. XML is a profile of SGML (ISO 8879). It adopts many aspects of SGML: separation of logical and physical structures, separation of data and metadata, separation of representation from processing instructions, mixed content, DTD-based validation, etc.

Jan
2000

Extensible HyperText Markup Language (XHTML) 1.0 becomes a World Wide Web Consortium (W3C) Recommendation. Essentially, XHTML reformulates HTML as XML so that standard XML parsers can parse web content. This enables web developers and content providers to define new elements and attributes. Content can be transformed to suit its context. User agents would become more interoperable. In reality, later years show that XHTML doesn't achieve wide adoption.

2006

XML 1.1 becomes a W3C Recommendation. Many improvements are standardized in this edition, such as support of Unicode line separator character #x2028 and release of a set of constraints called "full normalization" on XML documents.

## Sample Code

• <!--
Source: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/XHTML
Accessed 2022-05-25
-->

<!-- HTML -->
Content-Type: text/html

<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<title>HTML</title>
<body>
<p>I am a HTML document</p>
</body>
</html>

<!-- XHTML -->
Content-Type: application/xhtml+xml

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<title>XHTML</title>
<body>
<p>I am a XHTML document</p>
</body>
</html>


Author
No. of Edits
No. of Chats
DevCoins
4
4
1429
8
5
1348
1775
Words
0
Likes
1840
Hits

## Cite As

Devopedia. 2022. "XML." Version 12, May 26. Accessed 2023-02-11. https://devopedia.org/xml
Contributed by
2 authors

Last updated on
2022-05-26 02:05:31
• XML Schema Definition
• XML Protocol
• Binary XML
• XML Layout Engine
• Asynchronous JavaScript and XML
• Extensible Programming
• Site Map