It is my intention to keep this document in sync with the current state of the code. Thus, this document does not yet describe the full feature set planned for version 1.0. This document is current with version 0.7.8 of the code.

$Id$

AbiWord Document Format

Version 1.0

Copyright (C) 1999-2000 AbiSource, Inc., All Rights Reserved.

Jeff Hostetler

jeff@abisource.com

AbiSource, Inc.

$Date$

1. Introduction

This document describes the AbiWord file format used to represent AbiWord native documents. This document describes file format version 1.0.

AbiWord uses XML[1] to represent a document. This does not imply that AbiWord is an XML editor; but rather, AbiWord is a Word Processor that just happens to use XML as a convenient syntax for representing documents. AbiWord contains a very strict and unforgiving import; it requires well-formed XML in strict adherence the format specified by the code. This code is primarily located in ie_exp_AbiWord_1.cpp and ie_imp_AbiWord_1.cpp. AbiWord has a DTD[2], but it should not be taken as definitive. Our primary goal is to support documents written by AbiWord rather than hand written XML.

AbiWord also uses some of the syntax and conventions from CSS2[3] to represent certain concepts, such as character formatting. CSS2 was designed as a style mechanism for WWW documents and not as a style mechanism for page-oriented documents. We used CSS2 as a guideline, taking parts that were of use and inventing our own mechanism as necessary.

2. Document Structure

The AbiWord file format is an 7bit-clean ASCII XML file. Non-US-ASCII characters are represented using standard XML numeric entities (e.g., "ÿ" or "").

The following illustrates the basic form of an AbiWord file:

<?xml version="1.0"?>

<abiword version="0.7.8">
<section>
<p props="text-align:center">Hello World.</p>
<p>This is a test paragraph.</p>
<p>This word is
<c props="font-weight:bold">bold</c>.</p>
</section>
<section props="column-gap:0.25in; columns:2">
<p>This section <image dataid="foo">has two columns.</p>
</section>
<data>
<d name="foo">
XXXXXXXXXX...
</d>
</data>
</abiword>

2.1. <abiword>...</abiword>

The entire content of the document is contained within this pair of tags. Within these tags are a series of sections and an optional data block.

The only property of the <abiword> tag is:

version = number | "unnumbered"

2.2. <section>...</section>

These tags delimit a section. A section is a portion of a document that has some common characteristic, such as its column layout. A section does not necessarily correspond to anything in the actual content, such as a chapter. Every document must contain at least one section.

The following are section-level properties and may appear in the value of the props attribute of the <section> tag:

columns: integer-number-of-columns ;

column-gap: dimensioned-distance ;

section-space-after: ...;

page-margin-top: ...;

page-margin-left: ...;

page-margin-right: ...;

page-margin-bottom: ...;

Other section-level properties are:

type: footer | ...

id: unique-section-id

footer: section-id

2.2.1. <p>...</p>

These tags represent a block (or paragraph). All document text must be within a block. Blocks may not be nested (at the current time). All paragraph formatting options appear as attributes of the this tag. All document content must appear within a block. A section must contain at least one paragraph.

The following are block-level properties and may appear in the value of the props attribute:

text-align: left | center | right | justify ;

line-height: dimensioned-distance;

orphans: ...;

widows: ...;

keep-together: ...;

keep-with-next: ...;

margin-top: ...;

margin-left: ...;

margin-right: ...;

margin-bottom: ...;

text-indent: ...;

tabstops: ...;

default-tab-interval: ...;

The following are other properties of the p tag:

level: number

style: style-name

id: unique-p-id

2.2.2. <c>...</c>

These tags delimit an in-line span format. These tags are used to apply a span-level format change within a block. For example, making a word italic within a paragraph.

Spans may be nested, but this should be thought of as a convenience for document translators; AbiWord will flatten these during import and will write them out flattened on export.

A set of start and end c tags delimit a span of document text. These are only necessary to change a style from the settings inherited from the block. Text need not be enclosed in c tags.

The following are span-level properties and may appear in the value of the props attribute:

color: ...;

font-family: font-name;

font-style: ...;

font-variant: ...;

text-decoration: none | underline | line-through | overline;

font-weight: bold | normal;

font-stretch: ...;

font-size: floating-point;

The following are additional attributes of the c tag:

type: list-label

style: style-name

2.2.3. <image/>

This tag defines an in-line image reference.

The <image/> tag has the following attributes:

dataid = data-item-name

props = formatting specification

The dataid is a reference to a named Data Item in the Data section of the file.

The <image/> tag has the following properties that may appear in the value of the PROPS attribute:

width: ...;

height: ...;

2.2.4. <br/>

This tag defines a forced line-break. This tag has no attributes or properties.

2.2.5. <cbr/>

This tag defines a forced column-break. This tag has no attributes or properties.

2.2.6. <pbr/>

This tag defines a forced page-break. This tag has no attributes or properties.

2.2.7. <field/>

This tag defines an in-line computed field reference.

This tag has the following attribute:

type: list-label | page-number | page-count | time

Currently, we are in the process of updating our entire field model, so this will likely change in the near future.

2.3. <data>...</data>

These tags delimit a series of one or more data items.

2.3.1. <d>...</d>

These tags define a Data Item -- an opaque blob of Base64-Encoded data. Encoded content may be broken up on multiple lines as in MIME. For images, these are Base64-encoded PNG objects. Other types of objects may be defined in the future.

The <d> tag has the following attributes:

name=unique-data-item-name

The name attribute provides a target for the DATAID attribute of the <I> and other tags.

2.4. <styles>...</styles>

These tags delimit a series of one or more styles.

2.4.1. <s>...</s>

These tags define a style -- a set of formatting commands that can be applied to any p or c element in any <section>. When applied, these specify formatting for that span.

The <s> tag has the following attributes:

basedon = style-name

name = unique-style-name

type = ...

props = any properties of a p or a c tag

3. Example Document

The source for this document can be used as an example to study from. Additionally, the source tree contains numerous simple example documents (in abi/docs and abi/src/wp/samples). The XML source for these documents can be viewed with any text editor.

4. References

[1] XML - http://www.w3.org/TR/REC-xml

[2] AbiWord DTD - http://www.abisource.com/awml.dtd

[3] CSS2 - http://www.w3.org/TR/REC-CSS2

[4] The AbiWord source. (See http://www.abisource.com/lxr/ for a fully cross-referenced view of the AbiWord source code.)

5. TODO

[] Discuss white-space handling within content and around tags.

[] Discuss mapping of CR, LF, VT, HT, FF into various tags and vice versa.