Order Toll Free 1-877-339-5855
News, How-tos, and assorted Views on Accordance Bible Software.

Monday, February 26, 2007  

How Accordance Imports HTML

Last Friday, we imported an HTML file we downloaded from Project Gutenberg into Accordance as a User Tool. Today, I want to look at some of the ways Accordance interprets the HTML tags so that you can better understand how the HTML import works. Some people have been waiting with bated breath for this kind of documentation. I hope I don't disappoint.

The <title> tag. The first thing Accordance looks for is the <title> tag. This is the tag that specifies the name which appears in the Title bar of your web browser. Accordance automatically converts the Title into the first title of your User Tool. Since many e-texts will have the title within the body of the web-page as well, this conversion of the Title tag can lead to redundancy. For example, in the Companion to the Bible tool we created last Friday, you'll see that the first title in the browser is "The Project Gutenberg eBook of Companion To The Bible, by E. P. Barrows." Just below it is the title "Companion to the Bible" which was taken from the actual body of the HTML document. The first thing I would do with this User Tool is open the Edit window and delete the first title and all the Project Gutenberg license information. That will eliminate the redundancy in the browser and get rid of all the legalese.

The irony of Accordance's conversion of the <title> tag is that in well-formatted e-texts it will create a redundancy that you'll almost always want to remove. Still, including the <title> tag ensures that nearly all e-texts start out with at least one appropriate title.

The Header (<H1>, <H2>, etc.) tags. In HTML, header tags are used for merely cosmetic purposes, to create boldfaced headlines at various sizes. When Accordance imports header tags, it interprets them both cosmetically and structurally. In other words, it uses the header tags to identify the text which will appear in the User Tool's Titles field and to assign a browser hierarchy level. Thus, any text tagged with <H1> is assigned to the Titles field and placed at the top level of the browser hierarchy, just as if you had placed the red T in the margin of the tool's Edit window. Accordingly, <H2> is placed at the second level of the browser hierarchy (as if you had placed a red "1" in the margin of the Edit window); <H2> is placed at the third level of the browser hierarchy (as if you had placed a red "2" in the margin); etc.

When it came to how to display the various titles, we departed from the way HTML headings are usually formatted. Instead, we used a standardized system of descending formats.

<H1>     18 point bold, centered
<H2>    14 point bold
<H3>    12 point bold
<H4>    12 point italic.

Header tags <H5> through <H9> are also formatted as 12 point italic, but are placed at successively lower levels in the browser hierarchy. Any header tags beyond <H9> are ignored.

Since HTML Header tags are used primarily just for looks, Accordance's reliance on these tags to create the browser hierarchy can lead to hierarchy levels where none should exist. For example, it is common for e-texts to do something like this:

<H1>Chapter One</H1>
<H2>Introductory Remarks</H2>

In this case "Introductory Remarks" is really the title of "Chapter One", rather than the beginning of a new subarticle. You'll probably want to fix things like this either before the import (using a text editor to edit the actual HTML file) or after the import (in the Edit window of the User Tool).

I hope this is helpful to some of you. We'll look at how other HTML tags get imported into Accordance later this week.

That was a great post! Very helpful. Once you're done with this series, it would be nice to make it one file we could download as a help to user tools. Thanks!

Very Interesting. I have experienced one annoying problem though, that sometimes large chunks of imported text from an HTML file is indented and there is nothing I can do in edit mode to get rid of the indent. I would really appreciate it if you could explain what is going on there and how to fix it.

Just wondering if you could check out my question on the message boards regarding user tools. It deals specifically with the title tag that you talk about here, but there's a chasm between what you've explained and what I have experienced in importing. Do you have any thoughts on what I explained?


Is there a RSS Feed for the News page that will notify me when it is updated?


Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?