Jump to content

HTML for HTML User Tool Import


Mike Doyle

Recommended Posts

Hi All

 

Sorry for the double post - but thought it better to keep things separate.

 

I'm writing a perl script the clean up html files so they import into Accordance better. I'm sure people have noticed - Accordance sometimes has problems importing HTML files that are dirty, or use complex html, or are big etc.

 

Is there any more documentation regarding exactly what html Accordance will recognise? I've read this article but it only covers the very basics.

 

For example - will Accordance recognise the HTML Footnote tag? What about stylesheets that define headings up the top?

 

So - is there any more complete docs?

 

thanks

Mike

Link to comment
Share on other sites

Hi Mike,

How can I get this script when you are done with it? I am very interested.

Link to comment
Share on other sites

Hi Mike,

How can I get this script when you are done with it? I am very interested.

 

It's a fairly rough and ready script - but more then happy to pass it on to anyone.

 

Unfortuantely, even importing basic HTML into Accordance is being very problematic, and I can't work out why various things are happening, so I can't work around the issues. So I'm hoping to get more of a responce on these forums regarding issues with html import. hint hint. nudge nudge.

 

My eventual aim is to import the entire CCEL library (which are all out of copywrite works - so it's OK)

 

Mike

Link to comment
Share on other sites

Mike,

 

Sorry for the delay in responding. As I understand it, these are the tags which are supported and the way Accordance interprets them:

 

Headings:

<title> and <h1> marked with T, centered, bold, 18 pt

<h2> marked with 1, centered, bold, 14 pt

<h3> marked with 2, centered, bold

<h4> to <h9> marked with 3-9, centered, italic

<h10> and up centered

 

Paragraphs:

<br> and <p> carriage return (carriage returns and shift-returns are lost)

 

Styles:

<size 18> 18 pt

<em>, <strong>, and <b> bold

<i> italic

<u> underline

Lists:

<ul> and <ol>

followed by <li> indent

 

Spaces:

tabs, spaces, and

command-space space (except at beginning of a line)

 

Alignment:

<div align=left>,

or right, or center aligned

 

Superscript:

<sup> raised

 

As you can see, the number of supported tags is fairly limited. We may expand this slightly, but footnote tags and stylesheets are not likely to be supported. We built the HTML import feature to allow the import of fairly simple user tools. The entire CCEL library with its THML is beyond the scope of our html import feature.

 

Hope this helps.

Link to comment
Share on other sites

Mike,

If you are able to get things to work so that you can import things from CCEL please let us all know on the forum and then at that time I would love to get the script.

Link to comment
Share on other sites

Sorry for the delay in responding. As I understand it, these are the tags which are supported and the way Accordance interprets them:

 

Thanks Dave. I'll clean up some of the code and create a file with just those tags. If it still doesn't import right, I'll get back and see if we can work out what the problem is.

 

thanks

Mike

Link to comment
Share on other sites

Sorry for the delay in responding. As I understand it, these are the tags which are supported and the way Accordance interprets them:

 

Hi Dave.

 

Time to push back a little - cause it's not working the way I expect. I'm hoping you can shed some light on *why*.

 

Background: I have coded a perl script to strip HTML out of files, except for the tags Accordance likes. I have coded around one bug I found (scriptural references in headings - which I have received a reply about yet *cough*). I haven't coded around another bug (nested headings) - as I'm not sure where Accordance is up to on that.

 

Now, with my new HTML file, as far as I can tell, I'm only allowing "safe" tags. However, when it is parsed into Accordance, it will put in blank "subheadings" at strange points, as if the parser has come accross a bad character or something (I'm also removing non-normal characters - just in case). When I copy and paste the section of the html in a smaller file, sometimes the errors will repeat, sometimes not, but most often, it will repeat, but in a different area. This sounds confusing. What I mean is: The blank subheading will occur as say line 453 in the big file, and line 342 in the small file, just at random characters.

 

Bottom line: If I can find out what is causing the errors, I can strip out the offending html. What else do I need to do? What do I need to strip? Any other bugs, characters etc I need to get rid off?

 

So, attached is a test file, with a decent length of html attached (but a sub-set of the entire file - like 10%). As far as I can see, the html is valid, and only contains "safe" tags. Could you take a look, see if you can see the errors, and see if you can indicate to me why the errors are happening? I hard ask - but it would be very helpful!

 

In the test file, the error occurs in chapter 1, in the subheadings. In Accordance, a blank subheading is put in at footnote 23, just before the commentary on verse 9. In the test.html code, it's at line 277.

 

In the larger file (not attached), a similar error occurs (in fact, in every chapter), but at different spots. This suggests to me that the Accordance HTML parser is tripping up over some tag, or has some sort of buffer-overflow, line count problem.

 

Looking forward to hearing from you.

 

thanks

 

Mike

 

PS I'm happy to take this discussion offline if you wish, my email address is michaelrdoyle AT gmail.com thanks.

PPS I had to attach a "png" to the end of the file, as the board wasn't allowing me to upload a .html or a .zip file. Just rename it.

test.html.zip.png

Link to comment
Share on other sites

Mike,

 

The blank titles are not being inserted because of anything in the HTML. Rather, Accordance inserts them to break up long sections of text without titles. This is to avoid memory problems caused by long articles, but I have to confess that I am sometimes bewildered by the points at which blank titles are inserted. You can edit the finished tool to strip some of those blank titles out.

 

As far as stripping out unsupported tags, there's no need to do that in your script. Accordance does that in the import process.

 

You can have a Scripture reference in a title, but if it is formatted as a Scripture hypertext link, it will not appear in the browser. Neither will any text after the Scripture link. Essentially, Accordance is understanding the Scripture link as marking the end of the title. You can de-link the Scripture references in titles to get them to appear in the browser.

 

I'm not sure what you're referring to as a bug in nested headings.

 

I hope this helps.

Link to comment
Share on other sites

Rather, Accordance inserts them to break up long sections of text without titles. This is to avoid memory problems caused by long articles,

 

You can have a Scripture reference in a title, but if it is formatted as a Scripture hypertext link, it will not appear in the browser. Neither will any text after the Scripture link.

 

Thanks Dave. A couple of questions/points.

1) How many lines before Accordance inserts it's own blank heading?

2) Check out the code at this forum post, most certainly a <h2>Joshua 2:1-24</h2> causes a blank chapter (second level) to be inserted, and all chapters to be placed INSIDE this first chapter. Give the code a go.

3) The nested title bug is referred to in the second post here. I think I have confirmed it individually, but it's the least of my worries at the moment.

4) Accordance has a problem with a variety of complex tags that are legitimate for html4.0, stylesheets etc and in the files at CCEL. Not a problem though, since I'm removing them. I haven't documented them - just removed them as I came accross import problems.

 

Mike

 

Mike

Link to comment
Share on other sites

1) How many lines before Accordance inserts it's own blank heading?

 

I'm afraid I have no idea. I would imagine however that it's not just a simple matter of counting lines. The programmers will have to answer that one.

 

2) Check out the code at this forum post, most certainly a <h2>Joshua 2:1-24</h2> causes a blank chapter (second level) to be inserted, and all chapters to be placed INSIDE this first chapter. Give the code a go.

 

You need to understand that it is the Title field, and only the Title field, which will appear in the browser. In the case above, you have the header codes which create the sublevel in the hierarchical browser, but the text of that Title has been placed in the Scripture field, and therefore does not show in the browser. Your choices are to (a) de-link each of those Scripture references so that they appear in the Title field and in the browser, or (B) remove the browser level for each of those titles.

 

4) Accordance has a problem with a variety of complex tags that are legitimate for html4.0, stylesheets etc and in the files at CCEL. Not a problem though, since I'm removing them. I haven't documented them - just removed them as I came accross import problems.

 

It would be helpful if you pass them along to us (via e-mail) as you run across them. We'll never support every tag, but we do want to make sure there are not tags which are messing up the import in some way.

 

I hope this helps.

Link to comment
Share on other sites

You need to understand that it is the Title field, and only the Title field, which will appear in the browser. In the case above, you have the header codes which create the sublevel in the hierarchical browser, but the text of that Title has been placed in the Scripture field, and therefore does not show in the browser. Your choices are to (a) de-link each of those Scripture references so that they appear in the Title field and in the browser, or (B) remove the browser level for each of those titles.

thanks Dave

 

Is there something I can do in the html to get this to work? Or is it all post-import?

 

Mike

Link to comment
Share on other sites

Since the scripture references are defined when the text is imported, you could to move the verse reference to the next line with a

or

after the close header. If you want to keep them but unlink them, it would have to be done post import,

Link to comment
Share on other sites

The blank titles are not being inserted because of anything in the HTML. Rather, Accordance inserts them to break up long sections of text without titles.

 

Hi Dave

 

I've broken up the file, and I'm still getting this problem. It's within 4 lines of a previous heading <h2>. Any other suggestions on what it may be?

 

I'm trying to isolate it - but haven't been able to. It's just been happening.

 

<EDIT> Just added the file (once again, remove the .png and unzip it). Note how at line 223, just at footnote 18 (<sup>) the font size changes, a blank chapter is inserted (only two lines from the <h2> at line 220). Also note in the file how the headings are nested under each other. Chapter 2 is the last chapter INSIDE chapter 1, chapter 3 is INSIDE chapter 3 etc.

 

any pointers on what is going wrong, and what I need to change IN THE HTML so it imports nicely?

 

thanks again for your time.

 

Mike

<edit> Just updated attachment, wasn't backing out of <h3>, though this didn't change the Accordance results.

 

JoshOUT.html.zip.png

Link to comment
Share on other sites

As David requested, please send these kinds of technical questions and files directly to him or to me by email. There is really no point in posting them on the Forum.

 

Please also remember that the import feature is not intended for entire works, rather for articles, sermons, your own notes etc. It takes very sophisticated markup to create proper tools of major works, which is why our Accordance modules take weeks or months to produce but end up much better than any user tool. The markup of user tools is intentionally limited to keep it simple and user-proof.

Link to comment
Share on other sites

Hi Helen

 

Do you have an email address for Dave or yourself I can use? I'm not sure it's publically viewable on the forum (feel free to email me - michaelrdoyle at gmail dot com)

 

I also don't think you guys have understood the issues - all I'm attempting to do is get Accordance to import the simplest html, in a format you guys say it should handle. It's just it doesn't seem to work!

 

I'll follow up via email. BTW - I have 3 bug reports, with sample code I'm ready to submit, would you like that in the bug forum, or over email?

 

thanks

Mike

Link to comment
Share on other sites

I prefer email, but I have sent you our addresses via a personal message.

Link to comment
Share on other sites

  • 3 weeks later...

Mike, I never did receive an e-mail from you, but the file you submitted did help us to identify a number of issues which have been fixed in the just released version 8.0.5.

 

By the way, I hope you're just using Calvin's commentary on Joshua as a test case, and not seriously intending to import all of Calvin's Commentaries as user tools. You can get all of his commentaries as a fully formatted Reference tool for just $30! ;)

Link to comment
Share on other sites

Mike, I never did receive an e-mail from you, but the file you submitted did help us to identify a number of issues which have been fixed in the just released version 8.0.5.

 

By the way, I hope you're just using Calvin's commentary on Joshua as a test case, and not seriously intending to import all of Calvin's Commentaries as user tools. You can get all of his commentaries as a fully formatted Reference tool for just $30! ;)

 

Hi Dave & Helen

 

First: THANKS! I've just been playing around with the new release - and found it fixed a whole bunch of problems, in particular, the most annoying ones that made it (in my opinion) unusable. So I've very very pleased with the fixes - thanks.

 

Sorry for not getting back to you - it's on my to do list. Unfortunately I've been sick for several weeks - so it just kept slipping further down the list. I even had a whole bunch of test files showing various bugs! Good to see they're not needed.

 

I noticed the following bugs fixed in 8.0.5:

1) Heading Heirachy bug: FIXED (this was a real killer)

2) SuperScript Bug: FIXED

3) Headings appearing in random spots: FIXED (another real killer)

4) Scripture Ref in heading bug: NOT FIXED (Debateable on how to give this a proper fix)

5) heading tag and first <h1> tag conflict bug: NOT FIXED (minor)

 

 

As I've mentioned, I've been writing a perl script to cut down the HTML of CCEL HTML files to make them easy to import into Accordance. The html of the CCEL files have been created by MS Word, and use Stylesheets and XML - it's lucky if a normal browser can read them, let alone Accordance.

 

The good news is my script (which basically strips non-Accordance html from the files) now seems to make them workable.

 

anyway - thanks for the hard work, and I'll keep people updated on where we get to (there's a group of us) on the CCEL stuff. Pretty snowed under at the moment - so be patient.

 

And of course - if people want the good, formatted nicely stuff - do buy the modules off the Accordance site. As David mentioned - you can get all of Calvin's commentaries for only $30 - and they're not a kludge.

 

Mike

Link to comment
Share on other sites

  • 1 month later...

I am trying to import a 6 page paper I wrote into a User Tool. I have few hierarchy levels that I want to import as well. I have followed the format David mentioned about, <h1> for the the first heading and <h2> for the second heading and so on. But every time that I import my tool, the marker is just imported with it. For example if I have a title of <h1>My User Tool, it imports <h1>My User Tool, or if I try <h1>My User Tool<h1>, it imports <h1>My User Tool<h1>. So my first question is, which format is correct (I would guess the later)? and is there something I am doing wrong in the import?

 

I have attached a screen shot of the actual text. It was created in Word 08 and converted to HTML then imported into Accordance.

 

Thanks.

Link to comment
Share on other sites

You simply need to use the correct tagging (you're not using the correct end tag):

 

<h1>YOUR TITLE</h1>

Link to comment
Share on other sites

Ok, I have made the changes, but it is still importing the markers and no hierarchy for the browser???

Link to comment
Share on other sites

There has to be a reason. If you attach to a post your document (or a portion of it), I'll see if I can figure out why it isn't importing properly.

 

I've imported thousands of pages of books into User Tools, and I think I've encountered nearly every possible problem importing.

Link to comment
Share on other sites

Here is a file I just created, it is doing the same thing. The actual document is an internal document that I am not able to pass around freely. Anyway, let me know what you think the problem is. Thanks.

 

 

BTW - I am also using <sup>1</sup> to reference the footnotes at the bottom of the document and it is doing the exact same thing. It's like it just doesn't recognize the tags.

Link to comment
Share on other sites

Justin, I don't see a document attached to your last post. You could also email it to me: sean.alan.reed (at) gmail.com.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...