Only 0.7% of all html documents on the internet ar...

Former Member · ‎09-24-2004

How to cope with HTML thesis

The result presented here in this weblog is part of my master thesis which is available in postscript and online with each page as an image. To read postscript files use gsview. The page http://elsewhat.com/thesis contains some more information around the thesis and references to it on the Internet.

Standard for HTML, isn't that Internet Explorer?

The Hypertext Markup Language(HTML) is actually an application of the Standard Generalized Markup Language(SGML) which has been an ISO standard since 1986. SGML in itself is a meta-language, meaning that it is a language that describes how other languages are defined. The major construct SGML does this through, is the Document Type Definition(DTD). Some of you might have come across DTDs when using XML, and this is the same construct since in essence XML is a simplified version of SGML. The DTD gives a definition of which markup elements types exist and how these element types relate to each other.

HTML defines three different DTDs for their latest version (4.01); the loose, the strict and the frameset. The major difference between these DTDs is that the strict do not contain certain deprecated elements and attributes. The image below show the relationship between the elements in HTML 4.01 strict, for a readable and zoomable version try the pdf version.

How the validation was performed

Validation data

The test data consisted of the entire contents of Open Directory Project. They describe themselves the best: "The Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors." Since this is an open source project, you can download the entire directory and and do whatever you like with it. The data is available in the Resource Description Framework (RDF) XML format.

At the time I did the validation the RDF file was about 1 GB unzipped and from it I ended up with about 2.5 million URLs.

Validation tools

The tools used for the testing were:

A custom-modified version of GNU wget (used to download the html pages)
lq-nsgmls[tar.gz]. The actual SGML parser which does the validation
A custom modified version of WDG's validation perl script

Validation process

Since a sequential run would take over 80 days, the validation job was running in parallel on over 30 computers at night. I managed to do between 300 000 and 400 00 validation per night. The validation script was very simple and basically did the following:

acquire_packet();for each url in packet wget $uri $tempfile& perlscript_validate $tempfileend_for

The program's output for each url was a binary string which indicated which error types were present if any. Afterwards, the results were analysed with a separate program.

The result acquired

0.7% of all HTML documents are valid!

As you can see in the first two columns, 14 563 html documents were valid html according to the standard, while 2 034 788 documents were invalid (the documents which the program was unable to validate due to unknown DTD and those I could not download were not included in the rest of the statistics). It is very odd to have a standard that only one in 140 adheres to. I would believe this is a world record for any standard 🙂

The distribution among error types

The chart below shows which errors occur the most frequently in the non-valid html pages.

No DTD declared
Required attribute not specified
Non-standard attribute specified
Value not a member of a group specified for any attribute
An attribute was specified twice
An attribute value was not one of the choices available
Invalid attribute value
Misquoted attribute
Omitted end-tag
End-tag for element not open
Break of content model
Inline element containing block element
Start-tag omitted
Unknown entity
Missing a required sub-element
Invalid comment
Non-standard element
Start-tag omitted
Premature end-tag
Text placed where it is not allowed to be

It is interesting to see that 81.2% of all HTML documents do not specify which DTD they are following, therefore it's not possible to validate them as we do not know which version of HTML they are trying to be compliant with. In these cases, the validation process assumed that it was using the HTML 4.01 transitional (loose) DTD.

Number of errors

The chart below shows how many of the different error types the html documents contained. The median is 5 different error types pr non-valid html document. Four documents contained 19 out of 20 error types, which must be considered quite feat (still regret that I didn't send their webmasters a prize)

How could a standard evolve into this?

Well, you need to look a bit at the history to understand this. Browsers (primarily Mosaic, Netscape and Internet Explorer) have added their own proprietary elements and attributes instead of going through the long standards process. Especially during the browser war between Netscape and IE a lot of new non-standard elements were introduced and new ways of error correcting the html code. IE was generally more relaxed regarding the syntax of html documents (many end-tags are not required for example), and very few people validate their code, instead only check that it is displayed as they would like in a browser.

As a very wise man (Homer Simpson) once said:
"It takes two persons to lie,
one to lie
and one to listen"

If the browsers hadn't accepted incorrect HTML, we wouldn't have this problem. This is one of the reasons why the XML specification is very strict on the point that no error correction is to be done by the parsers. But would the web have evolved like it has if the HTML standard was more strictly implemented in browser. In one way no, but on another hand it probably would have lead to much better tool support which anyone could use.

The low standard adherence makes it very difficult to release a new browser, as there are so many old incorrect webpages you need to display correctly (which unfortunately is often defined as the way IE does it). This is the main reason that new browsers such as firefox and opera still have problems on some webpages (and gbrowser will also have these problem should it ever be realized).

What can I do?

There are a few actions you can do to improve the quality of your html code. I myself very rarely validate my html code, and when I do I usually have a few minor errors which I do not always correct. But I believe that as a minimum web developers should know about which elements can be placed were and that they remember to close open elements were needed. Also testing in a few different browsers is not a dumb idea.

If you would like to try an online html validator, I would recommend the "official" W3C validator http://validator.w3.org/ or WDG's version which is available from http://www.htmlhelp.com/tools/validator/.

And if you are really in the mood, try reading the W3C HTML4.01 specification

Conclusion

I was planning to do some validation on SAP EP and see if the support for alternative is caused by HTML errors, but unfortunately I've hadn't had the time to do it. Maybe I'll write another blog on the subject later. In the meantime you can check the login page of sdn for validity.

I have not covered XHTML either in this blog as it was not available at the time I did the validation. XHTML is basically an XML version of HTML (so it’s an application of XML instead of SGML).

Hopefully, you have learned a bit more about HTML in this blog and that you are more conscious on your HTML code in the future.