Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Foolproof HTML (pumpula.net)
186 points by voctor on Feb 4, 2017 | hide | past | favorite | 105 comments


> If you have a good strategy for validating your template files, I'd love to hear it!

Use S-expression syntax instead of SGML syntax, i.e. instead of:

<tag attr=value ...>content</attr>

write

((tag attr value ...) content)

and use Lisp to process it. It's actually quite straightforward. You can apply it to XML as well. Everything actually ends up looking a lot prettier this way.

See http://weitz.de/cl-who/ for an example of an implemented system that works this way. I've been using it in production for years. It works like a charm.


I've also done this for years using similar tools in Perl - https://metacpan.org/pod/HTML::AsSubs | https://metacpan.org/pod/Markapl | https://metacpan.org/pod/Builder

Here's an example of something I'm currently using/building in Rebol:

    [<tag> attr: value "content"]
Of course nothing new here because there have been other Rebol HTML dialects (for eg: http://www.hmkdesign.dk/project.rsp?id=html-dialect)

And there are plenty of CL-WHO related tools out there too - http://stackoverflow.com/questions/671572/cl-who-like-html-t...


Yes! This is what I like about JSX/React - you're actually writing the markup as function calls/data structure so invalid syntax is immediately obvious as missing a closing parentheses for a function call.


I'm not sure how React/JSX highlights the markup issues. The clarity that Jade/YAML syntax provides is hard to achieve with the clumsy looking React/JSX spaghetti code that facebook has sold to frontend devs.


Interesting, but you would still have to account for the inconsistencies of html, like:

- Standalone, non-closed tags <!doctype>, <hr>, <br>, <link>, <meta>, <img> etc.

- Proper encoding or quoting of [< > " ' &] in html attributes

- <script> tags and CDATA

Probably lots more as well. Not rocket science of course, but you would want a tool where you have some confidence they've covered all of these things.


For anyone interested, things are close, they just need to be glued:

In guile, (htmlprag) addresses the import[1]. It handles "weird html". The SXML modules[2] can handle encoding and rendering. CDATA is the only thing (that seems to be) missing. I'll have to look into that.

Someone also made a module to render html[3].

[1]http://www.nongnu.org/guile-lib/doc/ref/htmlprag/

[2]https://www.gnu.org/software/guile/manual/guile.html#SXML

[3]https://dthompson.us/rendering-html-with-sxml-and-gnu-guile....


> Not rocket science of course

And it depends on whether you're parsing or rendering. Rendering is much easier than parsing. But even parsing is nonetheless a solved problem :-)


Last time I looked <textarea> and <pre> also worked kind of like <script>.


<pre> doesn’t at all. At a stretch, maybe you’re thinking of <xmp>?


While I also like Lisp syntax, it's nowhere close to the power of SGML as a text format. SGML gives you regular type checking and inference of omitted tags, injection-free content transclusion, user-defined Wiki syntax, parametric template expansion, and pipelined, automaton-based processing/styling. I've just published my slides/paper on this topic at http://sgmljs.net/blog.html .


There is a one-to-one correspondence between (correct) SGML and S-expressions so your claim that S-expressions are "nowhere close to the power of SGML" cannot possibly be true. It might be true that the tools available for processing S-expressions as markup are not as powerful as the tools for processing SGML, but that is not a limitation of the syntax.

BTW, when you say "inference of omitted tags" did you mean "inference of omitted close tags"? Because if so, this is not a feature. It's a patch to cover up a design flaw in SGML, namely, that close tags are required to match and so it is possible to make the mistake of omitting or mismatching them. This is one of the reasons S-expressions are superior to SGML: S-expressions are DRY. SGML isn't.


In fact, inferring missing close tags is criminally stupid; when such a situation is diagnosed in any of the *ML languages, it should be loudly diagnosed.

Early in the web history, browsers tried to out-do each other in guessing what broken HTML means and render it. That led to a nasty situation of everyone having to emulate everyone else's bugs and hacks.


Like it or not, it's enshrined in the specification of HTML5.


It's actually been a standard part of HTML from the very start. It's got nothing to do with browser guessing and it's not new to HTML 5.

To give a concrete example for tannhaeuser's point, consider this document:

    <!DOCTYPE html>
    <title>…</title>
    <p>…
This is a completely correct, valid HTML document. The first thing to notice is that it's not a tree made up of elements. That first line is not a tag, and isn't part of the DOM tree.

Then we get to the DOM tree. It's got several implied start and end tags, so the tree itself looks like this:

    html
        head
            title
                …
        body
            p
                …

Again, this is a completely correct, valid document and this is the correct way to parse it. While you may be able to represent the parsed DOM as an S-expression, you can't represent all valid HTML documents as S-expressions, at least not in the convenient way people assume – and it's not "broken HTML" to blame.


> you can't represent all valid HTML documents as S-expressions, at least not in the convenient way people assume

Of course you can. Here is how to express your example as an s-expression:

((:!doctype html) (:title "This is the title") (:p "..."))

Here it is being rendered by CL-WHO:

    ? (princ (html ((:!doctype "html")) (:title "This is the title") (:p "...")))
    
    <!doctype "html">
    <title>This is the title
    </title>
    <p>...
    </p>


No, you're confusing HTML documents with their parsed DOM. The example you gave is different HTML to the example I gave.


Although the HTML he (lisper) gave is different in that it has the closing </p> tag, I'm failing to see any case where this is bad or would result in different behavior from your example. To be clear, he seems to be talking about rendering HTML only, not parsing it. Wouldn't both your example and his have the same parsed DOM? Am I missing something?


You sometimes need more control over the actual HTML document than that; for instance to work around browser bugs or for efficiency. But if you are only interested in the semantics, then it's still not an adequate representation of the document. How would you, for instance, add an attribute to the body element? If you're dealing with a semantic representation like a DOM library would give you, then this would be trivial, because the body element would be part of the model you are working with. But the body element doesn't exist in that S-expression. You'll have to manually insert it, which involves further domain-specific knowledge embedded in your code.

Basically, it's stuck in-between two states doing neither correctly. It doesn't represent the actual HTML document, and it doesn't represent the parsed document structure. It's an alternative model of the HTML document that serialises to something that would be parsed in an equivalent way. I'm sure that's useful in a whole bunch of different situations, but it's not as simple as "S-expressions can do everything HTML can, in a convenient way".

S-expressions are great, and very useful. But they aren't the right tool for every situation. HTML is an odd markup language that only appears simple superficially, with all kids of irregular corner cases creeping in when you dig into the details. S-expressions would be a great fit if HTML were as simple as it appears on the surface, but it's not.


I agree, and I'd also like to add that I find general discussions about s-expr vs markup (as well as JSON vs XML years ago) pointless.

Markup is meant as a text format for content authors that can be parsed into a hierarchical structure, rather than as general-purpose data representation syntax, even though XML is being frequently (ab-)used for this purpose.

The original use case for markup is that you can take a piece of plain text and then mark it up with tags, unlike s-expr and/or JSON which arise out of the syntax of a programming language and need eg. verbatim text to be written as string constants/with quotation characters.


> The original use case for markup is that you can take a piece of plain text and then mark it up with tags

Yes, that was the original use case, but in actual practice HTML has not been used that way for a long time. Nowadays HTML is de facto used as a programming language for the visual representation layer of a browser. No one actually uses HTML to mark up documents by hand any more, for two reasons: first, no one writes plain text documents to use as source material for markup. They write Word documents, or TeX documents, but plain text source is all but unheard of nowadays. And second, HTML syntax is too clumsy and places too many demands on the user. So when ordinary people want to produce HTML they use WYSIWYG editors. When geeks want to produce HTML (and remember I'm talking about documents here) they use markdown. The only time anyone writes HTML nowadays is when they want to make a browser do something fancy.


You are confusing syntax and semantics. HTML and the DOM are two different things. HTML is a string of characters (syntax). The DOM is a data structure (semantics). Normally a DOM is produced by parsing HTML, but it can be produced in other ways (by running Javascript code, for example).

S-expressions are a data structure, different from the DOM, but S-expression syntax is a syntax. Normally S-expression syntax is parsed to produce S-expressions, but can also be parsed to produce other things. S-expression syntax can be parsed to produce a DOM. The easiest way to do this is to parse S-expression syntax ino S-expressions, render those S-expressions into HTML code, and then use an off-the-shelf HTML parser to parse the HTML. But you could also write a parser that parsed S-expression syntax directly into a DOM if you wanted to. You could also write a transformation program that compiled S-expressions directly into a DOM without going through the intermediate HTML.

The answer to your question of how to add an attribute to an implied element is that it is not possible to do that in HTML. It is only possible to add an attribute to an implicit element of the DOM produced by parsing an HTML document that omits that element (because at that point the element is no longer implicit). The exact same thing is possible using S-expressions. For example, here's how you write tables in my library:

(:table (header header ...) (data data ...) (data data ...))

This string of characters is parsed by the Lisp reader to produce an S-expression that has a one-to-one correspondence with the string you see above. But then there is an extra processing step that transforms that into a different S-expression whose printed representation is:

(:table (:tr (:th header) (:th header) ...) (:tr (:td data) (:td data) ...) ...)

At that point you can manipulate that S-expression in the same way that you manipulate the DOM (because they are both just data structures). Once you're done, you convert the S-expression to a DOM. At the moment that is done by rendering to HTML, but as I noted above that is just an implementational convenience to take advantage of the fact that HTML->DOM parsers are available off the shelf. You don't have to do it that way (and indeed the world would be a better place if it were not done that way).

All of this is trivial when dealing with S-expressions precisely because of the strict 1-to-1 correspondence between data structure and visual representation that does not exist in SGML-derived languages. That is why writing code for SGML-derived languages using S-expression syntax is so advantageous. (Actually, this is true for any language, not just SGML-derived languages. It's just a little more obvious for SGML-derived languages because SGML syntax already kinda sorta looks like a data structure representation so it's a little easier to grasp what is going on.)


> HTML is a string of characters (syntax). The DOM is a data structure (semantics). [...] S-expressions are a data structure, different from the DOM, but S-expression syntax is a syntax.

I believe this is where the confusion is coming from. When you parse HTML syntax, you get a data structure; this is the same as when you read sexpr syntax, you also get a data structure. Both these data structures are different from the DOM tree.

Try this example:

    <pre>
      <span>one
      </span>
      <br>
      <span>two</span>
      <br />
    </pre>
Can CL-WHO generate HTML that matches that? (i.e. feed both into a tool like BeautifulSoup and produce the same data structure?)

Outside of CL-WHO and Hiccup-type libraries, you can of course use S-exprs to represent the same data structure. Here's a hypothetical S-expr syntax that might produce the same data structure:

    ((pre)
      "\n  " (span) "one\n  " (/span)
      "\n  " (br)
      "\n  " (span) "two" (/span)
      "\n  " (br/) "\n"
     (/pre))
Which is what I believe JimDabell meant by:

> you can't represent all valid HTML documents as S-expressions, at least not in the convenient way people assume


> Both these data structures are different from the DOM tree.

In the case of S-expressions that is true. In the case of HTML it may or may not be true. It depends on how the HTML parser is implemented. There is a "natural" mapping of HTML onto a parse tree that is different from the DOM, but that is not part of the standard (AFAIK).

> Can CL-WHO generate HTML that matches that?

Yes, though native Common Lisp does not provide c-like string escapes so putting in newlines is a little awkward. You could, of course, bring in a string interpolation library, but here's how you can do it without that:

    ? (defun nl () (who (fmt "~%")))     ; NL = NewLine
    NL
    ? (defun nli () (who (fmt "~%  ")))  ; NLI = NewLine + Indent
    NLI
    ? (princ (html (:pre (nli) (:span "one" (nli)) (nli) (:br (nli) (:span "two") (nl)))))
    
     <pre>
       <span>one
       </span>
       <br>
       <span>two</span>
     </br></pre>
Or you could do this:

    (html (:pre "
      <span>one
      </span>
      <br>
      <span>two</span>
      <br />
    "))
which looks like cheating but is actually closer to the spirit of the original.

The PRE tag is really weird because it actually changes the way things inside it are parsed. You can actually implement that in Lisp too via reader macros. CL-WHO doesn't support that out of the box, but it's not hard.

I can't imagine anyone actually wanting to do that, though. The PRE tag is for presenting pre-formatted text without changing its appearance, so embedding other tags inside it is kinda perverse. [EDIT: I was wrong about this. See below.]


There are uses for pre with tags embedded.

pre provides the simplified line breaking and usually a monospaced font. However, tags are available to do whatever else.

A major example is that the Vim editor uses pre for formatting syntax colored code to HTML (when you do that with :TOhtml).

The output is a pre block containing various span elements which are styled with CSS.

BTW where in the HTML spec does it say that the interior of pre is parsed differently?

If we are parsing HTML (to Lisp objects or whatever), we should preserve the exact whitespace. The reverse generation should regurgitate the original whitespace.

If we take the license to eliminate newlines, then we ruin pre. The fix is simply not to do that.


> where in the HTML spec does it say that the interior of pre is parsed differently?

I was wrong about that. I had a vague memory of putting HTML inside a PRE tag once and having it come out as if it were escaped, but apparently I hallucinated that.

> A major example is that the Vim editor uses pre for formatting syntax colored code to HTML (when you do that with :TOhtml).

OK, I stand corrected on that too.

> If we are parsing HTML (to Lisp objects or whatever), we should preserve the exact whitespace. The reverse generation should regurgitate the original whitespace. > If we take the license to eliminate newlines, then we ruin pre. The fix is simply not to do that.

Right.

Actually, I just realized that I mis-read the example. I saw <br /> and thought it was </br>. (Maybe the OP edited it?) In any case, the example now reads:

    <pre>
      <span>one
      </span>
      <br>
      <span>two</span>
      <br />
    </pre>
And you can render that in sexpr syntax as:

    (:pre "
      " (:span "one
      ") "
      " (:br) "
      " (:span "two") "
      " (:br) "
    ")
This is a particularly bad example to demonstrate here because the whitespace in the code plays badly with the whitespace in the HN markup. But I tried running this code and it does work. Here is the output copied-and-pasted verbatim from my listener:

    <pre>
      <span>one
      </span>
      <br />
      <span>two</span>
      <br />
    </pre>
Note that both BR tags are rendered as <br />.


It was <br> and <br /> for my example (</br> isn't a valid tag). The point that I was getting at was that <br> and <br /> self-closing tag are represented differently (<tag>, <tag />, and <tag></tag> are all different) in a parsed SGML data structure (though they both are equivalent in the HTML DOM tree in the browser).

This is why you would need separate tags to emit them properly with an S-expr syntax (tag), (tag/), and (tag)(/tag) in my example.


You can do this:

    <tag> ==> (:tag)
    <tag/> ==> (:tag nil)
    <tag></tag> ==> (:tag "")
Using (:tag/) is a bad idea because that would screw up attributes.

CL-WHO doesn't support this, but that would be easy to change if it ever actually mattered to anyone.


> In HTML, <tag/> and <tag></tag> are equivalent

In HTML, <script></script> is valid. <script /> is invalid. <br /> is valid. <br></br> is invalid. So they are represented differently.

> Using (:tag/) is a bad idea because that would screw up attributes.

For my example?

    ((:tag/ :attr "value"))               => <tag attr="value" />
    ((:tag  :attr "value") "..." (:/tag)) => <tag attr="value">...</tag>
> You actually can distinguish between those if you really want to. It's just a matter of picking a convention.

That sounds like it could work. So a leading `nil' would be treated as a special case (not a child node):

    (:pre "
      " (:span "one
      ") "
      " (:br) "
      " (:span "two") "
      " (:br nil) "
    ")


> <script /> is invalid. <br /> is valid. <br></br> is invalid.

OK, then the best way to handle that is to let the HTML-renderer know that different tags need to be rendered differently if they're empty. Are there any cases where you would ever want to distinguish between the various kinds of empty tags?

    ((:tag/ :attr "value"))               => <tag attr="value" />
    ((:tag  :attr "value") "..." (:/tag)) => <tag attr="value">...</tag>
No, that's not what you want. Let's start with this general form:

((:tag attr value ...) content ...) => <tag attr=value ...> content ... </tag>

Let's assume we have no attributes so I don't have to keep typing those. Then we have:

((:tag) content ...) => <tag> content ... </tag>

In this case (no attributes) we can unambiguously remove the parens around (:tag) and get:

(:tag content ...) => <tag> content ... </tag>

Now if we have no content we get:

(:tag) => <tag></tag>

All this is still completely regular, no special cases. But now if we write (:br) we get <br></br> which is not what we want. So we need to tell the renderer that some empty tags get rendered one way, and other empty tags get rendered another way. CL-WHO does this.

Notice that we have not actually typed any / characters. This is important. The role played by / in HTML is played by the close-paren in sexpr syntax. If we re-introduce the / into our new syntax we will have a hopeless mess.

> So a leading `nil' would be treated as a special case

That is exactly right. If (and this is a big if) we want to be able to write something equivalent to both <tag/> and <tag></tag> in the same document we have to be able to distinguish between those two things in the markup somehow. I just looked this up and the distinction that HTML makes between <tag /> and <tag></tag> is that the former content is EMPTY while the latter content is "" (i.e. the empty string). So really the Right Thing would be:

(:tag) => <tag />

(:tag "") => <tag></tag>

That will work, but now we have to remember to add an empty string in some situations, e.g.:

((:script :src "...") "")

Personally I would find this annoying, so I would choose to go with the lookup table.


Yawn. Hey look,

  #!/usr/bin/lisp
  (defun foo () a b c)
has an "inferred PROGN" around a b c and the first line isn't part of the tree.

What you're not getting here is that the above broken HTML has a canonical HTML representation. That canonical HTML can go to S-exp.

If we are doing HTML-in-Sexp, we can throw out some of the non-canonical aspects, keeping the ones we like. We can certainly infer element wrapping if someone using our HTML-in-Sexp finds that useful.


> Yawn.

Is that really necessary?

> the above broken HTML

As I very clearly stated, it's not broken. It's completely correct, valid HTML. Stick it in a validator if you don't believe me. Yes, I know a lot of people assume otherwise. That's because HTML is only superficially simple but has unexpected irregularities once you dig deeper. This just reinforces my point that HTML isn't the nice neat package that fits well with S-expressions you think it is. The canonical HTML representation of that "broken" HTML is simply the HTML I provided, unaltered, which is not conveniently representable as an S-expression. Please, before trying to reinvent HTML-as-S-expressions, take the time to learn what is and isn't correct HTML. You seem to be assuming the language is simpler than it is and any irregularities are because the sample HTML provided is "broken". This isn't the case.


You don't seem to understand what "canonical" means; it's a certain preferred alternative from among correct alternative forms. Often, the canonical form provides some base definition and the other forms can be understood in terms of equivalence to the canonical form. Do we not understand that a body element is added to the document if it is missing? Is there not a body element in the resulting DOM?  If so, then the source syntax which has that body can be considered canonical.


Re tag inference: no I mean both start- and end-tag inference, like HTML does. It's explained in my linked paper, and it's not for "covering up design flaws". I think you're making quite strong conclusions here considering your lack of knowledge of SGML.


Start and end tag inference really means element inference. LIke, oh, here is a <p>...</p> but it's not in a <body> element; let's wrap it in one to canonicalize it. That can be done in the abstract syntax tree, rather than by literally inserting tag tokens.

(I hope for the sake of SGML and HTML that you're the one confusing character level syntax with tree manipulation.)


We're talking about syntax here. Start-tag inference is not syntactic. There is no way to tell if there's a missing start tag in:

    <a><c></c></a>
End-tag inference is semi-syntactic. You can tell that the following might have a missing end tag:

    <a><c></a>
But the previous example definitely does not.


SGML/HTML tag inference is guided by content model declarations, eg.

     <!ELEMENT html O O (head,body)>
tells SGML that the html element should contain a head element, followed by a body element. "O O" (capital letter O for omission) are tag omission indicators (in this case meaning that both the start- and the end-element tags for html can be omitted).

This is covered in depth in the linked paper/slides (in fact, covering JimDabell's example exactly).


You really don't seem to understand the difference between syntax and semantics. Syntax has to do with the rules that govern what strings of characters constitute legal programs. Semantics has to do with what strings of characters which are legal programs mean. For example, the following two documents are both syntactically correct:

    <!document ...>
    <a><c></c></a>

    <!document ...>
    <!element a O O (b,c)>
    <a><c></c></a>
They just have different semantics. If parsed as HTML, the first produces a DOM with two nodes and the second produces a DOM with three nodes.

By way of contrast, this is syntactically incorrect:

    <table><tr><td>data</img>
So is this:

    <img src=x<y.html>
There is no DOM that corresponds to those two examples.

One of the (many) problems with SGML is that it muddies the distinction between syntax and semantics. That is one of the (many) reasons that using S-expression syntax to write SGML-like languages is advantageous.


> You really don't seem to understand the difference between syntax and semantics.

Are you talking to me? I've just pointed out how SGML works and didn't say anything about syntax/semantics.


Yes, I'm talking to you. You're right, you didn't say anything about syntax and semantics. I did. Go back to the beginning of the thread:

> > If you have a good strategy for validating your template files, I'd love to hear it!

> Use S-expression syntax instead of SGML syntax [emphasis added]

Note the use of the word syntax. I'm talking about syntax. All of your responses have been at best irrelevant or at worst wrong because you either don't understand what syntax is, or you chose to ignore it. It's damned annoying, particularly when you start making demonstrably false claims like, "it's nowhere close to the power of SGML as a text format" (https://news.ycombinator.com/item?id=13569991).


If reality is so annoying to you, please refrain from (trash-)talking to me.


Reality is not annoying. Straw-man arguments are.

https://en.wikipedia.org/wiki/Straw_man


The fact that you can do HTML in Lisp is because there is code behind that doing the semantics. What makes you think that those SGML requirements couldn't be done? Sounds to me like about one week's worth of evening hacking.


You're off several orders of magnitude. Implementing SGML is a multi-year effort. You don't have to take my word for it, James Clark has said the same (he has implemented SGML and XML, and also DSSSL, the Scheme-based precursor of CSS and XSLT).

[1]: http://drdobbs.com/a-triumph-of-simplicity-james-clark-on-m/...


Admittedly, I'm assuming that we can use some typesetting back end; i.e., for instance, we don't have the requirement to generate a typesetter-ready image (or PDF document) without any third-party code; we don't have to do our own font rendering and kerning, etc. Also, I'm assuming we don't have to burn cycles Greenspunning up half of Lisp in some dumb langauge.


Very interesting! Thanks for the ref. I wrote about my own efforts along these lines, using Lisp lists is certainly a natural, logical starting point. There are a number of parallel systems in Scheme which I used in prior applications.

I fully agree sexpr syntax is both beautiful and efficient. The fly in the ointment is that doing anything substantial quickly becomes complex and requires dealing with a complex programming model to accomplish the task. That can be a daunting hurdle even for experienced users to surmount. It speaks to appeal of simplified interfaces, which may help but I'm pretty sure won't mitigate the problem altogether.

Of course every bit helps, having more tools can be a very good thing if we know how to apply the right tool for the task at hand.


> The fly in the ointment is that doing anything substantial quickly becomes complex and requires dealing with a complex programming model to accomplish the task.

Some problems are inherently complex, but whatever it is you want to do it will almost certainly be simpler in Lisp than any other language.

What sort of "substantial" thing did you have in mind?


Can confirm: Hiccup is really great.


Yes! Check out a site I made to show the world: http://hiccup.space

Also if you're interested in a more interactive (i.e. can process data into html), check out http://cljsfiddle.com


Wow, that's a neat library and demonstration. Thank you.


It was a nice read, and I like the way the idea was presented. But one thing that stood out to me as kind of odd was the distinction between typing lowercase and uppercase characters. Automatically assuming lowercase to be a "keyword" does not seem like the right idea to me. Most sentences on the web might start with a capital, but not all of them will do. (For example here on HN, you have in the nav bar "new | threads | comments | .." None of them starting with a capital.


This irritated me too. I suppose a shortcut can be added to say "treat the next thing entered as a text element", along with the uppercase shortcut. So the user can type e.g. "!new" and the editor will make a text node with "new" as its content.


I agree, that much assumptions in the keyboard handling could get irritating quickly. Keyboard combos for actions would be the obvious choice, but I'd like to have something more automatic too.


I found my foolproof HTML in slim-lang.

It produces standards-compliant HTML and prevents me from writing code that is not well-formed.

The above is a nice side effect of its incredibly clean and terse syntax.

Now I feel cheated any time I need to write regular HTML.

https://github.com/slim-template/slim/blob/master/README.md

I used emmet for a while but slim improves on writing and reading code.


Slim becomes painful when using Ruby function calls (i.e. Rails helpers). Sometimes these functions have long list of parameters and wrapping to next line sometimes becomes not so convenient. Moving, copying and pasting indented blocks is also quite painful and it's much harder to see nested structure than in, for example, Python code. Also it has complicated syntax for attributes and text content, especially text content. I have to look up its docs again and again. It reminds me of yaml which has the same problems (but much worse, yaml's documentation has size of a book).

However I'm not a fan of ERB or Django/Jinja template markups either. Indentation-based syntax has its advantages, it works well for Python and ML-like languages. Just Slim can be improved further.


Would you consider pasting a sample? I do work with rails and can't think of a time when I have experienced this issue.

Maybe when passing multiple cars to a partial?


Seems similar to HTML generation libs in Lisp, except with indentation instead of parenthesis.


Yes, and here at least, I think indentation is a BIG win over parenthesis.


I disagree. When you want to move a block of code around, it makes it too easy to make mistakes. Or if you want to add e.g. a div around a bunch of other elements, you need to be very careful with the indentation.

I've worked with both: s-expression based syntax and whitespace-sensitive syntax. I'll take s-expression based syntax any time. Have you used both?


Yes, extensively.

In any half-way modern editor indentation-aware copy and paste/block indent/dedent is super easy.


> In any half-way modern editor indentation-aware copy and paste/block indent/dedent is super easy.

Sure. But I've made mistakes in Python because I didn't select e.g. the last line of a region.

If you've used both extensively and prefer whitespace to sexps, that's totally valid. De gustibus non est disputandum.


Regarding the 'code without syntax' part: I wrote something that basically does this for any language that you have an EBNF grammar for. It turns the grammar into a graph; wherever your cursor is in the document at a given moment corresponds to some node in the graph; the edges going out of that node are the syntactically valid things you can insert from that point.

Unfortunately there is no UI for it atm—though there is UI for the editing portion (which almost exactly matches the author's .gif at the end of the document): https://www.youtube.com/watch?v=tztmgCcZaM4&feature=youtu.be...

It's a concept that a lot of people have explored. My understanding is that some academics were interested in it a while ago but never produced anything that worked well and kind of wrote it off. Now some people are revisiting it (e.g. Unison[1] and Jetbrains MPS[2]).

I think the core idea involved is a shift away from using text as a model for representing programs; instead, interact with more abstract representations of code, and render those abstractions as text. This allows your editor to have better understanding of the language you're using, so syntax becomes a property of a language's visualization rather than something totally central to it (and now you don't have to memorize it!).

[1] http://unisonweb.org/2015-05-07/about.html [2] https://www.jetbrains.com/mps/


The problem with these environments is that humans routinely want to write syntactically incorrect code as they transition from one valid program to another. An AST-based editor like this enforces that every intermediate step must parse as a valid program. This restriction makes making changes a massive pain in the ass, because you can no longer take the shortest route from where you are to where you want to be.

For example, imagine you're restructuring an if-else sequence into a switch statement. You're probably going to change that first "if" to a "switch", then go down changing all the "else"s into "case"s. This is eminently sensible, but an AST-based editor will stop you dead if you try to do this, because the intermediate states are not valid ASTs.

The only system slightly like this I've seen in widespread use is Paredit, which prevents you typing anything that isn't a valid s-expression. This only works because s-expression syntax is so minimal, and most of Lisp's actual syntax is layered on top with macros. Even if you're using Paredit, your spit/slurp/kill operations will produce invalid intermediate code that would fail to compile.


That's an interesting point. I've heard other people bring up this issue in a vague way, but your example makes it clear why people would have a concern about this.

At the same time, however, I'm not convinced it's a serious issue—it's more important to save work and avoid re-typing the shell of a switch when you have to type it one character at a time. If you're mapping language constructs to single keys (this was the way I went), you can crank out a new multi-block switch statement in a few keystrokes; then (in my editor anyway), you can just drag the code blocks from the if statements into the new structure.

Even better, since the editor has easy access to info about language structure, it could include automatic transformations to transition between structurally similar language constructs.

I think it seems unnatural at first because we're so used to typing characters out one at a time, and it seems like we're throwing that skill out the window when considering an alternate form of editor. I still don't really see an intrinsic limit though.

I'd appreciate any other counter-example you may have (a lot of why I don't continue working on my own project at the moment is concern that these are lurking and I'll only discover them after lots more wasted effort—so if I could definitively rule out that this project is a good idea, that'd be great.)


I'd say the opposite - in your shoes, I'd just try it, doomed or not! (But then again building an IDE is My Thing atm - check my profile.) Building one of these systems is going to be really fun.

If you don't have the time/inclination to just do it cause it's a cool project, one option is to "paper prototype" the feasibility of the transforms you'd need.

Next time you're writing any code in the first language you want to support in that AST editor, take a screen recording of at least half an hour's coding. Watch it back later, and keep a count of how many times you:

* transition from one valid program to another in a simple way (eg just writing a line of code from scratch that's valid first time)

* transition in a way that would require abandoning your keyboard and reaching for your mouse in your editor (NB this is a seriously slow operation, usually taking a couple of seconds or more).

* transition in a complex way your editor would support if it had that kind of transform built in (eg if/else-> switch), and work out exactly how clever that transform would have to be (would it work if the if/elses weren't simple equality checks? If they weren't, how would you make the transform if you weren't allowed any non-compiling code?) Then enumerate the distinct such transforms you'd need to cover that half-hour of typing. (You will probably discover a Pareto-type distribution - the question is how tight the head/long the tail is. My guess is you'd need a huge number of special transforms to cover 95% of your edits, but data beats guessing)

* Jump around between statements (eg half-write something, leave it in an utterly broken state, then go actually define the variable/function you're using, then jump back and finish your thought). Your editor would have to permit this somehow or it will be really frustrating to use.


Thanks for the reply meredydd. That does seem like a good approach.

Actually, I spent 1.5 years building an editor in this style (see link in my original post here) while working at a grocery store :)

Unfortunately, while I can see now that I should have first been super focused on validating the concept—I instead just kind of ran with it, assuming it was going to work, and built this massive, probably over-engineered, framework for generating editors for given grammars (with my starting point being: not even knowing what a grammar was, thinking I'd have to invent some kind of 'linguistic constraint description' format ;) ).

As it stands the editing portion works well enough for a demo, but the program never reached the point where I could write code with it, so a lot of these questions are still un answered for me. I think doing the paper prototype on these edit actions would be a good pre-coding validation step.

I did check out Anvil briefly. My two second, potentially incorrect summary would be VB for web apps. Is that close? Are you guys doing anything special for working with text itself?


"VB for web apps" is a fair summary. We made a conscious decision against anything AST-based (for the reasons I outlined above), and we're about to deploy an Intellisense-style autocompleter instead. Our general philosophy is that coding is fine - it's the web/Javascript ecosystem that's the problem. We fix that, then get out of the way and let you write code :)

As for validating one's side projects before starting work - I personally think the Lean Startup thing can be taken a bit far. An interesting leisure-time technical project doesn't need the same level of prior justification as a big commercial project (even if it might one day evolve into one), as long as you're having fun. We do this stuff because we enjoy it!


> it's the web/Javascript ecosystem that's the problem.

I agree—that's definitely the bigger issue. I think differences between editors and languages tend to be overblown in general.

Unfortunately the project for me was not quite a fun thing: I ran into an issue with mouse/keyboard overuse, so I was trying to build an editor that could work efficiently with motion sensors. In the meantime, coding was painful :/ I'm still looking for an alternative for that reason—but VR and mobile also have needs for efficient editors that can be operated with fewer unique symbols.

Also, another solution to the AST editor issue: just convert any non-parsing nodes into plain text nodes until they're fixed.


Your editor looks insanely slick! I had no idea there was prior work in this area. It sounds like your editor could trivially support html.

That last paragraph is exactly what I was thinking with foolproof html, except applied to markup/data languages.

Is your editor available to try somewhere?


Thanks!

Unfortunately it's not available to try. I spent a lot of time building it in my free time and eventually got burnt out :/ It's still not to a point where it's usable (it's edit only). I'd love to revisit the project in some form, but time/money/other projects are obstacles at the moment.

I still think the concept has merit, but the implementation is more difficult than it seems (including lower-level design decisions, e.g., like the discussion in the other comment here). Good luck to you if you do give it a try, though. If nothing else, you'll probably learn a lot ;)

There's a little more info on the project here, btw, if you're curious: http://westoncb.com/projects/tiledtext


My story is kind of similar, including the wrist pain part. I've been brewing this thing for a bit over ten years. I got burnt out for other work and the editor some years ago and had to take a long break. Last year I started about the tenth new prototype and finally got it on a path that I think could lead to something that actually works.

This time I'm trying to give away the code, ideas and everything, so maybe someday someone would continue the work so I won't have to. I just want to see it get made somehow and use the damn thing.


Hehe, I know the feeling.


I made a Scalatags library that lets you write your HTML templates in-line in your Scala code (or Scala.js), similar to React's JSX but without the XML syntax:

    div(
      h1(id:="title", "This is a title"),
      p("This is a big paragraph of text")
    )
http://www.lihaoyi.com/scalatags/#ScalaTags

Although this ties it to the Scala language, it does not tie you to a particular platform: the templates can run on the JVM, in the browser with Scala.js, and even in the new scala-native LLVM backend.

The fact that the chosen language is statically typed means you get "basic" level validation right-off-the-bat thanks to the compiler:

http://www.lihaoyi.com/scalatags/#WhyScalatags

Basic typo-detection, enforcing things are properly nested, anti-XSS-enforcement, along with all the other "standard" IDE features: jump-to-definition for your sub-templates (which are just functions), use of arbitrary code within your templates (not unlike JS-X), etc.. Also, the fact that your templates are compiled into bytecode/JS at build time rather than interpreted and re-implementing scope management and variable-bindings and such in user-land code at runtime, means the templates are really very fast.

All this comes "for free", since your templates are code like any of the other code you write, and are handled by the IDE and optimized by the compiler in the same way.

The downside, of course, is that templates are code and thus cannot be provided safely by third parties, similar to React's JS-X


The problems I mostly have run into regarding html are because we have started shoving random javascript and javascript-esque things into it and over complicating things, often needlessly. The author says it himself "That's fine for plain HTML", referring to validation.

This is why back in 2014 I decided that I was going to focus on making as many of the webpages I make pure html5/css3, and keep javascript completely out of the picture if possible and now if I do break that rule I make it LibreJS compatible.

For an editor, I am an avid emacs fan, but for templates, I have of late become enamoured with asciidoc and asciidoctor, originally using them for sysadmin documentation things, I am beginning to realize my method of writing web pages (sshed into a box using emacs and updating with cron jobs calling asciidoc and bash scripts) might actually be good for just normal web stuff too.

Of course, there are some downsides, but I never really was a wysiwg sort of person anyway, because every time I've gotten my hopes up about some wyswig, it failed me in countless and unfathomable ways.

I feel like people have veered too far away from the core purposes of html and css, html for content and structure, css for design and display control.

We've gotten to the point now where I can visit any random top 500 alexa sites and anywhere from 10-50 javascripts are trying to load. It's ridiculous, and I think a return to simplicity will be key for most websites, because lets face it, you probably really don't need crud for $project.


    Code ??? WYSIWYG
    Could there be something unexplored in the middle?
WYSIWYM - What You See Is What You Mean.

Most famously used by the LaTeX powered document processor, LyX.

https://en.wikipedia.org/wiki/WYSIWYM


"Could there be something unexplored in the middle?"

I believe so.

At the moment I am working on next version of my blocknote.net editor - the editor for "Web writers" - people who create textual content of the Web.

Check http://sciter.com/new-blocknote-net-application-is-getting-i...

In particular first screenshot. It has so called block outline bar that shows structure of the HTML underneath WYSIWYG text.

IMO: For most of people WYSIWYG is still a preferable way of editing text. Markdown and especially HTML source code editing is far from being humanistic.


One of the things I like about JSX is that it will always make sure you write valid syntax.

Problem with things like slim is that you cannot copy-paste directly from HTML.


Interesting. Despite the many approaches to writing HTML more are invented all the time. Currently I'm working on yet another way. My idea follows the Lisp tradition based on its hierarchical list structures. It's surprisingly easy to parse and generate HTML from basically simple lists.

    '(html
      (head
        head stuff ...)
      (body
        (h1 "My HTML")
        (div (@ class "main-div" ...)
          "main div stuff")))
Of course it gets a lot more complicated when generating complete web pages and apps, but using paste operators, procedures and variables can make it produce anything necessary.

A very similar approach is possible with Tcl, which I am using now for several reasons including integration with legacy code. In Tcl the above looks like:

    {html
      {head
        head stuff ...}
      {body
        {h1 {! My HTML}}
        {div {@ class "main-div" ...}
          {! main div stuff}}}
In Tcl variables and procedures can be interpolated into the output by wrapping the list in a "subst" command:

    [subst {html .... [my-proc $my_var] ....}]
The hierarchical nature of Lisp-like languages naturally resembles the GUI outline form the author presents in the article. Using Tcl it's occurred to me that it should be possible to write a "front end" to the parser/generator code using Tk (via the text or canvas widgets) which could resemble the author's illustrations.

Ultimately I think the difficulty lies in the need for constructing customized templates to enable modular, easy-to-use GUIs for particular tasks. Real HTML can quickly become very complex which is the reason for the thousands of frameworks, generators and templating systems in existence. While I think the author's approach or my own can potentially be helpful, neither is going to magically eliminate the burden on the programmer.


There's also Blaze which is a nice haskell library to write templates or just plain HTML using combinators: https://jaspervdj.be/blaze/tutorial.html


I don't think this is all that helpful. If you use a moderately-decent text editor, it probably has a closing feature, and an autoindent feature.

If you're writing a new tag in emacs, you just need to write the opening tag, then press "C-c /", and it will close it for you. If you have a syntax error of this magnitude, the autoindenter will also help you realize. Just select the region (or the whole file) and press tab, see where the indentation stops making sense, and follow back to the opening tag by column.

There are heavy-handed approaches to this, but I find that default-configured emacs with these two tricks can reduce your error rate to effectively zero, without requiring you to learn a new tool just for [X][H]TML. I say "just" for HTML having spent about half of my working hours in HTML for two years full time. I think this is enough.


Hi! Article aithor here.

Emmet is a great plugin that has tag matching and all kinds of navigation/editing shortcuts too. Like those emacs shortcuts, it does reduce error rates and speed up editing signifigantly, at least for me, but I'm more interested in exactly a heavy handed approach. I'd like an app/plugin/whatever that makes it impossible to make many types of mistakes, even if it restricts what I can do.

I code html every day (and I do love it), but I'm a terrible and lazy typist. I make typos and mistakes constantly, so I'll take all the help I can get. So I'm trying to make something that will solve my specific woes in html/css development and I'm hoping someone else finds it useful too.


As you mention, a decent text editor solves large parts of this.

I have had occasions where I needed something more. When that's the case, tidy-html5[1] + running a diff afterwards has been sufficient.

[1]http://www.html-tidy.org/documentation/


Is hand-editing lots of HTML a common issue for web devs? I'd think more work is done elsewhere.


I've never encountered this problem to the effect I needed a build step for HTML. I wonder if frontenders who work only with HTML/CSS find it helpful? Whereas they don't have the concept of breaking up code into templates or partials with a server-side language.

No disprespect to the author, what you've done is cool and your presentation is awesome. My biggest concerns are:

1. Onboarding & rampup time that's now added to my team just to write HTML.

2. Turning away potential hires who don't want to work with an over-engineered tech stack. Much opinion can be formed just by adding to your job requirements "we use a transpiled HTML processor"

3. Technical debt & tech lock-in. Will this technology be a detriment in 5 years? Do we get enough from it right now that it's worth future legacy costs?


It is, things like Slim are really helpful.

Over the years editors are doing a better job at showing you if you made a mistake, but it is still somehow cumbersome to write big HTML files.

Things like bootstrap really tend to complexify because you also stylize with divs


I don't think so. Certainly from time to time, but not on a daily basis; in my experience, it's writers/editors that need a decent HTML editing tool.


Syntax looks similar to Pug (previously called Jade, which was the default templating language in Express).

https://pugjs.org/language/tags.html


About the gif at the bottom of the page, where a div is swapped in place with an html sister node...

That's 100% possible to do in Emacs, while editing plain html, using smartparens-strict-mode. Basically it treats the HTML like lisp code, so by "autobalancing parens" the editor lets you swap-sexps, which would perform the same transformation as the drag and drop in the editor, but much quicker and easier (as long as u can invest in learning the keys.)!

Of course with Hiccup[0] or Reagent[1] code in Clojure/script, it's perfectly possible too. The same keybindings even.

[0] http://hiccup.space [1] http://cljsfiddle.com

(I'm the author of those little projects)


If to focus on something close to source code editing then why not to simplify html itself a bit?

Sciter's HTML parser supports shortcut constructs like:

   <button|checkbox(first)> woo! </button>
which is

   <button type="checkbox" name="first"> woo! </button>
And

    <div#foo>  === <div id="foo">
    <div#bar.solid.main>  === <div id="foo" class="solid main">
Not too much but makes life a bit easier while keeping all other HTML features intact.


The author might find MPS interesting. It it another attempt at providing an AST editor for coding.

https://www.jetbrains.com/mps/


Use Hiccup - a Clojure library. You'll have a good syntax highlighting, see only important stuff (eg. no closing tags because it's Lisp) and much less place to make stupid errors. Many people complain about plethora of parents in Lisp code but whatever the syntax is you'll get used to it in <20 hrs.


Exactly!

I already mentioned it in this thread but, check out http://hiccup.space to show people how great it works.

For more in depth examples see http://cljsfiddle.com


We had XHTML, which didn't allow anything that wasn't properly formed XML. It went nowhere.


Interesting effort. By the way: Does somebody know a good linter which works on pure HTML, Vue templates (the HTML part), Jinja templates (the HTML part). Searching for this, ideally for VSCode.


The article describes Markdown[1], basically. In my opinion, content should be written once and transformed to any format[2], including ConTeXt[3], docx, HTML, EPUB, or plain text. But here's a puzzler.

Every software developer defines and uses variables. If variables are so powerful, why do WYSIWYG word processors lack the ability to quickly and easily insert variable definitions?[4] In the screenshot, the left-hand side provides an editable variable definition hierarchy, the middle pane provides a Markdown editor, and the right-hand side provides a real-time HTML preview of the content with variables interpolated and substituted.

After variables, programming language integration--such as R[5]--comes naturally. Consider the following syntax (that could be hidden through a UI):

    `r#csv2md('data.csv',totals=TRUE)`
The function imports data from an external data source and converts it to Markdown. The HTML version can be styled, or piped through pandoc to create beautiful PDF output[6]. In the screenshot, the left-hand side shows the csv2md command (totals are calculated automatically for numeric columns), the middle pane shows an HTML preview (real-time), and the right-hand side shows a PDF produced from the same source document using pandoc and ConTeXt.

The CSV file could also be a JSON request over HTTP, or database query, meaning that the document is always up-to-date with the most recent data: a living document. This could be accomplished without having to change the source document, as well:

    `r#import(v$data$source,totals=TRUE)`
Here, v$data$source is a variable that defines a data source. In this case, the value might be 'data.csv', but could be 'protocol://host/app/api/json'. The variables themselves could be sourced from a YAML file, JSON data stream, or remote database.

This is what I've been working towards for the last several months.[7] There's more that this editor can do, but it's still in early beta stages, should anyone care to try it.

[1]: https://github.com/jgm/pandoc/issues/168

[2]: http://pandoc.org/

[3]: http://wiki.contextgarden.net/

[4]: https://raw.githubusercontent.com/DaveJarvis/scrivenvar/mast...

[5]: https://github.com/bedatadriven/renjin/

[6]: http://i.imgur.com/Qe6mTpx.png

[7]: https://github.com/DaveJarvis/scrivenvar


Hm... I think this post is trying to solve a different problem.

Markdown solves the problem of having content (paragraphs, lists) syntax free.

This post:

> What if you had to write a language where you can make mistakes, but there are no errors? Where your parser just silently accepts anything you give it. Where you'll have to carefully compare your intentions to the parsed output to figure out what went wrong.

> Do you write HTML?

> What I just described was how browsers handle HTML. They will happily accept anything you give them and try their best to make sense of it. They could swap out element places, change attributes, or do whatever if there's a typo in your code.

Looks like they're trying to make make an easy-to-write HTML for all HTML, not just content.


> Looks like they're trying to make make an easy-to-write HTML for all HTML, not just content.

Yes. I know it's a long thread, but pandoc issue 168 and the proposed "foolproof" syntax are, in essence, addressing the same issue. They both strive to shroud the nuances of HTML syntax in simpler clothing.

Computationally, there's insignificant difference between this (proposed syntax):

    ; --- div {.foo}
    ; Ipsum dolor sit amet
And this (foolproof syntax):

    div class:foo
      Ipsum dolor sit amet
What Markdown doesn't have, as you allude to, are data bindings necessary to drive an interactive experience. Although, with a little ingenuity, a processor could provide data binding with a Markdown-compatible syntax, such as:

    ; --- input {.email}
    ; Email

    ; --- shuttle {.from}
    ; Available List
    ; --- shuttle {.to}
    ; Selected List
From a high enough perch, they both look the same.


Westoncb said it quite elegantly: "I think the core idea involved is a shift away from using text as a model for representing programs; instead, interact with more abstract representations of code, and render those abstractions as text. This allows your editor to have better understanding of the language you're using, so syntax becomes a property of a language's visualization rather than something totally central to it (and now you don't have to memorize it!)."

Foolproof HTML of course tries to do that only for HTML, but I do hope I can expand it to CSS, JSON and maybe generalize it later on.

So I'm not advocating for any specific syntax, but rather a way to edit html so you don't need to care about the syntax. The HTML syntax would still be there in the .html (or .erb, .php or whatever your template language is) the same as it has always been, but you could visualize it in a way that's most comfortable to you and edit it as a structure, so it would be impossible to produce syntax errors.

This could end up being a text editor plugin or a stand alone app, some sort of library or kind of anything, but I'm prototyping interactions and ideas as a standalone web app because that's what I'm good at.


Variables are pretty easy to work with in Word though they've been a little harder to get to since 2007.


I remember XHTML with the application/xhtml+xml mime type. Errors were pretty visible. The world backed out of it.


Isn't this haml without templating?


This. I literally saw HAML. I love HAML. JSHAML would be awesome.


So when are we going to ever get a replacement for for this deeply flawed technology called HTML?


There are some replacements for different parts of it. Like PDF, and whatever iphones' ui is called for instance.


That FrontPage screen shot gave me some flashbacks :-)


This is one of the best write-ups of all time.

In a way, Python is foolproof C. (I am thinking about the syntactic stuff around indentation, braces, semicolons, which is what I mean it about. I think if you can code in both syntaxes, read the article carefully, and generousy try to follow what I am talking about, you know what I mean.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: