Thursday, July 14, 2011

Pygments to the rescue

I'm finally realized that Pygments is the Answer. I can simply run my code through it and get the output I'm looking for with only a little manual tweaking (see below); I'm sure this can be integrated into Pygments with a bit of hacking. Then, a simple copy and paste retrieves the original source code! Some additional tweaks:
  1. I'm like to hide the comment character (# in Python, // in C) to make it look nicer, by giving it a very small font. I suspect I could edit the output of the lexer, or perhaps play with the formatter, to do this.
  2. I need to disable HTML escaping for comments (this is done in the formatter). Perhaps a quick check for non-HTML < characters and escaping only these would help.
  3. I rely on the code editor to support line wrapping, which isn't always present. I don't see an obvious work-around for this. 


My simple tweak to the HTML style from Pygments to make comments look nice:
body .c { color: #408080; font-family: Sans-serif; white-space: normal; font-size: 90% } /* Comment */

Documentation and more

Documentation

I continue to work on my documentation idea. So far, I've run into two problems. First, I'll need my code to HTML converter to recognize strings so that doesn't mistake comment characters in a string as a true comment: printf("// not a comment"). Second, the problem of initial spaces: how can I properly translate them? If a source file indents a comment by 4 spaces then the following line of code by 4 spaces, it looks fine. However, in HTML the code font and comment font will be different, so those 4 spaces cause things to look ugly either in HTML or (if the spacing works in HTML) in the code.

I can think of a couple ideas:
  1. Auto-space -- in code, indent a comment line to match the spacing of the next code line. This would work most of the time. In HTML, do the same; make the space characters the same as the code font to insure alignment.
  2. In HTML, always insert the same number of spaces as the code, in the code font. How would I detect these spaces and automatically remove them when going the other direction? Perhaps tagging the initial spaces as .
For correctly recognizing strings, I'll need some sort of lexer. Yuck. One option is to start with the Python tokenizer; its' source code (see link on that page) contains all the necessary regular expressions. Another is to use Pygments, which I'll want for syntax highlighting anyway. I may hack around this for now just to get some working code, then return to fix it. A related but simpler problem is dealing with C /* */ comments.

In working with the idea, I'm continually surprised by how much just writing about the problem has helped me to solve it. I believe that this will be a big help for me in all future projects, if I can actually find the time to implement it.

Other things
Do robots take people's jobs? No, they empower people and create jobs, as the arguments in this article show. It included some nice historical perspective and facts (did word processors eliminate secretaries?).

I found a list of the top 25 most dangerous software errors (from a security perspective). Interesting.

Friday, July 8, 2011

After playing with my documentation idea a bit, I discovered a serious problem: what I created looked great in Word, but all the comments existed only in Word, leaving me a bit lost when I looked at the source code itself. This is a problem; for many, the source code will be the first thing they see and the only thing they see. In many applications (fixing compiler errors, debugging) I'll be working with the source code. The moral of the story: the source code matters!

In particular, Knuth composed in a Web file, then produced both a .tex and a .pas (Pascal source file, whatever the extension was). However, neither produced file was editable or even very human-friendly. Instead, I now see that both the "pretty" format (in Word or whatever) and the "plain" format (raw source code) should both be nicely formatted and easily readable.

That is, I'm building a bridge between a beautiful representation of the code (probably HTML) and a functional representation of the code (as plain text). The beautiful form is easier to edit documentation and comments, insert diagrams, videos, etc. while the functional form provides a tight coupling with the compiler / debugger.

That changes everything in terms of my design.

Before I get too carried away with it, let me test-drive this idea by providing some example code. This is the beautiful form, taken from a unit-testing section of the document.


Testing
I don’t have a unit testing framework. So, I’ll develop what’s necessary as I go. There’s a framework for Excel, but it’s very tied to that application. Problems so far with the home-brew approach:
1. There’s no automatic test discovery; I have to manually add all tests.
2. There’s no setup()/teardown() facility
3. There’s no “clean the environment” comment. For example, strLastError can be polluted by earlier tests.

Source file split testing
A documentation file with no extension should produce an error.
Sub Test_SourceWithNoExtension()

First, create a dummy document to test with.
    Dim docSource As Document
    Set docSource = Documents.Add
    Dim strFileName As String
    strFileName = "Word documentation idea test."
    docSource.SaveAs fileName:=strFileName

Now, do our testing.
    OpenDocFile

Clean up by closing and erasing this old doc. If the test breaks, the developer must close it ma-nually. Time to look for a try/catch in VBA (On Error statement)
    docSource.Close
    Kill strFileName

Check that it worked. Errors are reported as strings, so check for the correct error text.
    Assert (strLastError Like "*Documentation file has no extension:*")
End Sub


Now, here's how I'd like to see this in the functional form (as source code). Since this is VBA, the comment character is the single quote '.
' <h1>Testing</h1>
'
<p>I don’t have a unit testing framework. So, I’ll
develop what’s  necessary as I go. There’s a framework
' for Excel, but it’s very tied to that application.
' Problems so far with the home-brew approach:</p>
<ul>
'   <li>There’s no automatic test discovery; I have to
'     manually add all tests.</li>
'   <li>There’s no setup()/teardown() facility</li>
'   <li>There’s no “clean the environment” comment.
'     For example, strLastError can be polluted by 
'     earlier tests.</li>
</ul>
'
<h2>Source file split testing</h2>
<p>A documentation file with no extension should
' produce an error.</p>
Sub Test_SourceWithNoExtension()

    ' First, create a dummy document to test with.</p>
    Dim docSource As Document
    Set docSource = Documents.Add
    Dim strFileName As String
    strFileName = "Word documentation idea test."
    docSource.SaveAs fileName:=strFileName

    ' <p>Now, do our testing.</p>
    OpenDocFile

    ' <p>Clean up by closing and erasing this old doc.
    ' If the test breaks, the developer must close it
    ' manually. Time to look for a try/catch in VBA
    ' (On Error statement)</p>
    docSource.Close
    Kill strFileName

    ' <p>Check that it worked. Errors are reported as strings, so check for the correct error text.</p>
    Assert (strLastError Like "*Documentation file has no extension:*")
End Sub

It's interesting that, to me, reading the first is much easier than reading the second. Not because of the HTML markup, but because a simple difference in font provides visual cues to divide the code nicely. It feels good to me to read the first! This is certainly what I'm striving for.

I haven't found a reasonably-featured word processor that read and writes HTML, though. Word includes lots of goop, but has all the features I want. I need to try OpenOffice and also Compser to see if they're reasonable. While I like several of the browser-based editors (Google Sites / Docs is great), the "allow now access to local files" paradigm seems to prevent their use in editing local files.

Friday, July 1, 2011

Documentation ideas

The ideal for documentation creation

I find that the type of documentation I want to write isn't well supported by the tools I've found. In particular, I typically like to write documentation at three levels. First, there should be a high-level overview, defining the overall purpose and ideas behind a module of code. This should be followed by detailed description of every element in a source file. Finally, I'd like to provide a line-by-line commentary of each function, commenting on the particulars of its implementation. This documentation should, as necessary, include equations, diagrams, images, flow charts, videos, etc. Because code changes frequently, all code snippets or references to names within the code should be easily refreshed by applying a tool.

I see variants of this approach in use for several types of documentation. I typically write code for other programmers, rather than end users. Therefore, my "users" will be fellow programmers desiring to make use of a module I've created. For these users, a high-level explanation of a given header file followed by a description of each element of the header, provides all the information they need to make use of the module. For fellow developers, I'd like to present the same high-level overview, this time focusing of the algorithms used to implement elements declared in the header. This information naturally belongs with the source file the header accompanies. Next, a per-element detailed description of the source file might also be accompanies by a line-by-line analysis of some of the subtle portions of the code. (On a side note, I'd like to use this to develop and updated version of the PIC24 book I co-authored).

My dream implementation would be seamless: a fully-featured word-processing program in which I can type code or in-line documentation, including snippets of code in explanatory sections as necessary. All documentation would be transparently encoded in the raw source file. No such tool exists, to the best of my knowledge.

Existing tools

There's nothing new under the sun, including this idea. Its best-known formulation, Literate Programming by none other than Donald Knuth, "regards a program as a communication to human beings rather than as a set of instructions to a computer. Your program is also viewed as a hypertext document, rather like the World Wide Web." (from an associated site). While WEB (Knuth's tool) operates on Pascal to produce TeX documents, a more modern version (CWEB) applies the same process to C. The literate programming site provides additional information on these ideas; several other notable implementations (FunnelWeb, Noweb). The practical result (here's a sample of some code) is that a program is written in CWEB syntax (mixed C and TeX), then transformed to either C or TeX, making it painful (IMHO) to either write documentation or develop code!

An opposite approach is to embed documentation into the source code, simplifying the build process but still requiring a translation step to produce documentation. Doxygen (along with variants such as JavaDoc), my favorite documentation tool which I've used for several years, excels at extracting documentation from code and producing a polished, nicely cross-referenced result -- the middle level (describing each element) of my documentation hierarchy. However, it contains several major flaws, IMHO:
  1. There's no way to directly edit the resulting documentation. I often find a typo or other small correction when browsing through the documentation, which then requires that I dig up the corresponding source, edit it, recompile the docs, and check. This discourages quick fixes.
  2. Writing high-level documentation is painful; editing text then compiling reminds me of all the evils of LaTeX without any of the helpfulness of word wrapping, TexWorks docs-to-source synchronization, or quick compilation.
  3. There's no way to write line-by-line commentary for a detailed look at an algorithm.
  4. Including non-textual media is painful.
  5. Trying to fix syntax errors in the source code documentation tags is painful.
Recently, Python adopted use of Sphinx and reStructured text to produce their documentation, which is very impressive. It seems a step back from Doxygen, since there's no automatic linking to source code, while suffering from all its liabilities. The same is true of other alternatives I've found, such as antiweb.

Proposed solution

So, I'd like to create yet another documentation tool, in the (most likely vain) hope it will have some impact. My ideas:

  1. I'd like to be able to open some source code in a modern, fully-featured word processor, add documentation (images, diagrams, etc.), then save the result (including any changes I made to the code) back to both the source file and its accompanying documentation file.
  2. The program should support documenting only selected portions of the code; for example, I'd typically omit a copyright notice appearing at the top of every file. It should allow adding comments to arbitrary snippets of code, rather than just as the API level (Doxygen's strength), and placing multiple copies of these snippets in arbitrary order within the code.
  3. All snippets should be auto-refreshable by reflecting any changes made to the source code. They should follow any source code changes such as moving code around, changing names, etc.
After pondering how I can implement this in as simple a fashion as possible, I've converged on the following design:
  1. Label the start of a snippet with a tag marked by rarely-used delimiters, such as &|tag|&.
  2. Auto-generate these tags when the documentation file is edited then saved.
  3. Auto-refresh all snippets when the documentation file is opened by matching source code tagged snippets with their tagged snippets in the documentation file.
I've chosen Microsoft Word as a word processor and begun writing code in VBA (Visual Basic for Applications), Word's macro language. There's little good documentation on the language I've found; the built-in help is poor, MSDN lacks in many areas, and even searching the web produces mediocre results. I may purchase a book to help. I haven't found a good unit-testing framework for Word; a framework for Excel seems fairly tied to that platform.

So far, I've written code that divides a source file into named snippets; not bad progress, but there's much more to do. I should probably next:
  1. Write unit tests, which I should have done first.
  2. Create a good, high-level document to describe all this in more detail.

Documentation ideas

The ideal for documentation creation

I find that the type of documentation I want to write isn't well supported by the tools I've found. In particular, I typically like to write documentation at three levels. First, there should be a high-level overview, defining the overall purpose and ideas behind a module of code. This should be followed by detailed description of every element in a source file. Finally, I'd like to provide a line-by-line commentary of each function, commenting on the particulars of its implementation. This documentation should, as necessary, include equations, diagrams, images, flow charts, videos, etc. Because code changes frequently, all code snippets or references to names within the code should be easily refreshed by applying a tool.

I see variants of this approach in use for several types of documentation. I typically write code for other programmers, rather than end users. Therefore, my "users" will be fellow programmers desiring to make use of a module I've created. For these users, a high-level explanation of a given header file followed by a description of each element of the header, provides all the information they need to make use of the module. For fellow developers, I'd like to present the same high-level overview, this time focusing of the algorithms used to implement elements declared in the header. This information naturally belongs with the source file the header accompanies. Next, a per-element detailed description of the source file might also be accompanies by a line-by-line analysis of some of the subtle portions of the code. (On a side note, I'd like to use this to develop and updated version of the PIC24 book I co-authored).

My dream implementation would be seamless: a fully-featured word-processing program in which I can type code or in-line documentation, including snippets of code in explanatory sections as necessary. All documentation would be transparently encoded in the raw source file. No such tool exists, to the best of my knowledge.

Existing tools

There's nothing new under the sun, including this idea. Its best-known formulation, Literate Programming by none other than Donald Knuth, "regards a program as a communication to human beings rather than as a set of instructions to a computer. Your program is also viewed as a hypertext document, rather like the World Wide Web." (from an associated site). While WEB (Knuth's tool) operates on Pascal to produce TeX documents, a more modern version (CWEB) applies the same process to C. The literate programming site provides additional information on these ideas; several other notable implementations (FunnelWeb, Noweb). The practical result (here's a sample of some code) is that a program is written in CWEB syntax (mixed C and TeX), then transformed to either C or TeX, making it painful (IMHO) to either write documentation or develop code!

An opposite approach is to embed documentation into the source code, simplifying the build process but still requiring a translation step to produce documentation. Doxygen (along with variants such as JavaDoc), my favorite documentation tool which I've used for several years, excels at extracting documentation from code and producing a polished, nicely cross-referenced result -- the middle level (describing each element) of my documentation hierarchy. However, it contains several major flaws, IMHO:
  1. There's no way to directly edit the resulting documentation. I often find a typo or other small correction when browsing through the documentation, which then requires that I dig up the corresponding source, edit it, recompile the docs, and check. This discourages quick fixes.
  2. Writing high-level documentation is painful; editing text then compiling reminds me of all the evils of LaTeX without any of the helpfulness of word wrapping, TexWorks docs-to-source synchronization, or quick compilation.
  3. There's no way to write line-by-line commentary for a detailed look at an algorithm.
  4. Including non-textual media is painful.
  5. Trying to fix syntax errors in the source code documentation tags is painful.
Recently, Python adopted use of Sphinx and reStructured text to produce their documentation, which is very impressive. It seems a step back from Doxygen, since there's no automatic linking to source code, while suffering from all its liabilities. The same is true of other alternatives I've found, such as antiweb.

Proposed solution

So, I'd like to create yet another documentation tool, in the (most likely vain) hope it will have some impact. My ideas:

  1. I'd like to be able to open some source code in a modern, fully-featured word processor, add documentation (images, diagrams, etc.), then save the result (including any changes I made to the code) back to both the source file and its accompanying documentation file.
  2. The program should support documenting only selected portions of the code; for example, I'd typically omit a copyright notice appearing at the top of every file. It should allow adding comments to arbitrary snippets of code, rather than just as the API level (Doxygen's strength), and placing multiple copies of these snippets in arbitrary order within the code.
  3. All snippets should be auto-refreshable by reflecting any changes made to the source code. They should follow any source code changes such as moving code around, changing names, etc.
After pondering how I can implement this in as simple a fashion as possible, I've converged on the following design:
  1. Label the start of a snippet with a tag marked by rarely-used delimiters, such as &|tag|&.
  2. Auto-generate these tags when the documentation file is edited then saved.
  3. Auto-refresh all snippets when the documentation file is opened by matching source code tagged snippets with their tagged snippets in the documentation file.
I've chosen Microsoft Word as a word processor and begun writing code in VBA (Visual Basic for Applications), Word's macro language. There's little good documentation on the language I've found; the built-in help is poor, MSDN lacks in many areas, and even searching the web produces mediocre results. I may purchase a book to help. I haven't found a good unit-testing framework for Word; a framework for Excel seems fairly tied to that platform.

So far, I've written code that divides a source file into named snippets; not bad progress, but there's much more to do. I should probably next:
  1. Write unit tests, which I should have done first.
  2. Create a good, high-level document to describe all this in more detail.

Visual Basic for Applications

The ideal for documentation creation

I find that the type of documentation I want to write isn't well supported by the tools I've found. In particular, I typically like to write documentation at three levels. First, there should be a high-level overview, defining the overall purpose and ideas behind a module of code. This should be followed by detailed description of every element in a source file. Finally, I'd like to provide a line-by-line commentary of each function, commenting on the particulars of its implementation. This documentation should, as necessary, include equations, diagrams, images, flow charts, videos, etc. Because code changes frequently, all code snippets or references to names within the code should be easily refreshed by applying a tool.

I see variants of this approach in use for several types of documentation. I typically write code for other programmers, rather than end users. Therefore, my "users" will be fellow programmers desiring to make use of a module I've created. For these users, a high-level explanation of a given header file followed by a description of each element of the header, provides all the information they need to make use of the module. For fellow developers, I'd like to present the same high-level overview, this time focusing of the algorithms used to implement elements declared in the header. This information naturally belongs with the source file the header accompanies. Next, a per-element detailed description of the source file might also be accompanies by a line-by-line analysis of some of the subtle portions of the code. (On a side note, I'd like to use this to develop and updated version of the PIC24 book I co-authored).

My dream implementation would be seamless: a fully-featured word-processing program in which I can type code or in-line documentation, including snippets of code in explanatory sections as necessary. All documentation would be transparently encoded in the raw source file. No such tool exists, to the best of my knowledge.

Existing tools

There's nothing new under the sun, including this idea. Its best-known formulation, Literate Programming by none other than Donald Knuth, "regards a program as a communication to human beings rather than as a set of instructions to a computer. Your program is also viewed as a hypertext document, rather like the World Wide Web." (from an associated site). While WEB (Knuth's tool) operates on Pascal to produce TeX documents, a more modern version (CWEB) applies the same process to C. The literate programming site provides additional information on these ideas; several other notable implementations (FunnelWeb, Noweb). The practical result (here's a sample of some code) is that a program is written in CWEB syntax (mixed C and TeX), then transformed to either C or TeX, making it painful (IMHO) to either write documentation or develop code! While other tools (



My favorite documentation tool, which I've used for several years, is Doxygen. It excels at extracting documentation from code and producing a polished, nicely cross-referenced result -- the middle level (describing each element) of my documentation hierarchy. However, it contains several major flaws, IMHO:

  1. There's no way to directly edit the resulting documentation. I often find a typo or other small correction when browsing through the documentation, which then requires that I dig up the corresponding source, edit it, recompile the docs, and check. This discourages quick fixes.
  2. Writing high-level documentation is painful; editing text then compiling reminds me of all the evils of LaTeX without any of the helpfulness of word wrapping, TexWorks docs-to-source synchronization, or quick compilation.
  3. There's no way to write line-by-line commentary for a detailed look at an algorithm.
  4. Including non-textual media is painful.