Shawn Burke from the Windows Forms team posted up a few days ago about the possibility of Microsoft deciding to ship the Windows Forms 2.0 source code (thanks for the link Joseph). I’m really excited by the possibility of being able to look at the code both to support debugging but also from an educational point of view. I’m most excited by getting the opportunity to look at some of the designer code that is baked into the framework. I wonder if the source code released would include snippets from System.Drawing.dll, System.Drawing.Design.dll and System.Design.dll as well.

But (there is always a but), Shawn needs to get the idea past LCA and that means a discussion not only about the code that ends being compiled into the framework but the comments which exist in the source code.

When developing software we sometimes get frustrated at ourselves, and others and this might lead us to leave a few choice comments in source code. If this code is indexable via Google or freely downloadable you might want to clean the code up a little bit.

In general there are three approaches for cleaning up comments:

  • Strip all the comments automatically.
  • Strip inappropriate comments manually.
  • Strip inappropriate comments automatically with a quick manual check.

Stripping out all non-executable comments is fairly trivial, especially when comments consume the entire line inside the source file – if they don’t you need to do some fairly basic source parsing to figure out if a section of text is a comment or not.

The manual approach really isn’t going to work. It would take forever to scrub the 500,000 lines of code especially when to extend the search beyond basic profanity to off jokes and customer references. You also need to consider the naming of variables, private fields and private methods which may not be appropriate.

The way I would do it is come up with a quick tool (“codecop” anybody) which reads in each source file in its entirety and did a rule-based search across code and comments for anything off colour.

A dumb version of codecop could do only two things, warn by inclusion, or warn by exclusion, meaning that it would either look for things that it did have in some kind of list or look for things that it didn’t have in its list. If it was possible to get a parse tree of the source file a more intelligent search could be done.

In the case of non-executable comments the search tool could automatically remove them if they aren’t appropriate. Instead of just deleting them it would need to replace them with whitespace (the whole region of comments, not just the offending word), this would allow the code sweep to be done after the compile and still keep the PDB source file references inline so executed code could be highlighted in the debugger.

Executable code that couldn’t be cleaned automatically could be reported back (the notification system could be configured to log directly into the bug tracking database at Microsoft) or to some unfortunate contractor who is given the job to vet the code manually. Given that this could be an ongoing process you’d probably want to patch the code up so you didn’t have to keep correcting the same problems.

Its a great idea that presents some interesting problems – and I doubt that Microsoft is the first to try and tackle them.