Remove whitespace from HTML

Why

In a normal webpage there is a lot of whitespace. We put it there in form of tabs and carriage-returns in order to make the document more maintainable and easier to read. But it has its price in terms of adding to the overall weight of the document in kilobytes and it takes longer for the browser to render a page with a lot of whitespace. This is especially true for IE6, which renders whitespace a lot slower than its counterparts.

By getting rid of all the extra whitespace in the HTML you can save up to ~25% of the total weight of the page in kilo bytes and make rendering in the browser faster. It's a double-win.

How

There are several ways of removing whitespace from HTML. You can either override the Render method in either the Page or MasterPage classes or you can use an HttpModule. The HttpModule is preferrable because you then decouple the whitespace removal from the rest of your website logic. Either way, these are the steps needed to remove whitespace:

  1. Retrieve the HTML before sending it to the browser
  2. Remove the whitespace
  3. Send new whitespace-free HTML to the browser

1. Retrieve the HTML

As stated earlier, this can be done in either the Render method of a Page or MasterPage or it can be done in an HttpModule. If you are using ASP.NET MVC you have no Render method to override and you need to remove the whitespace from an HttpModule.

The Render method:

protected override void Render(HtmlTextWriter writer)

{

  using (HtmlTextWriter htmlwriter = new HtmlTextWriter(new System.IO.StringWriter()))

  {

    base.Render(htmlwriter);

    string html = htmlwriter.InnerWriter.ToString();

 

    // Trim the whitespace from the 'html' variable

 

    writer.Write(html);

  }

}

The HttpModule:

To retrieve the HTML from the response stream you need to add a filter to the response. It's a bit more complex and requires a lot more code than overriding the Render method, but it is the recommended way of doing it. The best way to get an understanding on how this works is by looking in the RemoveWhitespaceModule and RemoveWhitespaceFilterStream of the WebOptimizer source.

2. Remove the whitespace

Here we need to use two regular expressions to identify the areas of whitespace that are safe to remove.

private static readonly Regex RegexBetweenTags = new Regex(@">(?! )\s+", RegexOptions.Compiled);

private static readonly Regex RegexLineBreaks = new Regex(@"([\n\s])+?(?<= {2,})<", RegexOptions.Compiled);

This method uses the two regular expressions to remove any whitespace from a string of HTML.

public static string RemoveWhitespaceFromHtml(string html)

{

  html = RegexBetweenTags.Replace(html, ">");

  html = RegexLineBreaks.Replace(html, "<");

 

  return html.Trim();

}

This method is safe for removing whitespace from any HTML fragment or document. If the HTML string contains inline styles or scripts, they will not be touched and stripped from whitespace - only the HTML. For trimming the inline styles and scripts, please look at Minify stylesheets and Minify JavaScript.

3. Send whitespace-free HTML to the browser

If you chose to override the Render method, you can see from the code above that it's already taken care of. All you need to do is to pass the html variable to the RemoveWhitespaceFromHtml method. For the HttpModule, you need to convert the HTML back to a byte array and write it to the output stream filter. Again, see the code for the RemoveWhitespaceFilterStream for a demonstration on how it is done.

Things to notice

By intercepting the HTML output and do string manipulations on it, it will require more work for the CPU. This technique, like most other performance techniques, is a tradeoff between bandwidth/rendering speed and CPU load. The larger the HTML document is, the more processing is required to do the string manipulation. The WebOptimizer has a limit on documents larger than ~80 KB because of the way the .NET Garbage Collector works. That said, most HTML documents are a lot smaller but if some pages on a site is very large, the WebOptimizer will simply ignore them and garbage collection will not be effected negatively.

In the case of removing whitespace, the extra CPU load is extremely low.

The WebOptimizer

By using the WebOptimizer you get the RemoveWhitespaceModule that performs the whitespace removal automatically for every ASP.NET application. All you need to do is to add the WebOptimizer.dll in your Bin folder and register the module in the web.config like so:

<add name="RemoveWhitespaceModule" type="WebOptimizer.Modules.RemoveWhitespaceModule, WebOptimizer" />

It works on both IIS 6 and 7 in either Classic Mode or Integrated Pipeline Mode.