django_auxilium.utils.html module

class django_auxilium.utils.html.TextExtractorHTMLParser[source]

Bases: HTMLParser.HTMLParser

Custom HTML parser which extracts only text while parsing HTML

Once the parser parses the HTML, get_text() can be called which will return extracted text

get_text()[source]

Get extracted text after HTML is parsed.

Returns:Extracted text from the HTML document
Return type:str
handle_charref(number)[source]

Handler for processing character references.

This method handles both decimal (e.g. '>' == '>') and hexadecimal (e.g. '>' == '>') references. It does that by simply converting the reference number to an integer with appropriate base and then converts that number to a character.

handle_data(d)[source]

Handler for data/text in HTML

This simply adds the data to the results list this class maintains of the extracted html

handle_entityref(name)[source]

Handler for processing character references.

This method handles processing HTML entities (e.g. '>'). It first maps the entity name to a codepoint which is a unicode character number and then converts that number to a unicode character.

django_auxilium.utils.html.html_to_text(html)[source]

Function to convert HTML text to plain text by stripping all HTML.

Implementation is based from http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python.

django_auxilium.utils.html.simple_minify(string)[source]

Minify HTML with very simple algorithm.

This function tries to minify HTML by stripping most spaces between all html tags (e.g. </div>    <div> -> </div> <div>). Note that not all spaces are removed since sometimes that can adjust rendered HTML (e.g. <strong>Hello</strong> <i></i>). In addition to that, this function replaces all whitespace (more then two consecutive whitespace characters or new line) with a space character except inside excluded tags such as pre or textarea.

Though process:

To minify everything except content of excluded tags in one step requires very complex regular expression. The disadvantage is the regular expression will involve look-behinds and look-aheads. Those operations make regex much more resource-hungry which eats precious server resources. In addition, complex regex are hard to understand and can be hard to maintain. That is why this function splits the task into multiple sections.

  1. Regex expression which matches all exclude tags within the html is used to split the HTML split into components. Since the regex expression is wrapped inside a group, the content of the exclude tags is also included inside the resulting split list. Due to that it is guaranteed that every odd element (if there are any) will be the excluded tags.
  2. All the split components are looped and processed in order to construct final minified HTML.
  3. All odd indexed elements are not processed and are simply appended to final HTML since as explained above, they are guaranteed to be content of excluded tags hence do not require minification.
  4. All even indexed elements are minified by stripping whitespace between tags and redundant whitespace is stripped in general via simple regex.

You can notice that the process does not involve parsing HTML since that usually adds some overhead (e.g. using beautiful soup). By using 2 regex passes this achieves very similar result which performs much better.