Tuesday, August 31, 2004
A little rant about Microsoft Internet Explorer's color parsing
A warning to all readers, things get strange and very geeky from here on in. Enjoy.
As my profile says, I am one of POPFile's developers. The POPFile team is commited to finding and making sure POPFile is capable of decoding new spammer "tricks". John Graham-Cumming, author and lead developer of POPFile also maintains the spammer's compendium, a catalogue of such spammer tricks.
The newest trick in the compendium is "Flex Hex", which was reported to John in July. POPFile's CVS code learned how to handle this trick shortly thereafter.
The essence of Flex Hex is that IE is very flexible in how it will interpret hexadecimal RGB values in any HTML attribute (I'm not sure about CSS) that expects color data. John sums it up well in the spammer's compendium:
Missing digits are treated as 0[...]. An incorrect digit is simply interpreted as 0. For example the values #F0F0F0, F0F0F0, F0F0F, #FxFxFx and FxFxFx are all the same.
Though the above generalization would have been good enough in 99% of cases, we found some cases where IE deviated from the fairly simple approach of zero-padding the field and zeroing invalid hex characters.
When color strings longer than 8 characters or shorter than 4 characters are used, things start to get strange. Things getting strange, particularly where undocumented, is always to a spammer's advantage. I will lay out here what IE does with unusual, unpredictable, or invalid color data.
If email filtering software isn't aware of how common HTML-enabled email readers will display HTML, malformed or otherwise, it becomes much easier for spammers to hide text within emails in a way that may fool statistical filters or otherwise evade filters.
As an interesting note, IE does this unusual parsing regardless of the doctype declaration, ignoring "standards mode". Mozilla performs similar parsing, differing only in how long strings are handled. However, in standards mode, invalid color notation is completely ignored by Mozilla and the default or parent color is allowed to set the color of the element.
The iframe below contains a slight variation on the DHTML page I used while determining how IE parses colors. I have gone out of my way to make it cross-platform, so other browsers can be tested with it. The two fields can be used to set the foreground and background colors of some text, and then the DOM of the page is sniffed to display the colors, as interpreted by the browser.
Throughout this explanation I will use a notation similar to CSS's RGB( RR, GG, BB) syntax to show how a value is split into red, green, and blue components. This isn't correct CSS RGB() syntax, but I am using it for clarity.
IE's non-CSS color parsing algorithm appears to behave as follows, in order to get to a 6 digit hexadecimal value from any string:
These steps may not be performed in the same order or using exactly the same criteria as IE, but the end result is identical as far as I can tell.
First, remove any hash-marks, then replace any non-hexadecimal characters (0-9a-f) with 0's.
Eg: #zqbttv becomes 00b000.
For lengths 1-2, right pad to 3 characters with 0's.
Eg: "0F" becomes "0F0", "F" becomes "F00".
For length 3, take each digit as a value for red, green, or blue, and prepend a 0 to that value.
Eg: "0F0" becomes RGB( 0, F, 0), which becomes RGB( 00, 0F, 00) or 000F00.
Any value shorter than 4 digits long is done at this point.
For lengths 4 and longer, the field is right-padded with 0's to the next full multiple of 3. This step is important for longer fields.
Eg: "0F0F" becomes "0F0F00", "0F0F0F0" becomes "0F0F0F000" and "00FF00FF00FF00FF" becomes "00FF00FF00FF00FF00"
Next, the string is broken into three even parts, representing red, green and blue, from left to right.
"0F0F00" behaves as expected, becoming RGB(0F, 0F, 00). Any string of 6 characters is done at this point.
Longer strings, such as "1234567890ABCDE" become RGB(12345, 67890, ABCDE). Extremely long strings are split similarly. "1234567890ABCDE1234567890ABCDE" becomes RGB( 1234567890, ABCDE12345, 67890ABCDE).
At this point, the RGB values are truncated individually.
If the individual RGB values are over 8 characters long, they are truncated to 8 characters by removing characters from the left. This, in particular, was unexpected.
RGB( 1234567890, ABCDE12345, 67890ABCDE) becomes RGB( 34567890, CDE12345, 890ABCDE), and so forth.
Once the individual RGB values are under 8 characters long they are truncated by removing characters from the right.
RGB( 34567890, CDE12345, 890ABCDE) becomes RGB( 34, CD, 89) or #34CD89, in more traditional notation.
Any string should be transformed into a 6-digit hexadeximal color by the above steps.
For instance, <font color="6db6ec49efd278cd0bc92d1e5e072d68"> (yes that is random hexadecimal data) will result in IE displaying text in the color "6ecde0", a rather pleasant light blue. This isn't at all what I would have expected before studying IE's behavior. A truncation to "6db6ec", I might have expected or to "072d68" (also a pale blues, coincidentally). However, if you look closely inside the random hexadecimal string, the components that make up the final RGB value are present, and in sequential order: "6db6ec49efd278cd0bc92d1e5e072d68"
To continue decoding this value, it first needs to be padded:
Then split into three even parts:
RGB( 6db6ec49efd, 278cd0bc92d, 1e5e072d680)
Then those parts are left-trimmed to 8 digits:
RGB( 6ec49efd, cd0bc92d, e072d680)
Then right-trimmed to the 2 most significant digits:
RGB( 6e, cd, e0)
And there you have it, the same color that IE will display if you enter 6db6ec49efd278cd0bc92d1e5e072d68 into one of the fields in the test applet above.
FWIW, this isn't IE's fault. It was inheriting color handling behavior from Netscape. If you throw the same values at a copy of NS4, you'll find it handles them the same way.
However, there seem to be some exceptions to the rule. For example, the color "radioactive" becomes 0ad00ac000e0, which should be reduced to 0ad0 0ac0 00e0 and then 0a 0a 00... but it's really reduced to ad ac 0e. I can't figure out for the life of me why this is. Other invalid colors with similar lengths get the by-the-books treatment.
Funnily enough, while IE7 still special cases it, Mozilla (and thus Firefox) process it by the book.
I've special-cased radioactive in my current project, as it's a popular color for the poor victims of my experiments, but it makes me wonder why it's so special.
First, the "radioactive" quirk appears also with other colors starting with "ra". Weird.
Second, Netscape 4 and Mozilla/Firefox have a slight modification to the rules.
After zero padding the right out to the nearest three, the string is truncated to the leftmost 12 characters.
6db6 ec49 efd2
6d ec ef
It seems that after a string has been split into three equal-sized blocks, and after those blocks have (if necessary) been truncated from the left down to 8 characters, the blocks are then truncated from the left until one of them begins with something other than a 0. This happens regardless of whether they were originally 9 characters or 4.
So, for instance, "00e000e000e0" splits as '00e0 00e0 00e0'; this is then reduced to 'e0 e0 e0' by chopping off the front of each block.
In the event that the blocks are cut down to a single letter each (eg, 'rules' becomes 00 0e 00, which becomes simply 0 e 0 to bring the e to the front of a block), Sam's rule for 3-character strings takes over, and it becomes 00 0e 00 again.
All of this means that no string will ever be interpreted as pure black if it contains any hex values at all. It also means that most strings of any decent length will appear as a bright colour, by running at least one hex value up to the front; my guess is that that's why it happens: to make 'invalid colours' more interesting to look at.
(Information gathered under Chrome 41.0.2272.118 m, and confirmed under IE 8.something. But purely from empirical evidence - no magical programming work or anything!)