Fonts For Forensics
By: Dr. Frederick B. Cohen, Ph.D.
Tel: (925) 454-0171
Email Dr. Cohen
Listing on Experts.com
AbstractLike other latent evidence that cannot be directly perceived by people, bit sequences have to be presented through tools. Presentations of digital forensic evidence often involve the presentation of text versions of bit sequences representing traces of events that took place within digital systems. This paper is about creating fonts for the examination and presentation of particular classes of bit sequences presented in particular ways in legal situations. Unlike fonts used for other purposes, fonts for forensics are less about the beauty of the presentation and more about the tradeoff between readability and being definitive about what is present. In other words, what you see is what you get, rather than what you see is what looks nice.
The presentation of trace evidence for legal purposes has substantially different requirements than for other purposes for several reasons. This includes, without limit, (1) legal mandates may restrict page formats (e.g., require the use of pleading paper for certain submissions), (2) subtle differences in presentation may be important for bringing clarity to the information presented (e.g., the difference between several spaces and a tab character may be vital to the issues at hand), (3) challenges may
be brought based on what is unclear (e.g., how can we tell from what is on this page that what you are claiming about this text is in fact true?), and (4) what is visible in many fonts may not properly reveal what is in fact present in the digital forensic data, leading to errors and omissions.
The history of fonts in non-legal uses has evolved over the last 60 years. In the early days of computing, there were two main technologies for presentation of digital data in readable form; dots and lines. The line technology used continuous media, such as deflections of a cathode ray in a cathode ray tube (CRT) or movement and up/down motion of a mechanical pen in two dimensions (the pen plotter). The dot technology consisted largely of light emitting diodes, lamps of various sorts, displays with fixed shape elements that were on or off at any given moment, and eventually the cathode ray tube with fixed scan patterns.
The fonts for plotters and line drawing CRTs originally consisted of sequences of line segments drawn one after another with pen up and down movements to break line continuity. Font designers created an array of different fonts and used coding schemes for representation, such as the American Standard Code for Information Interchange (ASCII), and Extended Binary Coded Decimal Interchange Code (EBCDIC). The dot matrix font designers started using fixed height and width font elements with dots on or off within the matrix of a single symbol location. Each ASCII, EBCDIC, or other coded symbol was assigned a font element for display purposes, with the exception of some
special characters, such as , , , and used for location control within a line, and characters such as and for movement within the page and movement from page to page. Other byte vales were often unused.
Over time, the dot technology largely won out, with pen plotters remaining still today. As display and printer technology improved, fonts became far more complex, involving more than on/off values for each location, variable hight and width, and a wide array of different symbol sets placed within font families. Boldface, underlines, and similar things were added to reflect the printer methods of using carriage return or backspace and printing over the same location again and again to produce similar
effects, fonts were developed for multiple languages, and ultimately unicode, a 2-byte coding scheme came about to help handle the explosion in the number of symbols desired within a font. While there are many other codings for bits in widespread use, the present discussion will be largely restricted to 8-bit fonts representing the ASCII character set. This is for convenience of space, but most of the results presented apply equally to other coding schemes and can be extended, with minimal difficulty,
to larger and smaller symbol sets and other similar schema.
There have been a number of different software packages over the years that have provided different representations of what normally appears on the screen. For example, and without limit; the "vi" , control-A) as two or more characters next to each other; Wordstar, and later Microsoft Word and many other document editors have had presentation modes that show many non-printing characters; the program "hexdump" and other esentations of hex, octal, binary or other coding details, in some cases along side of a display of the printable representations of many of the byte values; and many display systems such as packet analyzers show and allow the expansion and contraction of records, fields within records, and various representations of content in different windows. However, to date, we have found none of these that meet the requirements of a font for forensics, in that they all fail in one way or the other to precisely and accurately present what is present for all byte values and with the
basis details together in the same symbol with the presentation. Many of these packages also tend to have severe limits on the sequences they can sensibly present (e.g., the protocol analyzers make assumptions about interpretation that make them better presentations for what they are intended to project, but not for what they are not intended to project, and document formatters tend to not deal well with arbitrary binary files), don't allow flexibility in the alignment of content, and make underlying
assumptions about the coding scheme that lead to interpretation difficulties by the examiner seeking to understand what is present.
Requirements of a font for forensics
As a general rule, it is highly desirable that displayed symbols from a defined symbol set used for legal purposes be precise, accurate, and unique. Precision and accuracy of representation are well understood in the legal community and, for the presentation of scientific and technical evidence, have been highly supported by legal rulings. The uniqueness property is highly desirable to avoid confusion and allow definitive answers to be given to specific questions that may arise. As a first attempt to characterize a set of rules and basis for those rules when devising fonts for use in forensic examination and presentation, the following criteria are identified and the rational explained:
- Each symbol should be clearly different from all other symbols
- This allows for clarity about which one is identified. If this is not true, then there will be confusion both for the examiner and for those who review the results, including the lawyers,judges, juries, clerks, and public. Legal documents are often printed, scanned, reprinted, and go through other similar machinations. While it is impossible to always preserve all of the characteristics of what was originally present, it is important to provide enough of a difference between symbols so that these differences are likely to survive multi-generation copying, scanning, and a wide range of displays.
- Each symbol should be of the same width and height
- This allows symbols to be compared to other symbols around them for location. While this may destroy the appearance of tab characters and other similar presentation values, it provides clarity around issues like spaces, columns, helps with fixed width fields, such as databases, and allows the column and row to be clearly seen and specified verbally, which is vital for providing accurate testimony in legal matters.
- Each symbol should be familiar, with minimal added interpretation, so that it looks similar to what might appear on a display of the same symbol on a screen or printer.
- The word "help" should still be readable as such by someone who could read it in the normal display mode, or the font will create more confusion that it removes. Thus the font for EBCDIC will have to reflect EBCDIC coding, the one for ASCII will have to reflect ASCII coding, etc. There are limits to this today since existing fonts do not do this very well. For example, there are a wide range of different representations for the upper 128 symbols in the byte-values of the ASCII character set because, when defined, it was a 7-bit coding. The variations include a range of different accented characters, symbols used to make boxes and other graphical components, mathematical symbols, etc.. This proliferation of fonts and enormous variety of presentations is ultimately problematic, and can only be solved to a limited extent by the production of any particular font for forensics. For that reason, a forensic version of each font may ultimately be developed to allow the other properties to be met while producing a closer approximation to this one.
- Each symbol must be printable so that a , , , , , and other "non-printable" characters can be clearly seen on the printed page.
- This is necessary because, in many cases the issue in dispute is the non-printable characters, and even when they are not in dispute, it makes interpretation far easier when the non-printing characters are clearly revealed rather than being hidden. While many fonts provide presentation forms for most of the first 128 symbols within the symbol set, most of the ones observed in this study have shown a large portion of the codes from 128-255 as non-printing, essentially unassigned. There are many different options for the presentation of non-printing characters, and the presentation is a function of the utility for the examiner.
- Each symbol should self-indicate the underlying bit pattern that produced it so that it can be traced back to its original value.
- This provides a trace back to the origin of the data that produced the symbol, and allows the individual examining it to definitively know the basis for the display provided. For example, the byte code 03 may mean different things in different contexts. By providing the code with the display, the interpretation for a different context can be made, while without it, there would be no direct way for the examiner to know the basis for the interpretation applied. This also serves to remind the viewer that these are traces of latent evidence.
- Another challenge is that the upper half (the first bit 1) of the ASCII and many other character sets, have a wide range of different presentations. The font designer must identify a way to make these meaningful across a broad context of uses.
- For ease of understanding, except in cases where there is a clearly differentiable symbol for the upper half of the code space, the decision was taken to represent these symbols with emphasis marks (underlines) or coding indicators (e.g., FE). This may sacrifice a great deal in meaning because it does not display what actually may have "appeared" on a display or printer, but this can be resolved by the use of images of displays for cases where this is probative.
A side effect of these criteria is that the font will take up more space on a page than the normal font would take up for the same level of readability, and it will have some differences from the fonts commonly used for other purposes, such as a more distinct difference between 0 and O, 1 and l, I and |, and so forth.
A sample font for forensics
As a starting point, a forensic font for ASCII was developed, and a tool was implemented to convert any input byte sequence into a display using this font. A table depicting this font is included in Figure 1.
. . . Click here to read entire article (PDF).