|
LWN.net Weekly Edition for August 27, 2015
In what was perhaps one of the shortest keynotes on record (ten minutes),
Keith Packard outlined the hardware architecture of "The
Machine"—HP's ambitious new computing system. That keynote took place
at LinuxCon
North America in Seattle and was thankfully followed by an hour-long
technical talk by Packard the following day (August 18), which looked at
both the hardware and software for The Machine. It is, in many ways, a
complete rethinking of the future of computers and computing, but there is
a fairly long way to go between here and there.
The hardware
The basic idea of the hardware is straightforward. Many of the『usual
computation
units』(i.e. CPUs or systems on chip—SoCs) are connected to a "massive memory pool" using
photonics for fast
interconnects. That leads to something of an equation, he said:
electrons (in CPUs) + photons (for communication) +
ions (for memory
storage) = computing. Today's computers transfer a lot of data
and do so over "tiny little pipes". The Machine, instead, can address all
of its "amazingly huge pile of memory" from each of its many compute
elements. One of the underlying principles is to stop moving memory around
to use it in computations—simply have it all available to any computer that
needs it.
Some of the ideas for The Machine came from HP's DragonHawk systems, which
were traditional symmetric multiprocessing systems, but packed a "lot of
compute in a small space". DragonHawk systems would have 12TB of RAM in an
18U enclosure, while
the nodes being built for The Machine will have 32TB of memory in 5U. It
is, he
said, a lot of memory and it will scale out linearly. All of the
nodes will be connected at the memory level so that "every single processor
can do a load or store instruction to access memory on any system".
Nodes in this giant cluster do not have to be homogeneous, as long as they
are all hooked to the same memory interconnect.
The first nodes that HP is building will be homogeneous, just for pragmatic
reasons. There are two circuit boards on each node, one for storage and
one for the computer. Connecting the two will be the『next generation
memory interconnect』(NGMI), which will also connect both parts of the node
to the rest of the system using photonics.
The compute part of the node will have a 64-bit ARM SoC with 256GB of purely
local RAM along with a field-programmable gate array (FPGA) to implement
the NGMI protocol. The storage
part will have four banks of memory (each with 1TB), each with its own NGMI
FPGA.
A given SoC can access memory elsewhere without involving the SoC on the
node where the memory resides—the NGMI bridge FPGAs will talk to their
counterpart on the other node via the photonic interface.
Those
FPGAs will eventually be replaced by application-specific integrated circuits
(ASICs) once the bugs are worked out.
ARM was chosen because it was easy to get those vendors to talk with the
project, Packard said. There is no "religion" about the instruction set
architecture (ISA), so others may be used down the road.
Eight of these nodes can be collected up into a 5U enclosure, which gives
eight processors and 32TB of memory. Ten of those enclosures can then be
placed into a rack (80 processors, 320TB) and multiple racks can all be
connected on the same "fabric" to allow addressing up to 32 zettabytes (ZB) from
each processor in the system.
The storage and compute portions of each node are powered separately. The
compute piece has two 25Gb network interfaces that are capable of remote
DMA.
The storage piece will eventually use some kind of non-volatile/persistent
storage (perhaps
even the fabled memristor), but is using
regular DRAM today, since it is available and can be used to prove the
other parts of the design before switching.
SoCs in the system may be running more than one operating system (OS) and for
more than one tenant, so there are some hardware protection mechanisms
built into the system. In addition, the memory-controller FPGAs will
encrypt the data at rest so that pulling a board will not give access to the
contents of the memory even if it is cooled (à la cold boot) or
when some kind of persistent storage is used.
At one time, someone said that 640KB of memory should be enough, Packard
said, but now he is wrestling with the limits of the 48-bit addresses used
by the 64-bit ARM and Intel CPUs. That only allows addressing up to 256TB,
so memory will be accessed in 8GB "books" (or, sometimes, 64KB
"booklettes"). Beyond the
SoC, the NGMI bridge FPGA (which is also called the "Z bridge") deals with
two different kinds of addresses: 53-bit logical Z addresses and 75-bit Z
addresses. Those allow addressing 8PB and 32ZB respectively.
The logical Z
addresses are used by the NGMI firewall to determine the access rights to
that memory for the local node. Those access controls are managed outside
of whatever OS is running on the SoC. So the mapping of
memory is handled by the OS, while the access controls for the memory are
part of the management of The Machine system as a whole.
NGMI is not intended to be a proprietary fabric protocol, Packard said, and
the project is trying to see if others are interested. A memory
transaction on the fabric looks much like a cache access. The Z address is
presented and 64 bytes are transferred.
The software
Packard's group is working on GPL operating systems for the system, but
others can certainly be supported. If some『proprietary Washington
company』wanted to port its OS to The Machine, it certainly could.
Meanwhile, though, other groups are working on other free systems, but his
group is made up of "GPL bigots" that are working on Linux for the system.
There will not be a single OS (or even distribution or kernel) running on a
given instance of the The Machine—it is intended to support multiple
different environments.
Probably the biggest hurdle for the software is that there is no cache
coherence within the enormous memory pool. Each SoC has its own local
memory (256GB) that is cache coherent, but
accesses to the "fabric-attached memory" (FAM) between two processors are
completely uncoordinated by hardware. That has implications for
applications and the OS that are using that memory, but OS data structures
should be restricted to the local, cache-coherent memory as much as possible.
For the FAM, there is a two-level allocation scheme that is arbitrated by a
"librarian". It allocates books (8GB) and collects them into "shelves".
The hardware protections provided by the NGMI firewall are done on book
boundaries. A shelf could be a collection of books that are scattered all
over the FAM in a single load-store domain (LSD—not Packard's
favorite acronym, he noted), which is defined by the firewall access rules.
That shelf could then be handed to the OS to
be used for a filesystem, for example. That might be ext4, some other Linux
filesystem, or the new library filesystem (LFS) that the project is working on.
Talking to the memory in a shelf uses the POSIX API. A process does an
open() on a
shelf and then uses
mmap() to map the memory into the process. Underneath, it uses
the direct access (DAX) support to access the memory. For the
first revision, LFS will not support sparse files. Also, locking will not be
global throughout an LSD, but will be local to an OS running on a node.
For management of the FAM, each rack will have a "top of rack" management
server, which is where the librarian will run. That is a fairly simple
piece of code that just does bookkeeping and keeps track of the allocations
in a SQLite database.
The SoCs are the only parts of the system that can talk to the firewall
controller, so other components communicate with a firewall proxy that runs
in user space, which
relays queries and updates. There are a『whole bunch of potential
adventures』in getting the memory firewall pieces all working correctly,
Packard said.
The lack of cache coherence makes atomic operations on the FAM problematic,
as traditional atomics rely on that feature. So the project has added
some hardware to the bridges to do atomic operations at that level. There
is a fam_atomic library to access the operations (fetch and
add, swap, compare and store, and read), which means that
each operation is done at the cost of a system call. Once again, this is
just the first implementation; other mechanisms may be added later. One
important caveat is that the FAM atomic operations do not interact with the
SoC cache, so applications will need to flush those caches as needed to
ensure consistency.
Physical addresses at the SoC level can change, so there needs to be
support for remapping those addresses. But the SoC caches and DAX both
assume static physical mappings. A subset of the physical address
space will be used as an aperture into the full address space of the system
and books can be mapped into that aperture.
Flushing the SoC cache line by line would
"take forever", so a way to flush the entire cache when the physical
address mappings change has been added. In order to do that, two new
functions have been added to the Intel persistent memory library
(libpmem): one to check for the presence of non-coherent
persistent memory (pmem_needs_invalidate()) and another to
invalidate the CPU cache (pmem_invalidate()).
In a system of this size, with the huge amounts of memory involved, there
needs to be well-defined support for memory errors, Packard said. Read is
easy—errors are simply signaled synchronously—but writes are trickier
because the actual write is asynchronous. Applications need to know about
the errors, though, so SIGBUS is used to signal an error. The
pmem_drain() call will act as a barrier, such that errors in
writes before
that call will signal at or before the call. Any errors after the barrier
will be signaled post-barrier.
There are various areas where the team is working on free software, he
said, including persistent memory and DAX. There is also ongoing work on
concurrent/distributed filesystems and non-coherent cache management.
Finally, reliability, availability, and serviceability (RAS) are quite
important to the project, so free software work is proceeding in that area
as well.
Even with two separate sessions, it was a bit of a whirlwind tour of
The Machine. As he noted, it is an environment that is far removed from
the desktop
world Packard had previously worked in. By the sound, there are plenty of
challenges to overcome before The Machine becomes a working computing
device—it will be an interesting process to watch.
[I would like to thank the Linux Foundation for travel assistance to
Seattle for LinuxCon North America.]
Comments (17 posted)
ByNathan WillisAugust 26, 2015
TypeCon
Data visualization is often thought of in terms of pixels;
considerable work goes into shaping large data sets into a form where
spatial relationships are made clear and where colors, shapes,
intensity, and point placement encode various quantities for rapid
understanding. At TypeCon 2015 in Denver, though, researcher Richard
Brath presented a different approach: taking advantage of readers'
familiarity with the written word to encode more information into text
itself.
Brath is a PhD candidate at London South Bank University where, he
said, "I turn data into shapes and color and so on." Historically
speaking, though, typography has not been a part of that equation. He
showed a few examples of standard data visualizations, such as
"heatmap" diagrams. Even when there are multiple variables under
consideration, the only typography involved is plain text labels. 『Word
clouds』are perhaps the only well-known example of visualizations that
involve altering text based on data, but even that is trivial: the
most-frequent words or tags are simply bigger. More can certainly be
done.
Indeed, more has been done—at least on rare
occasion; Brath has cataloged and analyzed instances where other type
attributes have been exploited to encode additional information in
visualizations. An oft-overlooked example, he said, is cartography:
subtle changes in
spacing, capitalization, and font weight are used to indicate many
distinct levels of map features. The reader may not consciously
recognize it, but the variations give cues as to which neighboring
text labels correspond to which map features. Some maps even
incorporate multiple types of underline and reverse italics in
addition to regular italics (two features that are quite uncommon
elsewhere).
Brath also showed several historical charts and diagrams (some
dating back to the 18th Century) that used typographic features to
encode information. Royal family trees, for example, would sometime
vary the weight, slant, and style of names to indicate the pedigree
and status of various family members. A far more modern example of
signifying information with font attributes, he
said, can be seen in code editors, where developers take it for
granted that syntax highlighting will distinguish between symbols,
operators, and structures—hopefully without adversely impacting
readability.
On the whole, though, usage of these techniques is limited to
specific niches. Brath set out to catalog the typographic features
that were employed, then to try an apply them to entirely new
data-visualization scenarios. The set of features available for encoding
information included standard properties like weight and slant,
plus capitalization, x-height, width (i.e., condensed through
extended), spacing, serif type, stroke contrast, underline, and the choice of
typeface itself. Naturally, some of those attributes map well to
quantitative data (such as weight, which can be varied continuously
throughout a range), while others would only be useful for encoding
categorical information (such as whether letters are slanted or
upright).
He then began creating and testing a variety of visualizations in
which he would encode information by varying some of the font
attributes. Perhaps the most straightforward example was the
"text-skimming" technique: a preprocessor varies the weight of
individual words in a document based on their overall frequency in the
language used. Unusual words are bolder, common words are lighter, with
several gradations incorporated. Articles and pronouns can even be
put into italics to further differentiate them from the more critical
parts of the text. The result is a paragraph that, in user tests,
readers can skim through at significantly higher speed; it is somewhat
akin to overlaying a word-frequency cloud on top of the text itself.
A bit further afield was Brath's example of encoding numeric data
linearly in a line of text. He took movie reviews from the Rotten
Tomatoes web site and used each reviewer's numeric rating as the
percentage of the text rendered in bold. The result, when all of the
reviews for a particular film are placed together, effectively maps a
histogram of the reviews onto the reviews themselves. In tests, he
said, participants typically found it easier to extract information
from this form than from Rotten Tomatoes's default layout, which
places small numbers next to review quotes in a grid, intermingled with
various images.
He also showed examples of visualization techniques that varied
multiple font attributes to encode more than one variable. The first
was a response to limitations of choropleth maps—maps where
countries or other regions are colored (or shaded) to indicate a particular score
on some numeric scale. While choropleths work fine for single
variables, it is hard to successfully encode multiple variables using
the technique, and looking back and forth between multiple
single-variable choropleth maps makes it difficult for the reader to
notice any correlations between them.
Brath's technique encoded three variables (health expenditure as a
percentage of GDP, life expectancy, and prevalence of HIV) into three
font attributes (weight, case, and slant), using the
three-letter ISO country codes as the text for each nation on the
map. The result makes it easier to zero in on particular combinations
of the variables (for example, countries with high health expenditures
and short life expectancies) or, at least, easier than flipping back
and forth between three maps.
His final example of multi-variable encoding used x-height and font
width to encode musical notation into text. The use case presented
was how to differentiate singing from prose within a book. Typically,
the only typographic technique employed in a book is to offset the
sung portion of the text and set it in italics. Brath, instead,
tested varying the heights of the letters to indicate note pitch and
the widths to indicate note duration.
The reaction to this technique from the audience at TypeCon was, to
say the least, mixed. While it is clear that the song text encodes
some form of rhythmic changes and variable intensity, it does not map
easily to notes, and the rendered text is not exactly easy to look
at. Brath called it a work in progress; his research is far from
over.
He ended the session by encouraging audience members to visit
his research blog
and take the latest survey to test the impact of some of the
visualization techniques firsthand. He also posed several
questions to the crowd, such as why there were many font families that
come with a variety of different weights, but essentially none that
offer multiple x-height options or italics with multiple angles of slant.
Brath's blog makes for interesting reading for anyone concerned
with data visualizations or text. He often explores
practical issues—for example, how
overuse of color and negatively impact text legibility, which could
have implications for code markup tools, or the difficulties
to overcome when trying to slant text at multiple angles.
Programmers, who spend much of their time staring at text, are no
doubt already familiar with many ways in which typographic features
can encode supplementary information (in this day and age, who
does not associate a hyperlink closely with an underline, after
all?). But there are certainly still many places where the attributes
of text might be used to make data easier to find or understand.
Comments (3 posted)
A persistent theme throughout the LLVM
microconference at the 2015 Linux Plumbers
Conference was that of "breaking the monopoly" of GCC, the GNU C
library (glibc), and other tools that are relied upon for building Linux
systems. One could quibble with the "monopoly" term, since it is
self-imposed and not being forced from the outside, but the general idea is
clear: using multiple tools to build our software will help us in numerous ways.
Kernel and Clang
Most of the microconference was presentation-oriented, with relatively
little discussion.
Jan-Simon Möller kicked things off with a status report on the efforts to
build a Linux kernel using LLVM's Clang C compiler. The number of
patches needed for building the kernel has dropped from around 50 to 22
"small patches", he said. Most of those are in the kernel build system or
are for little quirks in driver code. Of those, roughly two-thirds can
likely be merged upstream, while the others are "ugly hacks" that will
probably stay in the LLVM-Linux tree.
There are currently five patches needed in order to build a
kernel for the x86 architecture. Two of those are for problems building
the crypto
code (the AES_NI assembly code will not build with the LLVM
integrated assembler and there are longstanding problems with
variable-length
arrays in structures). The integrated assembler also has
difficulty handling some "assembly" code that is used by the kernel build
system to calculate offsets; GCC sees it as a string, but the integrated
assembler tries to actually assemble it.
The goal of building an "allyesconfig" kernel has not yet been realized, but a
default configuration (defconfig) can be built using the most recent Git
versions of
LLVM and Clang. It currently requires disabling the integrated assembler
for the entire build, but the goal is to disable it just for the files that
need it.
Other architectures (including ARM for the Raspberry Pi 2) can be
built using roughly half-a-dozen patches per architecture, Möller said.
James Bottomley was concerned about the "Balkanization" of kernel builds
once Linus Torvalds and others start using Clang for their builds; obsolete
architectures and those not supported by LLVM may stop building altogether,
he said. But microconference lead Behan Webster thought that to be an
unlikely outcome. Red Hat and others will always build their kernels using
GCC, he said, so that will be supported for quite a long time, if not forever.
Using multiple compilers
Kostya Serebryany is a member of the "dynamic testing tools" team at Google,
which has the goal of providing tools for the C++ developers at the company
to find bugs without any help from the team. He was also one of the
proponents of the "monopoly" term for GCC, since it is used to build the
kernel, glibc, and all of the distribution binaries. But, he said,
making all of that code buildable using other compilers will allow various other
tools to also be run on the code.
For example, the AddressSanitizer
(ASan) can be used to detect memory errors such as stack overflow, use
after free, using stack memory after a function has returned, and so on.
Likewise, ThreadSanitizer
(TSan), MemorySanitizer
(MSan), and UndefinedBehaviorSanitizer
(UBSan) can find various kinds of problems in C and C++ code. But all are
based on Clang and LLVM, so only code that can be built with that compiler
suite can be sanitized using these tools.
GCC already has some similar tools and the Linux kernel has added some as
well (the kernel address sanitizer, for example), which have found various
bugs, including quite a few security bugs. GCC's support has largely come
about because of the competition with LLVM and still falls short in some
areas, he said.
Beyond that, though, there are other techniques beyond "best effort" tools
like the sanitizers. For example, fuzzing and hardening are two
techniques that can be used to either find more bugs or eliminate certain
classes of bugs. He stated that coverage-guided fuzzing can be used to
narrow in on problem areas in the code. LLVM's LibFuzzer can be used to
perform that kind of fuzzing. He noted that the Heartbleed bug can be "found" using LibFuzzer
in roughly five seconds on his laptop.
Two useful hardening techniques are also available with LLVM: control flow
integrity (CFI) and SafeStack. CFI will
abort the program when it detects certain kinds of undesired behavior—for
example that the virtual function table for a
program has been altered. SafeStack protects against stack overflows by
placing local variables on a separate stack. That way, the return address
and any variables are not contiguous in memory.
Serebryany said that it was up to the community to break the monopoly.
He was not suggesting simply switching to using LLVM exclusively, but to
ensuring that the kernel,
glibc, and distributions all could be built with it. Furthermore, he said
that continuous integration should be set up so that all of these pieces
can always be built with both compilers. When other compilers arrive, they
should
also be added into the mix.
To that end, Webster asked if Google could help getting the kernel patches
needed to build with Clang upstream. Serebryany said that he thought that,
by showing some of the advantages of being able to build with Clang (such
as the fuzzing support), Google might be able to help get those
patches merged.
BPF and LLVM
The "Berkeley Packet Filter" (BPF) language has expanded its role greatly
over the years, moving from simply being used for packet filtering to now
providing the in-kernel virtual machine for security (seccomp), tracing,
and more. Alexei Starovoitov has been the driving force behind extending
the BPF language (into eBPF) as well as expanding its scope in the kernel.
LLVM can be used to compile eBPF programs for use by the kernel, so Starovoitov
presented about the language and its uses at the microconference.
He began by noting wryly that he "works for Linus Torvalds" (in the same
sense that all kernel developers do). He merged his first patches into GCC some
fifteen years ago, but he has "gone over to Clang" in recent years.
The eBPF language is supported by both GCC and LLVM using backends that he
wrote. He noted that the GCC backend is half the size of the LLVM version,
but that the latter took much less time to write. "My vote goes to LLVM
for the simplicity of the compiler", he said. The LLVM-BPF backend has been
used to demonstrate how to write a backend for the compiler. It is now
part of LLVM stable and will be released as part of LLVM 3.7.
GCC is built for a single backend, so you have to specifically create a BPF
version, but LLVM has all of its backends available using command-line
arguments (--target bpf). LLVM also has an integrated
assembler that can take the C code describing the BPF and turn it into
in-memory BPF bytecode that can be loaded into the kernel.
BPF for tracing is currently a hot area, Starovoitov said. It is a better
alternative to SystemTap and runs two
to three times faster than Oracle's
DTrace. Part of that
speed comes from LLVM's optimizations plus the kernel's internal just-in-time
compiler for BPF bytecode.
Another interesting tool is the BPF Compiler Collection (BCC).
It makes it easy to write and run BPF programs by embedding them into
Python (either directly as strings in the Python program or by loading them
from a C file). Underneath the Python "bpf" module is LLVM, which compiles
the program before the Python code loads it into the kernel. A simple
printk() can easily be added into the kernel without recompiling
it (or rebooting). He noted that Brendan Gregg has added a bunch of example tools
to show how to use the C+Python framework.
Under the covers, the framework uses libbpfprog that compiles a C
source file into BPF bytecode using Clang/LLVM. It can also load the
bytecode and any BPF maps to the kernel using the bpf() system
call and attach the program(s) to various types of hooks (e.g. kprobes, tc
classifiers/actions). The Python bpf module simply provides bindings for
the library.
The presentation was replete with examples, which are available in the slides
[PDF] as well.
Alternatives for the core
There was a fair amount of overlap between the last two sessions I was able
to sit in on. Both Bernhard Rosenkraenzer and Khem Raj were interested in
replacing more than just the compiler in building a Linux system.
Traditionally, building a Linux system starts with GCC, glibc, and
binutils, but there are
now alternatives to those. How much of a Linux
system can be built using those alternatives?
Some parts of binutils are still needed, Rosenkraenzer said. The binutils gold linker can
be used instead of the traditional ld. (Other linker options were
presented in Mark Charlebois's final session of the microconference, which
I unfortunately had to miss.) The gas assembler from binutils can
be replaced with Clang's integrated assembler for the most part, but there
are still non-standard assembly constructs that require gas.
Tools like nm, ar, ranlib, and others will need
to be made to understand three different formats: regular object files,
LLVM bytecode, and the GCC intermediate representation. Rosenkraenzer
showed a shell-script wrapper that could be used to add this support to
various utilities.
For the most part, GCC can be replaced by Clang. OpenMandriva switched to
Clang as its primary compiler in 2014. The soon-to-be-released
OpenMandriva 3 is almost all built with Clang 3.7. Some packages
are still built with gccorg++, however. OpenMandriva
still needed to build GCC, though, to get libraries that were needed such
as libgcc, libatomic, and others (including, possibly,
libstdc++).
The GCC compatibility claimed by Clang is too conservative, Rosenkraenzer
said. The __GNUC__ macro definition in Clang is set to 4.2.1, but
switching that to 4.9 produces better code. There were several thoughts on
why Clang has chosen 4.2.1, though both are related: 4.2.1 was the last
GPLv2 release of GCC, so some people may not be allowed to look at later
versions; in addition, GCC 4.2.1 was the last version that was used to build
the BSD portions of OS X.
There are a whole list of GCC-isms that should be avoided for
compatibility with Clang. Rosenkraenzer's slides
[PDF] list many of them. He noted that there have been a number of
bugs found via Clang warnings or errors when building various programs—GCC
did not complain about those problems.
Another "monopoly component" that one might want to replace would be
glibc. The musl libc alternative
is viable, but only if binary compatibility with other distributions is not
required. But musl cannot be built with Clang, at least yet.
Replacing GCC's libstdc++ with LLVM's libc++ is possible
but, again, binary compatibility is sacrificed. That is a bigger problem
than it is for musl, though, Rosenkraenzer said. Using both is possible,
but there are problems when libraries (e.g. Qt) are linked to, say,
libc++ and a binary-only Qt program uses libstdc++, which
leads to crashes.
libc++ is roughly half the size of libstdc++, however, so
environments like Android (which never used libstdc++) are making
the switch.
Cross-compiling under LLVM/Clang is easier since all of the backends are
present and compilers for each new target do not need to be built. There
is still a need to build the cross-toolchains, though, for binutils,
libatomic, and so on. Rosenkraenzer has been working on a tool
to do automated bootstrapping of the toolchain and core system.
Conclusion
It seems clear that use of LLVM within Linux is growing and that growth is
having a positive effect. The competition with GCC is
helping both to become better compilers, while building our tools with both
is finding
bugs in critical components like the kernel. Whether it is called
"breaking the monopoly" or "diversifying the build choices", this trend is
making beneficial changes to our ecosystem.
[I would like to thank the Linux Plumbers Conference organizing committee
for travel assistance to
Seattle for LPC.]
Comments (17 posted)
ByNathan WillisAugust 26, 2015
TypeCon
At the 2015 edition of TypeCon in Denver, Adobe's Frank Grießhammer presented his
work reviving the famous Hershey fonts
from the Mid-Century era of computing. The original fonts were
tailor-made for early vector-based output devices but, although they
have retained a loyal following (often as a historical curiosity),
they have never before been produced as an installable digital font.
Grießhammer started his talk by acknowledging his growing
reputation for obscure topics—in 2013, he presented a tool for
rapid generation of the Unicode box-drawing
characters—but argued that the Hershey fonts were overdue for
proper recognition. He first became interested in the fonts and their peculiar
history in 2014, when he was surprised to find a well-designed
commercial font that used only straight line segments for its
outlines. The references indicated that this choice was inspired by
the Hershey fonts, which led Grießhammer to dig into the topic further.
The fonts are named for their creator, Allen V. Hershey
(1910–2004), a physicist working at the
US Naval Weapons Laboratory in the 1960s. At that time, the laboratory
used one of the era's most advanced computers, the IBM Naval
Ordnance Research Calculator (NORC), a vacuum-tube and
magnetic-tape based machine. NORC's output was provided by the General
Dynamics S-C
4020, which could either plot on a CRT display or directly onto
microfilm. It was groundbreaking for the time, since the S-C 4020
could plot diagrams and charts directly, rather than simply outputting
tables that had to be hand-drawn by draftsmen after the fact.
By default, the S-C 4020 would output text by projecting light
through a set of letter stencils, but Hershey evidently saw untapped
potential in the S-C 4020's plotting capabilities. Using the plotting
functions, he designed a set of high-quality Latin fonts (both upright
and italics),
followed by Greek, a full set of mathematical and technical symbols,
blackletter and Lombardic letterforms, and an extensive set of
Japanese glyphs—around 2,300 characters in total. Befitting the S-C
4020's plotting capabilities, the letters were formed entirely by
straight line segments.
The format used to
store the coordinates of the curves is, to say the least, unusual.
Each coordinate point is stored as pair of ASCII
letters, where the numeric value of each letter is found by taking its
offset from the letter R. That is, "S" has a value of +1, while "L"
has a value of -6. The points are plotted with the origin in the
center of the drawing area, with x increasing to the right and y
increasing downward.
Typographically, Hershey's designs were commendable; he
drew his characters based on historical samples, implemented his own
ligatures, and even created multiple optical sizes.
Hershey then proceeded to develop four separate styles that each used
different numbers of strokes (named "simplex," "duplex," "complex,"
and "triplex").
The project probably makes Hershey the inventor of『desktop
publishing』if not "digital type" itself, Grießhammer said, but
Hershey himself is all but forgotten. There is scant information
about him online, Grießhammer said; he has still not even been able
to locate a photograph (although, he added, Hershey may be one of the unnamed
individuals seen in group shots of the NORC room, which can be found online).
Hershey's vector font set has lived on as a subject for computing
enthusiasts, however. The source files are in the public domain (a
copy of the surviving documents is available from the Ghostscript project,
for example) and
there are a number of software projects online that can read their
peculiar format and reproduce the shapes. At his GitHub page,
Grießhammer has links to several of them, such as Kamal Mostafa's libhersheyfont. Inkscape
users may also be familiar with the Hershey
Text extension, which can generate SVG paths based on a subset of
the Hershey fonts. In that form, the paths are suitable for use with
various plotters, laser-cutters, or CNC mills; the extension was
developed by Evil Mad Scientist Laboratories for use with such devices.
Nevertheless, there has never been an implementation of the designs
in PostScript, TrueType, or OpenType format, so they cannot be used to
render text in standard widgets or elements. Consequently, Grießhammer set out
to create his own. He wrote a script to convert the original vector
instructions into Bézier paths in UFO format, then had to
associate the resulting shapes with the correct Unicode
codepoints—Hershey's work having predated Unicode by decades.
The result is not quite ready for release, he said. Hershey's designs
are zero-width paths, which makes sense for drawing with a CRT, but
is not how modern outline fonts work. To be usable in TrueType or
OpenType form, each line segment needs to be traced in outline form to
make a thin rectangle. That can be done, he reported, but he is
still working out what outlining options create the most useful final
product. The UFO files, though, can be used to create either TrueType
or OpenType fonts.
When finished, Grießhammer said, he plans to release the project
under an open source license at github.com/adobe-fonts/hershey.
He hopes that it will not only be useful, but will also bring some
more attention to Hershey himself and his contribution to modern
digital publishing.
Comments (5 posted)
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
-
Security: Nested NMIs lead to CVE-2015-3290; New vulnerabilities in firefox, openshift, openssh, owncloud, ...
-
Kernel: Bcachefs; Power-aware scheduling; Porting to a new architecture; 4.2 development stats.
-
Distributions: Copyright assignment and license enforcement for Debian; Debian and binary firmware blobs; The State of Fedora, ...
-
Development: New widgets in GTK+; Glibc wrappers for system calls; Go 1.5; The future of Firefox add-ons; ...
-
Announcements: The Open Mainframe Project, GUADEC videos, FSF 30th birthday, ...
Next page:
Security>>
|