SLOC Directory SLOC-by-Language (Sorted) 2437470 kernel-2.4.2 ansic=2285657,asm=144411,sh=3035,perl=2022,yacc=1147, tcl=576,lex=302,awk=248,sed=72 [GPL] 2065224 mozilla cpp=1279902,ansic=739470,perl=21220,sh=13717,asm=5212, java=3107,yacc=1831,lex=470,csh=271,sed=24 [MPL] 1837608 XFree86-4.0.3 ansic=1750460,asm=35397,cpp=20725,sh=14666,tcl=9182, yacc=3360,perl=1675,lex=1608,awk=393,csh=85,sed=57 [MIT] 984076 gcc-2.96-20000731 ansic=789901,cpp=126738,yacc=19272,sh=17993,asm=14559, lisp=7161,fortran=3814,exp=3705,objc=479,sed=310,perl=144 [GPL] 967263 gdb+dejagnu-20010316 ansic=871288,exp=58422,sh=12054,cpp=8252,yacc=5906, asm=5031,tcl=4477,lisp=1403,sed=248,awk=170,java=7,fortran=5 [GPL] 690983 binutils-2.10.91.0.2 ansic=489993,asm=161236,exp=13234,sh=12835, yacc=5665,cpp=4777,lex=1488,perl=776,sed=561,lisp=394,awk=24 [GPL] 646692 glibc-2.2.2 ansic=548722,asm=88413,sh=6036,perl=2120,awk=1037, yacc=315,sed=49 [LGPL] 627626 emacs-20.7 lisp=453898,ansic=169956,sh=2622,perl=884,asm=253, csh=9,sed=4 [GPL] 474829 LAPACK fortran=473590,ansic=1239 [Freely distributable] 455980 gimp-1.2.1 ansic=427967,perl=17482,lisp=9648,yacc=502,sh=381 [GPL, LGPL] 402799 mysql-3.23.36 ansic=249350,cpp=84068,perl=25088,tcl=18980,sh=18323, asm=3987,awk=1436,java=1149,sed=418 [LGPL] 395194 tcltk-8.3.1 ansic=291457,tcl=84322,sh=12259,exp=5742,yacc=876, awk=273,perl=265 [BSD] 345949 kdebase-2.1.1 cpp=181210,ansic=158682,sh=4880,perl=1155,python=22 [GPL] 323730 Mesa-3.4 ansic=286437,cpp=18189,asm=10002,sh=7611,objc=1184, python=307 [GPL/MIT] 321123 perl-5.6.0 perl=146755,ansic=118233,sh=49377,lisp=5739,yacc=996, java=23 [Artistic or GPL] 318430 libgcj ansic=191432,cpp=56843,java=41716,sh=15581,asm=11262, exp=841,perl=731,awk=24 [GPL] 304819 teTeX-1.0 ansic=223491,perl=49789,sh=17634,cpp=9407,pascal=1546, yacc=1507,awk=622,lex=323,sed=314,asm=139,csh=47 [Distributable] 298742 qt-2.3.0 cpp=259310,ansic=34578,yacc=2444,sh=1493,lex=480, perl=422,lisp=15 [GPL] 286113 postgresql-7.0.3 ansic=237184,java=17540,yacc=9740,sh=8975,tcl=7751, lex=1810,perl=1276,python=959,cpp=801,asm=70,csh=5,sed=2 [BSD] 283785 kdelibs-2.1.1 cpp=261334,ansic=17578,sh=1887,java=1538,perl=731, yacc=607,lex=110 [LGPL] 277502 xemacs-21.1.14 ansic=199927,lisp=73366,sh=2948,perl=930,asm=247, csh=62,sed=22 [GPL] 264528 gs5.50 ansic=259471,cpp=2266,asm=968,sh=823,lisp=405,perl=336, yacc=201,lex=58 [GPL] 227354 krb5-1.2.2 ansic=197886,exp=19124,sh=5140,yacc=2474,perl=1529, awk=393,python=348,lex=190,csh=147,sed=123 [MIT] 215473 vnc_unixsrc ansic=212766,cpp=848,asm=780,perl=648,sh=431 [GPL] 213818 koffice-2.0.1 cpp=197637,sh=7296,yacc=3791,ansic=3213,perl=1801, lex=80 [GPL] 202842 openssl-0.9.6 ansic=131874,cpp=25744,perl=14737,asm=12428,python=10171, yacc=3297,sh=2641,tcl=1583,lisp=224,objc=143 [BSD-like] 200908 Python-1.5.2 python=101017,ansic=96521,lisp=2353,sh=673,perl=342, sed=2 [Distributable] 194799 bind-9.1.0 ansic=173830,sh=12101,yacc=6025,perl=2830,tcl=13 [BSD-like] 192394 xpdf-0.92 cpp=167135,ansic=21621,sh=3638 [GPL] 191379 php-4.0.4pl1 ansic=173334,cpp=7033,sh=6591,lex=1867,yacc=1569, java=437,awk=367,perl=181 [PHP] 190950 pine4.33 ansic=190020,sh=838,csh=62,perl=30 [Freely distributable] 173492 abi cpp=159595,ansic=12605,perl=725,sh=550,python=17 [GPL] 167663 kdemultimedia-2.1.1 cpp=140731,ansic=23844,tcl=1004,sh=800,asm=598, lex=578,perl=106,awk=2 [GPL] 163449 4Suite-0.10.1 python=91445,ansic=72004 [Apache-like] 159301 linuxconf-1.24r2 cpp=142970,perl=6738,sh=3821,java=3074,ansic=2613, python=85 [GPL]Note that the operating system kernel (Linux) is the largest single component, at over 2.4 million lines of code (mostly in C); that compares to 1.5 million lines of code in Red Hat 6.2. See section 3.2 for a more detailed discussion about the Linux kernel. The next largest component is Mozilla; this is large because it's really a suite of applications including a web browser, email reader, news reader, HTML editor, and so on. Mozilla is the basis for Netscape Navigator 6.0. Mozilla was not included at all in Red Hat Linux 6.2. The next largest component is the X Window system, a critical part of the graphical user interface (GUI). Given the importance of GUIs, the long history of this program (giving it time to gain functionality and size), and the many incompatible video displays it must support, this is perhaps not surprising. Next is the gcc compilation system, including the C and C++ compilers, the symbolic debugger, a set of utilities for binary files, and the C library (which is actually used by most other language libraries as well). Emacs is next largest, which should not be a real surprise; some users use nothing but emacs (e.g., reading their email via emacs), using emacs as a kind of virtual operating system. Note that language implementations tend to be written in themselves, particularly for their libraries. Perl's implementation is written mostly in Perl, and Python is written mostly in Python. Intriguingly, this is not true for Tcl. In many senses, what is the ``largest'' component is an artifact of packaging. GNOME and KDE are actually huge, but both are packaged as a set of components instead of being delivered as a single large component. The amount of C code in ``kdebase'' seemed suspiciously high to one KDE developer, but it turns out that Red Hat includes ``lesstiflite'' (a Motif clone) in kdebase to support Netscape plug-ins. My thanks to Waldo Bastian (of KDE) for pointing out this unusual situation in kdebase and determining its cause, and to Bernhard Rosenkraenzer (of Red Hat) for confirming the reason for this kdebase addition. Éric Bischoff noted that KDE includes all the files beginning with kde, plus koffice (and in the latest versions of KDE, arts); using this definition and totalling the numbers in the summary table, KDE (with common applications) has 1,693,235 SLOC. In a similar manner, by totalling all the packages with ``gnome'' in them, as well as gnumeric, abi (abiword), and the Gimp, GNOME (with common applications) has 1,195,992 SLOC. One trouble here is determining exactly what to count; for example, do you include common applications, and if so, which ones? What about the graphical toolkit? For these reasons, I've simply let the component list stand. For a complete list of all components and their SLOC counts, see http://www.dwheeler.com/sloc/redhat71-v1/summary.
Language | SLOC (%) |
C | 21461450 (71.18%) |
C++ | 4575907 (15.18%) |
Shell (Bourne-like) | 793238 (2.63%) |
Lisp | 722430 (2.40%) |
Assembly | 565536 (1.88%) |
Perl | 562900 (1.87%) |
Fortran | 493297 (1.64%) |
Python | 285050 (0.95%) |
Tcl | 213014 (0.71%) |
Java | 147285 (0.49%) |
yacc/bison | 122325 (0.41%) |
Expect | 103701 (0.34%) |
lex/flex | 41967 (0.14%) |
awk/gawk | 17431 (0.06%) |
Objective-C | 14645 (0.05%) |
Ada | 13200 (0.04%) |
C shell | 10753 (0.04%) |
Pascal | 4045 (0.01%) |
sed | 3940 (0.01%) |
15185987 (50.36%) GPL 2498084 (8.28%) MIT 2305001 (7.64%) LGPL 2065224 (6.85%) MPL 1826601 (6.06%) Distributable 1315348 (4.36%) BSD 907867 (3.01%) BSD-like 766859 (2.54%) Freely distributable 692561 (2.30%) Free 455980 (1.51%) GPL, LGPL 323730 (1.07%) GPL/MIT 321123 (1.07%) Artistic or GPL 191379 (0.63%) PHP 173161 (0.57%) Apache-like 161451 (0.54%) OpenLDAP 146647 (0.49%) LGPL/GPL 103439 (0.34%) GPL (programs), relaxed LGPL (libraries), and public domain (docs) 103291 (0.34%) Apache 73650 (0.24%) W3C 73356 (0.24%) IBM Public License 66554 (0.22%) University of Washington's Free-Fork License 59354 (0.20%) Public domain 39828 (0.13%) GPL and Artistic 31019 (0.10%) GPL or BSD 25944 (0.09%) GPL/BSD 20740 (0.07%) Not listed 20722 (0.07%) MIT-like 18353 (0.06%) GPL/LGPL 12987 (0.04%) Distributable - most of it GPL 8031 (0.03%) Python 6234 (0.02%) GPL/distributable 4894 (0.02%) Freely redistributable 1977 (0.01%) Artistic 1941 (0.01%) GPL (not Firmware) 606 (0.00%) ProprietaryThese can be grouped by totalling up SLOC for licenses containing certain key phrases:
16673212 (55.30%) GPL 3029420 (10.05%) LGPL 2842536 (9.43%) MIT 2612681 (8.67%) distributable 2280178 (7.56%) BSD 2065224 (6.85%) MPL 162793 (0.54%) public domainFrom these numbers, you can determine that: (一)The GPL is far and away the most common license (by lines of code) of any single license. In fact, the category ``GPL'' (packages with only this one license) all by itself accounts for 50.36% of the packages. By totalling the SLOC for all packages that include "GPL" in the license text, the total rises to 55%. No matter how you look at it, the GPL is the dominant single license in this distribution. (二)The next most common licenses were the LGPL, MIT, BSD, and MPL licenses (in order). This is in line with expectations: the most well-known and well-used open source licenses are the GPL, LGPL, MIT, and BSD licenses. Although the MPL does well in terms of SLOC, there is only one program in this distribution that uses it - Mozilla. There is some use of the ``Artistic'' license, but its use is far less; note that papers such as Perens [1999] specifically recommend against using the the Artistic license due to its legal ambiguities. (三)Very little software is released as public domain software (``no copyright''). In this distribution, only 0.2% of the software is in packages labelled as ``public domain'' (note that the 0.54% figure above includes the ``sane'' package which has documentation in the public domain). There may be several factors that account for this. First, if a developer wishes to get credit for their work, this is a poor ``license;'' by law anyone can claim ownership of ``public domain'' software. Second, there may be a fear of litigation; both the MIT and BSD licenses permit essentially arbitrary use but forbid lawsuits. While licenses such as MIT's and BSD's are not proof against a lawsuit, they at least provide some legal protection, while releasing software to the public domain provides absolutely no protection. Finally, any software released into the public domain can be modified and re-licensed under any other license, so there's nothing that keeps updated public domain software in the public domain. (四)There is a tiny amount of proprietary code, which is entirely in one component - Netscape Communicator / Navigator. This component uses the Motif toolkit (which is not open source) and has proprietary code mixed into it. As a result, almost none of the code for this package is is included on the CD-ROM - only a small amount of ``placeholder'' code is there. In the future it is expected that this component will be replaced by Mozilla. (五)The packages which are clearly MIT-like/BSD-like licenses (totalling the MIT, BSD, MIT-like, BSD-like, and none/public domain entries) total 4,742,021 SLOC (15.92%). It's worth noting that 1,837,608 of these lines (39%) is accounted for by the XFree86 X server, an infrastructure component used for GNU/Linux's graphical user interface (GUI). (六)If the license types "distributable", "freely distributable", "MPL", "Free", "Artistic", "Apache", "Apache-like", and "IBM Public license" software was also considered MIT-like/BSD-like, the total SLOC would be 7,954,474 (26%, down from 36%). Unfortunately, the information to determine which of these other packages are simply BSD-like/MIT-like licenses is not included in the specification files. (七)The packages which include copylefting licenses (GPL or LGPL) total 63%. Limiting to only those that are GPL, LGPL, or both yields 60%, the same percentage as in Red Hat Linux 6.2 and a clear majority. It is quite clear that in this distribution the GPL is the dominant license and that copylefting licenses (the GPL and LGPL) significantly outnumber the BSD/MIT-style licenses. This is a simple quantitative explanation why several visible projects (Mozilla, Troll Tech's Qt, and Python) have changed their licenses so that they're compatible with the GPL. When there is so much GPL software, GPL compatibility is critically important to the survival of many open source projects. See the Free Software Foundation's information on Various Licenses and Comments about Them [FSF 2001a] for information on GPL compatibility, and the GPL FAQ [FSF 2001b] for more information on the GPL in general. The most common open source licenses in this distribution (by SLOC) are the GPL, MIT, LGPL, and BSD licenses (as well as the MPL, but note that it's only used by one project). Note that this is consistent with Perens [1999], who pleads that developers use an existing license instead of developing a new license where possible. As of this writing, the GPL has received the most attention of these licenses, because Microsoft has specifically been attacking the GPL license. The GPL license permits commercial use of the program, but requires that distributors of modified versions of the program must also release the source code to their changes under the same terms. Therefore, software released under the GPL resists Microsoft's usual ``embrace and extend'' approach to destroy competitors - Microsoft can use and change GPL'ed code, but it cannot make its changes to that code proprietary. As a counter-example, Kerberos (a security component released using an MIT license instead of the GPL) was recently incorporated by Microsoft into their products, and then extended in an incompatible way to prevent users from fully interoperating between products [Schneier 2000]. Had Kerberos been released under a GPL or LGPL license, this would have been much more difficult. The presence of so many GPL and LGPL components should make GNU/Linux distributions more resistant to being ``embraced, extended, and extinguished.''
Product | SLOC |
NASA Space Shuttle flight control | 420K (shuttle) + 1.4 million (ground) |
Sun Solaris (1998-2000) | 7-8 million |
Microsoft Windows 3.1 (1992) | 3 million |
Microsoft Windows 95 | 15 million |
Microsoft Windows 98 | 18 million |
Microsoft Windows NT (1992) | 4 million |
Microsoft Windows NT 5.0 (as of 1998) | 20 million |
Red Hat Linux 6.2 (2000) | 17 million |
Total Physical Source Lines of Code (SLOC) = 30152114 Estimated Development Effort in Person-Years (Person-Months) = 7955.75 (95469) (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05)) Estimated Schedule in Years (Months) = 6.53 (78.31) (Basic COCOMO model, Months = 2.5 * (person-months**0.38)) Total Estimated Cost to Develop = $ 1074713481 (average salary = $56286/year, overhead = 2.4).See appendix A for more data on how these effort values were calculated; you can retrieve more information from http://www.dwheeler.com/sloc.
compat-egcs.spec compat-glibc.spec compat-libs.spec kde1-compat.spec gtk+10.spec libxml10.spec x86-compat-libs.spec qt1x.specI also removed any ``beta'' software which had a non-beta version available (beta software was identified by searching for ``beta'' in the package or specification file name). This removed:
glib-gtkbeta.spec gtk+-gtkbeta.spec pango-gtkbeta.specI also removed "mysqlclient9.spec". This specification contained the older MySQL client library version 3.23.22, as shipped with Red Hat Linux 7, for use with applications linked against it. I did include "mysql.spec", which had the code for the newer version 3.23.36 of MySQL (a relational database package). Note that unlike Red Hat Linux 6.2, Red Hat Linux 7.1 didn't have two versions of bash or ncurses, so I didn't have to remove old versions of them. I left db1, db2, and db3 in, because it can be argued that none of these three necessarily replaces the other two. One complication was in handling the graphical subsystem "XFree86". Version 4 of XFree86 was used for all client-side applications, but Red Hat uses both version 3 and version 4 to implement various X servers. I looked at the XFree86 source package for version 4, and it turned out that server code was included in the package. Rather than have XFree86 counted essentially twice (once as version 3, and another as version 4), I only counted the code in version 4 of XFree86. This could be argued both ways; I understand that version 4 is a massive rewrite of much of the version 3 server, so counting it twice is actually not irrational. And unintentionally, I ended up counting a small amount of version 3 code through reuse by another program. It turns out that vnc_unixsrc includes (through reuse) portions of the X Window system version 3 code; in their words, ``a cut-down version of the standard XFree86 distribution (``server only'' distribution) without many of the later X extensions or hardware-specific code.'' VNC won't work without that code, and clearly there was effort to build version 3 and to rebuild version 4, so I let these counts stand. I then unpacked the source code by running code that in essence did this:
cd /usr/src/redhat/SPECS rpm -bp *.specThis uncompresses the source code and applies all patches used by the actual system. Since I wanted to count the amount of code actually included in the system, it was important to include the patches. The actual code to unpack the source code was more complex, because it also marked every unpacked directory (in the BUILD directory) to identify the spec file it came from and the license of the program. The license was determined by (1) looking at the "Copyright" and "License" fields of the spec file, and if that didn't work, (2) looking at various files in the build directory, such as "LICENSE", "COPYING*", and "Artistic". Unfortunately, MIT-like and BSD-like licenses can be harder to detect (because their text can be varied), but many licenses (such as the GPL and LGPL) can be detected with great confidence. I used the "spec" file as the primary source, because this was placed by a human (who could better understand legal technicalities than a machine). I actually had to repeat the unpacking more than once; the RPM system would notice a missing dependency for building the software and protest. This required installation of the missing component (in some cases I didn't have to install the program and could have forced installation, but I did not want to risk corrupting the results by failing to install a package). A surprising development was that the packages "imap" and "samba" reported errors in unpacking. For imap, patch #5 (imap-4.7c2-flock.patch) and for samba, patch #21 (samba-ia64.patch of source/passdb/pass_check.c) would cause unpacking to halt. I unpacked the software and simply counted what was there; this appears to be what the original developers did. I examined the reported license values, in particular for all code more than 100,000 source lines of code (as the largest components, wrong values for these components would be more likely to cause significant error). I found that Perl had been assigned "GPL" in its spec file, but this isn't the whole story; as documented in its README file, Perl can be used under either the GPL or Artistic license, so its license entry was changed to "GPL or Artistic". Mozilla's licensing situation is more complex; some portions of it are actually under a separate dual licensing scheme (licensed under both the GPL and Netscape Public License, i.e., NPL). However, labelling it as "MPL, NPL, and GPL" would probably overstate the amount of code licensed under the GPL, so I left its entry as the MPL license. Note that the unpacked source files (including source code, fonts, documentation, and so on) totalled more than 4.4 Gigabytes. I ran the analysis code as a normal user, so I first had to set the permissions for users to read the code. I then reverted to normal user account, and used sloccount version 1.9 to measure the source code, using the following bash command:
sloccount --multiproject /usr/src/redhat/BUILD > sloc-actions 2>&1 &Note that I did _not_ use the "--follow" option of sloccount. Some programs, notably pine, include a symbolic link to other directories such as /usr/lib. Thus, using --follow would have included files outside of the intended directory in the analysis. I looked over various error reports and determined that none would fundamentally invalidate the results. For example, there were several errors in the XFree86 source code involving improperly formatted strings. It appears that these are syntax errors in the code that are preprocessed away (and thus not noticed by the compiler). I intend to report these problems to the XFree86 project. One program was a bash shell script that began with "#! /usr/bin/env bash", which sloccount's heuristics could not handle at the time. I then modified sloccount to correctly determine its type (it's a bash shell script). Note that sloccount creates a large number of small files. This isn't fundamentally a problem, but because of the large scale of the system I found that I ran out of inodes if I tried to store multiple copies of results. Those who try to duplicate this activity may want to specially format their filesystems to include more inodes. For a complete list of all components and their SLOC counts, see http://www.dwheeler.com/sloc/redhat71-v1/summary.