Linux Filesystems LOC

The XFS filesystem has taken a beating for being a big, complicated, foreign filesystem since it’s introduction, and there is no doubt that there is a fair bit of code in there.  But an interesting thing happened on the way to the Linux Kernel v3.0.0 – XFS developers have steadily reduced lines of code, while other up and coming filesystems such as Ext4 and BTRFS are steadily growing in LOC and complexity.  And XFS has been under constant improvement at the same time as well.

Some of this is to be expected when comparing a mature product to newer developments, but I still find it interesting.

Notes on the above graph :

  • Comments & whitespace were stripped with CLOC for LOC counts
  • EXT4 LOC includes jbd2 as well.

XFS is actually more heavily commented than EXT4 or BTRFS; XFS is about 39% comments, while EXT4 is about 33% and BTRFS is about 17%.

Another interesting metric is to use Simian to see how much duplicated code there might be:

  • xfs: Found 4806 duplicate lines in 561 blocks in 55 files
  • ext4+jbd2: Found 917 duplicate lines in 116 blocks in 23 files
  • btrfs: Found 2252 duplicate lines in 272 blocks in 31 files

Those high-level numbers aren’t terribly useful, but digging into them sometimes reveals a surprising amount of cut+paste in the course of development.

Other duplicate finders such as duplo and CPD are useful, too – these latter have free licenses.  They all behave a little bit differently…

(edit: Many of the xfs dups are actually a result of the many explicit #include directives in each C file).

20 thoughts on “Linux Filesystems LOC

  1. While your interpretation of this data is interesting, I find it more interesting that xfs still has more SLOC, and that it won’t fall below that of your two comparison filesystems for quite some time.

    I took your graph, extended out the right side, and drew some lines out to extrapolate the current trend for each. (Nothing fancy like linear regression, just eyeballing the slope.) The xfs and btrfs lines won’t cross for another 9 Linux kernel releases, if my simple model holds true. At 3 months between releases, that means the btrfs code base will remain smaller than xfs for over 2 years.

    ext4’s line doesn’t cross that of xfs for another 41 releases, or about 14 years. By that time, we’ll probably be on ext5. :)

    Rather than compare against SLOC, it would be still more interesting to compare against SLOC per unit of functionality. I don’t much care whether a filesystem has 80 kSLOC or 800 kSLOC. What I care about is that it have the fewest SLOC per unit of function.

    • Agreed, LOC isn’t necessarily the greatest measure. Heck XFS has 1500+ lines of #includes alone… But defining units of functionality is tough, too. If you have a metric to suggest, I’ll graph it ;)

      I didn’t really mean to imply that XFS is better in this regard; it certainly still is a lot of code (to go along with plenty of units of functionality…) I just find it interesting that it continues to drop, under active development, while the others continue to grow. I wonder if that’s typical of mature code, or is it unique to a set of maintainers, or something else?

      • defining units of functionality is tough, too

        Very much so. The only methods I’m aware of are essentially manual processes, which require judgement on the part of the estimator. (e.g. function points)

        In the spirit of “the simplest thing that could possibly work”, you could start with this breakdown: http://en.wikipedia.org/wiki/Comparison_of_file_systems

        Weight each item according to its desirability (i.e. journaling is far more important than optional case insensitivity) then plot that against how these filesystems have fared on this comparison over time. It may be that you don’t have to extract history manually, but can use the WP history function to step back through prior versions of this article. It won’t be 100% accurate, but it may be enough to discern trend lines.

      • AIUI, SGI devs from ~1997 onwards were targeting workloads on ever growing Origin 2000/3000 MIPS ccNUMA systems as they pushed into the supercompute space. This was prompted by cheap x86 machines w/nVidia graphics eating up its worksation profit center. CPU counts in the Origin machines ranged from 8 to 512, with 16GB to 1TB of RAM and very large FC storage farms. When one has those kinds of resources in the box it’s easy to think and write “big” because you’re not constrained by hardware, and thus not concerned with writing “small”. “Native” Linux filesystem devs have been historically constrained by hardware and thus have attempted to write small and optimize for the small system.

        When SGI dumped MIPS/IRIX and moved to Itanium/Linux, the “small” mindset of Linux development filtered into XFS development. This has resulted in XFS devs attempting to streamline the existing code base, and write smaller new code.

        This is my opinion, analysis from the outside. Old hands at SGI, if there are any, who worked on IRIX XFS and now Linux XFS may have a better/different perspective. Maybe Dave Chinner has some insight here, but I don’t see him participating on your blog.

          • I was unaware that you, too, were at SGI Eric. Cool. So you weren’t there during the MIPS/IRIX days but at the start of the Linux push. Were you hired as part of the Linux porting effort? Were you involved in the LSE project? I deleted my LSE list archive long ago or I’d not have to ask…

            There is massive bloat in some open source projects–FireFox being a good example– so open source development in and of itself has no direct bearing here, I’d think. The point I was making is that Linus originally targeted Linux to a constrained x86 PC platform, and for a very long time, if not still today, that was the target platform of most/all devs. Because it was limited, devs wrote tight, tidy code to get the smallest binary, lowest memory consumption–the most bang from the box. I believe this is where/why the “tight/tidy” culture of Linux programming evolved. Just as the big iron at SGI was likely a factor in code bloat within IRIX and XFS. This was actually discussed a bit on the LSE mailing list so long ago. IIRC (it’s been a few years), big iron UNIX devs at SGI, IBM, HP, Bull, Fujitsu, etc often submitted rather large/unwieldy code, or design ideas that would lead to such, only to be shot down by seasoned Linux kernel devs with the “tight/tidy” Linux mindset. Which thankfully led to tight, performant, low memory consuming NUMA, hot plug CPU, cpumemsets, etc code.

  2. Great article :) Its nice to see someone watching how do the maijor filesystems for linux evolve…. it will be nice if you do this from time to time (or from release to release) so we can see the development for longer periods of time…
    Oh and adding jfs will be nice :)
    About ….ReiserFS, Reiser4, and ZFS? I dont even care about them :)
    ZFS is owned by oracle , so it can go away any time …
    reister …its not really a POSIX , so who carez :D

  3. I found the duplicate numbers interesting but on following the link it turns out Simian is not FLOSS. I take it you’ve not found any FLOSS tools that can do the same analysis?

    • Simian is not, sadly, but it has a fairly permissive license for free use on open source projects.

      However, see also the other 2 projects linked above: Duplo is GPLv2 and CPD/PMD is licensed under a “BSD-style” license.

  4. Pingback: Links 24/6/2011: Linux 3.0 is Fast, Lots of Android 3.0 | Techrights

  5. Pingback: ???????? ????????? ??????? ??????? ???? Ext4, Btrfs ? XFS | ManNix.ru

  6. Pingback: file system implications (possibly O/T)

  7. Pingback: Linux Filesystems LOC Update | Eric's Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.