jayunit100: A lightning tour of the thingy that runs your code


What happens when you press "go" ? We think in abstractions... we code in mega-abstractions... but... I recently read about Amdahl's law in this awesome article about caching for high performance web sites: http://architects.dzone.com/articles/role-caching-large-scale. The emphasis was on detection of the true bottleneck in a system. The Linux operating system's transparent and universally accepted organization lends itself quite well to recursive bottleneck analysis. Nevertheless, its never easy, when given a slow or stalled program, to tell exactly whats wrong. Why is this hard? There is no right answer to this question - but part of the problem is that as developers we deal in abstractions which do not map directly to the work done by a system. This is (kind of) a good thing - it frees us to focus on business logic, modularity, and horizontal scalability. Language makers have done an amazing job at separating programmers from machines, and we should applaud them wholeheartedly. But when software runs into scalability or performance problems, abstraction can really come back to bite you. Disks might be slow, resources could be scarce, networking performance may be dismal, or simple libraries may be corrupted or missing. In any case - you've got to be able to find out why - and your core programming language chops won't always help. But what will help... is an understanding of your operating system. So the following is a quick tour of the Linux innards I always find useful, followed by several links which shed some much needed light on the internals of how programming languages ultimately connect to the kernel. 1) First, the directory structure in linux. Very basic: /bin: These are binaries that anybody can run. /sbin: These are binaries that only root runs (ifup, ifconfig, ...) /home: This is where all your data goes. /tmp: This is where all your temporary files go. Its cleared automatically by the OS, periodically. /usr: This is where user create/built programs typically go. For example, if you compile your own version of a particular C library because it didn't work out of the box on your machine, you would put it in /usr/local. /opt: Applications and programs go here, in particular, the ones that you get from external sources, in one peice. For example, an executable version of google chrome would go in /opt. /var: System related files go here that are created by programs. Log files (/var/log), mail related stuff, cached (/log/cache/) data from programs that run. /etc: Configuration files for stuff like networking, disks, etc... Linux runs several services: When you change something, you need to restart the corresponding service... Otherwise you sill see no results. /dev: The files in this directory represent devices (audio, keyboards, disk drives, etc..)

2) Some quick notes about networking:

the dhclient command: This command will renew your ip address (if its dynamic). Importantly, this command also tells your DHCP server what your computer's name is.
/etc/init.d/networking restart: This command will restart the networking services, reload configuration files.
(file) /etc/network/interfaces: This has incantations that define the network cards, while defining their corresponding attributes - for example, what IP address to assign to it, and what IP address should be used to reach the outside world (the gateway), etc... There is, however *no DNS information* in this file. DNS info is in another file called ...
(file) /etc/resolv.conf : This is where your DNS information resides (its a map of domain names, to ip addresses).

3) Disks ~ like any OS, linux mounts and partitions disks for you. It is these disks, partitions, and ultimately, the filesystems on those partitions, which contain the data you see when you type "ls". Some of the basic commands available on Linux to play with this are:

The fdisk command: This will show you how many disks you have on your machine, including partitions (for example, one partition is for the kernel, another for swap space, etc...).
The df and du commands for looking at disks/directory usage: These commands tell you the filesystem and directory usage statistics.
The rsync command for synchronization: This command synchronizes directories. Notably, you can use the -e option to sync files over ssh in different places.
(file) /etc/fstab: This file defines defines filesystem mounting that occurs when your OS loads.
A revealing and accessible comparison of the linux file system, compared with others: http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/

Do disks in the cloud work the same?

Yes: They are still mounted as devices, and the device abstraction is sufficiently robust to support extremely diverse disk implementations. Its instructive to consider the way, in the cloud, the "/dev" directory maps to disks. As an experiment, we can create an AMI instance with no external storage, and run the following command:

ls /dev/

Subsequently, we can attach an EBS (Elastic Block Storage - a fancy new-fangled cloud storage solution) drive to that same instance using the AWS console. Shortly thereafter, when running the same command (ls /dev), we will see a new entry: This entry represents the raw disk device. By mounting it as a filesystem, we can then write to that detached storage device.

4) How your code works in the Linux environment:

When you run your programs, they ultimately call system functions.
Those system functions request operations of the kernel OR, they simply write instructions that the CPU can access.
Kernel operations access devices or do other privileged tasks: Accessing a disk, connecting to a socket, etc...
Library operations don't require access to the kernel: Simple calculations can be done without accessing the kernel. Finally, you should know that the Linux operating system is written in C. So when a higher level language requests a service (i.e. opening a file), it ultimately does it by a C binding of one sort or another.

5) Finally - a list of articles that will help you to understand your code, in order of increasing abstraction.

The Kernel: The hyperoptimized code written in C that runs your machine.

The C Programming language : the lowest level language that can be sanely used by a conventional developer to an build a real-world, end-user application - C is the ultimate basis for higher languages. Directly compiles to assembly.

C is fast - largely due to the fact that it compiles directly to machine code.
C requests services of the kernel by calling system level functions.
How C assemblers work: http://www.erg.abdn.ac.uk/~gorry/eg2069/comp.html
Compiling and "linking": http://www.tenouk.com/ModuleW.html

Python programs : written in C, but runs at a very high level.

Python is an interpreted language.
Python it is read in by a C program at runtime, and instructions are executed sequentially.
Python code is implemented, ultimately, by the C language.
http://tech.blog.aknin.name/2010/09/02/pythons-innards-hello-ceval-c-2/

Java and the JVM programs :

Java programs are run by a C program called "java". Java is compiled to "byte-code", which is run by a C compiler. Java is ultimately optimized by the JVM, which is capable of byte-code optimization at runtime by using tricks like inlining (combining separated functions into a single method).
Java is reasonably fast simply because of the statefullness of the JVM - it can remember, optimize, and reorganize loops at runtime.
Java classes: http://onjava.com/pub/a/onjava/2005/01/26/classloading.html
The JVM: http://java.dzone.com/articles/jvm-internals-series-part-1
The java memory model: http://www.youtube.com/watch?v=WTVooKLLVT8

If you're really adventurous, you can see the way the whole thing fits together here in this clickable visualization of the entire dependency hierarchy of a linux OS. Very useful for getting a picture of whats going on under the hood: http://www.makelinux.net/kernel_map/.

2 comments:

UnknownNovember 14, 2012 at 2:04 PM
Ok, there’s some good points here but i would like to be a bit more clear as to what you mean when you say

“Java is ultimately optimized by the JVM, which is capable of byte-code at run time by using tricks like inlining (combining separated functions into a single method). "

I will try and break this down a bit

I am sure you mean the JIT compiler and the a whole of bunch optimizations that may or may not proceed. OpenJDK (and probably the other VMs too) have pretty restrictive limits on how far it will go to optimize code. A large number of these limits are based on the byte code size of the target methods. Methods that get too big won’t inline, and large complex loops can prevent all kinds of crucial optimizations, loop unrolling, range-check elimination, all kinds of prefetching and alias analyses. So while it is true in the most generic sense it is useful to be aware that these optimizations are not guaranteed. The JIT is a wonderful thing but you still need good code to make it happen.

“which is capable of byte-code at runtime “ here I think its worth pointing out that as the JIT doing its best to optimize the code, this is only part of the story here, we can do all sorts of other cool stuff like grab and transform the byte code on the fly so to speak, there are a lot of rock solid api’s for this http://www.jboss.org/javassist/
jayunit100November 14, 2012 at 2:09 PM
ah typo ! Fixed :) cat blog | sed -i "s/byte-code at/byte-code optimization at/g"

13.11.12

A lightning tour of the thingy that runs your code

2 comments: