This video was sponsored by Let's Get Rusty. Today, we're going to cover some
low-level concepts that you probably never have to think about unless you're
working at the systems level. One of the most frequent video
suggestions I receive is to explain why some projects seem to involve multiple
programming languages in their development. Explaining this can be
either extremely easy or extremely difficult depending on the type of
project. Take a full stack framework like Django for example. Python is used
to handle the backend which runs on the server and HTML, CSS and JavaScript to
build the user interface displayed on the client side.
This is a multi- language project. But in this case, it's easy to understand
how everything works in production because we're essentially developing two
separate processes that communicate remotely at runtime using some form of
interprocess communication. But there are other types of projects
where components written in different programming languages are meant to run
together as a single process. So how are these kinds of projects even possible?
Hi friends, my name is George and this is Core Dumpt.
If you follow this channel, I'm pretty sure you love low-level systems. And
that's exactly why I'm excited to say that this video is sponsored by Let's
Get Rusty. Let's be honest, Rust is no longer the language of the future. It's
the language of the present. Don't take my word for it. Just look at the
industry. Big companies are betting big on Rust for building critical systems.
Google, Microsoft, even the Linux kernel itself are now integrating Rust into
their core systems. That's not hype. That's happening. And if you're thinking
about leveling up your Rust skills, whether for personal growth or to land a
job working on real systems, Let's Get Rusty is the go-to place for Rust
training. Created by a fellow YouTuber and one of
the most beloved names in the Rust online community, Let's Get Rusty has
helped thousands of developers, myself included, by the way, master the
language and break into systems programming. They're running a new
cohort very soon. And since spots are limited, now's a great time to check it
out. Visit let's getrusty.com/startwithjorge
or just click the link in the pinned comment below. Big thanks to let's
getrusty for supporting the channel. And now, let's get into today's video.
For simplicity, let's start by considering only programming languages
that compile down to machine code. Generally, each programming language has
its own dedicated compiler. So, we can't just take, say, a Rust file and compile
it using the Go compiler. That's where things start to get
interesting and a bit confusing. If most programming languages have separate
compilers, runtimes, and memory models, how can they possibly live inside the
same binary? What often confuses people is the common
oversimplification that compilers are just tools that turn source code
directly into executable files. Now, don't get me wrong, compilers do produce
executable files, but that's only the final result of a much more complex
multi-step process that we usually don't see.
To illustrate this, let's look at a simple C program. If you are not a C
developer, don't worry. This program simply prints a message, but the message
it prints changes depending on the operating system you're running it on.
On most GNU Linux systems, the go-to compiler for C is GCC. It used to stand
for GNU C compiler, but that's no longer true. And in a moment, you'll understand
why. To compile and run our C program, we
usually just call GCC and pass it the file or files we want to compile. Then
an executable is generated. From our perspective, it's just two
simple steps. One to compile the program and another one to run it. But under the
hood, the compiler is doing a whole lot more. Internally, GCC goes through four
main phases to turn a C file into a working executable. Now, GCC deserves
its own deep dive, but I'll break it down quickly here.
The first step is pre-processing. This prepares the source code by doing things
like removing comments, expanding macros, resolving conditional
compilation, and crucially resolving includes. When you use include, the C
pre-processor replaces that line with the contents of the header file and all
the headers it includes, effectively inserting that code into our file before
compilation begins. So, the output is still C code, but pre-processed for the
next step. Next comes compilation, but not directly
into machine code. Instead, the pre-processed code is translated into
assembly language, which is the instructions that the computer will
execute, but still in a human readable language.
So, here's our first myth busted. A compiler doesn't always convert source
code into machine code. In fact, many compilers convert source code into an
intermediate representation like assembly or even into another
programming language. The third step involves the assembler,
which is technically another compiler, but it takes the human readable assembly
code from the previous phase and translates it into machine code, the
ones and zeros your CPU understands. The result is called an object file.
But here's the catch. This object file isn't runnable yet. GCC still needs to
resolve the position within the binary where functions will be placed.
In our simple example, we're just printing text to the console. But
remember, the actual implementation of the printf function lives in the C
standard library. So that library also needs to go through the same compilation
steps we just described. This brings us to the final step,
linking. At this stage, we may have multiple object files, some from our
code, others from external libraries we included during development. The
linker's job is to combine all these object files into a single
self-contained executable. There are two ways to do this. The easiest is to take
the machine code of each required function from the library and copy it
into the final executable. This is called static linking. All the
library functions our program needs are embedded directly into the output file.
Everything is self-contained and hence ready to run whenever we want.
But another option is dynamic linking. Think about how many programs on your
system use the print function from the standard library. If every one of those
programs statically included its own copy of that function, you'd end up with
thousands of identical copies stored across your disk.
With dynamic linking, libraries are pre-ompiled into a special type of file
called a dynamic shared library. On Unix like systems, these libraries
have the SO file extension. While on Windows, they are identified by the DLL
extension. These dynamic shared libraries are
similar to executable files in that they contain executable code for the
functions provided by the library. The key difference is that they don't
contain an entry point to start execution, which makes sense as
libraries typically don't have a main function which is used to start a
program. When our program is compiled with dynamic linking, the linker won't
copy the functions from the library directly into the executable. Instead,
it will simply insert a reference to the library that contains the machine
instructions for that function. At runtime, if the program needs a
function from that dynamic library, the operating system will load the required
function into the program's address space so the program can use it as if it
were part of the executable. While this may sound a bit strange at
first, it's actually incredibly efficient. Instead of storing multiple
copies of the same function across different programs, the system only
stores the library once. Each program that needs it simply references the
shared library and loads it only when necessary on demand at runtime. This
saves both disk space and memory. A huge advantage, especially on systems with
lots of programs that depend on common libraries. It's also more flexible since
you can update or patch a library without having to recompile every
program that uses it. Linking, both static and dynamic, is a deep topic that
honestly deserves its own video. If you're interested in learning more, let
me know in the comments, and I'll dedicate a full episode to explaining
how linking works, including the ins and outs of dynamic libraries.
Now, back to the compilation steps. You might be wondering, what's the point of
all this modularization? Why break the process into so many
phases if the compiler could just go directly from source to executable?
Well, the reason we don't normally see all these intermediate steps is because
compilers like GCC are configured by default to hide them. They just show you
the final result, the executable. But with the right flags, we can expose
all those phases. For example, using GCC, if you compile a
program and add the save temps flag, you'll get not just the final
executable, but also all the intermediate files.
We can even stop the process at a specific stage. For example, the S flag
makes the process stop after generating assembly.
This is incredibly useful in educational settings where you might want to see how
highle C code translates to assembly or machine code. In professional
environments, this is also used to inspect performance critical code. You
can look at the generated assembly to verify whether the compiler is producing
efficient instructions. Even more interesting, we can start from
any phase in the pipeline. We can pass GCC the assembly file and simply tell it
to assemble and link it. This is huge because it means we can
write part of our code in assembly, pass it to the compiler at different stages,
and then the linker will take care of mixing them together into an executable
file. This already starts to answer our
original question. Let's walk through an example. Suppose we need to write a
program that calculates how many prime numbers exist between zero and a given
number. And we want it to be as fast as possible. We could write the whole thing
in C. But let's say we don't trust the compiler optimizations. So we decide to
write the heavy calculation function directly in assembly and just call it
from C. Then we pass both files to GCC which
will compile and assemble the C code, assemble the assembly code and link both
object files into a single executable file.
And voila, we've just compiled a multi- language project.
This technique is used by real world systems like the Linux kernel, ffmpeg,
open SSL, and many embedded projects. They often contain C for most of the
logic, but fall back to assembly when performance really matters.
Now, here's a fact that a lot of you might have already concluded, but I'm
still going to mention anyways. What we casually call the C compiler, like GCC,
isn't just one compiler. It's actually a tool chain, a pipeline of tools that are
executed in sequence. Each stage consumes the output of the previous one.
And each of these tools is pluggable. We can replace parts of the tool chain or
feed in our own files at various points. This is why GCC doesn't just support C.
It also supports C++, Objective C, Forran, ADA, D, and even Go depending on
how it's configured. Originally GCC stood for GNU C compiler but over time
it evolved into a compiler suite that supports many programming languages
beyond C. Because of this expansion the name GNU C compiler became misleading.
So the acronym GCC was redefined to mean GNU compiler collection.
I think this is really important to understand. Every time someone casually
calls it the GNU C compiler, it can unintentionally reinforce the idea that
this whole system is a single black box that transforms CC code into a runnable
file. But the truth is, it hasn't been just that for many years now.
Okay, but assembly isn't for everyone. And to be fair, since assembly is
already part of the compilation pipeline, using it feels a bit like
cheating. So what about mixing highle languages instead? For example, instead
of writing a function in assembly, what if we implement part of our project in
FORRAN? Well, this is totally possible and
actually more common than it might seem. In this case, however, we usually need
multiple steps. One to compile and assemble the forran file, another one to
compile and assemble the C file, and a third to assemble both object files into
a single executable. Unlike assembly which is already
embedded in the C compilation pipeline, forran has its own pipeline, its own
compiler and sometimes even its own runtime dependencies. And by now it
should be super clear that the answer to our original question, how can different
languages live inside a single executable comes down to the linker. You
see, the different languages involved don't even need to come from the same
compiler suite as GCC. Take Rust for example. It has a completely different
tool chain from C. Different compiler, different build system, and a different
philosophy altogether. I could spend hours talking about the insane
engineering behind its compiler. But what we care is that when it comes time
to produce the final binary, guess what? Rust 2 relies on a linker.
So if we want to call a Rust function from C, here's how we do it. We
implement the function in Rust. We compile the Rust code into a static or
dynamic library. We declare and use the function in the C code. Then we compile
the C code and link it with the Rust compiled library.
And of course, it works the other way, too. We can call C functions from Rust.
It all depends on what we're trying to achieve. In fact, it's more common to
call C code from Rust than the other way around. C is older and many mature
libraries and system APIs are written in C. Rust developers often need to hook
into that existing ecosystem, especially in areas like graphics, cryptography, or
operating system APIs. There are several reasons why you might want to mix
multiple programming languages in a single project. But another one that
comes to mind is performance. In many projects, the entire system
doesn't need to be blazing fast, just certain parts. So what a lot of
developers do is write most of the project in a highle language for
convenience and development speed and then implement only the performance
critical components in a lower level language like C.
Before we wrap up there's one more really important point to understand.
Let's say we have two highle languages language A and language B. Just because
both of them have a final linking phase doesn't automatically mean they can be
correctly linked together into one executable. Here's a very simple
example. We've implemented a function in language B and we're calling it from
language A. Even if both compilers emit assembly for the same architecture, they
might make different assumptions about how data is passed between functions.
For example, the compiler for language A might pass the two function parameters
in registers 0 and one, but the compiler for language B might expect the
parameters in registers one and two. Both are producing valid machine code,
but since their calling conventions differ, the result will be undefined
behavior at runtime. Language A will place arguments in the wrong place, and
language B will perform operations using incorrect data.
And it doesn't stop there. In this example, there's another problem. After
computing the result, the function writes it to register one and returns.
But language A expects the result to be in register zero. So, not only does
language B compute the wrong result, but language A doesn't even see or use that
result at all. The same example, but this time with two
languages X and Y. Here, both languages use register zero and register 1 to pass
and receive parameters. But let's say language X uses pass by reference for
all function arguments. It puts the addresses of variables in the registers
which is not the same thing as putting the values of those variables directly
in the registers. Meanwhile, language Y passes by value.
So, it expects the actual values in the registers. That's why here it
immediately add the content on those registers as soon as being called at
runtime. This mismatch causes language Y to interpret memory addresses as actual
values, adding those addresses instead of the values stored at those addresses,
leading to completely wrong behavior or even a crash.
So even though both compilers produce valid and executable assembly, the final
linked binary is inconsistent unless both sides agree on how to talk to each
other. These kinds of low-level rules are
defined by what's known as the application binary interface or ABI.
Just as an API defines functions at the application level, an AI defines how
different components of binary code interact with each other through the
hardware. So when we're mixing two different
languages, it's not enough that they both produce object files. at least one
of them or specifically the part of it that interacts with the other language
must conform to the others AI expectations.
In our language X and language Y example, one way to make this work is by
modifying language Y to dreference the values. That way the function will first
fetch the data from the memory addresses provided, loading the actual values into
the registers before performing the addition. In this case, language Y is
being made to conform to the AI expectations of language X.
But we could also take the opposite approach. Make language X conform to the
AI expectations of language Y by simply loading the argument values directly
into the registers instead of their addresses. This way, the function in
language Y can immediately add the values when it's called. As I mentioned
at the beginning of this video, these are low-level details that we usually
don't have to think about unless working at the system level. The good news is
language designers know this. Modern languages provide tools, keywords, and
compiler flags to make this process easier. In C, you might declare an
external function using extern. In Rust, you'd use the extern keyword and the no
mangle attribute. In forran, you can use the bind
attribute. In Go, you can use a special block of
comments placed directly above the line import C to include C header files. And
it even lets you write inline C code directly in your Go source files. Every
language has its own way of doing this. But at compile time, these declarations
all serve the same purpose. They tell the compiler, hey, this function will
interact with code written in another language. Please make sure the generated
assembly follows the expected ABI. And let's wrap things up for now. In the
next part, we will cover how to mix compiled languages with interpreted
languages. So, make sure to subscribe because you won't want to miss it.
Don't forget to check Let's Get Rusty linked in the pinned comment below. And
if you liked this video or learned something new, please hit the like
button. It's free and that would help me a lot. See you in the next one.