The Servo Browser Engine -- A Learner's Notebook

Script

The componenets/script folder contains most of the Rust implementation of the DOM and JavaScript interface.

JavaScript, Rust and WebIDL

Servo does not has its own JavaScript engine in Rust. It use SpiderMonkey as it JavaScript engine (written in C++), the same as desktop Firefox.

Web developers use JavaScript to interact with the DOM, each DOM object and API must have a corresponding Rust implementation. The main interface between a Rust reflector and its JavaScript counterpart is defined using WebIDL, which is a standard interface definition language commonly used in HTML5 specs. A python code generator (in components/script/dom/bindings/codegen) will take the WebIDL and generate a JavaScript wrapper/reflector (located in components/script/dom/bindings/codegen/Bindings) for the Rust code. The servo developers then fill in the missing Rust code. You can find those implementations in components/script/dom. There are common DOM elements like <body/> (htmlbodyelement.rs), <a/>(htmlanchorelement.rs), just to name a few.

Garbaged Collected DOM

In old browser engines like Gecko, the JavaScript objects are GC'd (garbage collected) by the JavaScript engine, while the corresponding C++ objects are managed by reference counting. However, a C++ may be pointed by a JavaScript object, and vice-versa, a pointer cycle is created, so the elements in the cycle can never be freed by reference counting. Gecko solve this problem by implementing a complex cycle collector. A cycle collector is not only hard to maintain, but also introduce performance penalties because the whole system has to be paused for the cycle collector to do a sweeping.

Servo address this issue by using the JavaScript engine's GC only. All the Rust objects' lifetime are controlled by the JavaScript engine. Once an DOM object is GC'd by the JavaScript engine, it's corresponding Rust object is freed.

SpiderMonkey's Garbage Collectior Design [1]

SpiderMonkey used to use a conservative mark-and-sweep garbage collector, but it migrated to a exact rooting garbage collector.

The basic concept behind the mark-and-sweep GC (garbage collection) is easy: The variables or in-memory data structures that are no longer accessable are considered garbage, and needs to be reclaimed. But first, these garbages are allow to accumulate. Until a certain criteria is met, for example: a new object allocation is requested but there is no enough memory, or a certain timeout has reached, the main program is paused for the garbage collector to do its job.

In the first phase, the garbage collector walks through (or trace out, thus the name tracing GC) the memory to mark objects that are directly or indirectly reachable as live. Directly reachable objects (root in GC speaking) includes objects that are referenced by static variables, stack variab les or even CPU register. Indirectly reachable objects are those referenced by the directly reachable objects or each other. In the second phase, the objects on the heap that are not marked as live are freed, or sweeped.

However, depending on the timing of the GC, there might be unreachable objects that are actually needed afterwords. For example, consider a local variable that was created to be inserted into an array. But before the insertion is executed, the GC kicks in. At that point, the local variable is actually live but unreachable from any root.

There are at least two ways to solve this problem. The first is called conservative GC. What is does is that during the marking phase, it looks into the stack and CPU registers and find anything that looks like a pointer pointing into the heap. Although there may be false positives, we still treat the object at the pointed address as live objects. Another way is called exact rooting. Programmers has to manually mark newly created variables or objects as root, so they will not be recycled.

At first glance, exact rooting puts a lot of burden on programmers, and can be error-prone if the programmer forget to mark anything. Forgetting to mark an object as live may cause memory leak, create security loopholes or used-after-free problems. Although conservative GC has some false positives, but it looks easy and clean. However, the conservative approach has a fatal flaw: the GC managed objects, or GC thing, can't be moved. This prevents many advanced GC optimization like compacting and generational GC, which will be discussed later.

When we want to move a GC thing, we need to update all the pointers that points to it. However, because conservative GC are not exactly sure if a value in the stack or register is an actual pointer or not, it might accidentally change the wrong thing. For example, suppose we have a value X somewhere in the memory, but the garbage collector wants to move the GC thing at heap address X to Y. The value might be treated as a pointer to X and being changed to Y. This will corrupt the value and affect the behavior of the program. With exact rooting, since all live objects are manually marked, the garbage collector can confidently move a GC thing because it knows what is a pointer and what is not.

Then why would a garbage collector wants to move GC things around? One use case is an optimization method call compacting. Since after a few runs of GC, the live objects may be scattered sparsely in the memory. In order to eliminate the fragmentation of the free space for more efficient usage of memory, the garbage collector may want to shift the live GC things together in memory.

Another use case, which have been implemented in SpiderMonkey, is called Generational GC. In generational GC, the heap is split into two sections: the nursery area and the tenured area. Newly created objects are allocated in the nursery area. If the object lives long enough, it will be moved into the tenured area. Since many objects, like local variable, usually don't live very long, the move will not be very often. Since many objects die early in the nursery, usually a fast minor GC in the nursery only will reclaim enough memory for the program to continue running. We only need to do a full-scale major GC on the whole heap occasionally, reducing the overall GC pause time. Also the fragmentation will reduce because the long-lived objects will be pretty condensed in the tenured area.

Servo's smart pointers[2]

In order to collaborate with the garbage collector, Servo introduced a few smart pointers, which are located in components/script/dom/bindings/js.rs. We'll introduce why these smart pointers are necessary, and when to use them.

Auto-generated trace hooks

Safe rooting

Misc

What is this Atom type used in the html element attribute?

Atom is from the string-cache crate. It uses a technique called string interning to save only one copy of duplicated strings, thus reduce the memory usage.

Interned strings are stored in a hashmap. If the a string already exists in the hashmap, the Atom version of it will only store the location in the hashmap.

What is the unit Au?

You might encounter the length unit Au when doing layout queries. For example, when you use layout query to get the upper-left corner position of a <img/>, the origin.x and origin.y are both in Au. An Au is an "App Unit" and represents 1/60th of a CSS pixel. It was originally proposed in 2002 [3] as a standard unit of measure in Gecko. The definition of Au is in components/util/geometry.rs.

TODO:

  • JS GC and Rooting
  • How to add an WebIDL
  • JS controlled memeory mgmt
  • layout query
  • Tip: webidls won't compile
  • activation example
  • blob example (add new web api):
    • 14:24    slyu-home    I have one more question, in the dom implementations, there are new_inherited(), new(), and Constructor(). What are the differences between them? Thanks
      14:24    larsberg    I'm not entirely sure. There are any number of things that can cause a failure to happen and merges to not happen. But I suspect that adding the linux2 builder had the good side effect of fixing that android bit.
      14:24    Ms2ger    slyu, Constructor is the thing called from JS
      14:24    * larsberg does not like all the rube goldbergian semi-related moving parts in these systems
      14:25    Ms2ger    slyu-home, new() is the thing you call when you create the object
      14:25    Ms2ger    slyu-home, new_inherited() is the thing you call from Class::new() and Subclass::new()
      14:26    slyu-home    Ms2ger, I didn't get the new_inherited()
      14:26    Ms2ger    slyu-home, so we have to fake class inheritance because rust doesn't support that
      14:27    slyu-home    Oh I see
      14:27    Ms2ger    slyu-home, Foo::new returns an object with a reflector of type Foo, which you don't want for Bar that inherits from Foo
      14:29    slyu-home    A side question: why do we call it a "reflector"
      14:29    slyu-home    As far as I understand, it's something created in the JS VM so it can access the underlying Rust
      14:29    Ms2ger    It reflects DOM APIs across the JS<->Rust boundary
      14:30    slyu-home    I see
      14:30    Ms2ger    The alternative is "wrapper", which can mean any of a dozen things :)
      14:30    slyu-home    Because "reflect" makes me think of bouncing back
      14:30    Ms2ger    Heh
      14:30    slyu-home    Now I understand
      
---
16:07    shinglyu    Hi
16:08    Manishearth    o/
16:08    shinglyu    I am reading this article: https://blog.mozilla.org/research/2014/08/26/javascript-servos-only-garbage-collector/
16:08    shinglyu    And this confuses me: "As in C++, the fields of Node are included in-line with the fields of Document, without any pointer indirection"
16:08    Manishearth    oh
16:08    Manishearth    so
16:08    Manishearth    struct Node {something: u8}
16:09    Manishearth    struct Document {parent: Node, somethingelse: u8}
16:09    shinglyu    So is this a automatic optimization done by the compiler?
16:09    Manishearth    will be stored as struct Document {parent: {something: u8}, somethingelse: u8}
16:09    Manishearth    which in memory is just two u8s
16:09    Manishearth    shinglyu: not even an optimization
16:09    shinglyu    Ah!
16:10    Manishearth    shinglyu: nothing is a pointer by default
16:10    shinglyu    I see
16:10    Manishearth    unlike Java and all
16:10    Manishearth    so nesting structs is a flat operation
16:10    shinglyu    so they are placed in the memory next to each other
16:10    Manishearth    note that when I say `parent` there, I mean the parent type (eg an HTMLElement is an Element is a Node, so each one is a child of the other in that sense)
16:11    Manishearth    for storing a pointer to a parent node we will use a JS<node>
16:11    Manishearth    shinglyu: yep
16:11    shinglyu    So if we have a nested struct, it will be flattened
16:11    shinglyu    kind of like serialization (?)
16:11        *** Temur joined #servo
16:14    Manishearth    serialization?
16:15    Manishearth    but yes, it's flattened
16:15    Manishearth    how else would it be stored?
16:15    Manishearth    if you're not allowed to insert random pointers, that's the only way to store it
16:16    shinglyu    My first intuition is that we store a pointer to a Node struct created somewhere else
16:16    Manishearth    nop
16:16    Manishearth    *nope
16:16    Manishearth    Rust and C++ do not put random pointers around
16:16    Manishearth    Node is Node
16:17    Manishearth    Box<Node>, JS<Node> JSRef<Node> are all pointers
16:17    shinglyu    I see
16:18        *** Ms2ger joined #servo
16:19    shinglyu    Does it make sense to use a pointer instead (perhaps in other kind of language?) in any case?, or is it simply a stupid idea?
16:21    Manishearth    not a stupid idea
16:21    Ms2ger    Well, this is simulating inheritance
16:21    Manishearth    in a language like Java, you GC everything anyway, so a pointer is fine
16:21    Manishearth    Ms2ger: I think he's asking about the language "feature" of flattening, not our usage of it
16:21    Ms2ger    Oh
16:22    Manishearth    shinglyu: but note that indirection can get costly
16:22    Manishearth    in lower level languages like C++/Rust you want to use pointers when you have stuff that you can't, or don't want to, put on the stack
16:22    Ms2ger    Right, then Java would store a pointer, but that's more because all object types are really Gc<Option<T>>
16:22    Manishearth    yep
16:23    Ms2ger    +RefCell
16:23    shinglyu    I see
16:23        *** canhtak joined #servo
16:23    Manishearth    Ms2ger: wait, why refcell?
16:23    Ms2ger    Note that we can't use a pointer for the Node fields, because then we couldn't cast a pointer to a Node to a pointer to a Document
16:23    Manishearth    oh right
16:24    Manishearth    mutability
16:24    Ms2ger    Manishearth, what else do you use RefCell for?
16:24    Manishearth    no I was thinking in the Java frame of mind and was like "why do we need refcounted interior mutability here?"
16:24    Manishearth    and then I realized that that's exactly what you were saying
16:24    Manishearth    well, perhaps it would be Cell though
16:25    Ms2ger    Mm, probably
16:25    Manishearth    shinglyu: note that since Node is the first field to the Document, and since memory layout is flat, a variable which is a Document can be treated like a Node without any issues
16:26    Manishearth    which is why all our casts are simple transmute (where transmute just is a way of telling the compiler to treat a variable of type A as type B)
16:26    Manishearth    well, all of our up casts
16:26    Manishearth    downcasts are checked
16:26        *** dawehner quit (Ping timeout: 121 seconds)
16:26    shinglyu    Manishearth: but we have to keep Node as the first field of Document, right?
16:27    Manishearth    yes
16:27    shinglyu    And what happen to the rest of the Document's field when we cast it to Node?
16:27    Manishearth    nothing
16:27    Manishearth    they stay in memory
16:27    Manishearth    it's just not accessible anymore
16:27    shinglyu    ok
16:27    Manishearth    until you cast it back, that is
16:27        *** canhtak quit (Ping timeout: 121 seconds)
16:28    shinglyu    But they are in the stack, so it has nothing to do with memory leak, am I right?
16:28        *** Rym joined #servo
16:28    Manishearth    shinglyu: well, in the case of Servo, all of these are on the heap
16:28        *** canhtak joined #servo
16:29    Manishearth    it's JSRef<T>, remember
16:29    Manishearth    so, when we cast it, the fields of Document which were on the heap aren't accessible through the new Node pointer
16:29    Manishearth    but they are accessible through the old Document pointer
16:29    Manishearth    (JSRef<Node> and JSRef<Document>, that is)
16:29    shinglyu    Oh ok
16:29    Manishearth    and they are still known to the SM GC
16:30    Manishearth    it's just that locally we lose access
16:30    Manishearth    leaks are handled by the GC
16:30    shinglyu    So even if we are done with the JSRef<Node>, thie JSRef<Document> will still be live
16:31    shinglyu    When JSRef<Document> is also dead, it will be GC'd (?)
16:31        *** psd joined #servo
16:32    Manishearth    so even if the JSRef<Document> wasn't alive, it still wouldn't be leaked
16:32        *** Rym quit (Ping timeout: 121 seconds)
16:32    Manishearth    JSRef<T> is a local rooted reference
16:32    Manishearth    it
16:32    Manishearth    it
16:32    Manishearth    it's possible that there are other pointers to the document within the DOM
16:33    Manishearth    eg a JS<T> in some dom struct elsewhere
16:33    Manishearth    JSRef<T> is just a way for us to tell the GC "hold on, even if this loses all references don't delete this, I'm holding on to it for now"
16:34    Manishearth    and when we relinquish a JSRef, we tell the GC that it's okay to start GCing this object now
16:36        *** dawehner joined #servo
16:39    shinglyu    Thanks! that's very helpful explanation
16:39    shinglyu    Manishearth: I also have a question about cycle collection
16:40    shinglyu    From the first paragraph of "Auto-generating field traversals"
16:40    shinglyu    it seems like we still have cycle collection?
16:41    shinglyu    But I think the previous section "Memory management for the DOM" is saying that cycle collection is bad, and we want a better way to do this?
16:42        *** nattokirai quit (Client exited)
16:49        *** dinfuehr quit (Connection closed)
16:54    Manishearth    shinglyu: I believe the SM GC is cycle collecting ( Ms2ger ?)
16:55    Manishearth    I think the blog post is saying that we don't create an auxiliary GC to manage the DOM
16:55    Ms2ger    "cycle collection" is a term that's typically used for a particular kind of system based on refcounting
16:55    Manishearth    not sure
16:56    Ms2ger    The GC doesn't really collect cycles; it just collects everything that isn't rooted, so it doesn't even notice there are cycles in the dead objects
16:58    SimonSapin    SpiderMonkey’s garbage collector is a "tracing" garbage collector
16:58    SimonSapin    while CPython for examples uses reference counting with a cycle collector.
16:58    SimonSapin    these are two kinds of garbage collectors
16:58    Manishearth    ah okay
16:59    Manishearth    so SM just goes around marking rooted stuff
16:59    Manishearth    and cleanung up non-rooted stuff
16:59    Ms2ger    Well, two kinds of automatic memory management
16:59    Manishearth    what does cycle collecting do?
16:59    Ms2ger    "garbage collection" is often reserved for the tracing kind
17:00    Ms2ger    So if you have a dead cycle with refcounting, none of the objects reach a refcount zero
17:00    Ms2ger    CC detects that case, and force-deletes those objects
17:00    SimonSapin    Manishearth: I don’t know how CC works in CPython
17:00    Ms2ger    Manishearth, also, r? #5741
17:00    crowbot1    Issue #5741: Use get_attr_for_layout more. - https://github.com/servo/servo/pull/5741
17:03    Manishearth    later, studying
17:05    shinglyu    I read the whole blog post again, now I believe it's trying to say "Instead of ref counting + cycle collector, we use the hooks in Encodeable trait to let GC know all the pointers between DOM objects"
17:05        *** Nekit1234007 joined #servo
17:07    Manishearth    note: we don't use Encodable anymore
17:07    shinglyu    Oh, what's the new mechanism then?
17:10    SimonSapin    a special-purpose Traceable trait with a compiler plugin to implement in automatically
17:11    shinglyu    SimonSapin: components/script/dom/bindings/trace.rs?
17:12    Manishearth    yep
17:13    shinglyu    Hmm, so the mechanism looks similar
17:14    Manishearth    it's the same, just not so hacky
17:14        *** acgtyrant quit (Ping timeout: 121 seconds)
17:14    shinglyu    Manishearth: Great, thankss