Source code reading | phase I: name resolution

In this issue, we share three kinds of problems that perplex developers in the programming process mentioned in the first chapter of the book:

  1. Lack of knowledge (lack of knowledge, corresponding to long-term memory, LTM). It refers to the trouble caused by developers' lack of basic programming language knowledge and unable to use or understand basic syntax.
  2. Lack of information (short-term memory, STM). It refers to the trouble caused by developers' lack of understanding of the problem domain information to be handled by the program.
  3. Lack of processing power (lake of processing power, working memory, WM). It refers to the trouble caused by developers' lack of processing ability for the whole programming execution process.

These three types of problems not only perplex developers in writing new programs, but also perplex developers in reading existing code.

Therefore, when we read the source code written by others, we should ensure that we have a preliminary supplement to the lack of knowledge in these three types of problems.

My reading habits

I read the source code in the same way as I read it, from the overall structure to the details.

First of all, make sure you have a good understanding of rustc_resolve has some knowledge of the context information of the library, that is, the second of the three kinds of problems in the programming process mentioned above needs to be supplemented. The first and third types of problems should be avoided for non Rust novices. Generally, the most common problem in reading the Rust source code is the second kind of problem, which lacks the understanding of the information in the problem field to be handled by the program.

Official recommended reading method

In the official Rustc source code reading phase I Slides, it is suggested to adopt a wide - deep - wide three-stage reading method.

Specifically:

  1. Broad: understand the module as a whole.
  2. Deep: focus on a function or small area (what you are interested in or have questions).
  3. Broad: return to the whole module.

Execute several rounds according to the above three reading methods, and the whole code is read.

Rustc compiler architecture

The overall architecture of the Rust compiler (Rustc) is introduced in the Rustc Dev Guide.

The Rustc compiler architecture is different from the traditional compiler architecture. The traditional compiler architecture is designed based on pass based, while the Rust compiler architecture is designed based on demand driven.

Traversal based compiler architecture

The so-called Pass is to scan and process the code / AST.

Early compilers were generally Single Pass, and then Multi Pass appeared, which was divided into compilation front-end and back-end. The front end is responsible for generating AST, while the back end is used to generate machine code. Each step of the compilation process is abstracted as Pass, which was first adopted by LLVM and then extended to the whole field of compilation principles.

Traversal is divided into two categories:

  • analysis traversal is responsible for collecting information for other Pass to use, assisting debugging or visualizing the program
  • transform traversal is used to change the data flow or control flow of the program, such as optimization

These two types of traversal processes also correspond to two stages of the compiler: analysis stage and synthesis stage. The former creates an intermediate representation from a given source text, and the latter creates an equivalent target program from the intermediate representation.

The compiler front end generally corresponds to the analysis phase, and the compiler back end corresponds to the synthesis phase.

The compiler front end includes the following parts:

  1. Lexical analyzer
  2. Parser
  3. semantic analyzer
  4. Intermediate Code Generator
  5. Code optimizer

The generation of object code is completed by the back end.

In lexical analysis, syntax analysis and semantic analysis, the compiler will create and maintain an important data structure to track the semantics of variables, that is, it will store relevant information and name binding information, which is called Symbol Table. It is used during intermediate code generation and object code generation.

The traditional compiler architecture based on traversal is probably like this.

On demand driven compiler architecture

Rust compiler execution process:

  • The rustc command performs compilation
  • rustc_driver to parse the command line parameters, and the relevant compilation configuration is recorded in rustc_interface::Config
  • rustc_lexer is used for lexical parsing and outputs the source code text as a token stream
  • rustc_parse prepares for the next phase of the compilation process. It includes part of lexical analysis, verifying text strings through the built-in StringBuffer structure, and symbolizing strings. Symbol ization is a technique called String interning, which stores the value of a string in an immutable copy.
  • rustc_ The other part of parse is syntax parsing, which uses recursive descent (top-down) method for syntax analysis to convert the term stream into an abstract syntax tree (AST). The entry point is rustc_parse:: parser:: parse of the parse struct_ crate_ Mod() and Parser::parse_mod() association method. The external module parsing entry point is rustc_expand::module::parse_external_mod. The macro parser entry point is Parser::parse_nonterminal().
  • Macro expansion, AST verification, name resolution, and early lint occur in the lexical and syntactic analysis stages of the compilation process.
  • After that, convert AST to HIR and use HIR for type inference]( https://rustc-dev-guide.rust-lang.org/type-inference.html )(the process of automatically detecting expression types), trait solving (the process of pairing impl with each reference to traits), and type checking (the process of converting types).
  • Subsequently, the HIR was downgraded to an intermediate representative (MIR). In this process, THIR was also constructed, which is a more desaccharized HIR. THIR (Typed HIR) is used for mode and exhaustive checking. Conversion to Mir is also more convenient than HIR.
  • MIR is used for borrowing inspection. It is basically a control flow diagram (CFG). In addition, MIR is also used for optimization, incremental compilation, Unsafe Rust UB checking, etc.
  • Finally, code generation is performed. Convert the MIR to LLVM IR, and then pass the LLVM IR to LLVM to generate the target machine code.

Another thing to note is that many values in the compiler are intern. This is a performance and memory optimization. We allocate values in a special allocator called Arena.

In the Rust compiler, the main steps of the above process are organized into a pile of mutually calling queries.

The rust compiler uses the Query System rather than the traversal compiler (traversal based compiler architecture) that most compilation principles textbooks use. Rust uses the Query System to realize the incremental compilation function, that is, on-demand compilation.

The Rust compiler was not originally implemented based on the query system, so now the whole compiler is still in the process of transforming into the query system, and the whole compilation process above will be transformed into the query system. However, as of November 2021, the process from HIR to LLVM IR is only query based.

Compiler source code structure

The Rust language project itself consists of three main directories:

  • Compiler /, containing the source code rustc. It consists of many crate s, which together constitute the compiler.
  • library /, including standard libraries (core, alloc, std, proc_macro, test) and Rust runtime (backtrace, rtstartup, lang_start).
  • src /, including the source code of rustdoc, clip, cargo, build system, language documents, etc.

All names of the compiler / packing box are marked with rustc_ *. These are a collection of about 50 interdependent crate s of varying sizes. Also, rustc crash is the actual binary file (i.e. main function); In addition to calling rustc_ It doesn't actually do anything except driver crash.

The reason why the Rust compiler distinguishes so many Crites is mainly due to the following two factors:

  1. Easy to organize code. The compiler is a huge code base, which is divided into multiple crate s, which is more conducive to organization.
  2. Accelerate compilation time. Multiple crate s facilitate incremental and parallel compilation.

But because the query system is in rustc_ It is defined in middle, and many other Crites depend on it, and it is very large, resulting in a long compilation time. But the task of splitting it is not so simple.

At the top of the entire compiler dependency tree is rustc_interface and rustc_driver crate. rustc_interface is an unstable wrapper around the query system, which helps drive all stages of compilation.

Query: on demand driven compilation

What is query? For example, there is a query called type_of(def_id). As long as the def ID of an Item (the index value of identifier definition: rustc_middle/src/hir/def_id.rs) is given, the type of the Item can be obtained. Query execution is cached, which is also the mechanism of incremental compilation.

let ty = tcx.type_of(some_def_id);

However, if the query is not in the cache, the compiler will try to find the appropriate provider. A provider is a function defined and linked somewhere in the compiler that contains code to calculate the results of a query.

The query system of Rust compiler also derives a general on-demand incremental computing framework Salsa. You can learn more about the working mechanism of the query system through Salsa BOOK.

Source code reading: name resolution component rustc_resolve

The first issue of source code reading focuses on rustc_resolve library, which is related to name resolution.

After the previous understanding about the background of Rust compiler architecture, we know that rustc_resolve name resolution occurs in the parsing phase and serves to generate the final abstract syntax tree. Therefore, this library does not use the query system.

This is also the reason why this library is specified in the first phase of source code reading. It will not involve relatively complex query systems.

The module of crate is built here, and the macro path, module import, expression, type, mode, label and life cycle are resolved here

Type dependent name resolution (methods, fields, associated items) occurs in rustc_typeck.

Name resolution in Rust

After consulting the data related to name resolution, it is learned that the Rust compiler introduced RFC 1560 in 2016 to improve the processing process of name resolution.

Before that, name resolution was handled early in the compiler, after AST was downgraded to HIR. The AST will be traversed three times. The first time is used to build a reduce_graph, the second time is used to resolve names, and the third time is used to check unused names. The simplified diagram is a record of all definitions and imports in the program.

RFC 1560 divides name resolution into two stages: the first stage occurs simultaneously with macro expansion, and will resolve the import to define a name to definition mapping within the scope. The second stage is to find the definition based on a name from the entire map. The purpose of this is decoupling.

At present, RFC 1560 has been implemented. During macro expansion, full name resolution will not be performed, but only import and macro will be resolved. After the entire AST is built, full name resolution will be performed to resolve all names in the entire crite.

Let's take an example:

#![allow(unused)]
fn main() {
    type x = u32;
    let x: x = 1;
    let y: x = 2;
}

The above code can be compiled legally. Where x is not only the name of the type, but also the name of a variable. How does Rust resolve names so that two identifiers with the same name coexist?

Because Rust has a different namespace. Symbols of different types exist in different namespaces. For example, types and variables do not conflict. Each namespace will have its own independent rib (the abstract scope concept introduced by the compiler, such as let binding, curly bracket definition range, macro definition range, etc. are all a rib) stack.

Next, let's use the official three paragraph reading method to read the source code of the library.

rustc_ Overall module structure of resolve

Included in reading rustc_ When I resolve this library, I start with its documentation. A crate document can clearly show the overall structure of the crate.

https://doc.rust-lang.org/stable/nightly-rustc/rustc_resolve/index.html

modular

  • build_ reduced_ After graph gets the AST fragment from the macro, the code in this module helps to integrate the fragment into the partially built module structure.
  • check_unused, as the name suggests, detects unused structs, enumerations, and functions
  • def_collector, create a DefId (definition ID) for the AST node
  • diagnostics, the diagnostic information at the time of failure
  • imports, package and parse import related methods and structures
  • Late, "late resolution" is the process of solving most names except before importing and macros. It runs when crate is fully deployed and the module structure is fully built. So, it just iterates through the crite and parses all expressions, types, and so on. Why is there no corresponding early because it is scattered to build_ reduced_ In graph.rs, macros.rs and imports.rs.
  • Macros, a package of methods and structures related to parsing macros

structural morphology

Error type

  • Ambiguitierror, ambiguity error
  • BindingError, binding error
  • PrivacyError, visibility error
  • UseError, use error

data type

  • Derive data, DeriveData
  • ExpandHasher, expand Hasher
  • ModuleData: data of a node in the module tree
  • ExternPreludeEnty, dealing with Extern and Prelude
  • NameBinding, the record may be a private value, type, or module definition
  • UsePlacementFinder, use related

Namespace and scope

  • PerNS, separate structure for each namespace, auxiliary type
  • ParentScope, which records the starting point of the scope visitor
  • Segment, path segment minimum rendering

Parser correlation

  • Resolver is the main parser type
  • ResolverArenas, which provides memory for other parts of crite, and the Arena model

enumeration

Not listed here are some enumeration types similar to structure classification.

Traits

  • ToNameBinding, which is used to convert the areas reference into a NameBinding reference

function

Some auxiliary functions

Type alias

Some type aliases are recorded

Dependent crate

In rustc_ In resolve's Cargo.toml, you can see some dependency Crites:

  • rustc_ast, which defines the AST data structure used internally by Rust
  • rustc_ Area, the compiler's internal global memory pool, is used to allocate memory. The allocated memory life cycle is' tcx '
  • rustc_middle, the main library of the Rust compiler, contains common type definitions used in other libraries
  • rustc_attr, which is related to compiler built-in attributes
  • rustc_data_structures defines many data structures used internally by compilers, including thread safe data structures required for parallel compilation
  • rustc_errors, which defines the utilities commonly used by the compiler to report errors
  • rustc_expand for macro expansion.
  • rustc_feature, which defines the features gate in the compiler
  • rustc_hir, defines HIR related data types
  • rustc_index, a NewType wrapper for use, which is used for the internal index of the compiler
  • rustc_metadata, some link meta information related to Rust static library and dynamic library
  • rustc_query_system, Rust query system
  • rustc_session, error handling during compiler compilation is related to built-in lint
  • rustc_span, define the data types related to the source code location, and also include macro health related information.

The above is just a list of the main dependencies. As of today (November 13, 2021), we have seen that the name resolution library has also been added to the query system.

Next, let's take a look at what is defined in lib.rs.

It can be seen that the structures or enumeration types defined in lib.rs are basically those used in the name resolution process shown in the above documents.

Here are several types that are easy to understand:

Scope enumeration type:

// The specific scope used to find the name can only be used in the early resolution process, such as import and macro, but not in late resolution.
/// A specific scope in which a name can be looked up.
/// This enum is currently used only for early resolution (imports and macros),
/// but not for late resolution yet.
#[derive(Clone, Copy)]
enum Scope<'a> {
    DeriveHelpers(LocalExpnId),
    DeriveHelpersCompat,
    MacroRules(MacroRulesScopeRef<'a>),
    CrateRoot,
    // The node ID is for reporting the `PROC_MACRO_DERIVE_RESOLUTION_FALLBACK`
    // lint if it should be reported.
    Module(Module<'a>, Option<NodeId>),
    RegisteredAttrs,
    MacroUsePrelude,
    BuiltinAttrs,
    ExternPrelude,
    ToolPrelude,
    StdLibPrelude,
    BuiltinTypes,
}

Segment structure:

// Minimized rendering of path:
// For example, std::sync::Arc is a path, where ':' is separated by segments
/// A minimal representation of a path segment. We use this in resolve because we synthesize 'path
/// segments' which don't have the rest of an AST or HIR `PathSegment`.
#[derive(Clone, Copy, Debug)]
pub struct Segment {
    ident: Ident,
    id: Option<NodeId>,
    /// Signals whether this `PathSegment` has generic arguments. Used to avoid providing
    /// nonsensical suggestions.
    has_generic_args: bool,
}

**LexicalScopeBinding enumeration:**

// Item, visible throughout the block
// Res, visible only where defined
/// An intermediate resolution result.
///
/// This refers to the thing referred by a name. The difference between `Res` and `Item` is that
/// items are visible in their whole block, while `Res`es only from the place they are defined
/// forward.
#[derive(Debug)]
enum LexicalScopeBinding<'a> {
    Item(&'a NameBinding<'a>),
    Res(Res),
}

Modulekend enumeration

#[derive(Debug)]
enum ModuleKind {
    // Interestingly, we found that there is also an anonymous module in the classification of internal modules. A block is an anonymous module
    /// An anonymous module; e.g., just a block.
    ///
    /// ```
    /// fn main() {
    ///     fn f() {} // (1)
    ///     { // This is an anonymous module
    ///         f(); // This resolves to (2) as we are inside the block.
    ///         fn f() {} // (2)
    ///     }
    ///     f(); // Resolves to (1)
    /// }
    /// ```
    Block(NodeId),
    /// Any module with a name.
    ///
    /// This could be:
    ///
    /// * A normal module – either `mod from_file;` or `mod from_block { }` –
    ///   or the crate root (which is conceptually a top-level module).
    ///   Note that the crate root's [name][Self::name] will be [`kw::Empty`].
    /// * A trait or an enum (it implicitly contains associated types, methods and variant
    ///   constructors).
    Def(DefKind, DefId, Symbol),
}

AmbiguityKind enumeration

// ambiguous type 
#[derive(Clone, Copy, PartialEq, Debug)]
enum AmbiguityKind {
    Import,  //  Multiple import sources
    BuiltinAttr, // Built in property naming conflict
    DeriveHelper, //  Naming conflict in derive
    MacroRulesVsModularized, //   Macro name and non macro name conflict
    GlobVsOuter, 
    GlobVsGlob,
    GlobVsExpanded,
    MoreExpandedVsOuter,
}

Resolver <'a '> structure

// This is a structure mainly used for resolution. It is a large structure that contains the data information required for the name resolution process
/// The main resolver class.
///
/// This is the visitor that walks the whole crate.
pub struct Resolver<'a> {
    session: &'a Session,

    definitions: Definitions,

    graph_root: Module<'a>,

    prelude: Option<Module<'a>>,
    extern_prelude: FxHashMap<Ident, ExternPreludeEntry<'a>>,
    // ...
}

// Used for memory allocation in Resolver library
pub struct ResolverArenas<'a> {
    modules: TypedArena<ModuleData<'a>>,
    local_modules: RefCell<Vec<Module<'a>>>,
    imports: TypedArena<Import<'a>>,
    name_resolutions: TypedArena<RefCell<NameResolution<'a>>>,
    ast_paths: TypedArena<ast::Path>,
    dropless: DroplessArena,
}

Next are some functions, including report_errors / report_conflict / add_suggestion_for_rename_of_use and other functions for compiler diagnostic information.

Focus on Problems

We now have a sufficient and systematic understanding of the background related to the name resolution function. Let's look at some code details.

According to the suggestions of the official reading source code, this step should be Deep, focusing on some functions of interest or doubt.

I'm interested in how Rustc checks unused variables, so let's focus on check_ Related functions in the unused.rs module.

The module notes that checking unused imports is mainly divided into three steps:

Step 1: UnusedImportCheckVisitor to traverse the AST to find all unused imports in UseTree, and record their use group and NodeId information.

For the unused trait method, in rustc_ typeck/check_ Check in unused.rs.

We already know from the previous background information, check_unused occurs in the third AST traversal. After the previous two traversals, UseTree has been built. Just check the Unused NodeId:

struct UnusedImport<'a> {
    use_tree: &'a ast::UseTree,
    use_tree_id: ast::NodeId,
    item_span: Span,
    unused: FxHashSet<ast::NodeId>,  // The internal quick HashSet stores NodeId information
}

impl<'a> UnusedImport<'a> {
    fn add(&mut self, id: ast::NodeId) {
        self.unused.insert(id);
    }
}

struct UnusedImportCheckVisitor<'a, 'b> {
    r: &'a mut Resolver<'b>,
    /// All the (so far) unused imports, grouped path list
    unused_imports: NodeMap<UnusedImport<'a>>,
    base_use_tree: Option<&'a ast::UseTree>,
    base_id: ast::NodeId,
    item_span: Span,
}

impl<'a, 'b> UnusedImportCheckVisitor<'a, 'b> {
    // We have information about whether `use` (import) items are actually
    // used now. If an import is not used at all, we signal a lint error.
    fn check_import(&mut self, id: ast::NodeId) {
        /* do something */
    }
    
}

// Implement rustc_ Visitor trait defined in AST, which is the application of visitor pattern in Rust compiler
// The Visitor trail defines the access hook method of AST Node, so that specific visitors can implement the specific method of Visitor for specific access
// The specific visitor here is UnusedImportCheckVisitor
impl<'a, 'b> Visitor<'a> for UnusedImportCheckVisitor<'a, 'b> {
      fn visit_item(&mut self, item: &'a ast::Item) { /* do something */ }
      fn visit_use_tree(&mut self, use_tree: &'a ast::UseTree, id: ast::NodeId, nested: bool) { /* do something */ }
}

Step 2: calc_unused_spans, traverse the Span associated with the NodeId collected in the previous step

fn calc_unused_spans(
    unused_import: &UnusedImport<'_>,
    use_tree: &ast::UseTree,
    use_tree_id: ast::NodeId,
) -> UnusedSpanResult {
    /* do something */
    match use_tree.kind {
        ast::UseTreeKind::Simple(..) | ast::UseTreeKind::Glob => { /* do something */ }
        ast::UseTreeKind::Nested(ref nested) => {/* do something */}
    }
    /* do something */
}

Step 3: check_ Crite, which sends diagnostic information according to the generated data

impl Resolver<'_> {
    // Implement check for Resolver_ Unused method
    crate fn check_unused(&mut self, krate: &ast::Crate) {
        /* do something */
        // Check import source
        for import in self.potentially_unused_imports.iter() {
            match import.kind {
                ImportKind::MacroUse => { /* do something */ }
                ImportKind::ExternCrate { .. } =>  { /* do something */ }
            }
        }
        let mut visitor = UnusedImportCheckVisitor {
            r: self,
            unused_imports: Default::default(),
            base_use_tree: None,
            base_id: ast::DUMMY_NODE_ID,
            item_span: DUMMY_SP,
        };
        visit::walk_crate(&mut visitor, krate);
        for unused in visitor.unused_imports.values() {
             let mut fixes = Vec::new(); // Record for cargo fix
             /* do something */
             // Calculate unused location information
             let mut spans = match calc_unused_spans(unused, unused.use_tree, unused.use_tree_id) {
              /* do something */
             }
             /* do something */
             // Send diagnostic messages
             visitor.r.lint_buffer.buffer_lint_with_diagnostic(
                UNUSED_IMPORTS,
                unused.use_tree_id,
                ms,
                &msg,
                BuiltinLintDiagnostics::UnusedImports(fix_msg.into(), fixes),
            );
        }
    }
}

By reading this part of the code, we have a general understanding of rustc_ Organizational structure of resolve Library:

  • Main Resolver related types and methods are defined in lib.rs
  • Implement specific parsing methods in different Resolver function modules, such as check_unused

Back to the whole module

Then we return to the overall module to understand the other parts of the code.

We know that the first AST traversal will build a reduced graph, so this process must correspond to build_reduced_graph.rs module.

We can see that the module introduces rustc_ast / rustc_expand/ rustc_data_structures::sync::Lrc (equivalent to Arc)/ rustc_hir::def_id and other related components, it can be imagined that it is related to macro expansion, and also supports parallel compilation.

impl<'a> Resolver<'a> {
    crate fn define<T>(&mut self, parent: Module<'a>, ident: Ident, ns: Namespace, def: T) where
        T: ToNameBinding<'a>,
    {
        let binding = def.to_name_binding(self.arenas);
        let key = self.new_key(ident, ns);
        // https://github.com/rust-lang/rust/blob/master/compiler/rustc_resolve/src/imports.rs#L490
        // try_define is defined in the imports module and is used to check the binding name when parsing the import
        if let Err(old_binding) = self.try_define(parent, key, binding) {
            // If there is a naming conflict, report will be called here_ Conflict to issue an error report
            self.report_conflict(parent, ident, ns, old_binding, &binding);
        }
    }
    fn get_nearest_non_block_module(&mut self, mut def_id: DefId) -> Module<'a>  {/* do something */}
    crate fn get_module(&mut self, def_id: DefId) -> Option<Module<'a>>  {/* do something */}
    crate fn expn_def_scope(&mut self, expn_id: ExpnId) -> Module<'a>  {/* do something */}
    crate fn build_reduced_graph(
        &mut self,
        fragment: &AstFragment,
        parent_scope: ParentScope<'a>,
    ) -> MacroRulesScopeRef<'a>  {/* do something */}
    
}

Resolver related methods required for building simplified diagrams are implemented. We won't look at the specific details, just understand the overall process.

Reference link

  • https://github.com/rust-lang/rustc-reading-club
  • https://www.manning.com/books/the-programmers-brain
  • https://courses.cs.washington.edu/courses/cse401/07au/CSE401-07sem.pdf
  • https://github.com/rust-lang/rfcs/blob/master/text/1560-name-resolution.md

Posted on Wed, 17 Nov 2021 05:13:11 -0500 by csousley