Syntax Reference
Copyright © 2020 Synopsys, Inc. All rights reserved worldwide.
CodeXM is all of the following things:
In CodeXM, as in Coverity Analysis, the central element is a checker: a portion of code that describes what you’re looking for, and what to do when you find it. A CodeXM file can define any number of these checkers, along with supporting functionality. Each checker can examine code in order to look for different kinds of patterns.
A number of predefined utilities are provided to support CodeXM checkers. These include both language-specific predefined patterns and functions, and general-purpose functions for use with any supported language.
As mentioned above, CodeXM is a functional programming language. A functional language is slightly different from most programming languages commonly employed in production environments today: These languages are variously described as being procedural or imperative languages.
In those other languages, you typically define a set of steps: first do this; when that is complete, do the next thing; and so on. In CodeXM, as in other functional languages, the logic is expressed in expressions, not statements. Using a functional language is more like defining the formulas in a spreadsheet, than it is like programming a procedural language such as C or Java.
In CodeXM you specify where you’re looking, and what you’re looking for. You also describe how to present the results when a CodeXM checker finds the pattern you’ve defined
Though it uses expressions rather than statements, CodeXM should still be easy to learn if you have programmed before. Most of the inner workings of code analysis are left unexposed, and what does show through is revealed in ways that ought to be familiar to developers who use typical programming languages. If you’re looking for a goto statement or an if statement, you will write a CodeXM checker that use a gotoStatement or an ifStatement pattern to find—or, in CodeXM parlance, to match—these statements in the target code.
If you have not already done so, we strongly recommend that you read the QuickStart Tutorial section of the Learning to Write CodeXM Checkers guide. This lesson gives you a taste of actually writing a CodeXM checker.
This syntax reference only describes the grammar of CodeXM: the rules of the CodeXM language. When you write a checker, as the “QuickStart” shows, you will also need to work with the patterns that are provided by the libraries for the various supported languages: Each such library has a reference manual of its own.
Describing CodeXM is a bit different from describing other languages. Most programming books talk about one language only. To describe CodeXM, we have to talk about CodeXM code itself, and the code of the language it is inspecting—the target language.
This reference shows all code (except pseudocode) in monospace
type.
CodeXM code appears in shades of green.
Target code (whatever the language) appears in shades of orange.
The type of certain pattern properties is said to be nullable. This means that the property might return a value, or it might not. If it does not, its value matches the CodeXM keyword null. In the CodeXM documents, a nullable type is indicated by the name of the type the property might return, followed by a question-mark; for example, int?.
In checker code, a nullable type requires some special handling in order to avoid the error of referencing the null. The “Handling Null Values” section of Learning to Write CodeXM Checkers describes how to do this.
This reference, like much programming-language documentation, uses both railroad diagrams and a version of EBNF to describe the syntax of CodeXM.
Syntax diagrams (often called “railroad diagrams”) have long been used to visually illustrate the syntax of various programming languages. Because they are easy to read, we use them in this document, too.
In case you’ve never seen these before and are unfamiliar with how to read them, the notion is that you follow the path from left to right. If an item lies on the path, it must be part of the code. If the path branches, you can choose one of the options that are shown. The path can even loop back to indicate a repeating element, but you cannot reverse direction to go against the overall movement from left to right.
When you reach the right end of the diagram, you have constructed an example of the portion of syntax that the diagram describes.
Syntax diagrams use a few visual conventions to help you see what the diagram is describing. In addition to the “railroad-track” path, the diagrams can include the following three elements:
The following text and graphics show some examples of syntax diagrams.
The syntax requires keyword to introduce another portion of syntax:
The syntax requires one of two choices. You can pick one or the other, but you have to make a choice:
The syntax allows an option. You can include the option, or omit it:
The syntax allows the something element to repeat an indefinite number of times:
The syntax allows the something element to repeat an indefinite number of times, but each additional something must be preceded by a plus sign:
As you see, syntax diagrams are easy to follow, but they are versatile enough to fully describe the syntax of any context-free grammar.
This reference also describes CodeXM syntax by using Extended Backus-Naur Form (EBNF). EBNF is a notation that uses text rather than graphics to describe programming-language syntax.
Attention: The syntax diagrams and EBNF notation both describe exactly the same syntax. However, they do so in slightly different ways. Do not be alarmed if there is not always a one-to-one correspondence between the symbols of the two notations.
Many variations of EBNF are in use. The variation we use, and describe in this section, is based on that used by the W3C. This version is comparatively simple, clean, and free of clutter.
The grammatical elements described in the previous list can optionally be qualified by one of the following regular-expression conventions.
Reminder: When a qualifier follows a parenthesized group, the qualifier applies to the entire group as if the group were a single entity.
For example, the following EBNF specifies a lowercase letter followed by zero or more digits:
[a-z][0-9]*
Careful! No parentheses group the range of letters [a-z] with the range of digits, so the asterisk applies only to the range of digits. This character sequence begins with a lowercase letter, but there can be only one of these.
For example, the following EBNF specifies a sequence of one or more digits—in other words, it can represent any positive integer:
[0-9]+
For example, the following EBNF specifies an optional semicolon:
';'?
Characters in a CodeXM source file are limited to the original 95 ASCII characters that can be typed or printed: This includes the uppercase and lowercase letters from A to Z, the ten digits, and 33 special characters in common use. Extended characters such as letters with accents are not supported.
As with most software languages, white space in CodeXM is largely discretionary (but see our recommendations in “A CodeXM Style Guide”). The exception is that you need white space to separate adjacent keywords and identifiers.
For example, function myFunction parses correctly, but functionmyFunction does not: You need a space to separate the keyword function from the identifier myFunction.
You can’t put a space within an identifier.
Spaces within a literal character sequence, such as a string value or a member of an enum, are allowed and are significant.
In CodeXM, you can place a comment anywhere that white space is allowed and is not significant. The format of comments in CodeXM will be familiar if you have used C, C++, or related programming languages. Specifically, there are two kinds of comments:
Here are examples:
To fully answer the question posed by the title of this chapter would require a lot of detail. This chapter isn’t a thorough description, just a birds-eye view of what Coverity Analysis does. If you understand the methods described here, you will have an easier time writing CodeXM checkers.
For a compiled, procedural language, one of the first steps the compiler does is to parse the source code and construct a syntax tree. You might already be familiar with how these look.
For example, the source code x - y*2 + 1 can be expressed in tree form, as the following illustration shows:
Constructing a syntax tree is also one of the first steps taken when you run cov-analyze
.
A code compiler proceeds to traverse the syntax tree and use it to generate object code.
The cov-analyze
program also traverses the syntax tree, but in this case it does not generate code.
Instead it searches for the patterns specified by the checkers it is running.
Those checkers can be provided off-the-shelf, as components of Coverity, or they can be custom checkers written using either
CodeXM or the Extend SDK. When a checker finds a target pattern, it generates a report—an event.
A standard compiler generates a tree that accurately represents the source code. Coverity, on the other hand, applies some optimizations and simplifications, to facilitate the work of static analysis. (It would be possible to analyze literal source, but that would be much, much slower.) Because of this, the structure built by Coverity is known as an abstract syntax tree, or AST.
To use Coverity, you don’t need to know details about the AST structure or how analysis works. If you are writing checkers of your own, it can help to bear in mind that the checker is not analyzing the original source code, but a representation of it that is built of node objects organized into a tree.
In terms of CodeXM code, each of the nodes shown in the tree shown above is an astnode object. There are various types of these. Some nodes represent binary operators such as assignment, addition, and multiplication. Other nodes represent variables; still others represent literals such as integers or strings.
Note: There are no astnode objects for white space or comments.
A pattern you specify in CodeXM, for example, can look for instances of addition—this would locate the y*2 + 1 expression in the tree illustrated above, or any other occurrence of the + operator). A pattern can look for multiplication of a variable and a constant—as in y*2. It can also look for binomial expressions such as instances of addition (or subtraction) used with multiplication—such a pattern would match y*2+1 or z*3-4, but would not match 1+2.
A pattern you write can be as specific and as thorough as you need it to be.
Every CodeXM language library has an astnode object, and each astnode has a specific set of properties. In addition, there are various types of astnode objects: Each type has specific properties of its own, as well as the ones it inherits from the parent astnode object. See any of the language library references for the details.
Once it has constructed the abstract syntax tree, cov-analyze
inspects all possible execution paths.
This is one of the advantages of static analysis: It is fast enough to comprehensively look at what the code will do—unlike
dynamic analysis, which has to execute the target program, and so to be efficient has to rely on a sampling of behaviors.
Not that we recommend you use only static analysis. A robust testing strategy should employ both techniques, as appropriate.
CodeXM source files have a .cxm
filename extension.
This chapter describes the high-level syntax elements that can be present in such a file.
In your CodeXM source, begin with an include declaration that names the target; for example, include `C/C++`;.
Specifying the target language includes the library for that language. This makes language-specific patterns and functions available within the CodeXM file.
The language specification also causes the checkers defined in your CodeXM file to be applied only to appropriate target source code in your code base.
Identifiers name elements of CodeXM syntax: variables, record properties, and so on. All identifiers use the same syntax.
An identifier begins with either an underscore ( _ ) or a letter. This first character can then be followed by zero or more letters, underscores, or digits. Letters can be either upper case or lower case, but the two are not interchangeable. CodeXM is case sensitive, so the identifier variable is a different entity than Variable, and both are different from VARIABLE.
The characters in an identifier must be in the set of allowable CodeXM characters.
A CodeXM file consists of checker definitions and the declarations that support them.
The source file contains any number of declarations. Each declaration is terminated by a semicolon ( ; ). Declarations can appear in any order, except that a definition or declaration needs to appear before the first time a reference to it appears; in other words, the rule is Declare it before you use it.
Each of these definitions is described in greater detail below.
Most of a typical CodeXM file consists of checker-definition elements. A checker element defines what the checker is searching for. It also specifies the report to generate when the examined target code is found to match the pattern described by this definition. Other CodeXM elements work to support the logic embodied by checkers.
A checker is defined using the keyword checker followed by a record-expression.
When used to define a checker (as opposed to simply structuring data), a record-expression must contain a specific set of elements, structured in a specific way. These requirements are described in the “Details” section that follows.
At the top level, a checker must contain a record that contains two elements. The names of these elements are as follows:
The reports record contains the following components:
Finally, in its turn, a record in the events list contains the following components:
.cxm
file that contains multiple checkers.
The sample code in this section shows a simple but working checker that contains all the components that Coverity requires to run the checker successfully.
Sets the value of a checker’s optional reports > issueType component. The issueType describes the purpose of the checker.
Here is an example of a call to defineIssueType():
Name | Type | Description |
---|---|---|
type | symbol | A type identifier for the checker |
subtype | symbol? | (Optional) A subtype identifier for the checker; null if not specified |
name | string | The name of the new checker |
description | string | A description of the checker |
localEffect | string | The consequence that the detected issue has, or might have |
impact | string | The severity of the detected issue |
The type and subtype values are
identifiers, not strings. They must conform to customary rules for identifiers:
have a leading alphabetical character, followed by letters or digits or the underscore
( _ ). These rules are summarized
by the regular expression ^[a-zA-Z_][a-zA-Z0-9_]*$
. In addition, these identifiers must be
no longer than 64 characters.
A constant is a named value.
A constant declaration begins with the keyword let, followed by an identifer that names the constant, then an equals sign ( = ) followed by an expression that evaluates to the desired value.
You can optionally specify a type for the constant. If you do not, the type is inferred from the expression.
If a constant is declared at the file level, its scope is global.
Expressions can use a let binding to declare a named value. The usage is similar, but when let is used within an expression, the scope of the name is limited to the portion of the expression that follows the binding.
All identifiers must be declared before they are used. With a forward-declaration, you can declare an identifier before you provide its detailed definition. This can be convenient to do, especially when you declare two or more entities whose definitions are interdependent.
The declaration is introduced by the keyword declare, the identifier to declare, a colon ( : ), and then the type the identifier will have when the declaration is completed.
CodeXM allows you to define functions that can evaluate specific expressions, allowing you to give a descriptive name to what otherwise might be a complicated expression. This helps readability and allows you to use that expression in more than one place.
A function declaration is introduced by the keyword function and an identifier that names the function. This is optionally followed by generic type IDs, and then by at least one formal parameter that has a name and an explicit type. The return type of the function can optionally be specified, and then an arrow delimiter ( -> ) introduces the expression that defines the body of the function.
Each type-parameter-identifier is a generic placeholder for an actual type (see below). This identifier can be used within the parameter list as if it were a known type. It can also be used as the function’s return type, or in any portion of the expression that the function evaluates. When present, these identfiers appear immediately after the function name. Generic type identifiers are enclosed by angle brackets ( < and > ) and separated by commas ( , ).
Each parameter-identifier is an identifier for a formal, explicitly typed parameter. Each parameter name must be unique to the function. Named parameters are enclosed by parentheses ( ( and ) ) and separated by commas ( , ). Each named parameter has an associated type.
Each parameter’s type must be one of the following:
.cxm
) file
.cxm
file
Because CodeXM is a functional (rather than procedural) language, a function cannot generate side effects: For example, a function cannot set variables that persist beyond the scope of the function call itself.
A function can be recursive: The function’s name and type are accessible within the body of the function, enabling it to call itself.
For an example of recursive function use, see “Defining CodeXM Functions” in Learning to Write CodeXM Checkers.
By specifying generic parameters, as mentioned above, you are defining a parameter type (or return type) that can change according to the context in which the function is evaluated. For example, consider the following declaration:
... here T is the type-parameter-identifier. At run time, T might be passed as an int, a string, a bool, or any other type.
A function property is, in effect, a subordinate checker that provides information about functions in the source code. It allows the checker to match characteristics found in functions called by top-level functions.
An example might be, “Does this function, or any function it calls, call system()?”. Function properties permit a lightweight form of interprocedural analysis. See Function Properties in Learning to Write CodeXM Checkers.
Function properties are defined using the keyword phrase function property, followed by an identifier that names the function property, followed by a record that has a single property named models, specified by a pattern.
The include-directive performs two functions.
First, it incorporates the contents of the specified CodeXM file, as if it were present in the current file.
This permits you to write common support functionality and make it available to many different checkers,
or to aggregate many checkers (separated for reasons of maintainability) into a single file that you can specify
on the cov-analyze
command line.
Second, if an enum-literal-expression is present, and it indicates a language library, the expression ensures that the contents of this CodeXM file are applied only to code of the same language. That is to say, if your file contains the following line:
... it causes the checkers declared in your file to be applied only to C or C++ code.
The directive is introduced by the include keyword, followed by either of the following:
A pattern-definition lets you define code patterns that can be more complex than the predefined patterns provided by a language library.
Custom patterns can be built using other patterns that have already been defined (either library patterns or other custom patterns in your own CodeXM project).
A pattern definition is introduced by the pattern keyword, This is followed by an identifier to name the pattern, followed in turn by a match-list enclosed by curly braces ( { and } ).
A pattern can be recursive, but to implement this, you need to define a function; also see “Defining CodeXM Functions” in Learning to Write CodeXM Checkers.
With a type-definition you can define a name for a new type based on one of the standard CodeXM types. You might want to do this to define a special-purpose type; for example, a specialized enum or record type.
The type definition is introduced by the typedef keyword. This is followed by an identifier that gives the new type its name. Following that is an equals sign ( = ), and then the name of the base type.
One of the implications of CodeXM being a functional programming language is that, loosely speaking, “everything is an expression”. More precisely, CodeXM has no statements that are executed sequentially, one after another; rather, the logic in CodeXM takes the form of expressions that are evaluated. Each expression produces a result that might in turn be used to further evaluate some other expression.
After the file-level elements, expressions are the main building blocks of a CodeXM program.
Within an expression, operators have a precedence that determines the order of their evaluation in the absence of parentheses. This is a convention originated by algebra, and used by nearly all programming languages. CodeXM uses such a convention as well, though some of its operations are not commonly used in other contexts.
Operators also have an associativity, also applied in the absence of parentheses. This determines the order of evaluation when operators appear in a sequence; for example, the order in which to execute x + y + z would be ambiguous without associativity. Addition/subtraction and multiplication/division associate to the left, so the order of executing the previous example is ( x + y ) + z. (Parentheses have top priority, and override the default precedence of operations.)
The following table shows the order of precedence for the CodeXM operators. The top row has the highest precedence, the bottom row the lowest.
Operator | Associativity |
---|---|
() {} | not associative |
. | left |
[] | not associative |
as | not associative |
% | left |
new | not associative |
default | not associative |
matches | left |
!, unary - | right |
* / | left |
+ - ++ ?? | left |
<=> | left |
> < >= <= | left |
== != | left |
&& | left |
|| | left |
>=> | left |
?: | right |
in | not associative |
where | not associative |
Besides introducing the default clause of a switch-expression, you can use the keyword default as an operator that matches any value. The most common way to use default as an operator is in a matches-expression; for example:
... In other words, this expression sets the value of x to equal the value of the expiry property, regardless of what that value might be.
An addition-expression adds ( + ) or subtracts ( - ) two integer values.
This expression has two specialized uses as well:
String concatenation: You can also use the plus-sign operator ( + ) to concatenate strings. However, the result of string concatenation is not of type string, but of type eventstring: so this usage is a special case that is not a true addition-expression.
The set-expression evaluates to a set or a list.
The nullable-expression evaluates to a nullable type.
The null-coalescing operator ?? is not strictly speaking an addition operation, but it does have the same precedence as the other additive operators.
This operator evaluates its left-hand operand. If this is not null, this is the value that ?? returns; otherwise, it returns its right-hand operand. The point is to return a result that is usable in expression evaluations, and will not cause a run-time failure.
The following expression returns the same result as ??, but does so in a lengthier fashion:
For more information, see Handling Null Values in Learning to Write CodeXM Checkers.
A binary-expression has a single operand infixed between two other expressions. Binary operators include standard operations for arithmetic, comparison, and logic; a few others have a more special-purpose use in the context of CodeXM
Calls a function, using the specified arguments.
The list of arguments is enclosed by parentheses ( ( ) ), and arguments are separated by commas ( , ).
The number of arguments in the call must equal the number of arguments in the function-definition. The arguments passed can be either named or unnamed. If the name is specified, it is followed by a colon ( : ) and then by the value of that argument.
The function-producing-expression specifies a function that has already been declared (unless you use forward-declaration).
If an argument is named, that name must be the same as the corresponding formal parameter in that function’s definition. It is an error to use an argument name that doesn't appear in the definition. It is also an error to change the order of the arguments, whether or not you specify argument names.
For example, if a function is declared like this:
... it can be invoked like this (that is, not naming the arguments):
... or like this (naming the arguments, in the right order):
... or even this (naming only one argument):
... but the following is an error because reordering arguments is not permitted:
The comparison-expression uses common operators to perform tests for equality or inequality, and relations such as greater-than or less-than.
The operator <=> is the three-way comparison: Rather than a Boolean value, its result is –1 for less-than, 0 for equals, and +1 for greater-than.
Careful: The character-combination <> is not a valid operator, as it is in some languages. To compare for inequality, use the != operator.
In general, you can compare expressions of the following types:
Two nodes are equal only if they are exactly the same node in the syntax tree. Equality and inequality are the only valid comparisons.
Two symbols are equal if they represent the same entity; for example, the same variable. Equality and inequality are the only valid comparisons.
Two types are equal only if they represent the same type. Equality and inequality are the only valid comparisons.
Type comparison does acknowledge typedef aliasing; for example, after the statement typedef int Integer the int type will equal the Integer type.
However, a qualifier such as const changes the fundamental type (in this case, from a variable to a constant). So the pointer type const int * does not equal the pointer type int *.
All the comparisons are available, including relational comparisons such as > or >=.
A string comparison depends on the target language and its file format. You can use relational comparisons on strings. If the string contains Roman letters, the relation is based on alphabetical order; for example, "l" > "c". If the string uses non-Roman characters such as Unicode, the result of a comparison is not well defined.
It is an error to compare two enum values if the types to which they belong have no values in common. Otherwise, normal comparisons as per strings apply.
You can compare nullable types. The value null is equal to itself and different from—specifically, less than—any non-null value.
Note: If the two values being compared are not of the same type, CodeXM uses unification to find a substitution that makes the two types comparable, if that is possible. Unification is an operation that is similar to casting (also known as type conversion).
The ?: (“ternary”) operator evaluates one of two expressions, depending on whether the specified condition is true or false.
A condition is simply an expression that evaluates to either true or false.
With the decomposing-pattern-expression, your checker can match not only a particular pattern, but the value of specified properties of that pattern. This expression is a way to narrow a search.
Pattern decomposition is expressed by a list enclosed by curly braces ( { and } ). Each item in the list consists of a property prefixed by a dot ( . ), an equals ( == ) or not-equals ( != ) assertion, and an expression that specifies the value of the property to match or to not match.
Each property-name is an identifier that names a property of the pattern.
Properties can nest, so the property-name can be a path to a property. In other words, .calledFunction.identifier == something is legal syntax. Properties with a nullable type along the path are handled correctly.
Each comparand-expression should evaluate to one of the following:
Suppose your checker uses a pattern, intLiteral, that matches integer literals in the target code. The intLiteral pattern has a property named value. The value property represents the numeric value of the integer literal.
If you use the pattern with no qualifications, as in the following code:
... then your checker matches any integer literal it finds in the code it is inspecting.
By comparison, if you use a property to qualify the pattern, as the following example shows:
... then your checker matches only instances of integer literals whose value equals 42.
This expression allows you to access individual elements within a list.
The notation is probably familiar to you from working with arrays in other programming languages: It is simply the name of the list followed by square brackets ( [ ] ) that enclose an index value that specifies a particular list entry.
The list-producing-expression is an expression that has already created a list.
The index-expression is a valid index to the list. The first element in the list is element zero (0), so the index of the last entry is one less than the list’s length property.
This expression is available to lists but not to sets or enums. Sets and enums are considered to be unordered. You can convert a list into a set, but the reverse is not true.
The exists-expression answers the question, “Is there at least one value in a set that meets a specified condition?” Its result, therefore, is simply either true or false.
The expression is introduced by the keyword exists, followed by an identifier that is used to traverse the set.
The keyword in introduces the set itself. (You can also specify a list to search.) It is followed optionally by the keyword globalset (which is necessary if the set in question is a globalset type) and then by an expression that evaluates to the set.
The keyword where introduces a condition-expression, presumably expressed in terms of the identifier, that describes the object of the search.
The exists-expression shown in the following code:
... can be considered shorthand for the following lengthier (and less readable) CodeXM pattern:
If the keyword globalset is present, the condition-expression cannot reference local variables that were defined outside of the loop. See “The globalset-type” for more information.
The for-accumulate-expression traverses a set or a list. For each member that it visits, it performs an operation. It accumulates the results of these operations, then returns the accumulated value as its result.
(If you are familiar with functional programming, you might recognize this as a fold operation over a set of elements.)
The syntax of the for-accumulate-expression is very similar to the syntax of the for-loop-expression. The main difference is the additional accumulate clause.
The keyword for introduces the loop variable. The loop variable will be assigned, in turn, each of the values in the set being examined. The set itself is introduced by the keyword in, followed by the variable or expression that represents the set.
The keyword accumulate follows the set-expression. It introduces the accumulator-identifier, which names the accumulator variable. This is followed by an equals sign ( = ) followed by an initial-expression.
(You have the option of also specifying an explicit type for the accumulator.)
To filter the set to only those members that meet a specific criterion, you can use the where keyword followed by a condition-expression that describes the criterion. (If you omit the filter condition, the loop visits every member of the set: It is as if the condition-expression were simply the value true.)
Finally, a colon ( : ) introduces the yielding-expression.
The variable named by identifier is within scope, and can be referred to, in both the condition-expression and the yielding-expression. The accumulator-identifier, and any variable that is defined in the where clause, is within the scope of the yielding-expression.
The set-expression must result in a set or a list.
The initial-expression sets the initial value of the accumulator variable.
The condition-expression describes a condition (criterion). Members of the set that do not meet this condition are not evaluated.
For each member that is evaluated, the result of the yielding-expression is assigned to the accumulator variable. When the loop has completed, it returns the final value of the accumulator.
The value of initial-expression is typically the identity value for the accumulator’s type:
The following pseudocode shows the very basic logic of a for-accumulate-expression, expressed procedurally:
Thus, the CodeXM expression that follows:
... might be expressed by the following pseudocode:
In a similar way, but using more complex expressions, the CodeXM code that follows finds the largest odd value (9, in this case) among a list of positive integers:
... and this might be expressed procedurally by the following pseudocode:
A for-loop-expression traverses a set or a list. Frequently a for loop is at the heart of a CodeXM checker, since analyzing code involves traversing the abstract syntax tree (AST) nodes of the target code, or a calling tree of functions in that code.
The keyword for introduces the loop variable. The loop variable will be assigned, in turn, each of the values in the set being examined. The set itself is introduced by the keyword in, followed by the variable or expression that represents the set.
If the set to examine is of the type globalset, then the for-loop-expression must include the keyword globalset, immediately following the keyword in.
To filter the set to only those members that meet a specific criterion, you can use the where keyword followed by a condition-expression that describes the criterion. (If you omit the filter condition, the loop visits every member of the set: It is as if the condition-expression were simply the value true.)
Finally, a colon ( : ) introduces the result-expression.
The variable named by identifier is within scope, and can be referred to, in both the condition-expression and the result-expression.
The set-producing-expression must result in a set or a list.
The condition-expression describes a condition (criterion) and the result-expression is evaluated for each member of the set that meets that condition.
The result of a for loop is a set that contains each result-expression that was found by the loop and whose value was not null.
For example, the following sample code shows a simple for-loop-expression (we presume that the function odd() has been coded previously, perhaps in a library):
The preceding code produces a set that contains twice the value of each odd number in the set-producing-expression. The following snippet shows this result:
If your for loop is the main loop of a checker, then its result-expression must have a specific form; namely, it must be an expression that contains a record named events. In its turn, events must define two properties: one named description and the other named location.
Both these properties contribute to the issue report: the description specifies a string to display, and the location specifies the code location where the issue was found.
For example, the following code shows the result-expression for a checker named UNINIT_VAR:
A fine point: In this example, code is the loop variable. When you traverse the globalset, the loop variable has a .location property that contains the source-line number where the loop criterion was met.
The following pseudocode might help you visualize how a for loop works, especially if you are mainly used to procedural languages:
That is to say, the result of a for loop begins as the empty set. When a member of the set-expression matches the specified condition AND that value of that result is not null, then the member is added to the resultset. When the loop completes, the resultset contains all the non-null matches from the traversal of set-expression.
The happens-before-expression takes two operands, and performs a Boolean test on the sequence in which those operations are invoked in the target code.
Each path-pattern-expression is a CodeXM pattern to match a statement or other executable code in the target language.
Be aware: This expression is valid only when it is used as an argument to the sequence() function. It is not allowed in other locations in the CodeXM source.
The two patterns do not have to be consecutive: There can be other executable code between them.
You can think of the sequence() function as a special kind of pattern that understands paths. For example, sequence(pattern_1 >=> pattern_2) returns true if pattern_1 executes in the target code before pattern_2 does, false otherwise.
The sequence() function reports on whether a particular execution sequence occurs (or does not occur) in the code being analyzed.
The argument to sequence() is a happens-before-expression. The function returns true if the happens-before-expression expression can be matched, false if it cannot.
Typically, sequence() is called from within a matches-expression.
For more information about using the sequence() function and the >=> operator, see “Path-Sensitive Analysis” in Learning to Write CodeXM Checkers.
The if-exists-expression finds the first member of a set that satisfies some criterion. It returns a result based on that member.
(The if-exists-expression differs from the exists-expression in the kind of information it provides. An exists expression reports only true or false about the existence of a member. An if exists expression actually returns the value of the member being searched for, or some computation based upon that value.)
The keywords if exists introduce an identifier that is used to traverse the set.
The keyword in introduces the set itself. (You can also specify a list to search.)
The keyword where can introduce an optional condition-expression.
Finally, a colon ( : ) introduces a yields-expression. If the member being sought is present in the set or list, and if it satisfies the condition (if one is specified), then the yields-expression is evaluated and becomes the result of the if exists search.
The result of an if-exists-expression is null if the set-producing-expression is empty, or if the set-producing-expression does not contain any member that meets the criterion expressed by the condition-expression.
While CodeXM is not procedural, this expression comes close. You can visualize its operation as executing the following pseudocode:
The if expression is a more versatile—and much more readable—construct than the conditional-expression. An if keyword introduces a conditional expression, followed by a then clause and an optional else clause. To handle additional conditions, before the else clause you can include any number of elsif clauses.
A condition is simply an expression that evaluates to either true or false.
Once the conditional-expression evaluates an if-true-expression, it does not evaluate any subsequent expressions in the if.
If a condition defines a variable (for example, via the matches-expression as keyword), that variable remains in scope in the corresponding if-true-expression— but it is not in scope or available in the if-false-expression.
Tip: To choose among multiple conditions, the switch-expression can be an alternative to coding an if statement that specifies multiple elsif clauses.
A lambda-expression is an unnamed function literal, frequently called an anonymous function.
In general, the criteria for using a pattern-expression also apply to a lambda-expression: Use these constructs in situations where the expression is only used once, and is simple enough to convey its intent without reducing the readability of the expression in which it appears.
Syntactically, a lambda is similar to a function-definition. But no name appears after the function keyword, and the return type is inferred from context rather than explicitly declared.
Each type-parameter-identifier is a generic placeholder for an actual type. This identifier can be used within the parameter list as if it were a known type, the return type, or any portion of the expression the function evaluates.
Each parameter-identifier is an identifier for an explicitly typed parameter. Each parameter name must be unique to the function.
Each parameter’s type must be one of the following:
.cxm
) file
.cxm
file
With a let-binding-expression, you can define a variable that becomes available for use in subsequent expressions.
The let-binding-expression is similar to const-definition. Each of these elements defines a variable whose value is constant for the duration of its scope. The difference is that const-definition defines a variable with global scope, which is evaluated (at most) only once during the life of the checker. The let-binding-expression defines a variable with local scope, whose use is limited to the expression specified by the let-binding-expression, and which is evaluated each time the let-binding-expression is evaluated.
The let keyword introduces identifier for the local variable, followed by an optional type, then an equals sign ( = ) followed by the expression to evaluate and assign to the variable. Finally, the in keyword introduces the expression during which the identifier will be usable.
The value-expression can be any expression. Its result is assigned to the identifier.
The scope-expression can be any expression, and defines the scope within which identifier is available (“in scope”).
Because CodeXM is a functional programming language (not a procedural one), the value of a “variable” does not change while the variable is in scope. You cannot assign a new and different value to an existing variable.
The let-binding-expression defines a variable, computes its value, then makes that variable available to a subsequent expression during which the variable is accessed. The following code scheme shows this construction:
Not really exceptions: With a for-loop-expression or a for-accumulate-expression, the loop variable does change over time. In both these cases, even though the variable changes while the loop is running, the value the loop returns cannot change any further, so loops are not an exception to the general rule.
A list-expression is a list of values considered to be an ordered group.
The list is enclosed by square brackets ( [ ] ). The items in the list are separated by commas ( , ).
Each expression in the list is evaluated when the list is created.
The order of items in the list is the order in which they are specified in the source code.
Elements do not have to be unique: More than one element can have the same value.
To access elements in a list, use either a for-loop-expression or an element-access-expression. You can also use the set-filter-expression to retrieve members of a list.
CodeXM provides several literal types.
(The int-literal and string-literal expressions correspond to the int and string builtin-types.)
Literals can take a variety of forms. A literal can be a string-literal-expression, an enum-literal-expression, or an int-literal-expression: Each of these has a syntactic structure of its own. The remaining literals are simply keywords, as shown in the syntax descriptors here, and described in the “Details” section that follows:
An enum literal is similar to a string literal, but it is enclosed by backticks ( ` ) rather than double quotes.
As with string literals, any embedded backticks need to be escaped by a preceding backslash ( \` ) in order to not be parsed as an end-of-enum marker.
The following literal uses correct syntax to include backticks:
`Including the \`backtick\` mark`
An integer literal is a positive 64-bit value. You can express integers in decimal, hexadecimal, or binary notation.
A positive integer value can range from 0 to 263–1.
You might recognize that this range implies that integers use a two’s complement representation. In other words, there is no “sign bit” in the CodeXM representation of integers. CodeXM does allow you to work with negative values, but the minus sign ( - ) is always treated as a unary operator, and not as part of the literal itself.
A string is a sequence of characters enclosed by double quotation marks ( " ).
A string can contain any allowable CodeXM character, but certain situations must be handled specially, as follows:
The following string uses correct syntax to include double quotation marks:
"This is a string that contains \"quoted\" text"
For example, if the source contains the following lines:
... then the variable str is assigned the string value "This is a multi-line string".
Careful: You cannot concatenate string variables in this way.
"The first line\nAnd the second line"
.cxm
.
A logical-expression expresses logical (as opposed to bitwise) AND or OR. The result of a logical operation is either true or false.
Two expressions are conjoined using either the logical AND ( && ) or the logical OR ( || ) operator.
Both of these operators are short-circuiting: In other words, if the overall truth of the operation can be known by evaluating the left-hand expression (that is, if it is true for || or false for &&) then the right-hand expression is not evaluated.
For logical AND ( && ) expressions, if the left-hand expression defines a variable via the matches-expression as keyword, that variable remains in scope for the right-hand expression as well.
Much of the power of CodeXM relies on pattern matching: comparing code (or other things) to a pattern to find which match that pattern.
The matches-expression is a predicate in the logical sense; that is, a statement that is true or false, depending on whether the condition it describes has been satisfied or not. In the context of CodeXM, the matches-expression compares an expression to a pattern. If the two do match, the matches-expression is successful. It can optionally define an appropriately typed variable that describes the results of that match.
The expression to match is followed by the keyword matches, and then by an expression to represent the pattern.
The pattern-producing-expression is simply an expression that produces a pattern. Typically this will be an identifier that names a pattern you have already created.
You have the option of using the keyword as to define a variable whose value is defined by the initial match and whose scope is the remainder of the matches-expression.
Most often you would define such a variable to further refine the match, as in the following commented code snippet:
Be careful: You cannot use the logical OR ( || ) in an expression of this sort, because OR implies that the expression to its right will be evaluated even when the match has failed. This is not logical, and indeed if the match fails, the value of the new variable v is undefined.
A multiplication-expression multiplies ( * ) or divides ( / ) two integer values.
Each operand must evaluate to an integer value.
Either operand can be a complex subexpression, as well as an integer variable or an integer literal.
If the divisor does not evenly divide the dividend, the result is a truncated value: The remainder is discarded.
Dividing by zero causes a run-time error.
As in algebraic notation and nearly all programming languages, in an expression multiplication has precedence over division, and a unary operation has precedence over multiplication.
A pattern-expression allows you to define a pattern inline—that is, within another expression. For patterns, it is the analog to a lambda (nameless) expression for functions.
In general, the criteria for using a lambda function also apply to a pattern-expression: Use these constructs in situations where the expression is only used once, and is simple enough to convey its intent without reducing the readability of the expression in which it appears.
A pattern definition is introduced by the pattern keyword, followed by the name (an identifier) by which the pattern will be known. This is followed by a match-list enclosed by curly braces ( { and } ).
A match-list is a sequence of pattern expressions. A checker evaluates the patterns in order: When a match is found, one or more expressions are evaluated, and the remaining pattern expressions are skipped.
Patterns in the list are separated by a vertical bar ( | ). A bar before the first pattern in the list is optional but recommended.
The patterns are tested in the order that they appear. If a match is found, then the yields-expression is evaluated. This becomes the value returned by the match-list, and no further patterns are tested.
If no yields-expression is specified, then the upper-level expression becomes the value returned by the match-list.
If an as clause specifies a variable, that variable is available (that is, “in scope”) for use in expressions until the match-list has finished evaluating.
Allows you to access individual proprties within a record.
An expression that refers to a collection of properties is followed by a dot ( . ), and then by the name of the property to access.
The property-producing-expression can simply be the name of an existing record. It can itself be a property-access-expression: This lets you write code that accesses the properties of a property.
The property-producing-expression can also be a function, a pattern, an astnode, or any other CodeXM type that has properties.
A record is a collection of property-value pairs.
Remember: In CodeXM, a checker takes the form of a record-expression that contains records nested in a particular way, with specific property-name values, some of them required.
A record-expression consists of a list of one or more properties, separated by semicolons ( ; ). Each property consists of a name, an equals sign ( = ), and an expression that represents the value of that property. The entire record-expression is enclosed by curly braces ( { and } ).
Each property-name is an identifier.
Each record has its own particular record-type.
A record-update-expression permits the contents of the indicated record expression to be updated with new values.
This expression can update the value of existing properties, but it cannot add new properties to a record structure.
The syntax of a record update is similar to that of a record-expression. Property specifications are also enclosed by matching curly braces ( { and } ), but the update expression begins by specifying an existing record, followed by the keyword with.
The final semicolon ( ; ) is optional.
The record-producing-expression is an expression that refers to a record that has already been created.
Each property-name in this expression must correspond to a property that was declared for the record-producing-expression.
The result of this expression is a record that has the same type as the original record-producing-expression, but with the specified properties assigned new values.
A set-expression is created by specifying a list of values, but a set is not considered to be ordered.
Each expression in the set is evaluated when the set is created.
The items in the set are not considered to have an order.
Elements do not have to be unique: More than one element can have the same value. This is comparable to a CodeXM list-expression, and differs from the usual mathematical meaning of a “set”.
To access an element in a set, you can use a for-loop-expression. You cannot use an element-access-expression, because the set elements are unordered. On the other hand, you can use a set-filter-expression to retrieve members of a set that meet a particular criterion.
The set-filter-expression filters members of a set or a list. It returns the members that meet the filtering criterion.
If you pass it a set, the set-filter-expression returns a set. If you pass it a list, it returns a list.
Before Coverity Analysis 2019.09: In previous releases, set-filter-expression returned a set regardless of whether it was passed a set or a list. This was a source of confusion. Legacy code need not change: The update is backward compatible because it has always been possible to treat a list as a set.
The set-producing-expression specifies a set or a list that has already been declared.
The pattern-producing-expression is simply an expression that produces a pattern. Typically this will be an identifier that names a pattern you have already created.
Set filtering is essentially a syntactic shorthand for the longer notation shown here:
Be aware: If you use the same variable name before and after the matches clause—that is, x matches something as x—then the original x and the x that is the result are two different variables that happen to have the same name. This is known as shadowing: The first value is shadowed by the second one, and the original x value is no longer accessible.
Mnemonics:
A switch-expression chooses among two or more alternatives, based on the outcome of a controlling expression. A switch is a convenient way to represent a chain of conditions, when each condition is a pattern of the same expression.
The switch-expression is introduced by the switch keyword, followed by the controlling expression enclosed by parentheses. This is followed by a match-list enclosed by curly braces ( { and } ).
The match-list must be followed by a required default alternative. This alternative is also introduced by a vertical bar ( | ), followed by the keyword default. Then an arrow operator ( -> ) introduces an expression that is evaluated only if no prior matching expression has been found.
The syntax and behavior of a switch-expression are similar to a pattern, but they differ in certain ways:
The unary-expression takes a single operand. A unary operation negates its operand—either numerically or logically.
A numerical expression is negated by prefixing a minus sign ( - ). A logical expression is negated by prefixing the logical NOT operator, represented by an exclamation point ( ! ).
In CodeXM, a variable is an entity that has been given an identifier to name it.
Examples include a global constant defined by const-definition, a local constants defined by a parent let-binding-expression, and a named value defined by a matches-expression that uses the keyword as.
Remember: The CodeXM documentation often uses the term “variable” by analogy with other languages, but because CodeXM is a functional language, even named values are constant. They are available, with the same value, a long as they remain in scope.
(You can, on the other hand, create a record that is based on an existing record but with some of the property values changed: See record-update-expression.)
Once it has been created, you simply refer to a variable by its identifier.
All objects, including expressions, in a CodeXM program have their own particular type. This chapter describes the possible types.
The type is itself the root of the type system.
Declaring an object can be as simple as assigning a value to an identifier; for example, assigning the identifier a literal numeric value or a string literal. Other type declarations require more elaborate syntax; for example, defining an enum or a function-type.
CodeXM has several intrinsic built-in types.
(The int and string built-in types correspond to the int-literal and string-literal literal-expressions.)
Built-in types are identified by their reserved names.
A value that equals either true or false.
The type for an int-literal-expression or an integer variable.
The type for a string-literal-expression or a string variable.
The type of a string value that has been formatted for output.
The following operations create an eventstring:
These are the properties you can use to format a string:
Concatenation is the only operation allowed on an eventstring object. You cannot compare eventstrings in a Boolean expression.
To format a string, simply append the attribute to the string value. Here is an example:
CodeXM creates the eventstring by adding Markdown to the original string.
Typically you will use formatting properties with the expression value output by a checker. For example, if your checker includes the following code:
... then Coverity sends the following string to the output stream:
... which in turn, on many systems will appear in the following form:
Be careful: The formatting properties belong to objects of the type string and not to objects of the type eventstring. So if you assign x, for example, the result of concatenating two strings, the code let y = x.formattedAsEmphasized generates an error because x, the result of a concatenation, is an eventstring.
A collection-type object can be either a set or a list. All members of a set or a list share the same base type.
The non-nullable-type is the base type of the collection. As the EBNF name implies, it cannot be nullable: See “The nullable-type”.
The members of a set or a set do not need to be unique (for sets, this is different from the usual mathematical stipulation).
Both sets and lists have the following properties:
The main difference between a set and a list is that the members of a list have a particular order, while the members of a set do not. This difference does have implications for the way these two types of collections behave, as the following points explain:
The type_of_empty_list is the type of the empty-list expression ( [] ).
You can initialize a set or a list to an empty state by using the empty list ( [] ). This can be useful when you work with the for-accumulate-expression.
The empty list belongs to all list types.
The enum-type is similar to a set, except that each member is an enum-literal-expression; that is, a character sequence enclosed by backticks ( ` ) rather than quotation marks.
Note: The values in an enum are not considered to be ordered.
Unlike some other languages, where the members of an enumeration are merely identifiers, in CodeXM the value of an enum member can be nearly any string-like value. For example, an enum describing operators might define its contents by using familiar symbols such as `+`, `-`, `*`, `/`, and so on.
The value of an enum is not required to be unique among all the enum types in your CodeXM source file. In fact, an enum literal that is not unique is considered to belong to any enum type that contains the same literal. For example, the literal `X` is assumed to be of the type enum{ `X` }, which is shared among all enums in the source file that include this same member.
For example, suppose your checker declares the following two enumerations:
... then a variable of either type can be compared with a variable of the other.
When you define a function the function has an implicit type, which has three components:
You can also define a particular function-type. You can specify a function type as the type of another function’s argument: doing so enables you to pass functions as arguments.
A function type declaration is similar to a function-definition, but it doesn’t specify a function name, and instead of an expression to evaluate it simply declares the name of the new type.
Each type-parameter-identifier is a generic placeholder for an actual type. This identifier can be used within the parameter list as if it were a known type, the return type, or any portion of the expression the function evaluates.
Each parameter-identifier is an identifier for an explicitly typed parameter. Each parameter name must be unique to the function.
Each parameter’s type must be one of the following:
.cxm
) file.cxm
fileA function that accepts a single integer parameter and returns a Boolean value can be written as follows:
Once declared, the function type can be used in other functions; for example, to specify the type of parameter that the function accepts, as in the following code sample:
In the following example, a filter function from a CodeXM library takes a function and a set of values, then applies that function to the values in the set and returns a filtered list:
For every target-language source file, a globalset object contains all the nodes that correspond to the source file’s code. A globalset can be used with exists and for loop expressions. Many checkers search the entire globalset.
Most language libraries include named globalset sets that contain specific kinds of objects; for example, allFunctionCode. You can use these named globalset sets to make the scope of a search more focused, and possibly quicker to run.
A few restrictions apply to globalset sets, as follows:
The type declaration: In such a declaration, the type is simply the identifier of the new set, and each globalset-domain is one of the following values:
The domains are used for internal bookkeeping and optimization. There is rarely need for a user-written checker to declare a new globalset-type.
The opening of the following pattern tells CodeXM to search only code within functions:
The named-type is any previously defined type. You can use a type-definition to define a type that is particular to your own checker code.
When you specify a complex type, you can use parentheses to impose an evaluation order that might otherwise be ambiguous or contain undesired precedence.
The target code that CodeXM analyzes is represented by a variety of Abstract Syntax Tree (AST) node types. An astnode is the root of this class hierarchy.
Tip: The various language libraries provide meaningful subtypes that are particular to the target language. As such, you are seldom likely to deal with these more abstract node types. Instead you might deal with patterns with names such as—for example—ifStatement or binaryExpression. The abstract node types can sometimes be useful; for example, as parameter types to certain CodeXM functions, or when you must specify a return type and no more specific type is exposed by the language library.
Each of the node types is represented by its corresponding keyword.
When a property is not specified, that property is null. If there is a possibility of this occurring, the property’s type is said to be nullable.
CodeXM code indicates a nullable type by appending a question mark ( ? ) to the name of the underlying non-nullable type.
For example, a function in the target language might return an integer value or it might not. The pattern for that function includes a property named .returnValue. The type for .returnValue would be int?.
If a value is null, you cannot use it in an operation and a pattern cannot match it. Your checker code needs to detect occurrences of null and avoid referencing them. There are two ways to do so: by using the keyword NonNull, or by using the null-coalescing operator ( ?? ).
The keyword NonNull matches only non-null values. You can use it with matches to avoid illegally referencing null, as shown in the following sample code:
The null-coalescing operator works a bit differently: It replaces null with a default non-nullable value that you can specify. The following is an example of code that uses ??:
An expression, such as a function, that accepts a nullable type will also accept an object that is null.
For further information, see “Handling Null Values” in Learning to Write CodeXM Checkers.
This is the type of the object null; in other words, it is the type of a property that has not been specified.
An expression, such as a function, that accepts a nullable type will also accept an object that is null.
Patterns differ in the type of code element the pattern can be applied to (that is, what the pattern matches) and in what the pattern returns when it successfully finds a match.
The keyword pattern introduces the pattern-type. This is followed by the name of the type that the pattern can be applied to, then followed by an arrow ( -> ) introducing the name of the type the pattern returns on finding a match.
The applicable-type must not be nullable, though it can be the nullable type’s non-nullable counterpart. You can convert a nullable type to non-nullable; for example, by using NonNull. See “The nullable-type”.
A record assembles various named properties, which can be of various types, into a single structure.
Each property-name is an identifier.
When a question mark ( ? ) follows a property-name, the usage is similar to the character’s use with the name of a type, but there is a difference. When a type is nullable, the question mark indicates that the property-name must be present, but that its value might be null. In a record definition, on the other hand, when the question mark follows property-name, it indicates that the property itself can be omitted.
accumulate | endif | include | record |
as | enum | initializer | set |
astnode | eventstring | int | statement |
bool | exists | let | string |
checker | expression | list | switch |
ctorinit | false | matches | then |
declaration | for | models | true |
declare | function | NonNull | typedef |
default | globalset | null | where |
else | if | pattern | with |
elsif | in | property |