Grammars, BNF, EBNF, and Parse Trees

When we write code, we're engaging with two fundamental aspects of a programming language: syntax and semantics. The syntax determines how the code is written—the structure, the symbols, and the rules that dictate how a program should look. But how do we define and formalize these rules? This is where grammars and formal language theory come into play.

In this blog, we’ll explore how programming language syntax is defined through formal grammars, the different types of grammars used, and how they help shape the way we write code.

What Is Program Syntax?

In programming, syntax is essentially the form and structure of a program. It dictates how the code should look. For example, in most programming languages, statements end with a semicolon ; or expressions are contained within parentheses (). While syntax governs the "look," semantics refers to the behavior of the program—what the code does when it is run.

Understanding syntax is essential for both writing and compiling code. Without proper syntax, a program won't execute correctly or at all.

Grammar and Parse Trees

To formalize programming language syntax, we use grammars—rules that describe how strings of code are formed. Think of a grammar as a blueprint for constructing valid sentences in a programming language.

A parse tree is a visual representation of how a string (like a piece of code) conforms to a grammar. In a parse tree, each node represents a component of the syntax, and the structure shows how these components fit together.

Example: Simple Grammar for English

Let’s consider an example from English grammar. A basic sentence can be broken down as:

<S> ::= <NP> <V> <NP>
<NP> ::= <A> <N>
<V> ::= loves | hates | eats
<A> ::= a | the
<N> ::= dog | cat | rat

For the sentence "the dog loves the cat," a parse tree might look like this:

<S>
 ├── <NP> 
 │   ├── <A> -> the
 │   └── <N> -> dog
 ├── <V> -> loves
 └── <NP>
     ├── <A> -> the
     └── <N> -> cat

This structure shows how the sentence is formed from the grammatical rules.

Defining Programming Languages with Grammar

In programming languages, grammars work similarly. For example, here’s a simple grammar to describe expressions in a hypothetical language:

<exp> ::= <exp> + <exp> | <exp> * <exp> | ( <exp> ) | a | b | c

This grammar defines that an expression (<exp>) can be:

The sum or product of two expressions,
A parenthesized expression, or
A variable (a, b, or c).

Example Parse Tree

For the expression (a + b) * c, the parse tree would look like this:

<exp>
 └── <exp> * <exp>
      ├── ( <exp> )
      │    └── <exp> + <exp>
      │         ├── a
      │         └── b
      └── c

This parse tree shows how the expression is structured according to the grammar rules.

BNF and EBNF: Tools for Defining Syntax

Backus-Naur Form (BNF)

BNF (Backus-Naur Form) is a notation used to define grammars. A BNF grammar consists of:

Tokens: The smallest units, such as keywords, constants, or operators.
Non-terminals: Symbols that can be replaced by other tokens or non-terminals.
Start symbol: The starting point for generating a parse tree.
Productions: Rules that define how non-terminals can be expanded into sequences of tokens or other non-terminals.

Extended BNF (EBNF)

EBNF (Extended BNF) simplifies the writing of grammars with additional symbols:

{x} means zero or more repetitions of x.
[x] makes x optional.
() and | are used for grouping and alternatives.

EBNF Example:

A simple grammar for an if-statement:

<if-stmt> ::= if <expr> then <stmt> [else <stmt>]

This defines that an if-statement consists of an expression, followed by a statement, with an optional else-clause.

Phrase Structure vs. Lexical Structure

When defining the syntax of programming languages, we often differentiate between phrase structure and lexical structure:

Phrase structure: How tokens are assembled into larger constructs like expressions and statements.
Lexical structure: How characters in the source file are divided into tokens.

For example, a scanner reads the source code and breaks it down into tokens (such as if, 123, or +), while a parser builds a parse tree from these tokens.

Syntax Diagrams (Railroad Diagrams)

Syntax diagrams, also known as railroad diagrams, provide a visual representation of grammar rules. They’re often easier for humans to read than textual grammars, though they are less precise and harder to automate.

In a syntax diagram:

Boxes represent non-terminals (constructs that can be expanded).
Ovals represent terminals (literal tokens).
Loops and branches represent optional or repeated elements.

Conclusion

Defining the syntax of a programming language is crucial for both programmers and machines. Grammars give us a formal way to specify what constitutes valid code, allowing us to write programs that can be compiled and executed without errors. Whether you use BNF, EBNF, or syntax diagrams, these tools allow us to clearly understand how programming languages are structured.

Grammars play a foundational role in compilers, interpreters, and even code analysis tools. By mastering the grammar of a language, you gain a deeper understanding of its structure and rules, making you a better programmer.

Program Syntax: The Foundations of Language Construction