A full reimplementation of IBM Pascal 2.0, a compiler targeting LLVM IR with semantic analysis in a dedicated type-checking phase. Built to handle the vintage Pascal-1981 dialect with all its systems-programming extensions (adr, sizeof, adrmem, word, extern) — the features that made Pascal suitable for low-level operating system and firmware work in the early 1980s.
Compile a Pascal program to a native executable:
# Pascal source -> LLVM IR (parse + type-check + codegen)
python3 compile_to_llvm.py myprogram.pas myprogram.ll
# LLVM IR -> native executable (requires clang)
clang myprogram.ll -o myprogram
# Run it
./myprogramAdd -v / --verbose for detailed output and full Python tracebacks if compilation fails:
python3 compile_to_llvm.py -v myprogram.pas myprogram.llIf no output file is specified, LLVM IR is written to stdout:
python3 compile_to_llvm.py myprogram.pas | clang -x ir - -o myprogramA clean, layered pipeline with clear separation of concerns:
Pascal Source -> Lexer -> Parser -> Type Checker -> Codegen -> LLVM IR -> clang -> Executable
Each phase is independent and focused:
- Front end (lexer, parser, type checker) is pure Python with no LLVM dependency
- Errors stop the pipeline early — type errors are reported before any IR is generated
- No surprise failures — if compilation succeeds, the generated code will link and run
- Lexer (
lexer.py) — tokenizes Pascal source: keywords, identifiers, numbers, operators, strings. - Parser (
parser.py) — builds an Abstract Syntax Tree (AST) from tokens. Implements the full IBM Pascal 2.0 grammar. Entry point:parse_file(path). - Type Checker (
type_system.py,symbol_table.py,type_checker.py) — semantic analysis: validates types, scopes, control flow, and module semantics before code generation. All type violations stop the pipeline with clear error messages. - Codegen (
codegen_llvm.py) — walks the AST and emits LLVM IR usingllvmlite. Built-in I/O (WRITELN,READLN) is wired to the C runtime (printf,scanf). - Linking —
clanglowers LLVM IR to native code and links any required runtime objects.
The grammar this dialect implements is formally specified in docs/ebnf_grammar.md. The parser test suite is graded against this grammar as the source of truth.
This compiler implements the full IBM Pascal 2.0 language, including all semantic rules and dialectal extensions. The checklist of features and gaps is tracked in docs/Grand_Unified_Checklist.md.
INTEGER(32-bit signed)BOOLEAN(one byte; stored asi8so address-of /sizeof/ fills are byte-consistent)REAL(64-bit float; limited codegen support)WORD(16-bit unsigned)CHAR(8-bit)ARRAY[low..high] OF type— bounds may be constant expressions, including namedCONSTsRECORD ... ENDSET OF type- Pointers, plus the
adrmem(generic address) parameter type
VAR x, y: INTEGERCONST size = 8190— constant values are folded and usable in array bounds,sizeof, and expressionsPROCEDURE name(params); ... ENDFUNCTION name(params): type; ... ENDTYPE name = typeEXTERN/FORWARD/EXTERNALprocedures (link against external/C objects)
IF cond THEN stmt ELSE stmtWHILE cond DO stmtREPEAT stmt UNTIL condFOR var := start TO/DOWNTO end DO stmtCASE expr OF cases ENDBEGIN stmt; stmt; ... END- procedure / function calls
- Arithmetic:
+,-,*,/,DIV,MOD - Logic:
AND,OR,XOR,NOT - Comparison:
=,<>,<,<=,>,>= - Calls:
func(args) - Systems-programming operators:
adr x(address-of),sizeof(x)/sizeof(type) - Built-ins:
CHR,ORD
WRITELN(...)— accepts a mix of integers, characters, booleans, and string literals (mapped toprintf)READLN(var)— reads an integer (mapped toscanf)
These are the features that made Pascal suitable for writing operating systems, firmware, and device drivers. They allow direct memory manipulation while maintaining Pascal's type safety where possible:
adr x— yields the address of a variable. Lowered to the variable's LLVM pointer, enabling low-level code.sizeof(x)/sizeof(T)— compile-time byte size, computed from real array bounds (constants are resolved) and element sizes; returns aWORD. Essential for buffer and layout calculations.adrmem— a generic address/pointer parameter type (i8*in LLVM). Pointer arguments are automatically bitcast to the parameter's type at the call site, enabling polymorphic low-level functions. Example:adr flags(an array pointer) can be passed where anadrmemis expected.externprocedures — declared without a body and resolved at link time. Enables linking Pascal code against C runtimes and external libraries.wordtype — 16-bit unsigned integer for register and hardware register operations.
This is a full reimplementation of IBM Pascal 2.0. The goal is not a subset or tutorial language, but complete dialect coverage as specified in the original IBM Pascal 2.0 manual.
Reference: The original compiler manual is here — this is the source of truth for dialect semantics and feature completeness.
Progress toward full coverage is tracked in docs/Grand_Unified_Checklist.md, which lists:
- ✅ Completed features with test evidence
- 🚧 In-progress and planned work
- 📋 Known gaps with effort estimates
Features are prioritized by impact (correctness traps first, then missing grammar, then semantic edge cases) and effort. The test suite is organized to run independently at each layer, so development can proceed without the full LLVM toolchain.
pascal-1981/
├─ Core Compiler
│ ├── lexer.py # Tokenizer (keywords, identifiers, numbers, strings, operators)
│ ├── parser.py # Syntax analysis; builds AST via recursive descent
│ ├── ast_nodes.py # AST node definitions (typed dataclasses)
│ ├── type_system.py # Type hierarchy and compatibility rules
│ ├── symbol_table.py # Scope management and symbol lookup
│ ├── type_checker.py # Semantic analysis (types, scopes, control flow)
│ ├── codegen_llvm.py # LLVM IR generation from AST
│ └── compile_to_llvm.py # Driver (parse → type-check → codegen)
│
├─ Tests (organized by pipeline layer)
│ ├── tests/
│ │ ├── __init__.py
│ │ ├── support.py # Test helpers and dependency probes
│ │ ├── test_parser.py # Parser accept/reject corpus (pure Python)
│ │ ├── test_typecheck.py # Type rules and semantics (pure Python)
│ │ ├── test_codegen.py # IR generation and build/run (requires llvmlite + clang)
│ │ ├── test_integration.py # Legacy integration corpus (removed)
│ │ └── fixtures/parser/
│ │ ├── should_pass/ # Programs that MUST parse
│ │ ├── should_fail/ # Programs that MUST be rejected
│ │ └── judgment_calls/ # Edge cases per dialect spec
│
├─ Documentation
│ ├── docs/
│ │ ├── ebnf_grammar.md # Formal grammar specification (reference document)
│ │ └── Grand_Unified_Checklist.md # Feature completeness tracker (priorities, effort, gaps)
│
├─ Runtime & Build
│ ├── runtime/
│ │ └── fillc.c # C runtime for Pascal I/O (printf/scanf bridge)
│ ├── scripts/
│ │ └── beautify.sh # Code formatter (isort + yapf)
│ ├── .gitignore
│ ├── .style.yapf # Code style config
│ └── README.md # This file
One unified test suite built on Python's stdlib unittest, with automatic detection of optional dependencies. Tests are organized by pipeline layer, so you can run the subset relevant to your changes without requiring the full LLVM toolchain.
# All tests; codegen tests auto-skip if llvmlite/clang are unavailable
python3 -m unittest discover -s tests -v# Parser accept/reject corpus + type rules (no llvmlite needed)
python3 -m unittest tests.test_parser tests.test_typecheck
# Codegen only (requires llvmlite + clang)
python3 -m unittest tests.test_codegen-
tests/test_parser.py— Parser accept/reject verdicts over a fixture corpus:should_pass/— programs that conform to the grammar and MUST parseshould_fail/— programs that violate the grammar and MUST be rejectedjudgment_calls/— edge cases where the dialect spec allows discretion
No subprocess or stdout grepping; verdicts come from catching
(ParserError, LexerError). Each fixture runs in asubTestfor isolated failure reporting. -
tests/test_typecheck.py— Type rules, scope, compatibility, control flow, and module semantics. Organized by topic intoTestCaseclasses (TestVariableScope,TestTypeCompatibility,TestModuleSemantics, etc.). In-process; no subprocess orllvmlitedependency. -
tests/test_codegen.py— LLVM IR generation and native build/run tests. Decorated with@requires_llvm(IR tests) and@requires_exe(build/run tests). Automatically skipped if the toolchain is unavailable; the suite still exits 0. -
tests/test_integration.py— Legacy integration corpus (currently removed from supported test suite).
The front end (lexer, parser, type checker) is pure Python with no llvmlite dependency. This means:
test_parser.pyandtest_typecheck.pyrun on any Python 3.8+ systemtest_codegen.pyrequiresllvmliteandclangbut is the only place that imports them- If codegen dependencies are missing, the suite auto-skips those tests without failure
- AST — typed dataclasses defined in
ast_nodes.py, one per language construct. The parser builds the tree bottom-up using recursive descent. Array, record, and pointer access use selector nodes for uniform representation. - Type System — modular type hierarchy: base scalar types (INTEGER, REAL, BOOLEAN, CHAR, WORD) plus composite types (ARRAY, RECORD, SET, POINTER) and callable types (PROCEDURE, FUNCTION). Implements Pascal's strict assignment rules with explicit type compatibility checks.
- Symbol Table — scope stack with parent chain for lexical scoping. Symbols are tagged by kind (var, const, function, procedure, parameter, type) to support scope-aware lookups and proper shadowing rules.
- Codegen — direct LLVM IR emission using
llvmlite. No intermediate IR; the AST walks directly to LLVM instructions. Globals receive proper zero initializers; named constants are folded at compile time; function arguments are coerced (pointer bitcasts, integer width adjustments) to match callee signatures.
- Type checking before codegen — all type errors are caught and reported before any IR is generated, guaranteeing that successful type checking implies compilable output.
- Minimal operator overloading — each operator works on specific types with explicit type rules, avoiding the ambiguity that makes compiled languages harder to reason about.
- Array bounds at compile time — constant expressions in array declarations enable
sizeofand layout calculations to be resolved during parsing, essential for systems programming.
For parsing and type checking:
- Python 3.8+
- No external dependencies (pure Python implementation)
For code generation (LLVM IR → native executable):
- Python 3.8+
llvmlite(for LLVM IR generation via Python)clang(recent versions; needed for native compilation and linking)- A harmless target-triple override warning from LLVM is expected and safe to ignore
Note: If llvmlite or clang are unavailable, the parser and type checker still work fully; only codegen tests are skipped.