Use of graph grammars/rewriting systems in compilers?

A(n imperative) program – in a higher-level language and more importantly in assembly language or intermediate representations like LLVM – can be formalized as a directed "port graph", in which vertices are instructions and ports correspond to input/output operands/arguments. An optimization applied of a program’s code therefore corresponds to a rewrite applied to its port digraph.

Now, graph rewriting is a small but somewhat-active area, in itself. What I’m wondering if these kinds of systems have been actually put to explicit use in the context of a compiler. That is, representing the optimization phase as a rewriting process, or a derivation using a port-graph grammar.

I’m not much of a "compilers guy" – I took a basic course on them in my undergraduate degree, and I know about LLVM and its IR – but it seems to me that graph rewriting systems is the "obvious" formalism to use; and at the same time – I don’t see almost any FOSS projects involving such systems, nor do the papers about them discuss their use in compilers.

Note: I’m more interested in practical-use, popular-language compilers more than academia-only systems, but I’ll take what I can get.

Are compilers safe?

I was wondering about how compilers like GCC compile themselves after each release but that got me thinking:
Are compilers safe?

Correct me if I’m wrong, but even if at one step along the way a version of the compiler was compiled with a piece of malicious code put inside, wouldn’t any release after that be infected as well? This would cause a chain reaction and could potentially infect millions of devices.

Has this happened already? If so how did we find out? Is there any group or organization making sure this doesn’t happen?

Do compilers of high programming languages always compile them directly to machine code?

As an amateur Bash/JavaScript scripter who never wrote one sentence in Assembly, I ask:

Do compilers of high programming languages always compile them directly to machine code, or are there cases when a compiler of some high programming language compiles it to assembly (and then assembler will assemble input to machine code output)?

High Level Assemblers vs Compilers?

I’m reading the “Art of Assembly” 2nd Edition by Randall Hyde. In the book the author seems uses his own language called High Level Assembly (HLA). Coming from a C Background, I question how useful this is in learning (real x86/NASM for use on Linux) Assembly. But, looking it up there seems to be a term “High Level Assembler” (wikipedia) that gives some credibility to calling HLA “Assembly”.

HLA provides ENUMs; bounds checking; multi-dimensional arrays; constant, static, and readonly variables; operators; a standard library; some error-catching; alignment and a lot of other stuff.

Where is the line drawn between an Assembler and a Compiler. From the book, it seems there is even some addressing modes not supported by HLA,

Actually, the 80×86 does support addressing modes involving certain 16-bit registers, as mentioned earlier. However, HLA does not support these modes and they are not useful under 32-bit operating systems.

If it is a compiler, then does it “compile” the Assembly, or “assemble” the Assembly? Or is “High Level Assembly” just a procedural programming language that has little to do with Assembly?

Detecting unicorn and dinosaur compilers

Rumor is that the next version of C will disallow sign magnitude and ones’ complement signed integer encoding. True or not, it seems efficient to not have to code and test for those rare encodings.

Yet if code might not handle such cases as non-2’s complement, it is prudent to detect and fail such compilations today.

Rather than just look for that one kind of dinosaur1, below is C code that looks for various unicorns2 and dinosaurs. Certainly some tests are more useful than others.

Review goal:

  • Please report any dinosaur1 and unicorns2 compilers found by this code.

  • Review how well this code would successfully flag true passé compilers and not report new innovative ones (e.g. 128-bit intmax_t.)

  • Suggest any additional or refined tests.

  • Pre-C11 compilers that lack static_assert may readily need a better #define static_assert ... than this code. Better alternatives are appreciated, but not a main goal of this post.

Note: I am not trying to rate strict adherence to IEEE_754 and the like.


/*  * unicorn.h  * Various tests to detect old and strange compilers.  *  *  Created on: Mar 8, 2019  *      Author: chux  */  #ifndef UNICORN_H_ #define UNICORN_H_  #include <assert.h> #ifndef static_assert   #define static_assert( e, m ) typedef char _brevit_static_assert[!!(e)] #endif  #include <float.h> #include <limits.h> #include <stdint.h>  /*  *  Insure 2's complement  *  Could also check various int_leastN_t, int_fastN_t  */ static_assert(SCHAR_MIN < -SCHAR_MAX && SHRT_MIN < -SHRT_MAX &&     INT_MIN < -INT_MAX && LONG_MIN < -LONG_MAX &&     LLONG_MIN < -LLONG_MAX && INTMAX_MIN < -INTMAX_MAX &&     INTPTR_MIN < -INTPTR_MAX && PTRDIFF_MIN < -PTRDIFF_MAX     , "Dinosuar: Non-2's complement.");  /*  *  Insure the range of unsigned is 2x that of positive signed  *  Only ever seen one once with the widest unsigned and signed type with same max  */ static_assert(SCHAR_MAX == UCHAR_MAX/2 && SHRT_MAX == USHRT_MAX/2 &&     INT_MAX == UINT_MAX/2 && LONG_MAX == ULONG_MAX/2 &&     LLONG_MAX == ULLONG_MAX/2 && INTMAX_MAX == UINTMAX_MAX/2,          "Dinosuar: narrowed unsigned.");  /*  *  Insure char is sub-range of int  *  When char values exceed int, makes for tough code using fgetc()  */ static_assert(CHAR_MAX <= INT_MAX, "Dinosuar: wide char");  /*  *  Insure char is a power-2-octet  *  I suspect many folks would prefer just CHAR_BIT == 8  */ static_assert((CHAR_BIT & (CHAR_BIT - 1)) == 0, "Dinosaur: Uncommon byte width.");  /*  *  Only binary FP  */ static_assert(FLT_RADIX == 2, "Dinosuar: Non binary FP");  /*  *  Some light checking for pass-able FP types  *  Certainly this is not a full IEEE check  *  Tolerate float as double  */ static_assert(sizeof(float)*CHAR_BIT == 32 || sizeof(float)*CHAR_BIT == 64,     "Dinosuar: Unusual float"); static_assert(sizeof(double)*CHAR_BIT == 64, "Dinosuar: Unusual double");  /*  *  Heavier IEEE checking  */ static_assert(DBL_MAX_10_EXP == 308 && DBL_MAX_EXP == 1024 &&     DBL_MIN_10_EXP == -307 && DBL_MIN_EXP == -1021 &&     DBL_DIG == 15 && DBL_DECIMAL_DIG == 17 && DBL_MANT_DIG == 53,     "Dinosuar: Unusual double");  /*  *  Insure uxxx_t range <= int  *  Strange when unsigned helper types promote to int  */ static_assert(INT_MAX < UINTPTR_MAX, "Unicorn: narrow uintptr_t"); static_assert(INT_MAX < SIZE_MAX, "Unicorn: narrow size_tt");  /*  *  Insure xxx_t range >= int  *  Also expect signed helper types at least int range  */ static_assert(INT_MAX <= PTRDIFF_MAX, "Unicorn: narrow ptrdiff_t"); static_assert(INT_MAX <= INTPTR_MAX, "Unicorn: narrow intptr_");  /*  *  Insure all integers are within `float` finite range  */ // Works OK when uintmax_t lacks padding static_assert(FLT_RADIX == 2 && sizeof(uintmax_t)*CHAR_BIT < FLT_MAX_EXP,     "Unicorn: wide integer range"); // Better method #define UNICODE_BW1(x) ((x) > 0x1u ? 2 : 1) #define UNICODE_BW2(x) ((x) > 0x3u ? UNICODE_BW1((x)/0x4)+2 : UNICODE_BW1(x)) #define UNICODE_BW3(x) ((x) > 0xFu ? UNICODE_BW2((x)/0x10)+4 : UNICODE_BW2(x)) #define UNICODE_BW4(x) ((x) > 0xFFu ? UNICODE_BW3((x)/0x100)+8 : UNICODE_BW3(x)) #define UNICODE_BW5(x) ((x) > 0xFFFFu ? UNICODE_BW4((x)/0x10000)+16 : UNICODE_BW4(x)) #define UNICODE_BW6(x) ((x) > 0xFFFFFFFFu ? \     UNICODE_BW5((x)/0x100000000)+32 : UNICODE_BW5(x)) #define UNICODE_BW(x) ((x) > 0xFFFFFFFFFFFFFFFFu ? \     UNICODE_BW6((x)/0x100000000/0x100000000)+64 : UNICODE_BW6(x)) static_assert(FLT_RADIX == 2 && UNICODE_BW(UINTMAX_MAX) < FLT_MAX_EXP,     "Unicorn: wide integer range");  /*  *  Insure size_t range > int  *  Strange code when a `size_t` object promotes to an `int`.  */ static_assert(INT_MAX < SIZE_MAX, "Unicorn: narrow size_t");  /*  *  Recommended practice 7.19 4  */ static_assert(PTRDIFF_MAX <= LONG_MAX, "Unicorn: ptrdiff_t wider than long"); static_assert(SIZE_MAX <= ULONG_MAX, "Unicorn: size_t wider thna unsigned long");  /*  *  Insure range of integers within float  */ static_assert(FLT_RADIX == 2 && sizeof(uintmax_t)*CHAR_BIT < FLT_MAX_EXP,     "Unicorn: wide integer range");  // Addition code could #undef the various UNICODE_BWn  #endif /* UNICORN_H_ */ 

Test driver

#include "unicorn.h" #include <stdio.h>  int main(void) {   printf("Hello World!\n");   return 0; } 

1 C is very flexible, yet some features applied to compilers simply no longer in use for over 10 years. For compilers that used out-of-favor features (non-2’s complement, non-power-of-2 bit width “bytes”, non-binary floating-point, etc.) I’ll call dinosaurs.

2 C is very flexible for new platform/compilers too. Some of these potential and theoretical compliers could employ very unusual features. I’ll call these compilers unicorns. Should one appear, I rather have code fail to compile than compile with errant functioning code.

How do C++ compilers store tokens?

Let say a token is designed this way

struct Token {     string content;     string filename; // I know it is inefficient     int linenumber;     int position; }; 

And let’s say the program is stored this way:

struct Application {     vector<Token> tokens; }; 

This is not enough. The applications are divided to classes, functions, assignments, function calls, etc.

These are none-uniform types.

How do gcc and clang manage them? Do they use OO to achieve them?

Is there any diagram of their overall data structure?

Why are there so few C compilers?

C is one of the most widely-used languages in the world. It accounts for a huge proportion of existing code and continues to be used for a vast amount of new code. It’s beloved by its users, it’s so widely ported that being able to run C is to many the informal definition of a platform, and is praised by its fans for being a “small” language with a relatively clean set of features.

So where are all the compilers?

On the desktop, there are (realistically) two: GCC and Clang. Thinking about it for a few seconds you’ll probably remember Intel exists as well. There are a handful of others, far too obscure for the average person to name and almost universally not bothering to support a recent language version (or often even a well-defined language subset, just “a subset”). Half of the members of this list are historical footnotes; most of the rest are very specialized and still don’t actually implement the full language. Very few actually seem to be open-source.

Scheme and Forth – other small languages that are beloved by their fans for it – probably have more compilers than actual users. Even something like SML has more “serious” implementations to choose between than C. Whereas the announcement of a new (unfinished) C compiler aiming at verification actually sees some pretty negative responses, and veteran implementations struggle to get enough contributors to even catch up to C99.

Why? Is implementing C so hard? It isn’t C++. Do users simply have a very skewed idea about what complexity group it falls in (i.e. that it actually is closer to C++ than Scheme)?

How a regular language , context free language and context sensitive grammar are used in compilers to shape up the languge?

I know that regular language can be used for pattern matching , context free language is used for syntax matching and context sensitive for semantic or meaning of the sentence . But i have found it hard to visualize how the total process takes place. for example say we want to justify a statement of any language then how should i start to form a full sequential generation of the grammars that correctly depict the language.

what i think is as follows. say we want to first match the first characters of the statement , now we try to feed the input into many of the automata that are operating on the background and then if any of them reaches the final state we have a match and issue a token of its class like is it data type , identifiers etc but at the same time we check that if there is other automata that matches the same pattern like for ‘int’ there exist two automata that reaches final state one is for keywords and other is identifier but we then check for priority. now after assigning each sequence of characters a class we done with pattern matching and now move on to syntax , now we use CFG but from here i do not know how we will proceed ,so please tell me how we process from here. also if you find anything wrong here tell me, its much appreciated.

How to organize a project as it grows more complex and depends on multiple languages and compilers across multiple operating systems?

I wrote a Music Player and Library in Java for GNU/Linux and Windows.

My build process is currently in ANT, but I intend to migrate away from that to something more modern after the next release. Probably Gradle.

The project was originally in pure Java, but in order to get access to certain features, I’ve had to start special-casing different native features and linking them into my program using JNI.

Here’s an outline currently:

  • Hypnos Base – written in Java
  • JXGrabKey – An existing open source hotkey library, modified by my project to support javafx.
  • GNU/Linux Native Launcher – written in C++, compiled with g++.
  • Windows Native Launcher – written in C++, compiled with cl.exe.
  • GNU/Linux GTK Tray Icon – written in C, compiled with gcc.

My src tree is currently done the standard way for Java:

  • src
    • net
      • joshuad
        • hypnos
          • [all the Hypnos Base code and packages]
    • jxgrabkey
  • packaging
    • NSIS Script for windows installer
    • AppImage stuff for standalone linux runtime

For the various natively-compiled code, I was compiling it by hand as needed, and then providing it to ANT in binary form to do my packaging.

As I’m adding more and more native code, I’m starting to want to be able to recompile the entire project on demand, rather than having to manually do each module before starting ANT. It feels irresponsible to have an open source project where the build process is arcane.

My primary development system is Ubuntu 18.04. Prior to today, all of the build code was doable on that system. Now I’ve added the native launcher for windows, which is currently compiled with cl.exe and targets 64-bit Windows. It’s relatively simple code (~60 lines) with a single function, WinMain, some basic library linking work and then handing things over to jvm.dll.

This post may be too open-ended for this site, but I figured I’d give it a shot. What are some paradigms or resources I can use to try to make my build process a bit more sane?