Understanding lexical grammar for multi-line comments in JavaCC

I’m struggling to understand why this piece of lexical grammar works for multi-line comments in JavaCC (posted here):

 "/*" (~["*"])* "*" (~["*","/"] (~["*"])* "*" | "*")* "/" 

As I read it, the parser scans the input /*, and then zero or more optional occurrences of * can appear, followed by a *, and then zero or more optional occurrences of * and /, then zero or more occurrences of * followed by * or *, ended by a /. Specifically, this part boggles my mind:

 (~["*","/"] (~["*"])* "*" | "*")* 

I’d appreciate some help understanding this.

Java lexical analyser

I created a lexical analyser in Java recently, but I don’t think the performance is very good.

The code works, but when I debugged the program, it take around ~100 milliseconds for only two tokens…

Can you read my code and give me tips about performance?

Lexer.java:

package me.minkizz.minlang;  import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; import java.util.HashSet; import java.util.Set; import java.util.stream.Stream;  public class Lexer {     private StringBuilder input = new StringBuilder();     private Token token;     private String lexema;     private boolean exhausted;     private String errorMessage = "";     private static Set<Character> blankChars = new HashSet<Character>();      static {         blankChars.add('\r');         blankChars.add('\n');         blankChars.add((char) 8);         blankChars.add((char) 9);         blankChars.add((char) 11);         blankChars.add((char) 12);         blankChars.add((char) 32);     }      public Lexer(String filePath) {          try (Stream<String> st = Files.lines(Paths.get(filePath))) {             st.forEach(input::append);         } catch (IOException ex) {             exhausted = true;             errorMessage = "Could not read file: " + filePath;             return;         }          moveAhead();     }      public void moveAhead() {          if (exhausted) {             return;         }          if (input.length() == 0) {             exhausted = true;             return;         }          ignoreWhiteSpaces();          if (findNextToken()) {             return;         }          exhausted = true;          if (input.length() > 0) {             errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";         }      }      private void ignoreWhiteSpaces() {         int charsToDelete = 0;          while (blankChars.contains(input.charAt(charsToDelete))) {             charsToDelete++;         }          if (charsToDelete > 0) {             input.delete(0, charsToDelete);         }      }      private boolean findNextToken() {          for (Token t : Token.values()) {             int end = t.endOfMatch(input.toString());              if (end != -1) {                 token = t;                 lexema = input.substring(0, end);                 input.delete(0, end);                 return true;             }          }          return false;     }      public Token currentToken() {         return token;     }      public String currentLexema() {         return lexema;     }      public boolean isSuccessful() {         return errorMessage.isEmpty();     }      public String errorMessage() {         return errorMessage;     }      public boolean isExhausted() {         return exhausted;     }  } 

Token.java:

package me.minkizz.minlang;  import java.util.regex.Matcher; import java.util.regex.Pattern;  public enum Token {      PRINT_KEYWORD("print\b"), PRINTLN_KEYWORD("println\b"), OPEN_PARENTHESIS("\("), CLOSE_PARENTHESIS("\)"),     STRING("\"[^\"]+\""), NUMBER("\d+(\.\d+)?");      private final Pattern pattern;      Token(String regex) {         pattern = Pattern.compile("^" + regex);     }      int endOfMatch(String s) {         Matcher m = pattern.matcher(s);          if (m.find()) {             return m.end();         }          return -1;     }  } 

Main.java:

package me.minkizz.minlang;  public class Main {      public static void main(String[] args) {         new Main();     }      public Main() {         long start = System.nanoTime();         Interpreter.execute("C:\Users\leodu\OneDrive\Bureau\minlang.txt");         long end = System.nanoTime();         System.out                 .println("Program executed in " + (end - start) + "ns (" + Math.round((end - start) / 1000000) + "ms)");     }  } 

Interpreter.java:

package me.minkizz.minlang;  public class Interpreter {      private static Token previousToken;      public static void execute(String fileName) {         Lexer lexer = new Lexer(fileName);          while (!lexer.isExhausted()) {             Token token = lexer.currentToken();             String lexema = lexer.currentLexema();              if (previousToken != null) {                  if (token == Token.STRING || token == Token.NUMBER) {                      if (previousToken == Token.PRINT_KEYWORD) {                         System.out.print(lexema);                     } else if (previousToken == Token.PRINTLN_KEYWORD) {                         System.out.println(lexema);                     }                  }              }              previousToken = token;              lexer.moveAhead();         }      }  } 

Lexical analysis: split chain of method calls into parts in Python

If I have a chain of method calls in Python, how do I extract the top level calls in a neat way? I have working code below.

Tldr; the function should behave thusly:

_parse_commands("df.hi()['fi'](__call__).NI(ni='NI!')") ['df', '.hi()', "['fi']", '(__call__)', ".NI(ni='NI!')"] 

I have this monstrosity, which kinda* works, but I suspect it is rather ugly compared to what is possible in Python:

def _fix_square_brackets(commands):      new_commands = []     for command in commands:          paren_level = 0         bracket_level = 0         starts = []         for i, char in enumerate(command):             if char == "(":                 paren_level += 1             elif char == ")":                 paren_level -= 1             elif paren_level == 0 and char == "[":  # start                 if bracket_level == 0:                     starts.append(i)                 bracket_level += 1             elif char == "]" and bracket_level != 0:                 bracket_level -= 1                 if bracket_level == 0 and paren_level == 0:                     starts.append(i + 1)          last_i = None         for i in starts:             new_command = command[last_i:i]             if new_command:                 new_commands.append(new_command)             last_i = i         new_commands.append(command[last_i:])      return new_commands  def _parse_commands(code):      # parsing top-level commands     level = 0     starts = []     dot_without_paren = False     open_call = False     lastchar = None     for i, char in enumerate(code, 0):         if char in (" ", "\n", "\t"): continue          if char == "(":             if lastchar == ")" and level == 0:                 starts.append((i, char))                 open_call = True             dot_without_paren = False             level += 1         elif char == ")":             if open_call and level == 0:                 open_call = False             level -= 1         elif level == 0 and char == "." and not dot_without_paren:             dot_without_paren = True             starts.append((i, char))         lastchar = char      commands = []     last_i = None     for (i, _) in starts:         commands.append(code[last_i:i])         last_i = i     commands.append(code[last_i:])      return _fix_square_brackets(commands) 

I said kinda because it does not support chains split over multiple lines or removes comments.

Product of Lexical Specification

I have a problem that asks me to consider the string abbbaacc. I’m supposed to figure out which of the following lexical specification produces the tokenization ab/bb/a/acc.

The options are:

A.     a(b+c)*     b+  B.     ab      b+     ac*  C.     c*      b+     ab     ac*  D.     b+      ab*     ac* 

I just learned about REGEX, and I’m not sure about this, but to solve this problem would I just be trying see which options can make ab/bb/a/acc?

If that’s correct, then would the answer be all four of them?

Since all four of them can match to ab/bb/a/acc:

A.      a(b+c)* -> a, ab, acc       b+ -> bb B.       ab -> ab      b+ -> bb      ac* -> a, acc C.      c* ->       b+ -> bb      ab -> ab      ac* -> a, acc D.       b+ -> bb      ab* -> a, ab      ac* -> a, acc 

I’m not sure if I’m doing this correctly.