loading...

A Haskell AST parser and formatter!

yujiri8 profile image Ryan Westlund ・2 min read

As you might know if you follow me, I dabble in Haskell, and intend to use it to for a machine translator from my conlang to English. I was working on it a bit today, and reading a tutorial showing how to make a calculator, but I couldn't figure out how to apply his designs to the translator, so I decided to try something more similar - an AST parser and formatter. Took me a few hours but I got the basic functionality working! The program takes in source code and spits out a formatted version (go fmt-style).

Current features:

  • Only assignment statements are supported.
  • Variable name rules are standard.
  • Literals can only be numbers.
  • A variable can be set to a literal or another variable.
  • Backslashes can break lines, Python-style.

I used the regex-pcre library for tokenizing.

Code:

import Text.Regex.PCRE
import Data.List

data Token =
 TokName String |
 TokLiteral String |
 TokEquals |
 TokSpace |
 TokBinaryOp String | -- TODO enum
 TokStmtBreak
 deriving (Eq, Show, Read)

-- data BinaryOp = Plus | Minus
 -- deriving (Eq)
-- instance Show BinaryOp where
 -- show Plus = "+"
 -- show Minus = "-"
-- instance Read BinaryOp where
 -- read "+" = Plus
 -- read "-" = Minus

tokenize :: String -> [Token] -> [Token]
tokenize "" tokens = tokens
tokenize code tokens =
 let (token, length) = getNextToken code
 in tokenize (drop length code) (tokens ++ [token])

main = do
 code <- getContents
 let tokens = tokenize code []
 -- now parse.
 let ast = parseAST (filter (\t -> t /= TokSpace) tokens) []
 -- output.
 putStr $ fmtAST ast

getNextToken :: String -> (Token, Int)
getNextToken code
  | not $ null spaceMatch = (TokSpace, length spaceMatch)
  | not $ null stmtBreakMatch = (TokStmtBreak, length stmtBreakMatch)
  | not $ null binaryOpMatch = (TokBinaryOp binaryOpMatch, length binaryOpMatch)
  | not $ null equalsMatch = (TokEquals, 1)
  | not $ null nameMatch = (TokName nameMatch, length nameMatch)
  | not $ null literalMatch = (TokLiteral literalMatch, length literalMatch)
  | otherwise = error $ "invalid token at " ++ code
  where spaceMatch = code =~ "\\A([ \t]|\\\\\n)+" :: String
        stmtBreakMatch = code =~ "\\A[[:space:]]+" :: String
        binaryOpMatch = code =~ "\\A(\\+|-|\\*|/|\\*\\*|%|&|\\||\\^|and|or|==|<|>|<=|>=|!=)" :: String
        equalsMatch = code =~ "\\A=" :: String
        nameMatch = code =~ "\\A[a-zA-Z_][a-zA-Z_0-9]*" :: String
        literalMatch = code =~ "\\A([0-9]+(\\.[0-9]+)?)" :: String

type AST = [Statement]
type Statement = Assignment
data Assignment = Assignment String Expr
data Expr = SimpleExpr Term | ComplexExpr Term String Expr -- TODO enum
data Term = LiteralTerm String | NameTerm String
instance Show Term where
 show (LiteralTerm term) = term
 show (NameTerm term) = term

parseAST :: [Token] -> AST -> AST
parseAST [] ast = ast
parseAST tokens ast =
 let stmtBreakIndex = findIndex (\t -> t == TokStmtBreak) tokens
     stmtTokens =
       case stmtBreakIndex of
         Just i -> i
         Nothing -> length tokens
     nextStmt = parseStmt (take stmtTokens tokens)
     remainingTokens = (drop (stmtTokens + 1) tokens)
 in
   if stmtTokens == 0 then parseAST remainingTokens ast
   else parseAST (drop (stmtTokens + 1) tokens) (ast ++ [nextStmt])

parseStmt :: [Token] -> Statement
parseStmt tokens =
  case tokens !! 0 of
    TokName name ->
      case tokens !! 1 of
         TokEquals ->
           Assignment name (parseExpr $ drop 2 tokens)
         token -> error $ "unparseable token:" ++ show token
    token -> error $ "unparseable token:" ++ show token

parseExpr :: [Token] -> Expr
parseExpr (TokName name : tokens) =
 if null tokens
   then SimpleExpr (NameTerm name)
   else error "Not yet suported"
parseExpr (TokLiteral literal : tokens) =
 if null tokens
   then SimpleExpr (LiteralTerm literal)
   else error "Not yet suported"

fmtAST :: AST -> String
fmtAST ast = unlines $ fmtStmt <$> ast

fmtStmt :: Statement -> String
fmtStmt (Assignment var val) =
 var ++ " = " ++ (fmtExpr val)

fmtExpr (SimpleExpr term) = show term

Example input:

q=3

w =q

Output:

q = 3
w = q

I'm proud :)

Posted on Jun 15 by:

yujiri8 profile

Ryan Westlund

@yujiri8

I'm a programmer, writer, and philosopher. My Github account is yujiri8; all my content besides code is at yujiri.xyz.

Discussion

markdown guide