Build your own Json parser in Golang

Table of contents

Before we start, if you find anything hard to understand in this or theres something can be improved let me know in the comments or you can reach me out on x.com/Yash_Dhadve4

First of all what even is a parser?

Well parser is simply a application/function that takes in the data/string and does appropriate actions based on the data. In case of json parser a string will be taken as input and we will have to extract data from it, check if the structure of given string matches the appropriate structure we defined and map data to variables based on input string.

From below example we can see how standard json decoder works -

package main

import (
    "encoding/json"
    "fmt"
)

type student struct {
    Name string
}

func main() {
    // we get data in byte format
    data := []byte(`{
        "Name":"yash"
    }`)

    // initilizing a datastructure in which we want to capture data
    var aki student

    // parsing data and extrating data into aki
    json.Unmarshal(data, &aki)
    fmt.Println(aki)
}

So thats exactly what we will be trying to make. Just a functions one that converts string or byte data into a structured json data format.

So to convert string in to a data we need to first be able to understand that string. Like If we have string something like { “name”:”yash is my name” }. We should be able to separate above string in appropriate parts like { name : yash is my name } . This process is called tokenisation.

And then we parse those tokens, validate them ( if they are in valid json format ) and then extract information out of it.

So now lets start by tokenization.

Tokenization

File structure looks something like this -

We first need to define the type of tokens we need. so starting with token.go

package token

type TokenType string

type Token struct {
    Type    TokenType // This tells the type of token
    Literal string // this is the actual data string we extracted 
}

const (
    ILLEGAL = "ILLEGAL" // This token tells if we have encountered an illigal token
    EOF     = "EOF" // This tells us if we have reached the end of the file

    COMMA = ","
    COLON = ":"

    LBRACE = "{"
    RBRACE = "}"
    LBRACK = "["
    RBRACK = "]"

    INT    = "INT"
    BOOL   = "BOOL"
    STRING = "STRING"
)

Now the hard part, lexer.go -

package lexer

import "jping/token"

type Lexer struct {
    input        string // Input string is the string which we want to convert to json
    position     int // Position is the pointer that keeps track of what character we are reading
    readPosition int // This pointer is one character of position pointer
    ch           byte // This holds the current character of readPosition
}

// This creates a new object of lexer
func New(input string) *Lexer {
    l := &Lexer{input: input}
    l.readChar()
    return l
}

// This function increments the position of position and readPointer by one,
// and replaces the current character with next character
func (l *Lexer) readChar() {
    if l.readPosition >= len(l.input) {
        l.ch = 0
    } else {
        l.ch = l.input[l.readPosition]
    }
    l.position = l.readPosition
    l.readPosition += 1
}

// This the the actual function where we convert our string into tokens
func (l *Lexer) NextToken() token.Token {
    var tok token.Token

    switch l.ch {

    case '{':
        tok = token.Token{Type: token.LBRACE, Literal: string(l.ch)}
    case '}':
        tok = token.Token{Type: token.RBRACE, Literal: string(l.ch)}
    case ':':
        tok = token.Token{Type: token.COLON, Literal: string(l.ch)}
    case '"':
        l.readChar()
        tok.Literal = l.readString() // This reads characters until " is encountered
        tok.Type = token.STRING
    case ',':
        tok = token.Token{Type: token.COMMA, Literal: string(l.ch)}
    case 0:
        tok.Literal = ""
        tok.Type = token.EOF
    }
    l.readChar()
    return tok
}

// reads a string till " is encountered and returns that string
func (l *Lexer) readString() string {
    var str string
    for l.ch != '"' {
        str += string(l.ch)
        l.readChar()
    }
    return str
}

main.go -

package main

import (
    "fmt"
    "jping/lexer"
    "jping/token"
)

type lak struct {
    Name string
}

func main() {
    data := `{"Name":"cats are cute"}`
    lex := lexer.New(data) // initilizing lexer

    tok := lex.NextToken() // reads the first token
    // we read till end of file is encountered
    for tok.Type != token.EOF {
        fmt.Println(tok.Type, "  ", tok.Literal)
        tok = lex.NextToken()
    }
}

This should be able to tokenize our string data, after you run this you should probably get following out put

{    {
STRING    Name
:    :
STRING    cats are cute
}    }

now lets try with this example -

data := `{
        "Name":"cats are cute",
        "age": "21"
        }

    `
{    {


STRING    Name
:    :
STRING    cats are cute
,    ,


STRING    age
:    :
STRING    21


}    }

Well we got correct response but theres white space getting counted as token, so we need to fix that. Add this in the start of nextToken() function -


func (l *Lexer) NextToken() token.Token {
    // this for loops checks in the current character is breakline(\n) or space or tab
    // if it is then skips that character and loops till theres something
    // other than that
    for l.ch == '\n' || l.ch == ' ' || l.ch == '\t' {
        l.readChar()
    }

    // ... rest of the function 
}

And that fixes our problem. So now lets make new problems by adding more datatypes like integer and booleans.

token.go -

const (
    // ... rest of the code
    TRUE  = "TRUE"
    FALSE = "FALSE"
)
// this keywords will help us identify inbuild keywords like true, false
var keywords = map[string]TokenType{
    "true":  BOOL,
    "false": BOOL,
}

lexer.go -

func (l *Lexer) NextToken() token.Token {
    // ... rest of the code 
    switch l.ch {
    // ... rest of the code 
    default: // default case is at the end which handles identifing keyword or illegal token
        if isLetter(l.ch) { 
            // we check if the character is letter
            // if it is letter then we read that whole word
            // and check if we it is a keyword
            tok.Literal = l.readIdentifier()
            tok.Type = token.Keywords[tok.Literal]
            return tok
        } else if isDigit(l.ch) {
            // similarly we check for number and create an integer token
            tok.Type = token.INT
            tok.Literal = l.realNumber()
            return tok
        } else {
            // if none of above cases match then that means
            // we have gotten a illigal character like # / etc ...
            tok = token.Token{Type: token.ILLEGAL, Literal: string(l.ch)}
            fmt.Println("Got illigal token")
        }
    // ... rest of the code
}

func (l *Lexer) readIdentifier() string {
    position := l.position
    for isLetter(l.ch) || isDigit(l.ch) {
        l.readChar()
    }
    return l.input[position:l.position]
}

func (l *Lexer) realNumber() string {
    position := l.position
    for isDigit(l.ch) {
        l.readChar()
    }
    // if we encounter something like 12abc then it is illegle number
    if isLetter(l.ch) {
        fmt.Println("A number shouldn't have letters")
    }
    return l.input[position:l.position]
}

func isDigit(ch byte) bool {
    return '0' <= ch && ch <= '9'
}

func isLetter(ch byte) bool {
    return 'a' <= ch && ch <= 'z' || 'A' <= ch && ch <= 'Z' || ch == '_'
}

and now lets test it. main.go -

func main() {
    // data := `{"Name":"yash is my name"}`
    data := `{
        "Name":"cats are cute",
        "age": 123423,
        "isAlive": true
        }

    `
    // ... rest of the code
}

output -

{    {
STRING    Name
:    :
STRING    cats are cute
,    ,
STRING    age
:    :
INT    123423
,    ,
STRING    isAlive
:    :
BOOL    true
}    }

This concludes our first part, that is tokenization. Now onto parsing and making sense of these gibberish stream of tokens we have.

Parser

Before we get into coding parser lets first understand what we want it to do.

what we want from our parser is it should be able to validate the structure of json data and extract information out of it. Lets understand it from below example

// this for json structure is correct because
// it starts with { and ends with } and the key has value assigned to it.
{
    "name" : "yash"
}

// wrong json structure would be - 
{ {
    "name" : "yash"
}
// or 
{
    "name" : 
}
// or 
{
    "name" "yash" 
}
// or 
{
    "name" : "yash",
}

So we can see that it has a proper structure that we need to follow -

  1. It starts with { so it should have } to end that structure.

  2. every key should have a value to it

  3. both key and value should be in double quotes

  4. A : colon should separate key and value

  5. If there are more key value pairs then they should be separated by , comma

Heres a better representation of json - www.json.org/json-en.html

create file parser/parser.go -

package parser

import (
    "fmt"
    "jping/lexer"
    "jping/token"
)

// this structure will hold the tokens we get from our lexer
type Parser struct {
    l *lexer.Lexer
    curToken  token.Token
    peekToken token.Token
}

// This function is used to start the parsing of our data
// it takes in stream of tokens that we create using our lexer
// and converts it into key value structure and return it
// The map is of type [string]interface{} because key will always be of type string
// and the value may be of type int, bool, string, array or other struct
func (p *Parser) ParseJson() map[string]interface{} {
    // This structure will collect all the key values form our data
    values := make(map[string]interface{})

    // We check for initial { and then skip it
    if p.curToken.Type != token.LBRACE {
        return nil
    }
    p.nextToken()

    // we parse all the data in it and store it in values
    p.parseKeyVal(&values)

    return values
}

// creates a new parser
func New(l *lexer.Lexer) *Parser {
    p := &Parser{
        l: l,
    }
    // Read two tokens, so curToken and peekToken are both set
    p.nextToken()
    p.nextToken()
    return p
}

func (p *Parser) nextToken() {
    p.curToken = p.peekToken
    p.peekToken = p.l.NextToken()
}

// This funcion parses all the tokens
func (p *Parser) parseKeyVal(values *map[string]interface{}) {
    switch p.curToken.Type {
    // when we encounter comma that means theres another key value pair
    // so we skip comma and do recursive parsekeyVal() to get next value
    case token.COMMA:
        p.nextToken()
        p.parseKeyVal(values)
    // when we encounter a string then it will always be key
    // that is because the first string encountered in key value will ofcourse be key
    // ex. "name" : "yash"
    //       ^        ^ 
    //      key     value
    // And when we encounte that key  we will check for 
    // next colon : and value in this funcion itself
    case token.STRING:
        key := p.curToken.Literal
        // checking if next token is colon
        if p.peekToken.Type != token.COLON { 
            fmt.Println("Error while parsing! Expected COLON got ", p.peekToken)
            return
        }
        p.nextToken() // now we have colon as curtoken
        p.nextToken() // now we must be having value as currtoken
        (*values)[key] = p.curToken.Literal // adding the key value pair

        // we check if next token is } which means we came to end of the json structure
        // hence our function should end
        // if it is not } then we continue the recursive call of parseKeyVal to read more values
        if p.peekToken.Type != token.RBRACE {
            p.nextToken()
            p.parseKeyVal(values)
        }
    }
    return
}

main.go -

package main

import (
    "fmt"
    "jping/lexer"
    "jping/parser"
)

type lak struct {
    Name string
    Age  int
}

func main() {
    // we are using strings for int and bool values because we haven't implemented them yet
    data := `{
        "Name":"cats are cute",
        "Age": "123423",
        "IsAlive": "true"
        }

     `
    lex := lexer.New(data)
    par := parser.New(lex)

    vals := par.ParseJson()

    for k, v := range vals {
        fmt.Println(k, " : ", v)
    }
}

And with this our basic parser is ready and it should give output something like this -

Name  :  cats are cute
Age  :  123423
IsAlive  :  true

Understanding this may be hard for first time so lets try to visualise it

Now lets add other data types, that is integer and booleans.

in parser/parser.go - we will modify string token case

case token.STRING:
        key := p.curToken.Literal
        if p.peekToken.Type != token.COLON {
            fmt.Println("Error while parsing! Expected COLON got ", p.peekToken)
            return
        }
        p.nextToken() // now we have colon as curtoken
        p.nextToken() // now we must be having value as currtoken

        // Modified part ->
        if p.curToken.Type == token.BOOL {
            if p.curToken.Literal == "true" {
                (*values)[key] = true
            } else {
                (*values)[key] = false
            }
        } else if p.curToken.Type == token.INT {
            i, e := strconv.Atoi(p.curToken.Literal)
            if e == nil {
                (*values)[key] = i
            }
        } else {
            (*values)[key] = p.curToken.Literal
        }
        if p.peekToken.Type != token.RBRACE {
            p.nextToken()
            p.parseKeyVal(values)
        }
    }

Now we have support for int, string and boolean. but what arrays and structs I hear you say ! well lets start with arrays.

lexer/lexer.go -

func (l *Lexer) NextToken() token.Token {
    // ... rest of the code
    switch l.ch {
    case '[':
        tok = token.Token{Type: token.LBRACK, Literal: string(l.ch)}
    case ']':
        tok = token.Token{Type: token.RBRACK, Literal: string(l.ch)}
    // ... rest of the code
    }
    // ... rest of the code
}

parser/parser.go -

func (p *Parser) parseKeyVal(values *map[string]interface{}) {
    switch p.curToken.Type {
    // ... rest of the code
    case token.STRING:
        // ... rest of the code
        } else if p.curToken.Type == token.LBRACK {
            // when we encounter [ that means the object is of type array
            p.nextToken() // skipping the left bracket [
            // checking what type of data we have in the array 
            // and creating array with that datatype
            if p.curToken.Type == token.INT {
                var arr []int 
                // we loop through all values in array object until we encounter ] i.e the end of array
                for p.curToken.Type != token.RBRACK {
                    // converting string literal to int and storing it in arr 
                    i, e := strconv.Atoi(p.curToken.Literal)
                    if e == nil {
                        arr = append(arr, i)
                    }
                    p.nextToken()
                    // here we check if theres a comma
                    // if there is a comma the means there are more values to collect
                    // so we check if next value is of correct type or not 
                    // and then we skip the comma to get the next value
                    if p.curToken.Type == token.COMMA {
                        if p.peekToken.Type != token.INT {
                            fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
                            return
                        }
                        p.nextToken()
                    }
                }
                (*values)[key] = arr 

            } else if p.curToken.Type == token.BOOL {
                // similarly we do for boolean and string.
                var arr []bool
                for p.curToken.Type != token.RBRACK {
                    if p.curToken.Literal == "true" {
                        arr = append(arr, true)
                    } else {
                        arr = append(arr, false)
                    }
                    p.nextToken()
                    if p.curToken.Type == token.COMMA {
                        if p.peekToken.Type != token.BOOL {
                            fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
                            return
                        }
                        p.nextToken()
                    }
                }
                (*values)[key] = arr

            } else if p.curToken.Type == token.STRING {
                var arr []string
                for p.curToken.Type != token.RBRACK {
                    arr = append(arr, p.curToken.Literal)
                    p.nextToken()
                    if p.curToken.Type == token.COMMA {
                        if p.peekToken.Type != token.STRING {
                            fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
                            return
                        }
                        p.nextToken()
                    }
                }
                (*values)[key] = arr

            }
        } else {
            (*values)[key] = p.curToken.Literal
        }
        // ... rest of the code
    }
    return 
}

main.go -


func main() {
    data := `{
        "Name":"cats are cute",
        "Age": 123423,
        "IsAlive": true,
        "ScoreInt" : [1,2,3],
        "ScoreBool" : [false, false, true],
        "Scorestr" : ["ma", "sco", "sec"],
        }

     `
    lex := lexer.New(data)
    par := parser.New(lex)

    vals := par.ParseJson()

    for k, v := range vals {
        fmt.Printf("%v : %v\n", k, v)
    }
}

output -

Name : cats are cute
Age : 123423
IsAlive : true
ScoreInt : [1 2 3]
ScoreBool : [false false true]
Scorestr : [ma sco sec]

This is getting too much convoluted so lets divide it into small functions -

func (p *Parser) parseKeyVal(values *map[string]interface{}) {
    switch p.curToken.Type {
    // ... rest of the code
    case token.STRING:
        // ... rest of the code
        } else if p.curToken.Type == token.LBRACK {
            p.nextToken() // skipping the left bracket [
                if p.curToken.Type == token.INT {
                    (*values)[key] = p.createIntArray()
                } else if p.curToken.Type == token.BOOL {
                    (*values)[key] = p.createBoolArray()
                } else if p.curToken.Type == token.STRING {
                    (*values)[key] = p.createStrArray()
                }
        } else {
            (*values)[key] = p.curToken.Literal
        }
        // ... rest of the code
    }
    return 
}


func (p *Parser) createIntArray() []int {
    var arr []int
    for p.curToken.Type != token.RBRACK {
        i, e := strconv.Atoi(p.curToken.Literal)
        if e == nil {
            arr = append(arr, i)
        }
        p.nextToken()
        if p.curToken.Type == token.COMMA {
            if p.peekToken.Type != token.INT {
                fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
                return nil
            }
            p.nextToken()
        }
    }
    return arr
}

func (p *Parser) createBoolArray() []bool {
    var arr []bool
    for p.curToken.Type != token.RBRACK {
        if p.curToken.Literal == "true" {
            arr = append(arr, true)
        } else {
            arr = append(arr, false)
        }
        p.nextToken()
        if p.curToken.Type == token.COMMA {
            if p.peekToken.Type != token.BOOL {
                fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
                return nil
            }
            p.nextToken()
        }
    }
    return arr
}

func (p *Parser) createStrArray() []string {
    var arr []string
    for p.curToken.Type != token.RBRACK {
        arr = append(arr, p.curToken.Literal)
        p.nextToken()
        if p.curToken.Type == token.COMMA {
            if p.peekToken.Type != token.STRING {
                fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
                return nil
            }
            p.nextToken()
        }
        // ... rest of the code
    }
    return arr
}

Much better !!

An array could also have array in it.ex [[1,2], [3,4]] or [ [[1,2,3], [1]], [[23]] ]. From the example we can see that there is this repeating pattern of arrays in arrays and so on until we reach some type like bool, int or string. As you could have guessed by this we are gonna solve this problem using recursion.

func (p *Parser) parseKeyVal(values *map[string]interface{}) {
    switch p.curToken.Type {
    // ... rest of the code
    case token.STRING:
        // ... rest of the code
        } else if p.curToken.Type == token.LBRACK {
           p.nextToken() // skipping the left bracket [
            // what we have changed here from above is
            // we have put the if else statement in a funcion
            // and the reason we have done that is becasue 
            // we are gonna use this function recursively
            var getArray func() interface{}
            getArray = func() interface{} {
                if p.curToken.Type == token.INT {
                    return p.createIntArray()
                } else if p.curToken.Type == token.BOOL {
                    return p.createBoolArray()
                } else if p.curToken.Type == token.STRING {
                    return p.createStrArray()
                } else if p.curToken.Type == token.LBRACK {
                    // this is the new part we have added
                    // if we are encountering a [ that means there is nested array
                    // to handle that we create a new array and then recursively
                    p.nextToken()
                    var ark []interface{}
                    arr := getArray()
                    p.nextToken() // skipping ] bracket
                    ark = append(ark, arr) // we recursively read first array

                    // this loops till there is no value left in array
                    for p.curToken.Type == token.COMMA {
                        p.nextToken()
                        if p.curToken.Type == token.LBRACK {
                            p.nextToken()
                        } else {
                            fmt.Println("Expected [ got ", p.curToken.Literal)
                        }
                        // we recursively get then next array
                        arr := getArray()
                        p.nextToken()
                        ark = append(ark, arr)
                    }
                    return ark
                } else if p.curToken.Type == token.RBRACK {
                    p.nextToken()
                }
                return nil
            }
            (*values)[key] = getArray() 
        } else {
            (*values)[key] = p.curToken.Literal
        }
    return arr
}

Now only data type to be implemented is struct.

func (p *Parser) parseKeyVal(values *map[string]interface{}) {
    switch p.curToken.Type {
    // ... rest of the code
    case token.STRING:
        // ... rest of the code
        } else if p.curToken.Type == token.LBRACE {
            // A { represents that theres a struct
            // so we simply make a map to store values of that nexted struct
            // and recursively pass in parseKeyVal function
            nextedValues := make(map[string]interface{})
            p.nextToken()
            p.parseKeyVal(&nextedValues)
            (*values)[key] = nextedValues
        } else {
            (*values)[key] = p.curToken.Literal
        }
    return arr
}

main.go -


func main() {
    data := `{
        "Name":"cats are cute",
        "Age": 123423,
        "IsAlive": true,
        "ScoreInt" : [1,2,3],
        "ScoreBool" : [false, false, true],
        "Scorelist" : [[1,2,3], [23,3]],
        "Scorestr" : ["ma", "sco", "sec"],
        "address" : {
                "city" : "Mumbai",
                "state" : "Maharashtra"
            }
        }

     `
    lex := lexer.New(data)
    par := parser.New(lex)

    vals := par.ParseJson()

    for k, v := range vals {
        fmt.Printf("%v : %v\n", k, v)
    }
}

output -

Age : 123423
IsAlive : true
ScoreInt : [1 2 3]
ScoreBool : [false false true]
Scorelist : [[1 2 3] [23 3]]
Scorestr : [ma sco sec]
address : map[city:Mumbai state:Maharashtra]
Name : cats are cute

And we are done !!

So you are here atlast. All this work you did! And made your own json parser! Pat yourself on the back for getting this far. And watch this video of prime as treat, you will surely find it fun after all that work Is JSON Blazingly Fast or...?

Get complete code at - github.com/Exar04/jping

Hope you had fun making this parser.