Table of contents
Before we start, if you find anything hard to understand in this or theres something can be improved let me know in the comments or you can reach me out on x.com/Yash_Dhadve4
First of all what even is a parser?
Well parser is simply a application/function that takes in the data/string and does appropriate actions based on the data. In case of json parser
a string will be taken as input and we will have to extract data from it, check if the structure of given string matches the appropriate structure we defined and map data to variables based on input string.
From below example we can see how standard json decoder works -
package main
import (
"encoding/json"
"fmt"
)
type student struct {
Name string
}
func main() {
// we get data in byte format
data := []byte(`{
"Name":"yash"
}`)
// initilizing a datastructure in which we want to capture data
var aki student
// parsing data and extrating data into aki
json.Unmarshal(data, &aki)
fmt.Println(aki)
}
So thats exactly what we will be trying to make. Just a functions one that converts string or byte data into a structured json data format.
So to convert string in to a data we need to first be able to understand that string. Like If we have string something like { “name”:”yash is my name” }
. We should be able to separate above string in appropriate parts like {
name
:
yash is my name
}
. This process is called tokenisation.
And then we parse those tokens, validate them ( if they are in valid json format ) and then extract information out of it.
So now lets start by tokenization.
Tokenization
File structure looks something like this -
We first need to define the type of tokens we need. so starting with token.go
package token
type TokenType string
type Token struct {
Type TokenType // This tells the type of token
Literal string // this is the actual data string we extracted
}
const (
ILLEGAL = "ILLEGAL" // This token tells if we have encountered an illigal token
EOF = "EOF" // This tells us if we have reached the end of the file
COMMA = ","
COLON = ":"
LBRACE = "{"
RBRACE = "}"
LBRACK = "["
RBRACK = "]"
INT = "INT"
BOOL = "BOOL"
STRING = "STRING"
)
Now the hard part, lexer.go
-
package lexer
import "jping/token"
type Lexer struct {
input string // Input string is the string which we want to convert to json
position int // Position is the pointer that keeps track of what character we are reading
readPosition int // This pointer is one character of position pointer
ch byte // This holds the current character of readPosition
}
// This creates a new object of lexer
func New(input string) *Lexer {
l := &Lexer{input: input}
l.readChar()
return l
}
// This function increments the position of position and readPointer by one,
// and replaces the current character with next character
func (l *Lexer) readChar() {
if l.readPosition >= len(l.input) {
l.ch = 0
} else {
l.ch = l.input[l.readPosition]
}
l.position = l.readPosition
l.readPosition += 1
}
// This the the actual function where we convert our string into tokens
func (l *Lexer) NextToken() token.Token {
var tok token.Token
switch l.ch {
case '{':
tok = token.Token{Type: token.LBRACE, Literal: string(l.ch)}
case '}':
tok = token.Token{Type: token.RBRACE, Literal: string(l.ch)}
case ':':
tok = token.Token{Type: token.COLON, Literal: string(l.ch)}
case '"':
l.readChar()
tok.Literal = l.readString() // This reads characters until " is encountered
tok.Type = token.STRING
case ',':
tok = token.Token{Type: token.COMMA, Literal: string(l.ch)}
case 0:
tok.Literal = ""
tok.Type = token.EOF
}
l.readChar()
return tok
}
// reads a string till " is encountered and returns that string
func (l *Lexer) readString() string {
var str string
for l.ch != '"' {
str += string(l.ch)
l.readChar()
}
return str
}
main.go
-
package main
import (
"fmt"
"jping/lexer"
"jping/token"
)
type lak struct {
Name string
}
func main() {
data := `{"Name":"cats are cute"}`
lex := lexer.New(data) // initilizing lexer
tok := lex.NextToken() // reads the first token
// we read till end of file is encountered
for tok.Type != token.EOF {
fmt.Println(tok.Type, " ", tok.Literal)
tok = lex.NextToken()
}
}
This should be able to tokenize our string data, after you run this you should probably get following out put
{ {
STRING Name
: :
STRING cats are cute
} }
now lets try with this example -
data := `{
"Name":"cats are cute",
"age": "21"
}
`
{ {
STRING Name
: :
STRING cats are cute
, ,
STRING age
: :
STRING 21
} }
Well we got correct response but theres white space getting counted as token, so we need to fix that. Add this in the start of nextToken()
function -
func (l *Lexer) NextToken() token.Token {
// this for loops checks in the current character is breakline(\n) or space or tab
// if it is then skips that character and loops till theres something
// other than that
for l.ch == '\n' || l.ch == ' ' || l.ch == '\t' {
l.readChar()
}
// ... rest of the function
}
And that fixes our problem. So now lets make new problems by adding more datatypes like integer and booleans.
token.go
-
const (
// ... rest of the code
TRUE = "TRUE"
FALSE = "FALSE"
)
// this keywords will help us identify inbuild keywords like true, false
var keywords = map[string]TokenType{
"true": BOOL,
"false": BOOL,
}
lexer.go
-
func (l *Lexer) NextToken() token.Token {
// ... rest of the code
switch l.ch {
// ... rest of the code
default: // default case is at the end which handles identifing keyword or illegal token
if isLetter(l.ch) {
// we check if the character is letter
// if it is letter then we read that whole word
// and check if we it is a keyword
tok.Literal = l.readIdentifier()
tok.Type = token.Keywords[tok.Literal]
return tok
} else if isDigit(l.ch) {
// similarly we check for number and create an integer token
tok.Type = token.INT
tok.Literal = l.realNumber()
return tok
} else {
// if none of above cases match then that means
// we have gotten a illigal character like # / etc ...
tok = token.Token{Type: token.ILLEGAL, Literal: string(l.ch)}
fmt.Println("Got illigal token")
}
// ... rest of the code
}
func (l *Lexer) readIdentifier() string {
position := l.position
for isLetter(l.ch) || isDigit(l.ch) {
l.readChar()
}
return l.input[position:l.position]
}
func (l *Lexer) realNumber() string {
position := l.position
for isDigit(l.ch) {
l.readChar()
}
// if we encounter something like 12abc then it is illegle number
if isLetter(l.ch) {
fmt.Println("A number shouldn't have letters")
}
return l.input[position:l.position]
}
func isDigit(ch byte) bool {
return '0' <= ch && ch <= '9'
}
func isLetter(ch byte) bool {
return 'a' <= ch && ch <= 'z' || 'A' <= ch && ch <= 'Z' || ch == '_'
}
and now lets test it. main.go
-
func main() {
// data := `{"Name":"yash is my name"}`
data := `{
"Name":"cats are cute",
"age": 123423,
"isAlive": true
}
`
// ... rest of the code
}
output
-
{ {
STRING Name
: :
STRING cats are cute
, ,
STRING age
: :
INT 123423
, ,
STRING isAlive
: :
BOOL true
} }
This concludes our first part, that is tokenization. Now onto parsing and making sense of these gibberish stream of tokens we have.
Parser
Before we get into coding parser lets first understand what we want it to do.
what we want from our parser is it should be able to validate the structure of json data and extract information out of it. Lets understand it from below example
// this for json structure is correct because
// it starts with { and ends with } and the key has value assigned to it.
{
"name" : "yash"
}
// wrong json structure would be -
{ {
"name" : "yash"
}
// or
{
"name" :
}
// or
{
"name" "yash"
}
// or
{
"name" : "yash",
}
So we can see that it has a proper structure that we need to follow -
It starts with
{
so it should have}
to end that structure.every key should have a value to it
both key and value should be in
”
double quotesA
:
colon should separate key and valueIf there are more key value pairs then they should be separated by
,
comma
Heres a better representation of json - www.json.org/json-en.html
create file parser/parser.go
-
package parser
import (
"fmt"
"jping/lexer"
"jping/token"
)
// this structure will hold the tokens we get from our lexer
type Parser struct {
l *lexer.Lexer
curToken token.Token
peekToken token.Token
}
// This function is used to start the parsing of our data
// it takes in stream of tokens that we create using our lexer
// and converts it into key value structure and return it
// The map is of type [string]interface{} because key will always be of type string
// and the value may be of type int, bool, string, array or other struct
func (p *Parser) ParseJson() map[string]interface{} {
// This structure will collect all the key values form our data
values := make(map[string]interface{})
// We check for initial { and then skip it
if p.curToken.Type != token.LBRACE {
return nil
}
p.nextToken()
// we parse all the data in it and store it in values
p.parseKeyVal(&values)
return values
}
// creates a new parser
func New(l *lexer.Lexer) *Parser {
p := &Parser{
l: l,
}
// Read two tokens, so curToken and peekToken are both set
p.nextToken()
p.nextToken()
return p
}
func (p *Parser) nextToken() {
p.curToken = p.peekToken
p.peekToken = p.l.NextToken()
}
// This funcion parses all the tokens
func (p *Parser) parseKeyVal(values *map[string]interface{}) {
switch p.curToken.Type {
// when we encounter comma that means theres another key value pair
// so we skip comma and do recursive parsekeyVal() to get next value
case token.COMMA:
p.nextToken()
p.parseKeyVal(values)
// when we encounter a string then it will always be key
// that is because the first string encountered in key value will ofcourse be key
// ex. "name" : "yash"
// ^ ^
// key value
// And when we encounte that key we will check for
// next colon : and value in this funcion itself
case token.STRING:
key := p.curToken.Literal
// checking if next token is colon
if p.peekToken.Type != token.COLON {
fmt.Println("Error while parsing! Expected COLON got ", p.peekToken)
return
}
p.nextToken() // now we have colon as curtoken
p.nextToken() // now we must be having value as currtoken
(*values)[key] = p.curToken.Literal // adding the key value pair
// we check if next token is } which means we came to end of the json structure
// hence our function should end
// if it is not } then we continue the recursive call of parseKeyVal to read more values
if p.peekToken.Type != token.RBRACE {
p.nextToken()
p.parseKeyVal(values)
}
}
return
}
main.go
-
package main
import (
"fmt"
"jping/lexer"
"jping/parser"
)
type lak struct {
Name string
Age int
}
func main() {
// we are using strings for int and bool values because we haven't implemented them yet
data := `{
"Name":"cats are cute",
"Age": "123423",
"IsAlive": "true"
}
`
lex := lexer.New(data)
par := parser.New(lex)
vals := par.ParseJson()
for k, v := range vals {
fmt.Println(k, " : ", v)
}
}
And with this our basic parser is ready and it should give output something like this -
Name : cats are cute
Age : 123423
IsAlive : true
Understanding this may be hard for first time so lets try to visualise it
Now lets add other data types, that is integer and booleans.
in parser/parser.go
-
we will modify string token case
case token.STRING:
key := p.curToken.Literal
if p.peekToken.Type != token.COLON {
fmt.Println("Error while parsing! Expected COLON got ", p.peekToken)
return
}
p.nextToken() // now we have colon as curtoken
p.nextToken() // now we must be having value as currtoken
// Modified part ->
if p.curToken.Type == token.BOOL {
if p.curToken.Literal == "true" {
(*values)[key] = true
} else {
(*values)[key] = false
}
} else if p.curToken.Type == token.INT {
i, e := strconv.Atoi(p.curToken.Literal)
if e == nil {
(*values)[key] = i
}
} else {
(*values)[key] = p.curToken.Literal
}
if p.peekToken.Type != token.RBRACE {
p.nextToken()
p.parseKeyVal(values)
}
}
Now we have support for int, string and boolean. but what arrays and structs I hear you say ! well lets start with arrays.
lexer/lexer.go
-
func (l *Lexer) NextToken() token.Token {
// ... rest of the code
switch l.ch {
case '[':
tok = token.Token{Type: token.LBRACK, Literal: string(l.ch)}
case ']':
tok = token.Token{Type: token.RBRACK, Literal: string(l.ch)}
// ... rest of the code
}
// ... rest of the code
}
parser/parser.go
-
func (p *Parser) parseKeyVal(values *map[string]interface{}) {
switch p.curToken.Type {
// ... rest of the code
case token.STRING:
// ... rest of the code
} else if p.curToken.Type == token.LBRACK {
// when we encounter [ that means the object is of type array
p.nextToken() // skipping the left bracket [
// checking what type of data we have in the array
// and creating array with that datatype
if p.curToken.Type == token.INT {
var arr []int
// we loop through all values in array object until we encounter ] i.e the end of array
for p.curToken.Type != token.RBRACK {
// converting string literal to int and storing it in arr
i, e := strconv.Atoi(p.curToken.Literal)
if e == nil {
arr = append(arr, i)
}
p.nextToken()
// here we check if theres a comma
// if there is a comma the means there are more values to collect
// so we check if next value is of correct type or not
// and then we skip the comma to get the next value
if p.curToken.Type == token.COMMA {
if p.peekToken.Type != token.INT {
fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
return
}
p.nextToken()
}
}
(*values)[key] = arr
} else if p.curToken.Type == token.BOOL {
// similarly we do for boolean and string.
var arr []bool
for p.curToken.Type != token.RBRACK {
if p.curToken.Literal == "true" {
arr = append(arr, true)
} else {
arr = append(arr, false)
}
p.nextToken()
if p.curToken.Type == token.COMMA {
if p.peekToken.Type != token.BOOL {
fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
return
}
p.nextToken()
}
}
(*values)[key] = arr
} else if p.curToken.Type == token.STRING {
var arr []string
for p.curToken.Type != token.RBRACK {
arr = append(arr, p.curToken.Literal)
p.nextToken()
if p.curToken.Type == token.COMMA {
if p.peekToken.Type != token.STRING {
fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
return
}
p.nextToken()
}
}
(*values)[key] = arr
}
} else {
(*values)[key] = p.curToken.Literal
}
// ... rest of the code
}
return
}
main.go
-
func main() {
data := `{
"Name":"cats are cute",
"Age": 123423,
"IsAlive": true,
"ScoreInt" : [1,2,3],
"ScoreBool" : [false, false, true],
"Scorestr" : ["ma", "sco", "sec"],
}
`
lex := lexer.New(data)
par := parser.New(lex)
vals := par.ParseJson()
for k, v := range vals {
fmt.Printf("%v : %v\n", k, v)
}
}
output
-
Name : cats are cute
Age : 123423
IsAlive : true
ScoreInt : [1 2 3]
ScoreBool : [false false true]
Scorestr : [ma sco sec]
This is getting too much convoluted so lets divide it into small functions -
func (p *Parser) parseKeyVal(values *map[string]interface{}) {
switch p.curToken.Type {
// ... rest of the code
case token.STRING:
// ... rest of the code
} else if p.curToken.Type == token.LBRACK {
p.nextToken() // skipping the left bracket [
if p.curToken.Type == token.INT {
(*values)[key] = p.createIntArray()
} else if p.curToken.Type == token.BOOL {
(*values)[key] = p.createBoolArray()
} else if p.curToken.Type == token.STRING {
(*values)[key] = p.createStrArray()
}
} else {
(*values)[key] = p.curToken.Literal
}
// ... rest of the code
}
return
}
func (p *Parser) createIntArray() []int {
var arr []int
for p.curToken.Type != token.RBRACK {
i, e := strconv.Atoi(p.curToken.Literal)
if e == nil {
arr = append(arr, i)
}
p.nextToken()
if p.curToken.Type == token.COMMA {
if p.peekToken.Type != token.INT {
fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
return nil
}
p.nextToken()
}
}
return arr
}
func (p *Parser) createBoolArray() []bool {
var arr []bool
for p.curToken.Type != token.RBRACK {
if p.curToken.Literal == "true" {
arr = append(arr, true)
} else {
arr = append(arr, false)
}
p.nextToken()
if p.curToken.Type == token.COMMA {
if p.peekToken.Type != token.BOOL {
fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
return nil
}
p.nextToken()
}
}
return arr
}
func (p *Parser) createStrArray() []string {
var arr []string
for p.curToken.Type != token.RBRACK {
arr = append(arr, p.curToken.Literal)
p.nextToken()
if p.curToken.Type == token.COMMA {
if p.peekToken.Type != token.STRING {
fmt.Println("error while parsing array. Expected Integer token got ", p.peekToken)
return nil
}
p.nextToken()
}
// ... rest of the code
}
return arr
}
Much better !!
An array could also have array in it.ex [[1,2], [3,4]] or [ [[1,2,3], [1]], [[23]] ]. From the example we can see that there is this repeating pattern of arrays in arrays and so on until we reach some type like bool, int or string. As you could have guessed by this we are gonna solve this problem using recursion.
func (p *Parser) parseKeyVal(values *map[string]interface{}) {
switch p.curToken.Type {
// ... rest of the code
case token.STRING:
// ... rest of the code
} else if p.curToken.Type == token.LBRACK {
p.nextToken() // skipping the left bracket [
// what we have changed here from above is
// we have put the if else statement in a funcion
// and the reason we have done that is becasue
// we are gonna use this function recursively
var getArray func() interface{}
getArray = func() interface{} {
if p.curToken.Type == token.INT {
return p.createIntArray()
} else if p.curToken.Type == token.BOOL {
return p.createBoolArray()
} else if p.curToken.Type == token.STRING {
return p.createStrArray()
} else if p.curToken.Type == token.LBRACK {
// this is the new part we have added
// if we are encountering a [ that means there is nested array
// to handle that we create a new array and then recursively
p.nextToken()
var ark []interface{}
arr := getArray()
p.nextToken() // skipping ] bracket
ark = append(ark, arr) // we recursively read first array
// this loops till there is no value left in array
for p.curToken.Type == token.COMMA {
p.nextToken()
if p.curToken.Type == token.LBRACK {
p.nextToken()
} else {
fmt.Println("Expected [ got ", p.curToken.Literal)
}
// we recursively get then next array
arr := getArray()
p.nextToken()
ark = append(ark, arr)
}
return ark
} else if p.curToken.Type == token.RBRACK {
p.nextToken()
}
return nil
}
(*values)[key] = getArray()
} else {
(*values)[key] = p.curToken.Literal
}
return arr
}
Now only data type to be implemented is struct.
func (p *Parser) parseKeyVal(values *map[string]interface{}) {
switch p.curToken.Type {
// ... rest of the code
case token.STRING:
// ... rest of the code
} else if p.curToken.Type == token.LBRACE {
// A { represents that theres a struct
// so we simply make a map to store values of that nexted struct
// and recursively pass in parseKeyVal function
nextedValues := make(map[string]interface{})
p.nextToken()
p.parseKeyVal(&nextedValues)
(*values)[key] = nextedValues
} else {
(*values)[key] = p.curToken.Literal
}
return arr
}
main.go
-
func main() {
data := `{
"Name":"cats are cute",
"Age": 123423,
"IsAlive": true,
"ScoreInt" : [1,2,3],
"ScoreBool" : [false, false, true],
"Scorelist" : [[1,2,3], [23,3]],
"Scorestr" : ["ma", "sco", "sec"],
"address" : {
"city" : "Mumbai",
"state" : "Maharashtra"
}
}
`
lex := lexer.New(data)
par := parser.New(lex)
vals := par.ParseJson()
for k, v := range vals {
fmt.Printf("%v : %v\n", k, v)
}
}
output
-
Age : 123423
IsAlive : true
ScoreInt : [1 2 3]
ScoreBool : [false false true]
Scorelist : [[1 2 3] [23 3]]
Scorestr : [ma sco sec]
address : map[city:Mumbai state:Maharashtra]
Name : cats are cute
And we are done !!
So you are here atlast. All this work you did! And made your own json parser! Pat yourself on the back for getting this far. And watch this video of prime as treat, you will surely find it fun after all that work Is JSON Blazingly Fast or...?
Get complete code at - github.com/Exar04/jping
Hope you had fun making this parser.