We explore html parsing methods, go functions and methods, lexical scope

We explore html parsing methods, go functions and methods, lexical scope

You can checkout github source code for the solution here

Problem Statement

This project was built as part of Excercise 4: HTML Link Parser. You can read more about the problem statement here

Let’s break down the problem into steps

  1. How can I parse the HTML? I have a url, how can I get the DOM tree from the url
  2. Now that I have the DOM nodes, how can I extract the links (anchor tags) and store them ? (Think of structure/format in which data can be stored)
  3. Another point to consider is we need to keep track of nested links as well. How can I achieve it? (Think of algorithm that could be helpful in this case)
  4. Final output the data. This step is the easiest!!

Parsing HTML

Lets start with a static html as string for simplicity purpose

doc := `<html>
<body>
  <h1>Hello!</h1>
  <a href="/other-page">A link to another page</a>
</body>
</html>`

Now we need to convert this html into DOM tree format. For this we use the package html. In this package which focus on the method Parse() that returns Node and err. So the Node contains the html ElementNode as the root of the DOM tree. And as per the documentation as part of Node type we get children of the node as well.

The package can be used this way. The Parse method requires a io.Reader as input parameter

import "golang.org/x/net/html"

func main(){
  // doc is initialized here
  //....

	// create a new IO reader
	r := strings.NewReader(doc)
	// parse HTML, which returns doc and err
	nodes, err := html.Parse(r)
	if err != nil {
		panic(err)
  }
  
  //...
}

  • Attr field contains the HTML properties like href, src etc
  • Type of the node can be ElementNode(like a, div, p etc), TextNode(that contains the data within elementNodes) and CommentNode(which contains all the comments in your HTML)

Functions and Methods

Functions and methods maybe used in the same context in other languages but in Go, they have different meanings and usecases

Definition

  • Functions ```go

- Methods
```go

Functions Cheatsheet

Description Syntax
Input Parameters - The data that is passed as input to the function. Syntax: . Incase multiple variable have same type it can also be defined as , go func name(abc string)
Return Type - The type of data that is returned from the function. The return Type is defined after input paramerter. In case of multiple returns values it can be specified as follows : (string, int, <type>) go func name(abc string) string { ... }

Not having used methods much I will not dive deep into it in this blog. Maybe a separate blog in future

Various string methods

For this excercise, we need use various string methods for comparsion, checking the prefix and stuff. Coming from javascript background, I will try to put across equivalent names in both of them for reference

Package : strings

Javascript method Go Function
<originalString>.startsWith(<subString>) <originalString>.hasPrefix(<substring>)
<originalString>.endsWith(<subString>) <originalString>.hasSuffix(<substring>)

String ‘ vs “”

Exported vs UnExported

Algorithms in Go

Coming back to the problem, we had the DOM tree right? Now we need to get a a links and all the nested text within those links. We want to ignore the commented code, the nested ElementNodes and nested a tags. We just need the text within those anchor tags joined together

Constants in Go

Just like we have const in Javascript. Once the constants are defined they cannot be modified. Similar in Go we have the const keyword

Read more about it here