We explore html parsing methods, go functions and methods, lexical scope
We explore html parsing methods, go functions and methods, lexical scope
- The idea is to build a package that extracts all links from the provided HTML page
You can checkout github source code for the solution here
Problem Statement
This project was built as part of Excercise 4: HTML Link Parser. You can read more about the problem statement here
Let’s break down the problem into steps
- How can I parse the HTML? I have a url, how can I get the DOM tree from the url
- Now that I have the DOM nodes, how can I extract the links (anchor tags) and store them ? (Think of structure/format in which data can be stored)
- Another point to consider is we need to keep track of nested links as well. How can I achieve it? (Think of algorithm that could be helpful in this case)
- Final output the data. This step is the easiest!!
Parsing HTML
Lets start with a static html as string for simplicity purpose
doc := `<html>
<body>
<h1>Hello!</h1>
<a href="/other-page">A link to another page</a>
</body>
</html>`
Now we need to convert this html into DOM tree format. For this we use the package html. In this package which focus on the method Parse()
that returns Node and err
. So the Node contains the html
ElementNode as the root of the DOM tree. And as per the documentation as part of Node
type we get children of the node as well.
The package can be used this way. The Parse method requires a io.Reader
as input parameter
import "golang.org/x/net/html"
func main(){
// doc is initialized here
//....
// create a new IO reader
r := strings.NewReader(doc)
// parse HTML, which returns doc and err
nodes, err := html.Parse(r)
if err != nil {
panic(err)
}
//...
}
Attr
field contains the HTML properties likehref
,src
etcType
of the node can be ElementNode(likea
,div
,p
etc), TextNode(that contains the data within elementNodes) and CommentNode(which contains all the comments in your HTML)
Functions and Methods
Functions and methods maybe used in the same context in other languages but in Go, they have different meanings and usecases
Definition
- Functions ```go
- Methods
```go
Functions Cheatsheet
Description | Syntax |
---|---|
Input Parameters - The data that is passed as input to the function. Syntax: |
go func name(abc string) |
Return Type - The type of data that is returned from the function. The return Type is defined after input paramerter. In case of multiple returns values it can be specified as follows : (string, int, <type>) |
go func name(abc string) string { ... } |
Not having used methods much I will not dive deep into it in this blog. Maybe a separate blog in future
Various string methods
For this excercise, we need use various string methods for comparsion, checking the prefix and stuff. Coming from javascript background, I will try to put across equivalent names in both of them for reference
Package : strings
Javascript method | Go Function |
---|---|
<originalString> .startsWith(<subString> ) |
<originalString> .hasPrefix(<substring> ) |
<originalString> .endsWith(<subString> ) |
<originalString> .hasSuffix(<substring> ) |
String ‘ vs “”
Exported vs UnExported
Algorithms in Go
Coming back to the problem, we had the DOM tree right? Now we need to get a a
links and all the nested text within those links. We want to ignore the commented code, the nested ElementNodes
and nested a
tags. We just need the text
within those anchor tags joined together
Constants in Go
Just like we have const
in Javascript. Once the constants are defined they cannot be modified. Similar in Go we have the const
keyword
Read more about it here