Parsing Text (HTML Parsing)

by Jerry Muelver
- jmuelver jmuelver


Parsing Overview


This article presumes to give the rationale and procedure for parsing HTML to extract references to images.

Suppose you've got a nice text string, and you know there's something (or several somethings) in the string that you want to extract and process. If you do it by visual inspection of printed text, you would scan or read through the text to find the pattern (symbols or letters or words) that marks or identifies the target. When the text is in a computer file or string, the process of scanning the text is called parsing, which means "to take apart and analyze according to structure".

For instance, if you want to find a URL buried in HTML source, you'd look for "<a href=..." in the text. For an image, you'd look for "<img source=..." in the text. Your brain happily adjusts for allowable variations on the pattern, like "<a Href" and <A HREF" and <IMG source" and "<img SOURCE" and the like. Then you grab the next chunk of text after the "=", jot it down, and you've got the information you wanted.

The problem is to how to give the computer instructions that allow it to perform a task you could do personally without even thinking. In other words, how can we convert thoughtless action into computer code. While thoughtless action can easily be recorded and displayed (just see the evening news on TV for the latest examples from Washington, D.C.), computer code requires serious effort.

URL Patterns for Images


The pattern for HTML image tags is "<img src=...", followed by either the full URL of the image file (if it is located at a different address than the calling page) or just the file name itself, maybe with a directory or small directory tree (pathname) prepended to the file name, or maybe not, possibly in upper case or lower case or a mixture of both, which may or may not be significant, depending on the server.

Now that we're clear on the problem, let's look at how to arrive at a solution.

Method of Attack


Since the source could be in any mix of upper and lower case letters, we'll convert everything to uppercase to simplify the parsing. But since case is significant when retrieving files from some servers, we have to capture filenames in their original form. To have this cake and eat it, too, we need to work on an upper-case copy of the source to parse for links, and go to the mixed-case original to extract the URLs.

HTML tags all start with left-angle brackets, "<", followed immediately by an identifying name of one or more letters or symbols (possibly with a leading front-slash "/"), followed by either a closing right-angle bracket, or a space and a list of parameters. To find a tag, search for a "<". To extract the tag from the source, search for a closing ">" and grab everything from the "<" to the next available ">".

The tag for images is "<IMG". In the tag, the filename of the image follows the parameter name "SRC=", going up the the tag's closing ">". The name might be in quotes, or it might not. If the name begins with "HTTP:", the file is external and we need the whole thing from "HTTP:" to the end. Otherwise, to be able to retrieve the file directly on a call to its own URL, we need to add the "HTTP:" part of the URL to name ourselves, cleverly cutting it off the source file's URL for our own use.

Some tags include other information. An image tag might include "border=0". Here's an example: <IMG SRC="myimage.gif" border=0> We can remove any extra information by using INSTR to check the extracted tag for blank spaces, and truncate the tag just to the left of the first blank space by using LEFT$.

The procedure in pseudocode comments, is:


  1. read the source into a string html$
  2. copy the file into string par$ for parsing
  3. convert parstring into uppercase
  4. use INSTR to find position tagpos of "<IMG SRC="
  5. use INSTR to find position endpos of next ">"
  6. truncate to the left of the first blank space, if one exists
  7. copy string tag$ from html$ using tagpos+9 and tagend
  8. cut off enclosing quotes, if needed
  9. check for "HTTP:"
  10. add urlPath to front of tag$
  11. add itag$ to a list of found links

DEMO


Suppose we've got an HTML file downloaded and saved as "mypage.htm". Suppose further that we know the file came from "http://www.msn.com", and that we'd like to get a list of graphics referenced on that page. This is one way to do it:

' HTML parsing demo
' setup
DIM filelist$(300)
count = 1
'replace with the URL for the webpage you are using for the demo
myURL$ = "http://www.msn.com/"
 
' read the source into a string html$
open "file.html" for input as #f
html$ = input$(#f, LOF(#f))
close #f
 
' copy the file into string par$ for parsing
' convert parstring into uppercase
par$ = UPPER$(html$)
 
' use INSTR to find position tagpos of "<IMG SRC="
' use INSTR to find position endpos of next ">"
' copy string tag$ from html$ using tagpos+9 and tagend
' check for blank space and truncate with LEFT$, if needed
' cut off enclosing quotes, if needed
tagpos = INSTR(par$,"<IMG SRC=")
while tagpos > 0
   endpos = INSTR(par$,">",tagpos)
   tag$ = MID$(html$,tagpos+9,endpos-tagpos-9)
   blank = INSTR(tag$," ")
   if blank > 0 then tag$ = left$(tag$,blank-1)
   filelist$(count) = tag$
 
   ' cut off enclosing quotes, if needed
   if LEFT$(tag$,1) = chr$(34) then
      tag$ = MID$(tag$,2)
   end if
   if RIGHT$(tag$,1) = chr$(34) then
      tag$ = LEFT$(tag$,LEN(tag$)-1)
   end if
 
   ' check for "HTTP:"
   ' add urlPath to front of tagstring to make imgUrl
   if LEFT$(UPPER$(tag$),5) <> "HTTP:" then
      tag$ = myURL$ + tag$
   end if
 
   ' add tag$ to a list of found links
   filelist$(count) = tag$
   count = count + 1
   tagpos = INSTR(par$,"<IMG SRC=",endpos)
wend
 
for x = 1 to count -1
   print filelist$(x)
next

To test it out, save a web page from your browser with File... SaveAs > Web Page, rename the saved file to file.html, and run the demo.

You can also download a file in your program, using the method in Downloading A File.

Where to Go from Here


The core notion in parsing text is to use the syntactical patterns in the text by nailing a starting pattern, marking up to an ending pattern, and grabbing everything between the two. INSTR is your friend, and MID$ is your workhorse. To expand on the idea, you could:

  • parse strings individually, as they are read in from a file, instead of parsing the whole file at once
  • rig the parser to grab text-only from the page (hint: start$ = ">", end$ = "<")
  • buld a generic extraction function, so you could parse for different tags:
    • tag$ = extract$(par$,"<IMG SRC=", ">")
    • tag$ = extract$(par$,"<A HREF=", ">")
    • tag$ = extract$(par$,"<script","</script>")

  • wrap your parser in a simple GUI to allow the user to select the extraction parameters on-the-fly
  • add tag-conversion rules, to change formatting or contents (process called filtering)
  • write a simple tagged-text format database program that grabs the whole database file and shows single records on demand
  • write an XML parser
  • write an interpreted language of your own