Class HTMLTokenizer
In: lib/openid/yadis/htmltokenizer.rb
Parent: Object

A class to tokenize HTML.


  page = "<HTML>
  <TITLE>This is the title</TITLE>
   <!-- Here comes the <a href=\"\">blah</a>
   comment body
     <H1>This is the header</H1>
       This is the paragraph, it contains
       <a href=\"link.html\">links</a>,
       <img src=\"blah.gif\" optional alt='images
       really cool'>.  Ok, here is some more text and
       <A href=\"\" target=\"_blank\">another link</A>.
   toke =

   assert("<h1>" == toke.getTag("h1", "h2", "h3").to_s.downcase)
   assert("<a href=\"link.html\">") == toke.getTag("IMG", "A"))
   assert("links" == toke.getTrimmedText)
   assert(toke.getTag("IMG", "A").attr_hash['optional'])
   assert("_blank" == toke.getTag("IMG", "A").attr_hash['target'])



page  [R] 

Public Class methods

Create a new tokenizer, based on the content, used as a string.

Public Instance methods

Get the next token, returns an instance of

Get a tag from the specified set of desired tags. For example: foo = toke.getTag("h1", "h2", "h3") Will return the next header tag encountered.

Get all the text between the current position and the next tag (if specified) or a specific later tag

Like getText, but squeeze all whitespace, getting rid of leading and trailing whitespace, and squeezing multiple spaces into a single space.

Look at the next token, but don‘t actually grab it

Reset the parser, setting the current position back at the stop