Tonight I've released another alpha version of Cognition, my semantic web parser. Changelog includes:

  • Microformats:
    • Add option (disabled by default) to require <head profile> for microformat support. Microformat profiles are treated as opaque strings! Supports the following profiles:
      • http://purl.org/uF/2008/03/
      • http://www.w3.org/2006/03/hcard or http://purl.org/uF/hCard/1.0/
      • http://dannyayers.com/microformats/hcalendar-profile or http://purl.org/uF/hCalendar/1.0/
      • http://purl.org/uF/hAtom/0.1/
      • http://purl.org/uF/rel-tag/1.0/
      • http://purl.org/uF/rel-license/1.0/
      • No profiles required for rel-enclosure, adr or geo (yet).
    • Support for hAtom, WebSlices.
      • In addition to hAtom 0.1, rel-enclosure is supported within hEntries.
    • Improve include-pattern support to prevent some infinite loops.
  • GRDDL:
    • Add option (disabled by default) to require for GRDDL.
    • Add option to check profile URLs for profileTransformation links.
  • Export:
    • Atom output. (Supports RDF/RSS and hAtom as input.)
    • iCalendar export option.
      • hCalendar 1.1 events.
      • hCalendar 1.1 todo items
      • hCalendar 1.1 freebusy info.
      • hCalendar 1.1 alarms.
      • hAtom entries (as VJOURNAL).
      • W3C's iCal RDF vocab (but see note in Cognition/Export/Calendar.pm)
      • RSS Event Module
  • Added a "--nofollow" option to prevent secondary fetching from particular hosts. (Secondary fetching = requesting <head profile>, <link rel="meta">, <link rel="transformation">.)
  • Support <rdf:RDF> elements found directly in (X)HTML.
  • Much improved HTML to Text convertion. Namely: word wrapping, line breaks added after block elements, quote marks around <q> elements, bullet points and numbers before <li> elements in unordered and ordered lists, brackets around superscript text, parentheses around subscripts, tab characters between table cells, usenet-style quoting for <blockquote>, alt text from <img> and <input type="img">, values from other <input> tags. Should be able to handle nested elements like //ul/li/ol/li/dl/dd/blockquote/img[@alt]. Won't be completely foolproof, but should be an improvement over what was there before!
  • Fix so that the entire page is not given a rdf:type of ical:vcalendar unless it contains some bona fide vevent/vtodo/valarm/vfreebusy nodes.