Tuesday, November 16, 2004

Short Tutorial on XPath

This article aims at a brief overview on XPath - standing for XML Path Language. I am presenting this in a FAQ format - I always feel that FAQs sink much easier than an essay - and FAQs help folks start on any technoogy rightaway.
  1. What is XPath ? Why is it required ?

XPath is a language devised for addressing [ locating/referring to ] portions of XML Documents. XML Documents can get really big and complex - XPath makes the job of the User a whole lot easier by providing rich semantics and syntactical constructs to identify and pick nodes within any document of arbitrary complexity.

2. Give me a few examples of simple XPaths .

XPath assumes a compact Non-XML URI like syntax - so you can address nodes with huge schemas in lesser space.

For instance, consider the following XML Document.

As I mentioned earlier, you can point to any node in this XML Document through an XPath.


Into the Groove

Today, I'd definitely try to get my first Groovy Program up and Running.....
list = [1, 2, 'hello', new java.util.Date()]
assert list.size() == 4
assert list.get(2) == 'hello'

Wednesday, November 10, 2004

Semantic Web - Answers from My Contemporaries

1)Is all the Excitement Real ? - Basically, I don't see any "excitement" about the semantic web outsidethe W3C and a few academic and government communities. Many of theacademics seem to be cynical, actually - subtly repackaging theirresearch as a "Semantic Web" effort to get funding. There is a lot of*concern* about the problems that the semantic web addresses(especially in the US government) and *interest* in this approach tosolving them, but not a self-amplifying boom like we saw with the Weba decade ago.
2) Semantic Web has few Closed Issues .
3) XML's value proposition for the semantic web is pretty much what itis everywhere else -- its sheer popularity outweighs its numerousflaws.
4) I suspect that something even smaller than OWL-Lite will cover most ofthe functionality that real people put in real ontologies for theforseeable future.
5) The problem IMHO is thatontology building is an extremely difficult activity; think of howlong it took the medical community to come up with their formalvocabulary *and* the biological theories on which it is based.
6) Adoption Curve -

First, Ontologies areunlikely to get much traction, IMHO, as long as they are called"ontologies", and one is forced to be conversant with formalsemantics, formal logics, etc. in order to use them in an enterpriseIT project. Somehow this approach has to be repackaged in a way thatrests cleanly on the foundation of semantic web theory/technology,but exposes only those concepts (and terminology) that are accessibleto ordinary mortals.Second, somehow the process of building usefully large but consistententerprise ontologies must be made more feasible. I'm not sure howpossible this is in principle (Goëdel had something to say along theselines?) but presumably most enterprises have enough structuredinformation in their glossaries, data dictionaries, businessprocesses, and existing IT operations that can be captured andusefully reasoned about .... given time. Technology can only automatethe tedium of humans doing, not take garbage in and spit consistentontologies out. Will technologies that effectively support whathumans need to do to make this happen come onto the market? I seesome hopeful signs, but I don't think Protege even comes close tobeing useful to the kinds of people who will have to do this in themainstream world.Third, if ontology-building is a top-down approach to supportingsemantics, there's a question of whether the bottom-up approach ofmaking sense of things by induction will actually work better. (Theeternal induction vs deduction debate ...). See, for examplehttp://www.eetimes.com/article/showArticle.jhtml?articleID=51201131'Sony Computer Science Laboratory is positioning its "emergentsemantics" as a self-organizing alternative to the W3C's SemanticWeb'. The bottom-up approaches (e..g. the Google approach to webserarch, the SpamBayes approach to spam filtering) seems to be awfullygood at hitting 80:20 points in the real world while the top-downapproaches are still research projects. I am personally convincedthat the bottom-up approach will continue to rule in massivelyunstructured domains such as the web and email; I'm not so sure thatthe top-down ontology building might not be more efficient insituations where there is a lot of semantic structure (e.g. enterpriseIT shops) but it just has to be captured and exploited. Nevertheless,both seems quite viable at the moment, and different companies arebetting on one, the other, or both. It may be that the rate ofadoption of ontologies will be stifled by early successes from thebottom-up approach in real enterprise applications.

>> I agree with Michaels comments, in particular the bit about the bottomup approach being applicable. The other thing I'll note is that theGovernment has a vested interest in driving much of this. One canthink of an Ontology infrastructure as being the Web equivalent of thefederal interstate system. It's a problem that needs to be solved onthe national scale and there is economic benefit as well as all theindirect social (security, medical, you name it) benefit. However,the idea of the government being in charge of how universal knowledgediscovery is done scares the heck out of me, some form of publicoversight with real clout is needed.

>> Even if you have thetechnology to be shared for creating ontologies, theinherently local nature of meaning indicates thatbottom up approaches are likely to dominate. XMLis successful precisely because it only constrainswhat is usefully sharable (mainly, syntax), and thenutility drops off proportional to the size of thecommunity of interest.As to the syntax, XMLers like/tolerate XML and benefitfrom the sharable tools. Otherwise, there is a nearuniversal loathing of its verbosity particularly insome AI and ontology circles. John Sowa lets fly aboutonce a week on that topic.

>>One problem I see, considering how long people have been talking about theSemantic Web, is that there's still surprisingly little data to form into aweb. (I'm just talking in terms of publicly available non-transient RDF.) Iwonder how far XML would have gotten if we'd all spent the first few yearswriting DTDs and only occasionally created little document instances todemonstrate how our DTDs might be used. That didn't happen because peoplecoming out of the SGML world already had plenty of real-world applicationsin which to use DTDs and documents, and the dot com boom gave people lotsmore ideas, but the amount of practical, usable RDF data still seemsremarkably small. I've been compiling a list at rdfdata.org, and it'sgetting harder and harder to find new entries.One could argue that we don't need RDF to build a viable semantic web, butRDF does address problems that need to be addressed, so if you pull it outof the equation something else needs to be plugged back in.

>>Actually, in the MISMO (Mortgage Industry Standards MaintenanceOrganization, which is the agreed upon standards body in the UnitedStates for Mortgage Technology) working groups this has been coming upquite a bit. In its standards process MISMO maintains a data dictionaryof terms that work across the industry, as well as a variety ofstructures (grouped in process areas and transactions) where these termsare used.It seems like a perfect candidate for a top down approach of semanticdescription, possibly via OWL. To be honest on a macro level the problemseems tenable-- much like the examples floating around the web of theWineries and wines, it seems like it would be pretty simple to develop astrategy for describing the data-points, and ultimately the way in whichthey can/should be used (even on a process/transaction basis). Maybethat is because I mentally skipped some things that were important tounderstand...But as Michael said, there is a lot of resistance to terminology--ontology, description logics, KR, etc.-- and we don't have enoughexperts from that domain (i.e., I am not an expert in that domain).There is also an ingrained need for ROI. Unfortunately, predicting ROIin this space is difficult because of a lack of visible successes. Itwould help if the media stopped focusing on what-if and started focusingon what-happened.But ultimately it strikes me that the solution is somewhere in betweenthe top-down and bottom-up approach. It would be really great ifindustry organizations such as MISMO created ontologies for their spaceand people could interact with them using their own local definitionsand mapping them together using equivalence classes. Especially in themortgage industry, if interfacing with a business partner was simply amatter of identifying like terms, and structure was invisible, then Ithink we will have made incredible progress. If you can eliminate theneed for a programmer who understands the esoteric terms of the industryand enable the business experts to identify terms you will greatlyreduce the time and money spent interfacing.Perhaps this is a limited or wrong view of the Semantic Web. But it is asmall step.

>> BTW to the original poster: a better place toask this question would be one where the ontologyexperts hang out. One such list is theConceptual Graphs list. cg@cs.uah.eduCode lists are a productive place to start.This seems easy, but it isn't although it is theeasiest of the problems once one gets past thesyntax and terms of the semantic web app itself.Industry lists have been around for years. Gettingthose into formats that are readily processable isa step in the right direction. Then 'to do what?'Local doesn't always mean 'in our shop'. An industryis a locale of sorts. The mushiness is domain overlap.For instance, we sell systems with jail commissaries.Some of the terminology is local to the 'jail business'but the items sold are items obtainable in mostcommercial stores. Then there are some items which onewould only see in a detention or corrections facilitybut are nonetheless, items one obtains at the jail.This sorting of the domains if done well can providegood code lists, but then one implements say a dropdownthat has members from multiple codelists. Domainoverlap (a domain subsuming multiple domains withsome common members and slightly different definitions)and domain leakage (a member that is adopted from onedomain into another with not so small differences indefinition but the assumption of equivalence) are apart of the semantic drift problem.If the semantic web has one very large hurdle, it isthe very dynamic nature of meaning with regards tochanging intent. Do the best you can but no onecan make time or meaning stand still. YMMV.

>>Sure they can, in the form of contracts. Essentially that is what OWL isfor right-- a contract about the nature/meaning of a particular piece ofinformation? Sure, those considerations will change over time but thatis what versioning is for?Semantic drift is to be expected, and I'll grant that it is a problembut that doesn't mean it makes the whole process useless. I know thatthe fidelity of an MP3 recorded from a CD and an old cassette are twowildly different things. I know that converting the MP3 to anotherformat and back will likely involve some loss-- but it doesn't mean thatthe information is useless, I just have to approach soberly.Code lists are great, shared code lists are more great-- but for eachlevel you go out you have to keep in mind that there will be somelossiness. Fine. Still, sign me up-- if I have a program that can automap 1800 out of 2000 fields reliably, I'll use it.

>>The problem is it isn't contract, but contracts.RFP by RFP. It is great if they can all referenceone ontology, but for that to work, that ontologyhas to be the sum of their requirements; whaddaygit?Another bloated specification. Just whining, here.It isn't that the ontology drifts: it is thatmeaning drifts. Will I accept a noise ratioof 5 to 1? Sure. Sobriety rules. One can'tcount on a large non-local community being soberall the time in all of the places where theymake their decisions. So not just sober choice,but well-considered application. That is as goodas it gets and why many said that frictionlesscomputing was/is nonsense, so YMMV.Don't get me wrong. We're very happy to getstandards for the codelists we use. Stuff theminto an enumeration and let us suck them via anXMLReader right into the database, then to thedropdown. Very happy indeed. But the real trickis to in near real time detect that a user in aparticular context chose the wrong value from thatlist. This is when the semantic stuff starts tohave more value.

>>Kendall,We discussed it before because I had said (a bit facetiously) that thecurrent Semantic Web is mostly FOAF files, tools, and talk. I certainlywouldn't deny that FOAF files are part of the Semantic Web; without them,there'd be little left!As I've mentioned on the rdf-interest list, I still haven't heard a use casethat demonstrates what value RSS 1.0 files of transient data can play in asemantic web. If it was current practice to archive them (like monkeyfistdoes) and I was reading an article by someone and wanted to see more by thatperson, semantic web technology crawling RSS 1.0 archives would make it easyto turn up more articles by that person. Maybe not everything he ever wrote,because in some bylines he may use his middle initial or called himself"James" instead of "Jim", but I would have found something.It's not that I'm against transient data having any role period. Movietimetables are transient data, so if someone made those available as RDFfiles (haven't found anyone who does yet), I could obviously see why thosewould be useful. I'm just wondering how people can apply semantic webtechnology to take advantage of transient RSS 1.0 files to do things thatthey can't do with RSS .9, 2.0, etc. files. In other words, what makes thempart of the semantic web? The mere fact that they're in RDF?The SemWeb life sciences conference is a great example of how a specificdomain, especially one currently suffering from data overload, is fertileground for proving the value of semantic web technology, and publiclyavailable data is appearing (http://www.rdfdata.org/data.html#bio). I wasjust telling a biomedical research professor about it over the weekend, andhe was anxious to hear more.Bob

Bob,We've talked about this before, but every FOAF and RSS 1.0 resource is anRDF file. I don't know why you discount that data as non-transient. Thatpeople don't archive all of their RSS 1.0 events seems a matter of a bestpractice. It doesn't change the fact that there are *lots* of RSS 1.0 (whichare RDF) resources on the Web. (And there are good social reasons for whypeople might not want to maintain all their FOAF versions.)It seems to me that we're maybe in the "intranet" phase of the Semantic Web,that is, lots of non-public RDF inside enterprise and institutional walls,while the amount of RDF on the public Web continues to grow (even if notexponentially).Lots of folks using RDF and OWL in the life sciences world, or so I learnedat the W3C's workshop about SemWeb in LifeSci in Boston a few weeks ago, andthe great majority of that isn't on the public Web.My two cents, anyway. :>Kendall ClarkManaging Editor, XML.com

>> believe the notion behind the semantic web is many fairly small,intersecting ontologies. As described in TBL underground map:http://www.w3.org/2003/Talks/0922-rsoc-tbl/slide23-0.html. Each colored linein this diagram corresponds to an ontology. No single line visits all thestations; but several stations are visited by more than one line.Information is shared within one ontology to interoperate between, say, theaddress book and events. Another ontology interoperates between events andphotos. The result is interoperation of addresses and photos. This is donewithout requiring all stakeholders to agree upon a single interlingua thatcovers all information silos at once.I can't really see how one ontology could be practical even in much smallerenvironment than Sem Web - such as a single company or a single departmentwithin a company. Often, even a single application will require multiplemodular ontologies.In theory, the modularity of ontology models should provide the flexibilityneeded to accommodate different contexts. One could also only reference/usepart of an ontology - parts one can "agree with" - without committing to theentire ontology. In practice, we are still figuring out how this will allwork.

> The problem is it isn't contract, but contracts.
> RFP by RFP. It is great if they can all reference
> one ontology, but for that to work, that ontology
> has to be the sum of their requirements;I was going to say something similar, but from the enterpriseintegration context: It's great if you can get an ontology thatdescribes the implicit semantics in a bunch of applications anddatabases by relating them back to the actual business functions theyserve. BUT it is highly unlikely, in my experience anyway, that theontology will remain the master "contract". Instead, the apps and DBsand business processes will evolve, as they always do, and IF &deity;smiles on us the ontology will be kept in synch.&deity; is, however, a capricious god :-) and seldom smiles on thegeeks trying to make life difficult for the people who are doing whatthey have to do to make the numbers this quarter or whatever.

>>Quite. No one expects a single interlingua,not before TBL or afterwards. These are thewell-known problems of ontologies. The betterauthorities than TBL are people such as JohnSowa, Pat Hayes, etc.Until you map a working ontology to a working database,the practical aspects of size and modularity aren'tapparent. Only a novice builds a database with onegiant very wide table. On the other hand, ensuringthat one has used all of the terminology correctly toname tables and columns, keeping these semanticallyconsistent, and avoiding full normalization thatcan create performance problems is quite an art.So the single upper level ontology that would spancultures, users and space-time is a pipedream.So no disagreement here.XML works because it knows nothing of meaning.Networks are predicated on the notion that thechoices are meaningless to the network (Seethe first page of Shannon and Weaver's work.)Notion one is reproducibility, not interpretability.A meaningful network is almost an oxymoron. A networkof users dynamically negotiating and validating themeaning of messages isn't.


>>
In theory, the modularity of ontology models should provide the flexibility
> needed to accommodate different contexts. One could also only reference/use
> part of an ontology - parts one can "agree with" - without committing to the
> entire ontology. In practice, we are still figuring out how this will all
> work.

Forgive me if this is something I should have learned in SemWeb 101,but doesn't any inferencing mechanism based on logic assume that theontologies are consistent? How does one ensure that the parts ofmultiple ontologies that one "agrees with" are consistent with oneanother? And if they're not, an inferencer could come to anyconclusion whatsoever (e.g. the possibly apocryphal story of BertrandRussell proving that he is the Pope from the premise that 2+2=5) ...or what am I missing here?In practice, what DOES one do, other than work with simple and unitaryontologies that don't imply anything remotely interesting, but letsoftware agents automate the grunt work of generating queries,transformations, etc. that are just too tedious for humans to doquickly and accurately. That's use case for the semantic webtechnologies that I can both grok and see an application for, FWIW.

>>>
Even then, looser can be better, at least until thenumber crunchers get into the act. The act of measurementis the surest expression of a semantic, or something likethat. Otherwise, from the geek perspective, looser lastslonger. We can spend enormous amounts of time identifyingall of the individually meaningful items, or we can implementtwo text boxes labeled Request and Response and get on withbusiness.This of course, negates traceability. So when building anenterprise app, it can be useful to have a 50k foot viewof the end-to-end lifecycle of all of the documents andthe items they control. Really precise data items makeit harder than it has to be if the systems act mainlyas transport/storage, not an interpreter. If the human isinterpreting and taking all of the critical actions,labeled textboxes do just as well. The fear and loathingstarts down in the queries and particularly any placethe system is performing hidden calculations.Cacheing and naming never get easier.

>>Yes, this is exactly right. Semantic Web is all about working with simpleunitary ontologies and having software agents go at them.I don't think you are missing anything. One of the motivations for common"upper" ontologies is that you support the interoperability of yourontologies by maiking them all consistent with the UO. So this could be asolution, but I have difficulty believing in the feasibility of making thishappen, although there are people who swear by it. I know of some work onreasoners that manage contexts, so that you don't have to import all of yourforeign ontology to do reasoning, but this still has the issue of how oneknows it is consistent when you do.


>>One approach to the upper ontology, or any ontology really,is to accept that it is, like law, an artifice. It works aswell as it works when it works and that is as well as it willwork. Like your car, it gets a job done and when it doesn't,you or someone else can fix it.The question of the semantic web is the golem problem: howmuch power and authority will you give the artifice over yourchoices? Otherwise, don't mistake a tool for the truth ofthe results of using the tool. A computer doesn't know howto add 2 + 2. It can be used to simulate that operation andgive a repeatable result. If 2 + 2 = 4 for an acceptablenumber of uses, it is a useful tool. If you hit the onecontext in which that isn't true, it fails. So understandin advance what you are committing to and what the bet is.An interesting question might be, when is an ontology expressingsomething non-trivial? Where there are doubts about the valueof the semantic web, they are related to that question. Thecost of an expert system proved to be very high for theutility it provided over a deliberately limited domain.The assumption seems to be that some of the scaling magicof the WWW will be obtained for the Semantic Web, but again,networks scale precisely because they are NOT meaningful.So this bet may not be a good one.Treat ontologies like law: to be useful, law must betestable or enforceable. Thus the notion of commitmentto rule by law and to an ontology (see Thomas Gruber). Inone view one might say, an ontology is a computable meansfor expressing a precedent. Expressing and applying aprecedent is a matter of judgement, not truth. It isalso useful to inquire of how often you will find asystem useful based on the frequency with which it haltsand asks you a clarifying question, and the value interms of work when it does that? Interupts are expensive.

>> have begun to see the value of this within the US federal governmentspace. The federal government is extremely data-rich, and agencies wantto share information to save money, earn efficiencies, and potentiallyincrease overall data quality (through elimination of redundant, butpossibly inconsistent, data sources). But in order to determine whatinformation can be shared, it is first necessary to identify whatinformation (or types of information) are available *to* share.Ontologies and taxonomies are, I believe, wonderful mechanisms by whichto accomplish this.In addition to identifying opportunities for information sharing, theseartifacts can also identify opportunities for federated queries (perhapsusing Enterprise Information Integration - EII). Consider a hypotheticalsituation in which 2 agencies have arrest information for an individual- but one has it on a domestic basis, and another on an internationalbasis. A federated query between these two data sources - which can bedetermined by comparing their ontologies and taxonomies - can yield anarrest record for a given individual on an international basis.I have found that in educating unfamiliar folks on these artifacts, itworks best to use examples within their own domain. Familiary with theirown data and concepts greatly eases the mental transition.


Tuesday, November 09, 2004

What Do I think of the Semantic Web - Somebody asked.

I hope we are soon in an era where I can find the best answers to your questions on the Semantic Web.
I feel strongly about the entire initiative considering the fact that it has brought in thoughts from a wide diaspora of industries and domains – a community effort to find a solution to a common problem of disparities in metadata and understanding.

The reach of the Semantic Web has turned out to be so widespread, to such an extent that the term “World Wide Web” itself might require rephrasing [unless the effort of migration is so humungous]. Wonder why? - Because it would not even matter if we are in the same world. [Hope that’s not too much to ask for]

Finally, we have come to the point where the Metadata has traversed up to the Root of the heirarchy, where entities have become self-descriptive, and no longer require external metadata for them to be defined.


I feel that the transition from human readable semantics, where the User knows that the content within an tag should be an author, to the machine readable semantics where the Document Processor also knows about this as a fact, is a success in the following respects.

1) The Definition of a Consumer has increased in breadth – Does not matter who the Consumer is – The Data is available at the disposal for consumption by any End System, as long as it treats it well.
2) Common View of the Data for Consumers irrespective of the Syntax.
3) Transparent and Adaptive Context Based Processing and Interpretation of Data for any Consumer of a Document.
4) Automatic Expansion of Data on discovery of matching Data Semantics.

While Programming languages shifted from their notion of a Procedural Paradigm to that of an Object Oriented Paradigm, so do have Applications moved from a Processing Paradigm to a Document Centric Paradigm – The realization that Processing is just about making decisions based on Data or just enable data flow, and the relevance of a Consistent and generic Data Model has been the most significant shift in the thoughts over the past decade. Both these systems have felt the significance of Data Sharing.

One common misconception that has evolved around the Semantic Web – I have noticed it, is the myth that
a) Semantic Web requires a Browser and a Search Engine
b) Semantic Web is all about, and only about Semantic matches for data. Of course, discoverability is one of the principal concepts that Semantic Web holds, but it is unfair to restrict its definition to such a small subset.

But I definitely foresee the Web evolving into a virtual database with infinite tables, rows and columns, and inspite of this, providing easy access to the relevant information that Applications and People really want. In any case, semantically, a Document Query closely resembles a database query in many respects.
a) Both are aimed at extracting relevant information from a data source
b) Both use a linguistic query syntax for retrieving data
c) Both access data that exists in a predefined structural format.

It’s interesting to note the similarities in the notion of semantic equivalence in Documents and terminologies, and Objects in programming languages.
Object Oriented Programming had crude notions of semantic associations between Objects in the form of “is a”, “has a” relationships, domains, ranges that enable storage of data within the Object that it considers correct and valid, and help build networks of Objects that can work together.

This eventually led to the natural evolution of Web Services that allowed loosely coupled interaction between Objects [well, at the rock bottom, it boils down to Objects, right?].
There’s a plenty of scope, and plenty of research already happening – in the area of Semantic discovery of Services – based on the publicly visible behaviour, assisting discovery and selection of Partners based on Semantic Annotations on the Service Definitions through variety of mechanisms.

Well, as soon as the industry realized that Web Services alone are not going to solve all the business problems in the world – they needed to talk to the Organizations’ internal Business Processeses – for creating a Business Process Orchestration that would police the interaction between the Services, Semantic Integration has come onto the forefront – where communities started mail threads talking on Process Ontologies, and automatic orientation of the Organizational direction to changing Business Needs – capturing metrics of partners and processes in ontologies and optimizing the Business Processes based on the metadata.

Maybe I feel I should stop here – for this answer could run into pages, weeks and months, and by the time I finish it, I would be able to see my PC with an inbuilt intelligent inference engine giving me instructions on optimized work-time productivity and suggest times when I should continue writing this article – based on my Project Deadline metadata, my personal metadata and information from external data sources.

Well, in Summary, I feel that the Semantic Web Initiative holds a bright future by
1) Allowing people and Applications to understand that they are talking about the same thing or a related fact.
2) Providing space for Resilient Knowledge Management by moving forward from structural and linguistic constraints on definition of data, to a space that allows for change in the Organization’s data representation without impacting Business and avoiding downtimes.
3) Spreading awareness among organizations that a generic and extensible data model will solve problems that would otherwise impact the Business, if not at the immediate present.
4) Involvement and endorsal of a huge intellectual community under the umbrella of people like TBL, who can dream of the impossible and make them possible.