job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroKeyOutputFormat.setOutputPath(job, new Path(args 1 » ));
Setting mapred.child.ulimit=unlimited (mapred-site.xml) solves the problem for me.
The aspirational goal of our Data Cleaning project at Microsoft Research 6 » (started in the year 2000) is to
design and develop a domain independent horizontal platform for data cleaning, so that scalable vertical solutions
such as address cleansing and product de-duplication can be developed over the platform with little programming
effort. The basis for this goal is the observation that some fundamental concepts such as textual similarity and
need to extract structured data from text are part of many data cleaning solutions