Software cross-sorts gene data

By Kimberly Patch, Technology Research News

As anyone who uses the Internet knows, there's a vast difference between simply gathering information, and organizing it so that you can actually find anything. Analyzing large amounts of genetics or financial data presents and even larger problem: teasing patterns from reams upon reams of data.

One popular method that allows for quicker analysis of large groups of data is clustering, which organizes data into groups with similar traits.

Researchers from the Weizmann Institute have taken the process a step further with coupled two-way clustering (CTWC), an approach that identifies two types of subsets in the data, then uses one to cluster the other.

CTWC works with any clustering algorithm. For their demonstrations, the researchers used it in conjunction with their Super Paramagnetic Clustering algorithm, which clusters data based on the way grains of magnetic materials organize themselves into magnetic clusters as they cool.

The CTWC method allows the clustering algorithm to zero in on relevant data and spend less time chunking away on data that has less in common. "We were looking for ways to reduce the problem and break it up into its constituent parts," said Eytan Domany, a physics professor at the Weizmann Institute.

In one demonstration that involved genetic data, for example, “we were able to zero in on small groups of genes whose expression levels are able to distinguish between particular subgroups of tissues," Domany said.

The method also allows for clusters based on similar traits, even when the traits were not identified beforehand. "We want to find also new partitions -- say of two different kinds of tumor -- of whose existance we were unaware before the experiment," said Domany.

For example, in a demonstration that analyzed gene participation in colon cancer, the method was used to analyze expression levels of thousands of genes for 40 tumor samples and 22 normal tissue samples.

This type of data can be clustered in two basic ways, said Domany. The first way is according to gene expression profiles across the genes, like tumor versus healthy, or different kinds of tumor. It can also be organized into groups of genes whose expression is strongly correlated. The second type of cluster is important because such groups of genes could belong to the same biological mechanism that may have caused the disease.

To identify new clusters the researchers "first use all [tissue] samples to divide the genes into clusters, and all genes to partition the samples into groups. Now we take [the clusters] of genes, and use only its members... to partition every group of samples into subgroups," said Domany.

"The procedure gives us numerous partitions of the genes and of the samples. We look at them and check whether they are... statistically significant and... biologically meaningful," he said.

In one experiment, the researchers first clustered the genes using both tumor and normal tissue samples, then using the tumor tissues only. The two resulting clusterings each had two similar gene clusters, but only in the tumor tissue sample were the expression levels of the two clusters strongly correlated.

This showed that colon cancer was more likely when people had both types of genes, which is known to be the case.

Using the same data, the researchers found a cluster of genes which parsed the 62 tissue samples into two groups, each containing both tumor and normal tissues. The two groups were distinct in a different way, however. "We discovered that at a certain date the experimental protocol had been changed," said

Domany. "Nearly all the tissues in one of our groups have been measured before this date, and those of the other group after this date. Hence we discovered a set of genes whose expression levels are sensitive to the changing measurements induced by the change of protocol," he said.

The algorithm is still being tuned, but the researchers plan to have a version available to download from their web site within two months, according to Domany.

The next step is to apply the algorithm to two more types of data, said Domany. "We are applying the algorithm for document classification [and also] plan to look at financial data," he said. In addition, the researchers are planning to use the existing algorithm on more gene expression data. They are also taking steps towards commercializing the algorithm, Domany said.

The current emphasis in clustering algorithm research is improving their performance for very large data sets, said Johannes Gehrke, an assistant professor in the computer science department at Cornell University.

"Algorithms for very large data sets have to be scalable, which means that their running time has to increase about linearly with the number of records in the input data set. Many existing algorithms... just take too much time to run on today's data sets, even with the speed of computers increasing according to Moore's Law," he said.

Researchers are also beginning to turn their attention to making clustering algorithms work on the fly.

"One challenge is the mining of high-speed data streams," said Gehrke. "As an example, Yahoo had about 450 million page views per day in December 1999 and about 680 million page views per day last month. This amounts to 6 gigabytes of clickstream data per hour. Given this enormous flood of data, we need to develop stream data mining algorithms that can digest these rivers of data just by looking at each record exactly once, in the order the records arrive," he said.

Domany's research colleagues were doctoral students Gad Getz and Erel Levine. They published their research in the Oct. 17 issue of the Proceedings of the National academy of Sciences (PNAS).

The research was funded by the U.S.-Israel Binational National Science Foundation, the Germany-Israel Science Foundation, the Ministry of Science and the Minerva Foundation.

Timeline:   <2 months
Funding:   Government
TRN Categories:  Data Structures and Algorithms
Story Type:   News
Related Elements:   Technical paper, "Coupled Two-way Clustering Analysis of Gene MicroArray Data," Proceedings of the National Academy of Sciences (PNAS), Oct. 17, 2000; Site where algorithm will be available in early 2001:


November 22, 2000

Page One

Holey chips channel light

Piezoelectric sliver forms sensor

Self-tuning software speeds networks

Software cross-sorts gene data

Electron beams turnout tinier tubes


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.