Business Analytics: June 2007

Thursday, June 21, 2007

Web Analytics: Future Applications in Predicting Modeling

Web Analytics: Future Applications in Predicting Modeling

A lot of time and effort is being channeled in the area of web analytics. This terms refers to:

“[t]he measurement of data as it relates to an Internet site, including the behavior of visitors, the amount of traffic, the conversion rates, web server performance, user experience, and other information in order to understand and proof of results and continually improve the results of a site towards a set of objectives.”

Since web analytics is another area of predictive modeling, we must ask whether the methodologies, analytics software, and visualization tools develop in web analytics could have impact in other industries that use predictive modeling like healthcare, banking, insurance, retail, and manufacturing industries. I think that the processes and software developed for web analytics will ultimately be use in many other industries because the intersection of the Internet and other industries is already a reality.

Predictive modeling and web analytics have the same objective, to provide a measurement (or baseline) and to predict future behavior. One of the key contributions of web analytics has been software that can withstand the rigors of commercial use. The scalability components of web analytics are crucial for other industries in which large databases has become the norm.

Another significant issue that web analytics has contributed to the area of predictive modeling is the ability to come together and provide a series of metrics and benchmarks for the industry. Although some may disagree with this assessment, if we look at the history healthcare industry it apparent that the inability to agree upon benchmarks and metrics have negatively impacted the cost of healthcare in the United States. Moreover, those involved in web analytics could give industries like banking, insurance, and retail an innovative new look at what needs to be measured.

A third issue that web analytics have contributed to the issue of predictive analytics is the healthy, spirited, and robust exchange in the area of privacy. The Internet has created and raised serious, relevant, and pertinent questions regarding privacy that other industries could find beneficial.
A fourth area that web analytics has contributed to predictive modeling are the development of new return on investment (ROI) models in business. As companies adopt for these ROI models for their advertisement, new media, and marketing strategies they may find that these models are also applicable to other lines of businesses.

Last but not least, web analytics have contributed to a new set of visualization tools that summarize previously hidden nuggets of gold in a way that can be easily understood and act upon.

Tuesday, June 19, 2007

Geovisual Analytics and Crisis Management

good article about geovisual analytics. Good utilization of I2 and GIS technologies.

NIH-NSF Visualization Research Challenges Report

This article is for all the developers and scientists working on visualization tools in analytics and data mining. Enjoy!

Monday, June 18, 2007

BioGRID version 2.0.29 release ( maintenance update )

the latest release of BioGrid

Friday, June 15, 2007

What Data Mining Can and Can't Do

I include this article because Peter Fader is an expert in behavior predictive modeling. The caveat is that there is a difference between predicting behavior and predicting patterns that are not necessary related to behavior. For example, in molecular genetics we try to predict how a gene or a chemical substance have an impact on the physiology of a person. These reactions are mostly physical instead of behavior oriented. Nevertheless, I agree with Peter that executives expectations of data mining are out of proportion to the investment in predictive modeling. Predicting modeling using Excel spreadsheets might work on behavioral analysis, but I do not think it will work in the health care and biotechnology industries. When I was practicing law I used to refer to this as the "what is at stake syndrome". In a civil case is money, but in a criminal case is a person's freedom. Experience have taught me that generalizations make a great sound bite, but could be dangerous in the real world.

Wednesday, June 13, 2007

Evaluation of noise reduction techniques in the splice junction recognition problem

The authors have done a good job in evaluating noise reduction techniques using pre-processing algorithms in large genetic databases which are characterized by the presence of noisy data which can affect the performance of data mining processes.

A review of symbolic analysis of experimental data

This article suggest a time-series analysis as a way to reduce noise in large databases when doing analysis. My first impression was, "you must be kidding time series have nothing to do with noise reduction". Then I did an experiment using my 4.7 terabytes of data and I found that a time series analysis could detect the cause of noise in my sample data (or training set). When I re-read the article after the experiment I found that this methodology is for processes that are non-linear and possible chaotic. I am using healthcare data that is non-linear and chaotic. I found that the time-series analysis was a good methodology to identify noise in the training set for the data tags=1. I still need to do a lot of reverse engineering to understand the why, but in the meantime I thought this was worthwhile passing on.

Enhancing Data Analysis with Noise Removal

I am working on doing some noise reduction to an enterprise data mining model I thought that this was a good overall article on the different techniques applicable.

Tuesday, June 12, 2007

Incremental Mining of Sequential Patterns in Large Databases

The fundamentals of this algorithm could be use in large databases.

The problem: "As databases evolve the problem of maintaining sequential patterns over a significantly long period of time becomes essential, since a large number of new records may be added to a database. To reflect the current state of the database where previous sequential patterns would become irrelevant and new sequential patterns might appear, there is a need for efficient algorithms to update, maintain and manage the information discovered [12]. Several efficient algorithms for maintaining association rules have been developed [12–15]. Nevertheless, the problem of maintaining sequential patterns is much more complicated than maintaining association rules, since transaction cutting and sequence permutation have to be taken into account [16]."

The proposed solution: "This method is based on the discovery of frequent
sequences by only considering frequent sequences obtained by an earlier mining
step. By proposing an iterative approach based only on such frequent sequences
we are able to handle large databases without having to maintain negative border
information, which was proved to be very memory consuming [16]. Maintaining
such a border is well adapted to incremental association mining [26,19], where association rules are only intended to discover intra-transaction patterns (itemsets). Nevertheless, in sequence mining, we also have to discover inter-transaction patterns (sequences) and the set of all frequent sequences is an unbounded superset of the set of frequent itemsets (bounded) [16]. The main consequence is that such approaches are very limited by the negative border size."

Friday, June 08, 2007

Molecular Staging for Survival Prediction of Colorectal Cancer Patients

This article shows the potential of data mining for prognostic diseases using microarray (SAM) data.

Go Stanford! That's were one of my daughters graduated from and SAM is a product of Stanford University.

The treatment of missing values and its effect in the classifier accuracy

Good paper on the effects on missing values in the accuracy of your model. The organization of this paper could improve if the authors would have included their recommendation as part of the Summary.

Nevertheless, this is the crucial recommnedation (p.8): "We recommend that we can deal with datasets having up to 20 % of missing values. For the CD (Complete Deletion) method we have up to 60 % of instances containing missing
values and still have a reasonable performance."

For healthcare, pharma, and biotech data this paper is important because of the complexity and diversity of this data.

An Assessment of Accuracy, Error, and Conflict with Support Values from

This article is for experienced biostatisticians. Nevertheless, this is the interpretation for the layman:
When molecular biology theories are tested with real data we need to be cautious in reading bootstrap values if we are assuming an underestimation of the actual support. For example (my example is not in this article), if using a decision tree vs. logistic regression bayesian model, be cautious in how you assess the accuracy of your model since the decision-tree tends to understimate and bayesian models tend to overestimate.

I have found that to increase a classifier accuracy for a model, this type of distinction (non-parametric bootstrap values vs. Bayesian probabilities) is fundamental.

Phase II Studies: Which is Worse, False Positive or False Negative

A short but powerful article that helps understands the effects of Type I and Type II errors in clinical trials.

Business Analytics