Saturday, September 22, 2007

Mobile Business Intelligence - the Next Big Step

Cognos has taken the lead in the area of mobile business intelligence. This is a huge step!

Wednesday, September 19, 2007

Duke Plots Course Beyond the Smart Grid

This is one of the most foward thinking business intelligence projects in the world. Duke Energy is taking the steps to create a smart power grid. The creation of the intelligence real-time applications behind this concept will revolutionize the world.

Friday, September 14, 2007

VP, Decision Support Systems

I was contacted for a position as a VP, Decision Support Systems, in New York with a prestigious financial institution. Although I am not interested some of you may be interested in this position. If you are contact Heinz Bartesh at

Wednesday, September 12, 2007

Market Forecasting and Modeling for the Power System of the Future

This paper addresses the utilization of predictive modeling and forecasting in the power supply industry. The issues herein were identified a couple of years ago, but the implementation is occurring at this time. The challenge of a forecasting system in power supply is the many "what if" scenarios that different models will need to consider. These "what if" scenarios need to take into consideration:
1. physical assets
2. contract prices
3. economic forecasting

A modeling of this size and complexity could require the utilization of a combination of most of the tools and methodologies in the current data mining and predictive modeling market, plus the development of some new tools.

Predictive Planning for Supply Chain Management

This paper shows one methodology of using predictive modeling in planning and scheduling decisions in supply chain management. It is important to remember that the variables will be different depending on the industry and client-specific requirements.

Tuesday, September 11, 2007

F.B.I. Data Mining Reached Beyond Initial Targets

It seems that the definition of "community of interest" association will be the cluster results from I2 Notebook. I have used this tool many times and the results are impressive. If you are using this tool you may consider using additional analysis (logistic regression and decision tree) to further refine your results.

Monday, September 03, 2007

Frequent Doesn’t Mean Loyal: Using Segmentation Marketing to Build Shopper Loyalty

This is a classic article regarding the theory of how to translate customer loyalty to develop a "profitable differenciation".

Data Mining Analysis and Modeling for Marketing Based on Attributes of Customer Relationship

This article on data mining in CRM for the retail industry shows the utilization of cluster analysis, association rules, and linear regression in determining Attributes of Customer Relationship ACR


This 2000 paper from the Bank for Internaltional Settlements gives a good overall picture of the statistical modeles used to analyze and determine risk assessment in the banking industry. The mortage industry and associated lenders and market leaders should consider implementing these early warning system models to prevent the current mortage crisis to repeat itself in other areas.

Data Mining Applications in Higher Education

Good article about data mining applications in higher education.

Data Mining Technologies and Decision Support Systems for Business and Scientific Applications

The issue is whether or not you have the information or data anymore since a lot of companies and organizations have large amounts of data. "The challenge is to be able to utilize the available information, to gain a better understanding of the past, and predict or influence the future through better decision-making."

Integrating Customer Value Considerations into Predictive Modeling

This is a good article about how to measure "sucess" in applied predictive modeling. The example is in the telecommunications industry, but the "valuable customer" aproach can be used in any industry.

Thursday, August 30, 2007

Predictive Analytics and Data Mining

Excellent article by David in terms of the utilization of data mining and predictive modeling concepts. I believe that expectations and corporate strategy are not properly aligned in this area. Data mining, predictive modeling, and business intelligence give an enterprise the opportunity to build a decision support system which is the marriage of the best technology and science have to offer. It does not replaces intuition, but augments it. The best way that I can describe this enterprise system is:
1. A robust back end to handle large amounts of diverse and complex data;
2. Creation of client, industry, and business problem variables that can assist in determining patterns in the data;
3. Utilization of multiple data mining or predictive modeling algorithms to classify the data; and
4. Utilization of statistical techniques to help forecast, partition, or determine areas with common patterns.

On the Advantages and Disadvantages of BI Search

Stephen has written and easy to read article as to the challenges for the next-generation BI. Let me add that text-mining technologies are currently improving constantly. We have seen it with Yahoo and Google and their association algorithms when you start typing in the search bar. As individual PC, laptops, and portable handled devices become embedded with intelligent agents we will start seeing the future unfolding right before our eyes. At the same time, you will see servers with the capacity to analyze the information from the intelligent agents. This is exciting!

Tuesday, August 21, 2007

Paper Kills: Transforming Health and Healthcare with Information Technology

If you have a role in healthcare strategy or data mining this book is a must read. Thought leaders like Dr. Brandon Savage at GE Healthcare. Once medical records are transformed into digital form, the vision of the future of healthcare in the US is data mining and healthcare analytics are at the core of this vision. Hence, what we are working on today will be one of the building blocks of this vision.

Monday, August 20, 2007

Donald Farmer on Data Mining

Donald and his team at Microsoft are first class professionals in data mining. If you have not visited Donald's blog I would recommend you to do it.

Donald's blog:

Look at his data visualization music video link!

Thursday, August 16, 2007

Technology: Is Data Mining Misguided?

When I read this article I see the clear confusion regarding the expectations of data mining technologies and how they should interact with statistical methodologies. The purpose of data mining should be to create a classification (think of a list of items going in a particular order 1, 2, 3,4, 5...). This calssification is based on a value that is express as a probability. Once you have a good measurement tool (this is waht data mining should do for you), then you apply statistical techniques (distribution, cluster, cause and effect analysis, correlation) to determine the areas that should "group" together (using relevant discrete and numerical variables, including but not limited to the data mining value obtained). Once you have determine the areas you want to study, then you use the data mining value (and other variables) and statistical methods to make your recommendations. Again, the process is: 1. variables, 2. data mining models, 3. determination of areas of classification, 4. statistical methods, and 5. recommendations.

The change management is to get users of data mining to understand that it is a process and that for it to work you need to invest resources (mostly time and technology).

Wednesday, August 15, 2007

Google, Microsoft and the glacial healthcare revolution

Good article on ZDNet that explains how Microsoft and Google are competing in their strategic initiatives in the healthcare industry. I believe that the main issue is how to effectively aggregate and find value in the vast amount of healthcare data. I think that the solution is going to be a combination of predictive modeling, data mining, powerful servers, and artificial intelligence tools that are connected through the Internet. I am honor to be a participant in this effort.

Thursday, August 02, 2007

Korean stem cell fraud masked a true advance

The stem cell fraud case in Korea shows how scientific fraud can actually hold back progress. If Dr. Hwang would have been careful in his methodology and reporting he will still be considered reputable scientist. Lesson to be learn: be careful in your methodology and even more careful in your reporting of finding.

Monday, July 30, 2007

Genetic breakthrough in multiple sclerosis -- biggest for decades

This is what data mining and predictive modeling is all about, a tool for subject matter experts to identify "new suspects". Once predictive modeling helps identify new suspects the subject matter experts apply their knowledge to determine whether this has value in their filed.

Monday, July 23, 2007

New processors present problems, payoff

The new challenge and opportunity in designing microprocessors is presented in this article. A new operating systems will be needed to optimized the utilization of these microprocessors. In my opinion the combination of data mining technologies that allow "automatic data mining (or predictive modeling) factories", intelligent agents, and parallel computing are going to be the fundamental blocks in addressing this challenge. Those technologies combined with gaming, simulation and other visualization technologies will be part of ingredients needed for this leap into the future.

Conceptually I think that it will be like this:
  1. Data mining technologies will provide the fundamentals of pattern and error detection. Due to the complexity and diversity of the rich data environment that we currently face we will need the ability to have part of this technology embedded into any program, and we will probably need multiple and different data mining models analyzing data simultaneously so as to customize the needs of the end users;
  2. Intelligent mobile agent technologies would be fundamental to access and process data from servers, mainframes, and handheld devices like cellphones;
  3. Web based technologies will be fundamental in solving finding patterns and in improving remote communications;
  4. Parallel computing technologies will be needed to optimize the processing of large quantities of data; and
  5. Visualization technologies that make complex patterns easily understood, while simulateously adhering to establish laws of nature (i.e., medicine, or physics), or previous experience (business rules) would also be a keystone in this endevour.

Our biggest challenge is going to be to reach out acrross multiple disciplines and technologies to integrate all these technologies into a great schema. In this sense we are all pioneers. We bring different skills set that we combined will mark the path for others to follow. It will not be easy, but it will be worthwhile!

Wednesday, July 11, 2007

Understanding Molecular Imaging

GE Healthcare is correct in their assessment that if you can track molecular changes in cells and link them to disease progression an enterprise will be demostrating "the power of molecular imaging". I believe that web analytics algorithms and software is what is going to make the step possible. The reason is that web analytics algorithms allows to predict a variable (i.e., disease) given a series of inputs (medical procedures and other diagnoses) over a sequence.

Web Analytics and Healthcare: Disease Progression

We are starting to develop a heatlhcare model for disease progression prediction using Microsoft Sequence Clustering algorithm in SQL 2005 Server Analysis Services. It seems to work well, but I would like to make a comparison with other algorithms. I was wondering if anyone in the community knows how can we obtain Gooogle's permission to use (or adapt) their Web Analytics algorithm for disease progression prediction. Or if anyone has any other suggestion for Web Analytics software that we could try. We have the largest private payer healthcare database in the U.S. so we need robust algorithms.

Monday, July 09, 2007

Moving Closer To Solving Lou Gehrig's Disease Mystery

This is an area that I hope predictive modeling and data mining can make a difference. If we can do a linear disease progression modeling at the cellular level we might be able to diagnose and prevent ALS before its onset.

Thursday, June 21, 2007

Web Analytics: Future Applications in Predicting Modeling

Web Analytics: Future Applications in Predicting Modeling

A lot of time and effort is being channeled in the area of web analytics. This terms refers to:
“[t]he measurement of data as it relates to an Internet site, including the behavior of visitors, the amount of traffic, the conversion rates, web server performance, user experience, and other information in order to understand and proof of results and continually improve the results of a site towards a set of objectives.”

Since web analytics is another area of predictive modeling, we must ask whether the methodologies, analytics software, and visualization tools develop in web analytics could have impact in other industries that use predictive modeling like healthcare, banking, insurance, retail, and manufacturing industries. I think that the processes and software developed for web analytics will ultimately be use in many other industries because the intersection of the Internet and other industries is already a reality.

Predictive modeling and web analytics have the same objective, to provide a measurement (or baseline) and to predict future behavior. One of the key contributions of web analytics has been software that can withstand the rigors of commercial use. The scalability components of web analytics are crucial for other industries in which large databases has become the norm.

Another significant issue that web analytics has contributed to the area of predictive modeling is the ability to come together and provide a series of metrics and benchmarks for the industry. Although some may disagree with this assessment, if we look at the history healthcare industry it apparent that the inability to agree upon benchmarks and metrics have negatively impacted the cost of healthcare in the United States. Moreover, those involved in web analytics could give industries like banking, insurance, and retail an innovative new look at what needs to be measured.

A third issue that web analytics have contributed to the issue of predictive analytics is the healthy, spirited, and robust exchange in the area of privacy. The Internet has created and raised serious, relevant, and pertinent questions regarding privacy that other industries could find beneficial.
A fourth area that web analytics has contributed to predictive modeling are the development of new return on investment (ROI) models in business. As companies adopt for these ROI models for their advertisement, new media, and marketing strategies they may find that these models are also applicable to other lines of businesses.

Last but not least, web analytics have contributed to a new set of visualization tools that summarize previously hidden nuggets of gold in a way that can be easily understood and act upon.

Tuesday, June 19, 2007

Geovisual Analytics and Crisis Management

good article about geovisual analytics. Good utilization of I2 and GIS technologies.

NIH-NSF Visualization Research Challenges Report

This article is for all the developers and scientists working on visualization tools in analytics and data mining. Enjoy!

Monday, June 18, 2007

Friday, June 15, 2007

What Data Mining Can and Can't Do

I include this article because Peter Fader is an expert in behavior predictive modeling. The caveat is that there is a difference between predicting behavior and predicting patterns that are not necessary related to behavior. For example, in molecular genetics we try to predict how a gene or a chemical substance have an impact on the physiology of a person. These reactions are mostly physical instead of behavior oriented. Nevertheless, I agree with Peter that executives expectations of data mining are out of proportion to the investment in predictive modeling. Predicting modeling using Excel spreadsheets might work on behavioral analysis, but I do not think it will work in the health care and biotechnology industries. When I was practicing law I used to refer to this as the "what is at stake syndrome". In a civil case is money, but in a criminal case is a person's freedom. Experience have taught me that generalizations make a great sound bite, but could be dangerous in the real world.

Wednesday, June 13, 2007

Evaluation of noise reduction techniques in the splice junction recognition problem

The authors have done a good job in evaluating noise reduction techniques using pre-processing algorithms in large genetic databases which are characterized by the presence of noisy data which can affect the performance of data mining processes.

A review of symbolic analysis of experimental data

This article suggest a time-series analysis as a way to reduce noise in large databases when doing analysis. My first impression was, "you must be kidding time series have nothing to do with noise reduction". Then I did an experiment using my 4.7 terabytes of data and I found that a time series analysis could detect the cause of noise in my sample data (or training set). When I re-read the article after the experiment I found that this methodology is for processes that are non-linear and possible chaotic. I am using healthcare data that is non-linear and chaotic. I found that the time-series analysis was a good methodology to identify noise in the training set for the data tags=1. I still need to do a lot of reverse engineering to understand the why, but in the meantime I thought this was worthwhile passing on.

Enhancing Data Analysis with Noise Removal

I am working on doing some noise reduction to an enterprise data mining model I thought that this was a good overall article on the different techniques applicable.

Tuesday, June 12, 2007

Incremental Mining of Sequential Patterns in Large Databases

The fundamentals of this algorithm could be use in large databases.

The problem: "As databases evolve the problem of maintaining sequential patterns over a significantly long period of time becomes essential, since a large number of new records may be added to a database. To reflect the current state of the database where previous sequential patterns would become irrelevant and new sequential patterns might appear, there is a need for efficient algorithms to update, maintain and manage the information discovered [12]. Several efficient algorithms for maintaining association rules have been developed [12–15]. Nevertheless, the problem of maintaining sequential patterns is much more complicated than maintaining association rules, since transaction cutting and sequence permutation have to be taken into account [16]."

The proposed solution: "This method is based on the discovery of frequent
sequences by only considering frequent sequences obtained by an earlier mining
step. By proposing an iterative approach based only on such frequent sequences
we are able to handle large databases without having to maintain negative border
information, which was proved to be very memory consuming [16]. Maintaining
such a border is well adapted to incremental association mining [26,19], where association rules are only intended to discover intra-transaction patterns (itemsets). Nevertheless, in sequence mining, we also have to discover inter-transaction patterns (sequences) and the set of all frequent sequences is an unbounded superset of the set of frequent itemsets (bounded) [16]. The main consequence is that such approaches are very limited by the negative border size."

Friday, June 08, 2007

Molecular Staging for Survival Prediction of Colorectal Cancer Patients

This article shows the potential of data mining for prognostic diseases using microarray (SAM) data.

Go Stanford! That's were one of my daughters graduated from and SAM is a product of Stanford University.

The treatment of missing values and its effect in the classifier accuracy

Good paper on the effects on missing values in the accuracy of your model. The organization of this paper could improve if the authors would have included their recommendation as part of the Summary.

Nevertheless, this is the crucial recommnedation (p.8): "We recommend that we can deal with datasets having up to 20 % of missing values. For the CD (Complete Deletion) method we have up to 60 % of instances containing missing
values and still have a reasonable performance."

For healthcare, pharma, and biotech data this paper is important because of the complexity and diversity of this data.

An Assessment of Accuracy, Error, and Conflict with Support Values from

This article is for experienced biostatisticians. Nevertheless, this is the interpretation for the layman:
When molecular biology theories are tested with real data we need to be cautious in reading bootstrap values if we are assuming an underestimation of the actual support. For example (my example is not in this article), if using a decision tree vs. logistic regression bayesian model, be cautious in how you assess the accuracy of your model since the decision-tree tends to understimate and bayesian models tend to overestimate.

I have found that to increase a classifier accuracy for a model, this type of distinction (non-parametric bootstrap values vs. Bayesian probabilities) is fundamental.

Phase II Studies: Which is Worse, False Positive or False Negative

A short but powerful article that helps understands the effects of Type I and Type II errors in clinical trials.

Monday, May 21, 2007

SPSS Launches Enhanced Predictive Analytics Platform

I have not tried this product yet, but SPSS tend to have good products in predictive modeling.

The Advantages of Smart Data Mining

Good general article about data mining in the retail and POS indsutry. Free registration required.

Data-mining moves into the mainstream, in search of profit

good general article about how data mining is moving into different fields.

Wednesday, May 16, 2007

General Healthcare Data Mining Model

We went into production on May 7 with our General Healthcare Data Mining Model. Our metrics comparison (without giving intellectual property away) is as follows:

Metric 1 - 5.15% (old) with new model 20.3%

Metric 2 - 6.06% (old) with new model 53.4%

Right now we are fine tuning the model, and reducing our findings to a writing since we must make sure that we have good documentation. We processed over 396 million claims in a 4.7 TB environment (SQL Server 2005). We can refresh every month right now and the goal is to refresh once a week in the next couple of months.

The model can be used in healthcare, pharmaceutical, and biotech industries.

Tuesday, May 01, 2007

Doctors test gene therapy to treat blindness

this is the type of therapy that once there is a single success is going to revolutionize the way that we look at data mining in the health care field.

Tuesday, April 24, 2007

Father and me

Father and me

He was a giant when in one knee will listen to me
He was wise when over the years I learned to listen
He was a leader when showed me by example
He defined courage in the most difficult struggle

I will be a giant like him when on bended knee
I will be wise by becoming a listener
I will lead by example
I will have the courage to be his son

Monday, April 23, 2007

Studies back Parkinson’s and pesticides link

This article touched a very personal issue that explains why I have invested the last seven years of my professional life to perfect data mining in healthcare and in the biotech industry. My father was a pure research scientist in the area of pesticides research for the last 10+ years of his professional career. About 5 years ago he died of Parkinson's disease. As I tried to understand his disease I discovered that all the scientists and lab workers that worked for him in the Pesticides Laboratories died of Parkinson's disease too! I decided that my experience in data mining, mathematics, business and law could be use to help create a health care data mining model that could have multiple uses: from finding hidden patterns for outcomes research, or molecular biology research, or healthcare insurance claims.

So friends and collegues that is my goal: to create and make sure that as many people as possible have access to a healthcare data mining model that have multiple uses. I thought that if I created this tool, I could assist scientist and companies find sometime of relief for some terrible diseases. Eighteen months after dad died, my mother died of ALS. Now you know what drives me.

Thursday, March 29, 2007

Predicting breast cancer survivability:comparison 3 models

I like that this article uses good methodology for the comparison of the three different models. On the other hand, my experience tells me that it is the combination of the three models that increases the predictability in any enterprise model.

Intel details new chip technology

The first part of the puzzle of intelligent agents and data mining is already taking place.

Wednesday, March 28, 2007

Successful Data Mining Applications

For those who are interested in what are the industries that use data mining sucessfully. There is obviously a lot of growth potential.

Mining the Genome

The basic article about bioinformatics. It includes the issues and challenges.

Mining biotech's data mother lode

A good article that shows the utilization of data mining in the biotechnology (biotech) industry.

Pellucid Agent Architecture for Administration Based Processes

Another application of data mining and intelligent agents

Application of Data Mining and Intelligent Agent Technologies to Concurrent Engineering

This is a good article about a potential application of data mining an intelligent agents in the manufacturing industry

Tuesday, March 27, 2007

Intelligent Agents

And the Future of Data Mining

An intelligent agent is: (1) a software agent if it is a piece of software that acts in a relationship of agency for a user or other program; or (2) an intelligence actor if it interacts with its environment. The first definition refers mostly to data mining, while the later refers to a robot like machine.

Although in the science and technology communities we tend to separate both definitions of intelligence agent, the advances in computer processors are bringing both environments closer to one another. I imagine the integration of an Electronic Medical Record (EMR) device like GE Centricity, and healthcare specific data mining algorithms using Microsoft SQL 2005 Analysis Services into a machine learning hardware that will assist physicians and other healthcare providers in real-time improving and measuring of clinical outcomes.

This type of technology could also be applicable to PDA’s and trading in the financial markets, or the purchasing goods and services (brick and mortar or thru the Internet), or in the decision-making process of what food to buy or movie to watch. The technological challenges will be correlated by the advances in technology by companies like Intel and Motorola in designing smaller, faster, and with greater storage capacity. Other challenges involve data network security and privacy issues which affect consumers. These challenges are great, but without any doubts the framework to integrate intelligent mobile agents and data mining is already in place.

Strategic alliances in the technology industry are no longer limited to the industrialized countries, but are a worldwide phenomenon. They are not in the realm of the large technology companies either. I would not predict who, when, or what industries and what companies will benefit from the merger of both technologies. I do predict that we should see the first fruits of the merger of both technologies in the next eighteen to twenty four months.

SQL 2005 Analysis Services Project: Training Set

The main reason of why an SQL 2005 Analysis Services project fail is the lack of understanding of the purpose and importance of the training set in data mining. The Training Set takes the place of the scientific theory in data mining. The scientific theory refers to facts known to be true or false. The key is specificity. For example, if you are trying to find out what cancer drugs have the best chemical compounds to fight off cancer you must have the specific chemical compounds and their associated values for each drug. These are called inputs in Analysis Services Data Mining Structures (DMS). The second step is to decide what you want to predict. Do you want to predict a discrete state (yes or no)? Do you want to predict a numerical continuous value (i.e., the price of a particular item)? The third step is to determine your key column or the unique identifier for a particular row.

Always ask yourself what I am trying to predict or what is the scientific theory? The theory and your training set are always specific to want you want to predict. Remember, Microsoft is providing the tool but you must provide the specific theory.

Once you successfully build one model then you can use that model to predict similar situated situations. If you are selling fruits built the model for selling apples first. Once this model is working change the training set to reflect oranges and apply the same model to oranges. The combination of all your models is your data mining enterprise system.

Friday, March 23, 2007

Microsoft SQL 2005 Analysis Services: Ten Best Practices©

By Alberto Roldan

A number of data mining, executive management, and IT professionals seem to be experiencing the same issue with Microsoft QL 2005 Analysis Services (MAS): How do I make this product work for my enterprise? These ten best practices should help provide some assistance in dealing with this issue.

1. Training: Any organization using this product must have at least one person who has received training in SQL 2005 Analysis Services, and the basic principles of data mining and predictive modeling.

a. Do not expect the Information Technology Department to create an enterprise data mining project without the proper training in the technology and the science of data mining. One of the reasons for the lack of success of data mining projects is that the IT department understands neither the technology, nor the science behind data mining. MAS make the development of an enterprise data mining project, if at least one member of the staff understands the technology and science behind it.

i. Potential Solution – Since this is cutting edge technology which merges science and technology, the Chief Technology Officer, Chief Information Officer, and Enterprise Architect must receive some training in this area. They do not need to be experts, but they must understand the basic principles behind it. This training will make sure that expectations and strategic business initiatives properly align. Also, make sure at least two or more software engineers take the online tutorials.

2. 2. Strategic investment and not simple a cost center: Most IT projects are linear (i.e., project scope, charter, resource allocation, design, development, QC, testing and deployment). Data mining is always a cyclical project because it is research and development. It is heuristic by nature. Organizations invest in enterprise data mining because they realize that the amount of data must be managed differently to transform data into actionable information. The expectation that we are going to do something differently but use the same practices that we have used in the past is an oxymoron.

a. Plan data mining projects by incorporating research and development techniques into IT software engineering best practices. You need a specific theory for your research, proposed research steps, timetable, and allocate dedicated resources. Also, document all the steps, define your metrics, complete the research, QC the results, evaluate results, and determine the next step for research in your continuous improvement process.

i. Potential Solution – Concentrate on your theory for research. The more specific your theory, the greater the probability to conduct an experiment that will give you specific results that you can use (whether by confirming your theory or not confirming your theory). If the theory is something to the effect of "save the planet and cure cancer," I will guarantee you that it will not work. Never underestimate the ability to transform a strategic business initiative into multiple theories for data mining research.

3. 3. Dedicated Resources: One of the missions of any IT department is to prove added value to the enterprise by reducing costs. As a consequence, the standard in some organizations is to share resources as much as possible in order to decrease costs. Nevertheless, the architecture of an enterprise data mining project requires, at a minimum, its own server due to the amount of processing time required.

a. The SQL 2005 Server is a powerful tool that has the ability to accommodate many users and serve multiple purposes in the enterprise. As any other resource we should always strive to optimize this resource. Nevertheless, I have found that in a metadata environment, operational and data mining research projects' competition within the same environment causes unnecessary friction and delays to all the parties.

i. Potential Solution – The best practice is separate and dedicated IT resources (staff, hardware, and software). Nevertheless, if this is not feasible, a detailed communication and utilization plan of resources must be implemented. The goals and expectations of any data mining project must be adjusted to reflect the additional time (between 25%-50% additional time) required to complete a data mining project.

4. 4. "One Theory at aTime" Rule: You can use this tool to address multiple business issues, but it would probably require multiple models for multiple issues. If you are looking for the needle in the haystack, you must consider that you could have multiple haystacks and multiple needles. Therefore, you are in a situation that requires multiple models for the haystacks and the needles. The complexity of this issue cannot be underestimated.

a. Many enterprises spend vast resources in their organizational planning and structure. The reason for this expenditure is they understand that their business is complex and requires a clear chain of command to successfully implement strategic and operational initiatives. Nevertheless, these same enterprises fail to recognize this complexity when attempting to implement data mining systems.

i. Potential Solution – Study your organizational chart. This could be a roadmap as to the priorities for a successful enterprise data mining system. It will assist in defining the theories that you want to test in a research environment.

5. 5. Models are never generic; they are always specific: The use of the terms data mining and artificial intelligence sometimes are used out of context in the business environment. These terms tend to be used more in the science fiction context than in business content. Therefore, this contributes to unrealistic expectations of what data mining could do to assist in implementing strategic business initiatives. It is imperative that the CIO and the CTO have a basic understanding of the science and technology behind data mining so the executive management team makes well-informed decisions about the incorporation of these tools into any initiative.

a. A data mining system is not going to lower your operational costs overnight by ten percent. A data mining project is not magic, and like any other strategic initiative, involves planning, knowledge management, and change management. Expectations must be realistic to the size of the investment. Investment is not just hardware and software, but it also involves training and making sure that you have the right people to do the job well. It requires being intellectually honest to determine what are your needs, and candid about your business expectations depending on the size of your investment.

i. Potential Solution: The first step before making an investment to evaluate where the organization is currently at, how did it go there, the nature of the competition, and what do you think a data mining system can do for your organization. The axiom that those that fail to plan, are planning for failure is applicable in data mining. Executive management support and sponsorship is a keystone to the process. Do not underestimate the challenges and cyclical nature of this type of initiative, but make sure that the message throughout the organization is clear: we will achieve an enterprise data mining system because it is part of how we intent to stay competitive in the future by lowering cost and increasing revenues.

6. 6. Three Main Categories: The variables (input), training set (sample), and data mining structures are the keystones of Analysis Services. A clear understanding of these three areas will assist in the creation of a data mining system.

a. IT Architects and developers that do not understand the three main components of Analysis Services will have a difficult time with the successful completion of a single data mining project. It will be impossible for them to successfully design and implement an enterprise data mining system. The knowledge requires a basic understanding of the technology and the science of data mining and predictive modeling. Acquiring this knowledge does not need to be extremely costly or time-consuming. Nevertheless, expecting the IT department to successfully complete this type of project without allowing them time to acquire this knowledge and training is putting them in a position to fail.

i. Potential Solution: The technical and scientific knowledge to successfully complete a data mining project can be acquired thru training (classes or online), engaging the services of a consultant with a proven record of completing data mining projects, or by self-motivating reading. I would suggest looking within your own organization for individuals with at least a statistics or mathematical background, or those who have an interest in life sciences (genetics, biology, astronomy, physics, or chemistry). Those individuals could have a predisposition to quickly learn the science and methodology behind the technology of data mining.

7. 7. Statistically accepted best practices as metrics: As this product has joined science and technology but is cyclical rather than linear, we must incorporate statistically accepted best practices if we want to have a continuous improvement process required in research. The inclusion of additional areas of measurements besides traditional business metrics (cost per employee, revenue per employee, and profits per employee), IT metrics (reduction of hours to complete a process, increase in revenues, CPU utilization per employee, and total down system time), now we need to incorporate some scientific metrics that will assist in improving a data mining system. The understanding of scientific metrics is important to measure success, improvement, or lack of improvement.

a. It is a change for organizations that have never had an organizational research component to apply an additional set of metrics to gauge the performance of a data mining system. The tendency is to only use the same metrics that has been used in the past to measure the performance of a data mining system. The error is in assuming that data mining is a linear type of project rather than a cyclical one. The best example is a data mining system could seem to be a failure from the business point of view, but when measured by statistical metrics it is successful. In this scenario, the statistical metrics can help diagnosis the problem.

i. Potential Solution: I would suggest the utilization of an add-in statistical software package to determine the Variable Inflation Factor (VIF) for the numerical variables to measure that a particular variable that is having an undue influence in your model. Also, I would suggest measuring Type II error to measure determine the predictability qualities of your model (i.e., what is your model measuring).

8. 8. Design of Experiments: In a research environment the projects are cyclical. The creation of a successful data mining system requires research and research requires experimentation. This is one of the areas where business and science seem to conflict. Science expects that some experiments will not be successful, and Businesses tend to be risk averse. Nevertheless, businesses constantly take risks to improve their profitability and growth. Therefore, it is not that businesses do not take risks, but that they want to be able to quantify and qualify the risks.

a. When science and technology join in the business arena some compromises must be made that are beneficial for all the parties. Science and technology cannot operate within a “pure science” mentality, and businesses must face the inherent risks head-on.

i. Potential Solution: If you have invested in the SQL 2005 Server and its software you have already incurred in part of the financial risk. The issue then becomes do you want to use the server as a simple storage facility or are you willing to make an additional investment (i.e., training, knowledge transfer, or consultant) to use the full potential of this tool. I would suggest stating by having two people in your staff go thru all the online tutorials (no shortcuts) and then let them try to successfully create and deploy a test analysis services project using limited data. This process will bring out a series of unanswered questions, and those questions will help you plan the options that you have to acquire the knowledge that you need to design, test, and implement a data mining system. Also, change the name design of experiments to design of research since it will make it more easily understood by others within the enterprise.

9. 9. Quality Control and Testing: Design multiple quality control staging areas during the process. Although Microsoft has made this product in such a way that it writes about ninety percent (90%) of all the code you will find that you need to make small modifications and wrap coding sometimes. Also, optimization of the processes will take place if you need to create specific variables like Z-scores. Lastly, you will find resistance to changes within the IT and Operational organizations of the enterprise to making any changes of the current processes. This resistance to change will require a specific change management strategy

a. The potential of a successful data mining system in an enterprise tends to create apprehension. This apprehension is rooted in the mistaken belief that if a data mining system is successful it will result in people losing their jobs. In the work place nothing is as personal as the instability that a potential change could bring if the perception for managers and staff alike is that their jobs might be in jeopardy if this new system is successful. The implications for QC and testing are immeasurable in terms of utilization of productive time.

i. Potential Solution: The development of realistic communication, quality control, and testing plans as part of the initial executive management evaluation whether or not to design a data mining enterprise system is a must. This plan should include goals and expectations at all levels of the enterprise. Specifically, the communication plan should address the issue of the potential changes in duties and responsibilities of staff and managers. It is a lot easier for managers and staff to buy into this type of strategic initiative if they see the role that they will play during the different stages of an enterprise data mining project.

1 10. Expectations: The term “high but realistic expectations” does not need to be a contradiction. The high expectations refer to the ability of the enterprise to learn to effectively use the tools at its disposal. Realistic expectations refers to the increase in value that a data mining system should give you based on your prior experience in growing and developing the business. Also, expectations should be directly correlated to the investment in training and dedicated resources to a data mining project.

a. Some organizations tend to go to their IT department and ask them to build them a data mining or analytics enterprise system that will solve their business issues, as well as all the pressing world issues. The CIO, CTO, or VP of Technology sometime do not have the knowledge required to explain that a more specific approach is required in data mining. Hence, the failure of data mining projects is the failure to properly plan having high but realistic expectations:

i. Potential Solution: Microsoft has made a product that streamlines a lot of the designing and developing of a data mining system, but the key is specific planning, knowledge transfer from the business areas to the IT department, and defines the specific business needs. It is going to take time and effort to put this together. The first step is to develop a plan that will take into consideration the resources and training necessary to use this new tool. This plan should serve as a roadmap of how we are going to implement this initiative

I hope that you can use some or all of these best practices in using SQL 2005 Analysis Services to create an enterprise analytics or data mining system. The barriers are technical, scientific, and in the change management areas. The potential is immeasurable. Contact:

Monday, February 19, 2007

Predictive Modeling and Microsoft Analysis Services 2005

I have been using this product now for 6 months. Also, I went to Microsoft and got a 3 day training by Jamie (thanks!). This is a good product and Microsoft has done an excellent job at bringing data mining to the "masses".

This product is scalable (we are utilizing in over seven terabytes of data every month) and user friendly. It integrates fairly simple with Reporting Services.

The key in how to utilize Analysis Services in a supervised model is the training sample. My main recommendation is that you bring all your data tags into your training sample. In order to determine the size of your training sample population multiply the number of data tags by five and then your data tags will represent 20% of your population.

Another key issue is the modifying of the algorithm parameters. Specifically, the maximum states. In order to determine the maximum number of states in your data I suggest a combination of partition and distribution analyses. You can also use the Microsoft Decision Tree Algorithm.

David did a great job with the data mining algorithms but for those of us who have been in the data mining industry for a long time we need more detail (as well as peer review) articles about the algorithms. For example, the predict and predict probability functions has output that are negative values when this should be a mathematical improbability in an unsupervised model. Even if we filter all the negative inputs we still get negative output. I think that this is a data type kind of issue but we are still researching.

Another issue that is not address in the algorithms is whether any variable or input is improperly influencing the predictive output. Specifically, I would prefer that the models will give us the VIF value for each input. Otherwise, we may find ourselves with one of those situations that are "too good to be true."

The last issue is that the number of Type II errors are extremely large in these models (when we apply the training set to the entire population). Specifically, I am referring to Type II errors that are greater than 60%!!!

Microsoft through Jamie's group is providing us with great technical support and I want to congratulate them for their efforts.

Business Analytics

Business Analytics

Blog Archive

About Me

My photo
See my resume at: