Thursday, March 29, 2007

Predicting breast cancer survivability:comparison 3 models

I like that this article uses good methodology for the comparison of the three different models. On the other hand, my experience tells me that it is the combination of the three models that increases the predictability in any enterprise model.

Intel details new chip technology

The first part of the puzzle of intelligent agents and data mining is already taking place.

Wednesday, March 28, 2007

Successful Data Mining Applications

For those who are interested in what are the industries that use data mining sucessfully. There is obviously a lot of growth potential.

Mining the Genome

The basic article about bioinformatics. It includes the issues and challenges.

Mining biotech's data mother lode

A good article that shows the utilization of data mining in the biotechnology (biotech) industry.

Pellucid Agent Architecture for Administration Based Processes

Another application of data mining and intelligent agents

http://ups.savba.sk/parcom/publications/agents/IISAS_IAWTIC-2003.pdf

Application of Data Mining and Intelligent Agent Technologies to Concurrent Engineering

This is a good article about a potential application of data mining an intelligent agents in the manufacturing industry

http://issel.ee.auth.gr/ktree/Documents/Root%20Folder/ISSEL/Publications/3_MITKAS_IJAM.pdf

Tuesday, March 27, 2007

Intelligent Agents

And the Future of Data Mining

An intelligent agent is: (1) a software agent if it is a piece of software that acts in a relationship of agency for a user or other program; or (2) an intelligence actor if it interacts with its environment. The first definition refers mostly to data mining, while the later refers to a robot like machine.

Although in the science and technology communities we tend to separate both definitions of intelligence agent, the advances in computer processors are bringing both environments closer to one another. I imagine the integration of an Electronic Medical Record (EMR) device like GE Centricity, and healthcare specific data mining algorithms using Microsoft SQL 2005 Analysis Services into a machine learning hardware that will assist physicians and other healthcare providers in real-time improving and measuring of clinical outcomes.

This type of technology could also be applicable to PDA’s and trading in the financial markets, or the purchasing goods and services (brick and mortar or thru the Internet), or in the decision-making process of what food to buy or movie to watch. The technological challenges will be correlated by the advances in technology by companies like Intel and Motorola in designing smaller, faster, and with greater storage capacity. Other challenges involve data network security and privacy issues which affect consumers. These challenges are great, but without any doubts the framework to integrate intelligent mobile agents and data mining is already in place.

Strategic alliances in the technology industry are no longer limited to the industrialized countries, but are a worldwide phenomenon. They are not in the realm of the large technology companies either. I would not predict who, when, or what industries and what companies will benefit from the merger of both technologies. I do predict that we should see the first fruits of the merger of both technologies in the next eighteen to twenty four months.

SQL 2005 Analysis Services Project: Training Set

The main reason of why an SQL 2005 Analysis Services project fail is the lack of understanding of the purpose and importance of the training set in data mining. The Training Set takes the place of the scientific theory in data mining. The scientific theory refers to facts known to be true or false. The key is specificity. For example, if you are trying to find out what cancer drugs have the best chemical compounds to fight off cancer you must have the specific chemical compounds and their associated values for each drug. These are called inputs in Analysis Services Data Mining Structures (DMS). The second step is to decide what you want to predict. Do you want to predict a discrete state (yes or no)? Do you want to predict a numerical continuous value (i.e., the price of a particular item)? The third step is to determine your key column or the unique identifier for a particular row.

Always ask yourself what I am trying to predict or what is the scientific theory? The theory and your training set are always specific to want you want to predict. Remember, Microsoft is providing the tool but you must provide the specific theory.

Once you successfully build one model then you can use that model to predict similar situated situations. If you are selling fruits built the model for selling apples first. Once this model is working change the training set to reflect oranges and apply the same model to oranges. The combination of all your models is your data mining enterprise system.

Friday, March 23, 2007

Microsoft SQL 2005 Analysis Services: Ten Best Practices©

By Alberto Roldan

A number of data mining, executive management, and IT professionals seem to be experiencing the same issue with Microsoft QL 2005 Analysis Services (MAS): How do I make this product work for my enterprise? These ten best practices should help provide some assistance in dealing with this issue.

1. Training: Any organization using this product must have at least one person who has received training in SQL 2005 Analysis Services, and the basic principles of data mining and predictive modeling.

a. Do not expect the Information Technology Department to create an enterprise data mining project without the proper training in the technology and the science of data mining. One of the reasons for the lack of success of data mining projects is that the IT department understands neither the technology, nor the science behind data mining. MAS make the development of an enterprise data mining project, if at least one member of the staff understands the technology and science behind it.

i. Potential Solution – Since this is cutting edge technology which merges science and technology, the Chief Technology Officer, Chief Information Officer, and Enterprise Architect must receive some training in this area. They do not need to be experts, but they must understand the basic principles behind it. This training will make sure that expectations and strategic business initiatives properly align. Also, make sure at least two or more software engineers take the online tutorials.

2. 2. Strategic investment and not simple a cost center: Most IT projects are linear (i.e., project scope, charter, resource allocation, design, development, QC, testing and deployment). Data mining is always a cyclical project because it is research and development. It is heuristic by nature. Organizations invest in enterprise data mining because they realize that the amount of data must be managed differently to transform data into actionable information. The expectation that we are going to do something differently but use the same practices that we have used in the past is an oxymoron.

a. Plan data mining projects by incorporating research and development techniques into IT software engineering best practices. You need a specific theory for your research, proposed research steps, timetable, and allocate dedicated resources. Also, document all the steps, define your metrics, complete the research, QC the results, evaluate results, and determine the next step for research in your continuous improvement process.

i. Potential Solution – Concentrate on your theory for research. The more specific your theory, the greater the probability to conduct an experiment that will give you specific results that you can use (whether by confirming your theory or not confirming your theory). If the theory is something to the effect of "save the planet and cure cancer," I will guarantee you that it will not work. Never underestimate the ability to transform a strategic business initiative into multiple theories for data mining research.

3. 3. Dedicated Resources: One of the missions of any IT department is to prove added value to the enterprise by reducing costs. As a consequence, the standard in some organizations is to share resources as much as possible in order to decrease costs. Nevertheless, the architecture of an enterprise data mining project requires, at a minimum, its own server due to the amount of processing time required.

a. The SQL 2005 Server is a powerful tool that has the ability to accommodate many users and serve multiple purposes in the enterprise. As any other resource we should always strive to optimize this resource. Nevertheless, I have found that in a metadata environment, operational and data mining research projects' competition within the same environment causes unnecessary friction and delays to all the parties.

i. Potential Solution – The best practice is separate and dedicated IT resources (staff, hardware, and software). Nevertheless, if this is not feasible, a detailed communication and utilization plan of resources must be implemented. The goals and expectations of any data mining project must be adjusted to reflect the additional time (between 25%-50% additional time) required to complete a data mining project.

4. 4. "One Theory at aTime" Rule: You can use this tool to address multiple business issues, but it would probably require multiple models for multiple issues. If you are looking for the needle in the haystack, you must consider that you could have multiple haystacks and multiple needles. Therefore, you are in a situation that requires multiple models for the haystacks and the needles. The complexity of this issue cannot be underestimated.

a. Many enterprises spend vast resources in their organizational planning and structure. The reason for this expenditure is they understand that their business is complex and requires a clear chain of command to successfully implement strategic and operational initiatives. Nevertheless, these same enterprises fail to recognize this complexity when attempting to implement data mining systems.

i. Potential Solution – Study your organizational chart. This could be a roadmap as to the priorities for a successful enterprise data mining system. It will assist in defining the theories that you want to test in a research environment.

5. 5. Models are never generic; they are always specific: The use of the terms data mining and artificial intelligence sometimes are used out of context in the business environment. These terms tend to be used more in the science fiction context than in business content. Therefore, this contributes to unrealistic expectations of what data mining could do to assist in implementing strategic business initiatives. It is imperative that the CIO and the CTO have a basic understanding of the science and technology behind data mining so the executive management team makes well-informed decisions about the incorporation of these tools into any initiative.

a. A data mining system is not going to lower your operational costs overnight by ten percent. A data mining project is not magic, and like any other strategic initiative, involves planning, knowledge management, and change management. Expectations must be realistic to the size of the investment. Investment is not just hardware and software, but it also involves training and making sure that you have the right people to do the job well. It requires being intellectually honest to determine what are your needs, and candid about your business expectations depending on the size of your investment.

i. Potential Solution: The first step before making an investment to evaluate where the organization is currently at, how did it go there, the nature of the competition, and what do you think a data mining system can do for your organization. The axiom that those that fail to plan, are planning for failure is applicable in data mining. Executive management support and sponsorship is a keystone to the process. Do not underestimate the challenges and cyclical nature of this type of initiative, but make sure that the message throughout the organization is clear: we will achieve an enterprise data mining system because it is part of how we intent to stay competitive in the future by lowering cost and increasing revenues.

6. 6. Three Main Categories: The variables (input), training set (sample), and data mining structures are the keystones of Analysis Services. A clear understanding of these three areas will assist in the creation of a data mining system.

a. IT Architects and developers that do not understand the three main components of Analysis Services will have a difficult time with the successful completion of a single data mining project. It will be impossible for them to successfully design and implement an enterprise data mining system. The knowledge requires a basic understanding of the technology and the science of data mining and predictive modeling. Acquiring this knowledge does not need to be extremely costly or time-consuming. Nevertheless, expecting the IT department to successfully complete this type of project without allowing them time to acquire this knowledge and training is putting them in a position to fail.

i. Potential Solution: The technical and scientific knowledge to successfully complete a data mining project can be acquired thru training (classes or online), engaging the services of a consultant with a proven record of completing data mining projects, or by self-motivating reading. I would suggest looking within your own organization for individuals with at least a statistics or mathematical background, or those who have an interest in life sciences (genetics, biology, astronomy, physics, or chemistry). Those individuals could have a predisposition to quickly learn the science and methodology behind the technology of data mining.

7. 7. Statistically accepted best practices as metrics: As this product has joined science and technology but is cyclical rather than linear, we must incorporate statistically accepted best practices if we want to have a continuous improvement process required in research. The inclusion of additional areas of measurements besides traditional business metrics (cost per employee, revenue per employee, and profits per employee), IT metrics (reduction of hours to complete a process, increase in revenues, CPU utilization per employee, and total down system time), now we need to incorporate some scientific metrics that will assist in improving a data mining system. The understanding of scientific metrics is important to measure success, improvement, or lack of improvement.

a. It is a change for organizations that have never had an organizational research component to apply an additional set of metrics to gauge the performance of a data mining system. The tendency is to only use the same metrics that has been used in the past to measure the performance of a data mining system. The error is in assuming that data mining is a linear type of project rather than a cyclical one. The best example is a data mining system could seem to be a failure from the business point of view, but when measured by statistical metrics it is successful. In this scenario, the statistical metrics can help diagnosis the problem.

i. Potential Solution: I would suggest the utilization of an add-in statistical software package to determine the Variable Inflation Factor (VIF) for the numerical variables to measure that a particular variable that is having an undue influence in your model. Also, I would suggest measuring Type II error to measure determine the predictability qualities of your model (i.e., what is your model measuring).

8. 8. Design of Experiments: In a research environment the projects are cyclical. The creation of a successful data mining system requires research and research requires experimentation. This is one of the areas where business and science seem to conflict. Science expects that some experiments will not be successful, and Businesses tend to be risk averse. Nevertheless, businesses constantly take risks to improve their profitability and growth. Therefore, it is not that businesses do not take risks, but that they want to be able to quantify and qualify the risks.

a. When science and technology join in the business arena some compromises must be made that are beneficial for all the parties. Science and technology cannot operate within a “pure science” mentality, and businesses must face the inherent risks head-on.

i. Potential Solution: If you have invested in the SQL 2005 Server and its software you have already incurred in part of the financial risk. The issue then becomes do you want to use the server as a simple storage facility or are you willing to make an additional investment (i.e., training, knowledge transfer, or consultant) to use the full potential of this tool. I would suggest stating by having two people in your staff go thru all the online tutorials (no shortcuts) and then let them try to successfully create and deploy a test analysis services project using limited data. This process will bring out a series of unanswered questions, and those questions will help you plan the options that you have to acquire the knowledge that you need to design, test, and implement a data mining system. Also, change the name design of experiments to design of research since it will make it more easily understood by others within the enterprise.

9. 9. Quality Control and Testing: Design multiple quality control staging areas during the process. Although Microsoft has made this product in such a way that it writes about ninety percent (90%) of all the code you will find that you need to make small modifications and wrap coding sometimes. Also, optimization of the processes will take place if you need to create specific variables like Z-scores. Lastly, you will find resistance to changes within the IT and Operational organizations of the enterprise to making any changes of the current processes. This resistance to change will require a specific change management strategy

a. The potential of a successful data mining system in an enterprise tends to create apprehension. This apprehension is rooted in the mistaken belief that if a data mining system is successful it will result in people losing their jobs. In the work place nothing is as personal as the instability that a potential change could bring if the perception for managers and staff alike is that their jobs might be in jeopardy if this new system is successful. The implications for QC and testing are immeasurable in terms of utilization of productive time.

i. Potential Solution: The development of realistic communication, quality control, and testing plans as part of the initial executive management evaluation whether or not to design a data mining enterprise system is a must. This plan should include goals and expectations at all levels of the enterprise. Specifically, the communication plan should address the issue of the potential changes in duties and responsibilities of staff and managers. It is a lot easier for managers and staff to buy into this type of strategic initiative if they see the role that they will play during the different stages of an enterprise data mining project.

1 10. Expectations: The term “high but realistic expectations” does not need to be a contradiction. The high expectations refer to the ability of the enterprise to learn to effectively use the tools at its disposal. Realistic expectations refers to the increase in value that a data mining system should give you based on your prior experience in growing and developing the business. Also, expectations should be directly correlated to the investment in training and dedicated resources to a data mining project.

a. Some organizations tend to go to their IT department and ask them to build them a data mining or analytics enterprise system that will solve their business issues, as well as all the pressing world issues. The CIO, CTO, or VP of Technology sometime do not have the knowledge required to explain that a more specific approach is required in data mining. Hence, the failure of data mining projects is the failure to properly plan having high but realistic expectations:

i. Potential Solution: Microsoft has made a product that streamlines a lot of the designing and developing of a data mining system, but the key is specific planning, knowledge transfer from the business areas to the IT department, and defines the specific business needs. It is going to take time and effort to put this together. The first step is to develop a plan that will take into consideration the resources and training necessary to use this new tool. This plan should serve as a roadmap of how we are going to implement this initiative

I hope that you can use some or all of these best practices in using SQL 2005 Analysis Services to create an enterprise analytics or data mining system. The barriers are technical, scientific, and in the change management areas. The potential is immeasurable. Contact: alberto_roldan_2001@yahoo.com

Business Analytics

Business Analytics

About Me

My photo
See my resume at: https://docs.google.com/document/d/1-IonTpDtAgZyp3Pz5GqTJ5NjY0PhvCfJsYAfL1rX8KU/edit?hl=en_USid=1gr_s5GAMafHRjwGbDG_sTWpsl3zybGrvu12il5lRaEw