2022.01.19 02:49

Data mining and software engineering

Software reuse enables developers to leverage past accomplishments and facilitates significant improvements in software Data mining techniques can be applied for identifying set of software components having dependence amongst each other. Author : Graham J.

Quality and Safety Engineering , 7 4 , Last M. Friedman and A. Issue 4. Michail A Data mining library reuse patterns using generalized Apart from developing algorithms, Are we at the same point as SE was in ? The Gartner Group estimates [14] that has not yet been sized for these tasks, as it is very much there will be an upsurge of DM projects over the next focused on pure development activities and tasks.

Another Gartner taking on the dimensions of an engineering problem. Group report [15] claims that enterprises in the DM area Therefore, the processes to be applied should include all grew by 4. Our proposal is to that a lot of DM projects are being developed, neither all enhance CRISP-DM by embedding other current stan- the project results are in use [17—19] nor do all projects dards, as suggested in [27], inspired by the work done end successfully [20,21].

DM is the most commonly used methodology for devel- oping DM projects [23—25]. However, its use is not becoming any more widespread due to rivalry with other, 2.

There is some confusion about the terminology All the above goes to show that while CRISP-DM was different authors use to refer to process and methodology.

These terms are used with the aim of unifying complexity of the problems it has to address. And this criteria, because they are better established and backed by detracts from the effectiveness of its deployment, as it the International Organization for Standardization ISO or does not produce the expected results. A poorly selected life cycle performed to develop a particular element, as well as the can lead to continual delays and unnecessary rework.

Life elements that are produced in each task outputs and the cycle selection depends on many variables that the project elements that are necessary to do a task inputs [2].

The KDD process [31] has a process the right product. A good opers can go back to the last stage to put right any error process will help us do this. The process helps lay out detected in any of the stages.

It is not an iterative life cycle, the steps of development. The project is developed as a replicated in future projects. Ad hoc processes are whole. Like the KDD process, Two Crows [32,33] is a rarely replicable unless the same team is working on process model and waterfall life cycle.

At no point does it the new project. Like the itself. Even if we were as good as we could be now, above approaches, SEMMA also sets out a waterfall life both development environments and requested pro- cycle, as the project is developed through to the end.

If the ducts are changing so quickly that our processes will solution is not interesting, developers go backwards always be running to catch up. Tasks are performed paradigm for projects of any type. It focuses on the quality using techniques that stipulate how they should be done. CRM catalyst [39] is a implement the techniques and improve task performance. The last part uses DM. A performed. In this case, the life cycle is iterative, as the CRM system is built as small increments, not as a at one go.

It is an stage to the next intermediate products. Market Con- been completed and criteria for selecting and starting sulteks [41] integrates the technological part of DM into the next stage. Marba whole or at most divides it into different problems. Development-oriented processes start project. It may as it gives recommendations on how to do some tasks. With the support about how to do them. Finally, the activities for installing, operating, supporting, maintaining and retiring the software product 3.

SE process models should be performed. Integral processes are necessary to successfully complete the software project activities.

They The SE panorama is quite a lot clearer, and there are are enacted at the same time as the software develop- two well-established process models: IEEE Standard ment-oriented activities and include activities that are not [45] and ISO [46]. In the following, we will analyze related to development. They are used to assure the both processes in some detail and propose a generic completeness and quality of the project functions. This joint model will then be used for comparison with and, if necessary, to expand the 3.

ISO divides the activities that can be carried out 3. It processes organizational life cycle processes , as shown in determines a non-time-ordered set of essential activities Fig. Each life cycle process is divided into a set of that should be part of developing a software product. The activities, and these activities are further divided into a set life cycle that should be followed to develop the product is of tasks.

Each organization using the standard software life cycle. The primary parties are the acquirer, supplier, Fig. The software life cycle selection process system or product. The support- selected. The project management processes are the set of ing processes are divided into eight subprocesses, any of Fig.

Software process models. Marba 91 which can be used in the acquisition, supply, strategic different process models. According to this criterion, we planning, development, operation or maintenance pro- selected IEEE Std as a basis, as its processes are more cesses or any other supporting processes. The supporting detailed. Additionally, we added the ISO acquisition processes are used at several points of the life cycle and and supply processes, because IEEE Std states that ISO can be enacted by the organization that uses them, by a acquisition and supply processes should be used separate organization as a service or by a customer as a [45] if it is necessary to acquire or supply software.

The supporting Fig. The organization that uses and enacts a supporting Fig. These processes help to establish, implement and improve 4. The organiza- reference guide to develop DM projects.

Clearly, most of the processes standing the project objectives and requirements from proposed in IEEE Std match up with ISO a business perspective, then converting this knowledge processes and vice versa. The designed to achieve the objectives. Joint process model. Typically, there are understanding understanding several techniques for the same DM problem type. Therefore, stepping back to the data preparation preparation phase is often necessary. At the end of this phase, a decision on the use of the DM results should be reached.

Even if the purpose of the model is Fig. Comparison of life cycle selection processes. This set of processes also extends to third party software acquisition and supply. These two processes cover 5. SE process model vs. This project development suggests that acquisition and supply comparison should identify what SE model elements processes may be considered necessary and third parties activities, tasks are applicable to DM projects and are engaged to develop or create DM models for projects of not covered by CRISP-DM.

This way it will be possible to some size or complexity. The life cycle depends on the type of SE process model elements is not exact. In some cases, the project to be developed. Life cycle models are used for elements are equivalent, but the techniques are different.

This obviously de- Developing a more or less everyday piece of software e. In one case the project aim is to a common management application has nothing to do develop software and in the other it is to gather knowl- with building a totally unknown piece of software e.

This also applies to DM projects. Therefore, even if the activities are similar, the to identify and select a life cycle for the software project order and way in which they are performed will not be the that is to be developed. Possible life cycle models are same. Marba DM projects. Project management processes resources are available rather than allocating resources across the project. And all the tasks that resources throughout the project life cycle.

Project planning covers management is related to metrics define metrics, retain all the processes related to planning project management records, collect and analyze metrics. For the most part, this activities, including contingency planning.

It aims to throughout project execution. The other major omission is identify potential problems, determine the likelihood of the evaluation component plan evaluations. CRISP-DM their occurrence and their impact and establish the steps does have a results evaluation stage, but this component for their management.

Additionally, it also covers sub- refers to process evaluation as a whole. Tasks need to be processes related to project metric management. Looking at the tasks covered by the of current projects and the teams of human resources CRISP-DM stages; however, only the business understand- working together on such projects, we believe that it ing BU phase includes any project management-related should. Different people generate multiple versions of tasks. The identify major iterations task is comparable to models, initial data sets, documents, etc.

Additionally, the versions not be valid. Also there is a risk of models, data philosophy behind the experience documentation task is and documentation for different versions getting mixed up.

Comparison of management processes. Comparison of pre-development processes. Additionally, any DM project should include tasks for produce this knowledge and its documentation. As in SE, managing the transfer and use of the results plan system these processes can also be divided into pre-development, transition, plan installation , tasks that CRISP-DM does not development and post-development stages. Finally, the other major oversight, fruit of process immaturity, is the documentation task plan documenta- 5.

Pre-development tion. Reports are generated in all stages. However, there is The pre-development processes Fig. This software system, such as concept exploration or system would improve documentation evaluation and review and allocation requirements. The concept exploration process facilitate work on process improvement. This performed by a computer identify ideas or needs. With the ment processes. It is the document upon which all the support of the integral process activities and subject to later engineering work is based.

It is a starting documentation as of the statement of need. Finally, point for project development as it provides an under- activities for installing installation , operating operation standing of the problem to be solved and establishes the and support , supporting operation and support , main- supposed requirements for and constraints on the project taining maintenance and retiring retirement the soft- to be developed.

These processes are already accounts for this process, as shown in Fig. Here we discuss head to head comparison, key differences with comparison table. You may also look at the following articles to learn more —.

Submit Next Question. By signing up, you agree to our Terms of Use and Privacy Policy. Forgot Password? This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy.

By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy. To evaluate methods and to find patterns, open-source software projects such as JHotDraw, Junit, and MapperXML have been generally preferred by researchers. For example, Zanoni et al. Then, to obtain five design patterns, instances were collected from 10 open-source software projects, as shown in Table 4.

Design patterns and code smells are related issues: Code smell refers to symptoms in code, and if there are code smells in a software, its design pattern is not well constructed. Therefore, Kaur and Singh [ 96 ] checked whether design pattern and smell pairs appear together in a code by using J48 Decision Tree. Their obtained results showed that the singleton pattern had no presence of bad smells.

According to the studies summarized in the table, the most frequently used patterns are abstract factory and adapter. It has recently been observed that studies on ensemble learning in this field are increasing. It improves readability and maintainability of the source code and decreases complexity of a software system. Code smell and refactoring are closely related to each other: Code smells represent problems due to bad design and can be fixed during refactoring. The main challenge is to obtain which part of the code needs refactoring.

Some of data mining studies related to software refactoring are presented in Table 5. Some studies focus on historical data to predict refactoring [ ] or to obtain both refactoring and software defects [ ] using different data mining algorithms such as LMT, Rip, and J Results suggest that when refactoring increases, the number of software defects decreases, and thus refactoring has a positive effect on software quality.

While automated refactoring does not always give the desired result, manual refactoring is time-consuming. Therefore, one study [ ] proposed a clustering-based recommendation tool by combining multi-objective search and unsupervised learning algorithm to reduce the number of refactoring options. Since many SE studies that apply data mining approaches exist in the literature, this article presents only a few of them.

We extracted publications in since this year has not completed yet. As seen in the figure, the number of studies using data mining in SE tasks, especially defect prediction and vulnerability analysis, has increased rapidly. The most stable area in the studies is design pattern mining. Number of publications of data mining studies for SE tasks from Scopus search by their years. Figure 5 shows the publications studied in classification, clustering, text mining, and association rule mining as a percentage of the total number of papers obtained by a Scopus query for each SE task.

For example, in defect prediction, the number of studies is in the field of classification, 64 in clustering, 8 in text mining, and 25 in the field of association rule mining. As can be seen from the pie charts, while clustering is a popular DM technique in refactoring, no study related to text mining is found in this field.

In other SE tasks, the preferred technique is classification, and the second is clustering. Number of publications of data mining studies for SE tasks from Scopus search by their topics.

Defect prediction generally compares learning algorithms in terms of whether they find defects correctly using classification algorithms. Besides this approach, in some studies, clustering algorithms were used to select futures [ ] or to compare supervised and unsupervised methods [ 27 ].

In the text mining area, to extract features from scripts, TF-IDF techniques were generally used [ , ]. Figure 6 shows the number of document types conference paper, book chapter, article, book published between the years of and It is clearly seen that conference papers and articles are the most preferred research study type. It is clearly seen that there is no review article about data mining studies in design pattern mining.

The number of publications in terms of document type between and Table 6 shows popular repositories that contain various datasets and their descriptions, which tasks they are used for, and hyperlinks to download.

Since these repositories contain many datasets, no detailed information about them has been provided in this article. Refactoring can be applied at different levels; study [ ] predicted refactoring at package level using hierarchical clustering, and another study [ 99 ] applied class-level refactoring using LS-SVM as learning algorithm, SMOTE for handling refactoring, and PCA for feature extraction.

Data mining techniques have been applied successfully in many different domains. In software engineering, to improve the quality of a product, it is highly critical to find existing deficits such as bugs, defects, code smells, and vulnerabilities in the early phases of SDLC. Therefore, many data mining studies in the past decade have aimed to deal with such problems.

The present paper aims to provide information about previous studies in the field of software engineering. This survey shows how classification, clustering, text mining, and association rule mining can be applied in five SE tasks: defect prediction, effort estimation, vulnerability analysis, design pattern mining, and refactoring. It clearly shows that classification is the most used DM technique.

Therefore, new studies can focus on clustering on SE tasks. Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3. Help us write another book on this subject and reach those readers. Login to your personal dashboard for more detailed statistics on your publications. Edited by Derya Birant. Edited by Peng-Yeng Yin. Fdez-Vidal, Xose M. Pardo and Raquel Dosil. We are IntechOpen, the world's leading publisher of Open Access books.

Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. Downloaded: Abstract Software engineering is one of the most utilizable research areas for data mining.

Keywords software engineering tasks data mining text mining classification clustering. Introduction In recent years, researchers in the software engineering SE field have turned their interest to data mining DM and machine learning ML -based studies since collected SE data can be helpful in obtaining new and significant information. Table 1. These base learners can be acquired with: Different learning algorithms Different parameters of the same algorithm Different training sets The commonly used ensemble techniques bagging, boosting, and stacking are shown in Figure 3 and briefly explained in this part.

Design the activities. Estimate product size and complexity. Compare and repeat estimates. Table 2. The most popular and widely utilized definition appears in the Common Vulnerabilities and Exposures CVE report as follows: Vulnerability is a weakness in the computational logic found in software and some hardware components that, when exploited, results in a negative impact to confidentiality, integrity or availability. Employ a deep neural network 2.

Table 3. Creation of metrics-oriented dataset 2. Table 4. RF, Adaboost Axis2, Eclipse. Table 5.

taiprodhifi1986's Ownd

0コメント

1000 / 1000