Página 1 dos resultados de 25 itens digitais encontrados em 0.007 segundos

Integração de dados de sensoriamento remoto multi resoluções para a representação da cobertura da terra utilizando campos contínuos de vegetação e classificação por árvores de decisão

Latorre, Marcelo Lopes; Carvalho Júnior, Osmar Abílio de; Santos, João Roberto dos; Shimabukuro, Yosio Edemir
Fonte: Universidade de Brasília Publicador: Universidade de Brasília
Tipo: Artigo de Revista Científica
POR
Relevância na Pesquisa
26.02%
Este trabalho objetiva desenvolver uma metodologia de integração de multisensores para um sistema de monitoramento da Amazônia. O sistema proposto baseia-se no Vegetation Continuous Fields (VCF) que utiliza um algoritmo de árvore de decisão. O algoritmo utiliza um conjunto de variáveis independentes, no caso métricas multitemporais do MODIS, para recursivamente particionar uma variável dependente, no caso dados de treinamentos provenientes de classes de uso da terra, em subconjuntos, que maximizam a redução do quadrado da soma residual. Os dados de treinamentos são obtidos pela classificação de imagens de alta resolução (Landsat/TM, ETM+ e CBERS 2/CCD). Neste estudo, um algoritmo foi desenvolvido a partir da linguagem IDL, no programa ENVI, e uma rotina estatística do programa S-PLUS. A área de estudo é o Estado do Mato Grosso com uma extensa área de cobertura de Floresta Amazônica. As cenas são classificadas em três classes: floresta, não floresta e água. Comparações do produto final com o mapa regional de uso da terra derivado do PRODES revelam uma concordância geral. Portanto, os resultados desse estudo sugerem que a metodologia é apropriada para a determinação da cobertura da terra na Floresta Amazônica. ________________________________________________________________________________ ABSTRACT; This paper aims to develop a methodology of multisensor integration for an Amazon monitoring system. The proposed system employs the Vegetation Continuous Fields (VCF) method that uses the decision tree algorithm. The algorithm uses a set of independent variables...

Relevance Vector Machine Learning for Neonate Pain Intensity Assessment Using Digital Imaging

Gholami, Behnood; Haddad, Wassim M.; Tannenbaum, Allen R.
Fonte: PubMed Publicador: PubMed
Tipo: Artigo de Revista Científica
EN
Relevância na Pesquisa
26.12%
Pain assessment in patients who are unable to verbally communicate is a challenging problem. The fundamental limitations in pain assessment in neonates stem from subjective assessment criteria, rather than quantifiable and measurable data. This often results in poor quality and inconsistent treatment of patient pain management. Recent advancements in pattern recognition techniques using relevance vector machine (RVM) learning techniques can assist medical staff in assessing pain by constantly monitoring the patient and providing the clinician with quantifiable data for pain management. The RVM classification technique is a Bayesian extension of the support vector machine (SVM) algorithm, which achieves comparable performance to SVM while providing posterior probabilities for class memberships and a sparser model. If classes represent “pure” facial expressions (i.e., extreme expressions that an observer can identify with a high degree of confidence), then the posterior probability of the membership of some intermediate facial expression to a class can provide an estimate of the intensity of such an expression. In this paper, we use the RVM classification technique to distinguish pain from nonpain in neonates as well as assess their pain intensity levels. We also correlate our results with the pain intensity assessed by expert and nonexpert human examiners.

A multiscale and multiblock fuzzy C-means classification method for brain MR images

Yang, Xiaofeng; Fei, Baowei
Fonte: American Association of Physicists in Medicine Publicador: American Association of Physicists in Medicine
Tipo: Artigo de Revista Científica
EN
Relevância na Pesquisa
16.12%
Purpose: Classification of magnetic resonance (MR) images has many clinical and research applications. Because of multiple factors such as noise, intensity inhomogeneity, and partial volume effects, MR image classification can be challenging. Noise in MRI can cause the classified regions to become disconnected. Partial volume effects make the assignment of a single class to one region difficult. Because of intensity inhomogeneity, the intensity of the same tissue can vary with respect to the location of the tissue within the same image. The conventional “hard” classification method restricts each pixel exclusively to one class and often results in crisp results. Fuzzy C-mean (FCM) classification or “soft” segmentation has been extensively applied to MR images, in which pixels are partially classified into multiple classes using varying memberships to the classes. Standard FCM, however, is sensitive to noise and cannot effectively compensate for intensity inhomogeneities. This paper presents a method to obtain accurate MR brain classification using a modified multiscale and multiblock FCM.

Improving Predictions in Imbalanced Data Using Pairwise Expanded Logistic Regression

Jiang, Xiaoqian; El-Kareh, Robert; Ohno-Machado, Lucila
Fonte: American Medical Informatics Association Publicador: American Medical Informatics Association
Tipo: Artigo de Revista Científica
EN
Relevância na Pesquisa
26.02%
Building classifiers for medical problems often involves dealing with rare, but important events. Imbalanced datasets pose challenges to ordinary classification algorithms such as Logistic Regression (LR) and Support Vector Machines (SVM). The lack of effective strategies for dealing with imbalanced training data often results in models that exhibit poor discrimination. We propose a novel approach to estimate class memberships based on the evaluation of pairwise relationships in the training data. The method we propose, Pairwise Expanded Logistic Regression, improved discrimination and had higher accuracy when compared to existing methods in two imbalanced datasets, thus showing promise as a potential remedy for this problem.

A BAYESIAN INTEGRATION MODEL OF HIGH-THROUGHPUT PROTEOMICS AND METABOLOMICS DATA FOR IMPROVED EARLY DETECTION OF MICROBIAL INFECTIONS

WEBB-ROBERTSON, BOBBIE-JO M.; MCCUE, LEE ANN; BEAGLEY, NATHANIAL; MCDERMOTT, JASON E.; WUNSCHEL, DAVID S.; VARNUM, SUSAN M.; HU, JIAN ZHI; ISERN, NANCY G.; BUCHKO, GARRY W.; MCATEER, KATHLEEN; POUNDS, JOEL G.; SKERRETT, SHAWN J.; LIGGITT, DENNY; FREVERT,
Fonte: PubMed Publicador: PubMed
Tipo: Artigo de Revista Científica
Publicado em //2009 EN
Relevância na Pesquisa
26.12%
High-throughput (HTP) technologies offer the capability to evaluate the genome, proteome, and metabolome of an organism at a global scale. This opens up new opportunities to define complex signatures of disease that involve signals from multiple types of biomolecules. However, integrating these data types is difficult due to the heterogeneity of the data. We present a Bayesian approach to integration that uses posterior probabilities to assign class memberships to samples using individual and multiple data sources; these probabilities are based on lower-level likelihood functions derived from standard statistical learning algorithms. We demonstrate this approach on microbial infections of mice, where the bronchial alveolar lavage fluid was analyzed by three HTP technologies, two proteomic and one metabolomic. We demonstrate that integration of the three datasets improves classification accuracy to ~89% from the best individual dataset at ~83%. In addition, we present a new visualization tool called Visual Integration for Bayesian Evaluation (VIBE) that allows the user to observe classification accuracies at the class level and evaluate classification accuracies on any subset of available data types based on the posterior probability models defined for the individual and integrated data.

The Psychometric Latent Agreement Model (PLAM) for Discrete Latent Variables Measured by Multiple Items

Dumenci, Levent
Fonte: PubMed Publicador: PubMed
Tipo: Artigo de Revista Científica
Publicado em /01/2011 EN
Relevância na Pesquisa
16.02%
The Psychometric Latent Agreement Model (PLAM) is proposed for estimating the subpopulation membership of individuals (e.g., satisfactory performers vs. unsatisfactory performers) at discrete levels of multiple latent trait variables. A binary latent Type variable is introduced to take account of the possibility that, for a given set of observed variables, the latent group memberships of some individuals are indeterminate. The latent Type variable allows for separating individuals who can reliably be assigned to satisfactory versus unsatisfactory performers classes from those individuals whose ratings do not contain the necessary information to make the class assignment possible for a particular set of rating items. Agreements among discrete latent trait variables are also estimated. The PLAM was illustrated with two examples using real data on behavioral rating measures. One example involved ratings of two behavioral constructs by a single rater type, whereas the other involved ratings of one construct by three rater types. Implications were presented for using behavioral ratings to determine the subpopulation membership, such as qualified versus unqualified groupings in hiring decisions and pass versus fail groupings in performance evaluations.

Methods to Interpolate Soil Categorical Variables From Profile Observations: Lessons From Iran

HENGL TOMISLAV; TOOMANIAN NORAIR; REUTER HANNES ISAAK; MALAKOUTI MOHAMMAD
Fonte: ELSEVIER SCIENCE BV Publicador: ELSEVIER SCIENCE BV
Tipo: Articles in Journals Formato: Printed
ENG
Relevância na Pesquisa
16.46%
The paper compares semi-automated interpolation methods to produce soil-class maps from profile observations and by using multiple auxiliary predictors such as terrain parameters, remote sensing indices and similar. The Soil Profile Database of Iran, consisting of 4250 profiles, was used to test different soil-class interpolators. The target variables were soil texture classes and World Reference Base soil groups. The predictors were 6 terrain parameters, 11 MODIS EVI images and 17 physiographic regions (polygon map) of Iran. Four techniques were considered: (a) supervised classification using maximum likelihoods; (b) multinominal logistic regression; (c) regression-kriging on memberships; and (d) classification of taxonomic distances. The predictive capabilities were assessed using a control subset of 30% profiles and the kappa statistics as criterion. Supervised classification and multinominal logistic regression can lead to poor results if soil-classes overlap in the feature space, or if the correlation between the soil-classes and predictors is low. The two other methods have better predictive capabilities, although both are computationally more demanding. For both mapping of texture classes and soil types, the best prediction was achieved using regressionkriging of indicators/memberships (κ=45%...

Methods to interpolate soil categorical variables from profile observations : lessons fron Iran

HENGL TOMISLAV; TOOMANIAN Norair; REUTER HANNES ISAAK; MALAKOUTI Mohammad J.
Fonte: ELSEVIER SCIENCE BV Publicador: ELSEVIER SCIENCE BV
Tipo: Articles in Journals Formato: Printed
ENG
Relevância na Pesquisa
16.18%
Tha paper gives comparison of interpolation methods to produce soil-class maps from profile observations, given the large amount of auxiliary predictors such as terrain parameters, remote sensing indices and similar. The Soil Profile Database of Iran, consisting of 4250 profiles, was used to test different soil-class interpolators. Four techniques have been considered: (a) supervised classification using maximum likelihoods; (b) multinominal logistic regression; (c) regression-kriging on memberships; and (d) classification of taxonomic distances. The predictive capabilities were assessed using a control subset of 30% profiles and kappa statistics. Steps to improve interpolation of osil-class data, by considering the fuzziness of classes directly on the field and by improving the quality of input data, are further discussed.; JRC.H.6-Digital Earth and Reference Data

How optically diverse is the coastal ocean?

MELIN Frederic; VANTREPOTTE Vincent
Fonte: ELSEVIER SCIENCE INC Publicador: ELSEVIER SCIENCE INC
Tipo: Articles in periodicals and books Formato: Printed
ENG
Relevância na Pesquisa
26.02%
Coastal regions are a resource for societies while being under severe pressure from a variety of factors. They also show a large diversity of optical characteristics, and the potential to optically classify these waters and distinguish similarities between regions is a fruitful application for satellite ocean color. Recognizing the specificities and complexity of coastal waters in terms of optical properties, a training data set is assembled for coastal regions and marginal seas using full resolution SeaWiFS global remote sensing reflectance RRS data that maximize the geographic coverage and seasonal sampling of the domain. An unsupervised clustering technique is operated on the training data set to derive a set of 16 classes that cover conditions from very turbid to oligotrophic. When applied to a global seven-year SeaWiFS data set, this set of optical water types allows an efficient classification of coastal regions, marginal seas and large inland water bodies. Classes associated with more turbid conditions show relative dominance close to shore and in the mid-latitudes. A geographic partition of the global coastal ocean serves to distinguish general optical similarities between regions. The local optical variability is quantified by the number of classes selected as dominant across the period...

A Class-Based Approach to Characterizing and Mapping the Uncertainty of the MODIS Ocean Chlorophyll Product

DOWELL Mark; MOORE Timothy; CAMPBELL Janet
Fonte: ELSEVIER SCIENCE INC Publicador: ELSEVIER SCIENCE INC
Tipo: Articles in Journals Formato: Printed
ENG
Relevância na Pesquisa
26.02%
Global chlorophyll products derived from NASA's ocean color satellite programs have a nominal uncertainty of ± 35%. This metric has been hard to assess, in part because the data sets for evaluating performance do not reflect the true distribution of chlorophyll in the global ocean. A new technique is introduced that characterizes the chlorophyll uncertainty associated with distinct optical water types, and shows that for much of the open ocean the relative error is under 35%. This technique is based on a fuzzy classification of remote sensing reflectance into eight optical water types for which error statistics have been calculated. The error statistics are based on a data set of coincident MODIS Aqua satellite radiances and in situ chlorophyll measurements. The chlorophyll uncertainty is then mapped dynamically based on fuzzy memberships to the optical water types. The uncertainty maps are thus a separate, companion product to the standard MODIS chlorophyll product.; JRC.H.3-Global environement monitoring

Numerical Classification of Soil Profile Data Using Distance Metrics

CARRE' Florence; JACOBSON Martin
Fonte: ELSEVIER SCIENCE BV Publicador: ELSEVIER SCIENCE BV
Tipo: Articles in Journals Formato: Online
ENG
Relevância na Pesquisa
26.02%
Quantitative grouping of soil layer descriptions into profile classes has not advanced much since the 1960s. Here we tackle the problem from pedological, utilitarian and joint points of view using an application, OSACA, that we have developed for the purpose. The program calculates the taxonomic distances between observed profiles based on layer (horizon) characteristics. Characteristics can be either observed soil properties or layer class memberships. The inter-profile distance is calculated in three ways: 1 Pedological distance focuses on the sequence of layers without regard to layer thickness 2 Utilitarian distance weights the metric according to the layer thickness 3 Joint distance is like Utilitarian, but with less layer thickness dependance through prescaling of depths OSACA either allocates profiles to existing classes, or creates a new classification of the profiles. Since the pedological distance seems to be more useful for creating classes for pedogenetic and geomorphic studies, whereas the utilitarian distance may be more useful for environmental applications, we test the three distances for soil taxonomy application and available water capacity prediction by using as input variables, soil attributes, and classifying them into new set of profiles. The methods are described for a set of soil profiles in New South Wales...

A land cover map of Latin America and the Caribbean in the framework of the SERENA project

Blanco P.D.; Colditz R.R.; Lopez Saldana G.; Hardtke L.A.; Llamas R.M.; Mari N.A.; Fischer A.; Caride C.; Acenolaza P.G.; del Valle H.F.; Lillo-Saavedra M.; Coronato F.; Opazo S.A.; Morelli F.; Anaya J.A.; Sione W.F.; Zamboni P.; Arroyo V.B.
Fonte: Universidade de Medellín Publicador: Universidade de Medellín
Tipo: Article; info:eu-repo/semantics/article
ENG
Relevância na Pesquisa
46.75%
Land cover maps at different resolutions and mapping extents contribute to modeling and support decision making processes. Because land cover affects and is affected by climate change, it is listed among the 13 terrestrial essential climate variables. This paper describes the generation of a land cover map for Latin America and the Caribbean (LAC) for the year 2008. It was developed in the framework of the project Latin American Network for Monitoring and Studying of Natural Resources (SERENA), which has been developed within the GOFC-GOLD Latin American network of remote sensing and forest fires (RedLaTIF). The SERENA land cover map for LAC integrates: 1) the local expertise of SERENA network members to generate the training and validation data, 2) a methodology for land cover mapping based on decision trees using MODIS time series, and 3) class membership estimates to account for pixel heterogeneity issues. The discrete SERENA land cover product, derived from class memberships, yields an overall accuracy of 84% and includes an additional layer representing the estimated per-pixel confidence. The study demonstrates in detail the use of class memberships to better estimate the area of scarce classes with a scattered spatial distribution. The land cover map is already available as a printed wall map and will be released in digital format in the near future. The SERENA land cover map was produced with a legend and classification strategy similar to that used by the North American Land Change Monitoring System (NALCMS) to generate a land cover map of the North American continent...

Fermi's Sibyl: Mining the gamma-ray sky for dark matter subhaloes

Mirabal, N.; Frias-Martinez, V.; Hassan, T.; Frias-Martinez, E.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 22/05/2012
Relevância na Pesquisa
26.02%
Dark matter annihilation signals coming from Galactic subhaloes may account for a small fraction of unassociated point sources detected in the Second Fermi-LAT catalogue (2FGL). To investigate this possibility, we present Sibyl, a Random Forest classifier that offers predictions on class memberships for unassociated Fermi-LAT sources at high Galactic latitudes using gamma-ray features extracted from the 2FGL. Sibyl generates a large ensemble of classification trees that are trained to vote on whether a particular object is an active galactic nucleus (AGN) or a pulsar. After training on a list of 908 identified/associated 2FGL sources, Sibyl reaches individual accuracy rates of up to 97.7% for AGNs and 96.5% for pulsars. Predictions for the 269 unassociated 2FGL sources at |b| > 10 degrees suggest that 216 are potential AGNs and 16 are potential pulsars (with majority votes greater than 70%). The remaining 37 objects are inconclusive, but none is an extreme outlier. These results could guide future quests for dark matter Galactic subhaloes.; Comment: 5 pages, 3 figures, 2 tables, accepted for publication in MNRAS. Complete tables can be retrieved at http://www.gae.ucm.es/~mirabal/sibyl.html

Discriminative Clustering with Relative Constraints

Pei, Yuanli; Fern, Xiaoli Z.; Rosales, Rómer; Tjahja, Teresa Vania
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 30/12/2014
Relevância na Pesquisa
16.12%
We study the problem of clustering with relative constraints, where each constraint specifies relative similarities among instances. In particular, each constraint $(x_i, x_j, x_k)$ is acquired by posing a query: is instance $x_i$ more similar to $x_j$ than to $x_k$? We consider the scenario where answers to such queries are based on an underlying (but unknown) class concept, which we aim to discover via clustering. Different from most existing methods that only consider constraints derived from yes and no answers, we also incorporate don't know responses. We introduce a Discriminative Clustering method with Relative Constraints (DCRC) which assumes a natural probabilistic relationship between instances, their underlying cluster memberships, and the observed constraints. The objective is to maximize the model likelihood given the constraints, and in the meantime enforce cluster separation and cluster balance by also making use of the unlabeled instances. We evaluated the proposed method using constraints generated from ground-truth class labels, and from (noisy) human judgments from a user study. Experimental results demonstrate: 1) the usefulness of relative constraints, in particular when don't know answers are considered; 2) the improved performance of the proposed method over state-of-the-art methods that utilize either relative or pairwise constraints; and 3) the robustness of our method in the presence of noisy constraints...

Redshift and spatial distribution of the intermediate gamma-ray bursts

Horvath, I.; Bagoly, Z.; Postigo, A. de Ugarte; Balazs, L. G.; Veres, P.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 24/04/2015
Relevância na Pesquisa
26.02%
One of the most important task of the Gamma-Ray Burst field is the classification of the bursts. Many researches have proven the existence of the third kind (intermediate duration) of GRBs in the BATSE data. Recent works have analyzed BeppoSax and Swift observations and can also identify three types of GRBs in the data sets. However, the class memberships are probabilistic we have enough observed redshifts to calculate the redshift and spatial distribution of the intermediate GRBs. They are significantly farther than the short bursts and seems to be closer than the long ones.; Comment: AIP Conference Proceedings; 1358. pp. 235-238

A distinct peak-flux distribution of the third class of gamma-ray bursts: A possible signature of X-ray flashes?

Veres, P.; Bagoly, Z.; Horváth, I.; Mészáros, A.; Balázs, L. G.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 11/10/2010
Relevância na Pesquisa
26.12%
Gamma-ray bursts are the most luminous events in the Universe. Going beyond the short-long classification scheme we work in the context of three burst populations with the third group of intermediate duration and softest spectrum. We are looking for physical properties which discriminate the intermediate duration bursts from the other two classes. We use maximum likelihood fits to establish group memberships in the duration-hardness plane. To confirm these results we also use k-means and hierarchical clustering. We use Monte-Carlo simulations to test the significance of the existence of the intermediate group and we find it with 99.8% probability. The intermediate duration population has a significantly lower peak-flux (with 99.94% significance). Also, long bursts with measured redshift have higher peak-fluxes (with 98.6% significance) than long bursts without measured redshifts. As the third group is the softest, we argue that we have {related} them with X-ray flashes among the gamma-ray bursts. We give a new, probabilistic definition for this class of events.; Comment: accepted for publication in ApJ

Modelling time evolving interactions in networks through a non stationary extension of stochastic block models

Corneli, Marco; Latouche, Pierre; Rossi, Fabrice
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 08/09/2015
Relevância na Pesquisa
26.02%
In this paper, we focus on the stochastic block model (SBM),a probabilistic tool describing interactions between nodes of a network using latent clusters. The SBM assumes that the networkhas a stationary structure, in which connections of time varying intensity are not taken into account. In other words, interactions between two groups are forced to have the same features during the whole observation time. To overcome this limitation,we propose a partition of the whole time horizon, in which interactions are observed, and develop a non stationary extension of the SBM,allowing to simultaneously cluster the nodes in a network along with fixed time intervals in which the interactions take place. The number of clusters (K for nodes, D for time intervals) as well as the class memberships are finallyobtained through maximizing the complete-data integrated likelihood by means of a greedy search approach. After showing that the model works properly with simulated data, we focus on a real data set. We thus consider the three days ACM Hypertext conference held in Turin,June 29th - July 1st 2009. Proximity interactions between attendees during the first day are modelled and an interestingclustering of the daily hours is finally obtained, with times of social gathering (e.g. coffee breaks) recovered by the approach. Applications to large networks are limited due to the computational complexity of the greedy search which is dominated bythe number $K\_{max}$ and $D\_{max}$ of clusters used in the initialization. Therefore...

ABACUS: frequent pAttern mining-BAsed Community discovery in mUltidimensional networkS

Berlingerio, Michele; Pinelli, Fabio; Calabrese, Francesco
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
16.29%
Community Discovery in complex networks is the problem of detecting, for each node of the network, its membership to one of more groups of nodes, the communities, that are densely connected, or highly interactive, or, more in general, similar, according to a similarity function. So far, the problem has been widely studied in monodimensional networks, i.e. networks where only one connection between two entities can exist. However, real networks are often multidimensional, i.e., multiple connections between any two nodes can exist, either reflecting different kinds of relationships, or representing different values of the same type of tie. In this context, the problem of Community Discovery has to be redefined, taking into account multidimensional structure of the graph. We define a new concept of community that groups together nodes sharing memberships to the same monodimensional communities in the different single dimensions. As we show, such communities are meaningful and able to group highly correlated nodes, even if they might not be connected in any of the monodimensional networks. We devise ABACUS (Apriori-BAsed Community discoverer in mUltidimensional networkS), an algorithm that is able to extract multidimensional communities based on the apriori itemset miner applied to monodimensional community memberships. Experiments on two different real multidimensional networks confirm the meaningfulness of the introduced concepts...

Validation of Soft Classification Models using Partial Class Memberships: An Extended Concept of Sensitivity & Co. applied to the Grading of Astrocytoma Tissues

Beleites, Claudia; Salzer, Reiner; Sergo, Valter
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
46.82%
We use partial class memberships in soft classification to model uncertain labelling and mixtures of classes. Partial class memberships are not restricted to predictions, but may also occur in reference labels (ground truth, gold standard diagnosis) for training and validation data. Classifier performance is usually expressed as fractions of the confusion matrix, such as sensitivity, specificity, negative and positive predictive values. We extend this concept to soft classification and discuss the bias and variance properties of the extended performance measures. Ambiguity in reference labels translates to differences between best-case, expected and worst-case performance. We show a second set of measures comparing expected and ideal performance which is closely related to regression performance, namely the root mean squared error RMSE and the mean absolute error MAE. All calculations apply to classical crisp classification as well as to soft classification (partial class memberships and/or one-class classifiers). The proposed performance measures allow to test classifiers with actual borderline cases. In addition, hardening of e.g. posterior probabilities into class labels is not necessary, avoiding the corresponding information loss and increase in variance. We implement the proposed performance measures in the R package "softclassval"...

Enabling scalable data analysis for large computational structural biology datasets on large distributed memory systems supported by the MapReduce paradigm

Zhang, Boyu
Fonte: University of Delaware Publicador: University of Delaware
Tipo: Tese de Doutorado
Relevância na Pesquisa
26.4%
Taufer, Michela; Today, petascale distributed memory systems perform large-scale simulations and generate massive amounts of data in a distributed fashion at unprecedented rates. This massive amount of data presents new challenges for the scientists analyzing the data. In order to classify and cluster this data, traditional analysis methods require the comparison of single records with each other in an iterative process and therefore involve moving data across nodes of the system. When both the data and the number of nodes increase, classification and clustering methods can put increasing pressure on the system's storage and bandwidth. Thus, the methods become inefficient and do not scale. New methodologies are needed to analyze data when it is distributed across nodes of large distributed memory systems. In general, when analyzing such scientific data, we focus on specific properties of the data records. For example, in structural biology datasets, properties include the molecular geometry or the location of a molecule in a docking pocket. Based on this observation, we propose a methodology that enables the scalable analysis for large datasets, composed of millions of individual data records, in a distributed manner on large distributed memory systems. The methodology comprises two general steps. The first step extracts concise properties or features of each data record in isolation and represents them as metadata in parallel. The second step performs the analysis (i.e....