Página 1 dos resultados de 23908 itens digitais encontrados em 0.063 segundos

Two-Phase Mapping for Projecting Massive Data Sets

PAULOVICH, Fernando V.; SILVA, Claudio T.; NONATO, L. Gustavo
Fonte: IEEE COMPUTER SOC Publicador: IEEE COMPUTER SOC
Tipo: Artigo de Revista Científica
ENG
Relevância na Pesquisa
55.8%
Most multidimensional projection techniques rely on distance (dissimilarity) information between data instances to embed high-dimensional data into a visual space. When data are endowed with Cartesian coordinates, an extra computational effort is necessary to compute the needed distances, making multidimensional projection prohibitive in applications dealing with interactivity and massive data. The novel multidimensional projection technique proposed in this work, called Part-Linear Multidimensional Projection (PLMP), has been tailored to handle multivariate data represented in Cartesian high-dimensional spaces, requiring only distance information between pairs of representative samples. This characteristic renders PLMP faster than previous methods when processing large data sets while still being competitive in terms of precision. Moreover, knowing the range of variation for data instances in the high-dimensional space, we can make PLMP a truly streaming data projection technique, a trait absent in previous methods.; Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP); Fapesp-Brazil; CNPq-NSF; U.S. National Science Foundation (NSF); Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq); U.S. National Science Foundation (NSF); National Science Foundation (NSF); Department of Energy (DOE); U.S. Department of Energy (DOE); IBM; IBM

Particle competition and cooperation to prevent error propagation from mislabeled data in semi-supervised learning

Breve, Fabricio; Zhao, Liang
Fonte: Universidade Estadual Paulista Publicador: Universidade Estadual Paulista
Tipo: Conferência ou Objeto de Conferência Formato: 79-84
ENG
Relevância na Pesquisa
55.87%
Semi-supervised learning is applied to classification problems where only a small portion of the data items is labeled. In these cases, the reliability of the labels is a crucial factor, because mislabeled items may propagate wrong labels to a large portion or even the entire data set. This paper aims to address this problem by presenting a graph-based (network-based) semi-supervised learning method, specifically designed to handle data sets with mislabeled samples. The method uses teams of walking particles, with competitive and cooperative behavior, for label propagation in the network constructed from the input data set. The proposed model is nature-inspired and it incorporates some features to make it robust to a considerable amount of mislabeled data items. Computer simulations show the performance of the method in the presence of different percentage of mislabeled data, in networks of different sizes and average node degree. Importantly, these simulations reveals the existence of the critical points of the mislabeled subset size, below which the network is free of wrong label contamination, but above which the mislabeled samples start to propagate their labels to the rest of the network. Moreover, numerical comparisons have been made among the proposed method and other representative graph-based semi-supervised learning methods using both artificial and real-world data sets. Interestingly...

Mesoscale ocean variability signal recovered from altimeter data in the SW Atlantic Ocean: a comparison of orbit error correction in three Geosat data sets

Goni,Gustavo; Podesta,Guillermo; Brown,Otis; Brown,James
Fonte: Instituto Oceanográfico da Universidade de São Paulo Publicador: Instituto Oceanográfico da Universidade de São Paulo
Tipo: Artigo de Revista Científica Formato: text/html
Publicado em 01/01/1995 EN
Relevância na Pesquisa
65.93%
Orbit error is one of the largest sources of uncertainty in studies of ocean dynamics using satellite altimeters. The sensitivity of GEOSAT mesoscale ocean variability estimates to altimeter orbit precision in the SW Atlantic is analyzed using three GEOSAT data sets derived from different orbit estimation methods: (a) the original GDR data set, which has the lowest orbit precision, (b) the GEM-T2 set, constructed from a much more precise orbital model, and (c) the Sirkes-Wunsch data set, derived from additional spectral analysis of the GEM-T2 data set. Differences among the data sets are investigated for two tracks in dynamically dissimilar regimes of the Southwestern Atlantic Ocean, by comparing: (a) distinctive features of the average power density spectra of the sea height residuals and (b) space-time diagrams of sea height residuals. The variability estimates produced by the three data sets are extremely similar in both regimes after removal of the time-dependent component of the orbit error using a quadratic fit. Our results indicate that altimeter orbit precision with appropriate processing plays only a minor role in studies of mesoscale ocean variability.

Freely available compound data sets and software tools for chemoinformatics and computational medicinal chemistry applications

Hu, Ye; Bajorath, Jurgen
Fonte: F1000Research Publicador: F1000Research
Tipo: Artigo de Revista Científica
Publicado em 14/08/2012 EN
Relevância na Pesquisa
55.93%
We have generated a number of  compound data sets and programs for different types of applications in pharmaceutical research. These data sets and programs were originally designed for our research projects and are made publicly available. Without consulting original literature sources, it is difficult to understand specific features of data sets and software tools, basic ideas underlying their design, and applicability domains. Currently, 30 different entries are available for download from our website. In this data article, we provide an overview of the data and tools we make available and designate the areas of research for which they should be useful. For selected data sets and methods/programs, detailed descriptions are given. This article should help interested readers to select data and tools for specific computational investigations.

Matched molecular pair-based data sets for computer-aided medicinal chemistry

Hu, Ye; de la Vega de León, Antonio; Zhang, Bijun; Bajorath, Jürgen
Fonte: F1000Research Publicador: F1000Research
Tipo: Artigo de Revista Científica
Publicado em 21/02/2014 EN
Relevância na Pesquisa
55.93%
Matched molecular pairs (MMPs) are widely used in medicinal chemistry to study changes in compound properties including biological activity, which are associated with well-defined structural modifications. Herein we describe up-to-date versions of three MMP-based data sets that have originated from in-house research projects. These data sets include activity cliffs, structure-activity relationship (SAR) transfer series, and second generation MMPs based upon retrosynthetic rules. The data sets have in common that they have been derived from compounds included in the ChEMBL database (release 17) for which high-confidence activity data are available. Thus, the activity data associated with MMP-based activity cliffs, SAR transfer series, and retrosynthetic MMPs cover the entire spectrum of current pharmaceutical targets. Our data sets are made freely available to the scientific community.

Follow up: Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer

Hu, Ye; Bajorath, Jürgen
Fonte: F1000Research Publicador: F1000Research
Tipo: Artigo de Revista Científica
Publicado em 11/03/2014 EN
Relevância na Pesquisa
55.98%
In 2012, we reported 30 compound data sets and/or programs developed in our laboratory in a data article and made them freely available to the scientific community to support chemoinformatics and computational medicinal chemistry applications. These data sets and computational tools were provided for download from our website. Since publication of this data article, we have generated 13 new data sets with which we further extend our collection of publicly available data and tools. Due to changes in web servers and website architectures, data accessibility has recently been limited at times. Therefore, we have also transferred our data sets and tools to a public repository to ensure full and stable accessibility. To aid in data selection, we have classified the data sets according to scientific subject areas. Herein, we describe new data sets, introduce the data organization scheme, summarize the database content and provide detailed access information in ZENODO (doi: 10.5281/zenodo.8451 and doi:10.5281/zenodo.8455).

Field data sets for seagrass biophysical properties for the Eastern Banks, Moreton Bay, Australia, 2004–2014

Roelfsema, Chris M.; Kovacs, Eva M.; Phinn, Stuart R.
Fonte: Nature Publishing Group Publicador: Nature Publishing Group
Tipo: Artigo de Revista Científica
Publicado em 04/08/2015 EN
Relevância na Pesquisa
55.89%
This paper describes seagrass species and percentage cover point-based field data sets derived from georeferenced photo transects. Annually or biannually over a ten year period (2004–2014) data sets were collected using 30–50 transects, 500–800 m in length distributed across a 142 km2 shallow, clear water seagrass habitat, the Eastern Banks, Moreton Bay, Australia. Each of the eight data sets include seagrass property information derived from approximately 3000 georeferenced, downward looking photographs captured at 2–4 m intervals along the transects. Photographs were manually interpreted to estimate seagrass species composition and percentage cover (Coral Point Count excel; CPCe). Understanding seagrass biology, ecology and dynamics for scientific and management purposes requires point-based data on species composition and cover. This data set, and the methods used to derive it are a globally unique example for seagrass ecological applications. It provides the basis for multiple further studies at this site, regional to global comparative studies, and, for the design of similar monitoring programs elsewhere.

Attrition in Longitudinal Household Survey Data : Some Tests for Three Developing-Country Samples

Alderman, Harold; Behrman, Jere R.; Kohler, Hans-Peter; Maluccio, John A.; Cotts Watkins, Susan
Fonte: World Bank, Washington, DC Publicador: World Bank, Washington, DC
EN_US
Relevância na Pesquisa
55.8%
For capturing dynamic demographic relationships, longitudinal household data can have considerable advantages over more widely used cross-sectional data. But because the collection of longitudinal data may be difficult and expensive, analysts must assess the magnitudes of the problems, specific to longitudinal, but not to cross-sectional data. One problem that concerns many analysts is that sample attrition may make the interpretation of estimates problematic. Such attrition may be especially severe where there is considerable migration between rural, and urban areas. And attrition is likely to be selective on such characteristics as schooling, so high attrition is likely to bias estimates. The authors consider the extent, and implications of attrition for three longitudinal household surveys from Bolivia, Kenya, and South Africa that report very high annual attrition rates between survey rounds. Their estimates indicate that: 1) the means for a number of critical outcome, and family background variables differ significantly between those who are lost to follow-up, and those who are re-interviewed. 2) A number of family background variables are significant predictors of attrition. 3) Nevertheless, the coefficient estimates for standard family background variables in regressions...

A new distance for data sets (and probability measures) in a RKHS context

Martos, Gabriel
Fonte: Universidade Carlos III de Madrid Publicador: Universidade Carlos III de Madrid
Tipo: info:eu-repo/semantics/draft; info:eu-repo/semantics/workingPaper Formato: application/pdf
Publicado em /05/2013 ENG
Relevância na Pesquisa
65.96%
In this paper we define distance functions for data sets (and distributions) in a RKHS context. To this aim we introduce kernels for data sets that provide a metrization of the set of points sets (the power set). An interesting point in the proposed kernel distance is that it takes into account the underlying (data) generating probability distributions. In particular, we propose kernel distances that rely on the estimation of density level sets of the underlying distribution, and can be extended from data sets to probability measures. The performance of the proposed distances is tested on a variety of simulated distributions plus a couple of real pattern recognition problems; This work was partially supported by projectsDGUCM 2008/00058/002, MEC 2007/04438/001 and MIC 2012/00084/00

Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines

Weber, Richard; Famili, Fazel; Maldonado, Sebastián
Fonte: Elsevier Publicador: Elsevier
Tipo: Artículo de revista
EN
Relevância na Pesquisa
55.82%
Artículo de publicación SCOPUS; Feature selection and classification of imbalanced data sets are two of the most interesting machine learning challenges, attracting a growing attention from both, industry and academia. Feature selection addresses the dimensionality reduction problem by determining a subset of available features to build a good model for classification or prediction, while the class-imbalance problem arises when the class distribution is too skewed. Both issues have been independently studied in the literature, and a plethora of methods to address high dimensionality as well as class-imbalance has been proposed. The aim of this work is to simultaneously explore both issues, proposing a family of methods that select those attributes that are relevant for the identification of the target class in binary classification. We propose a backward elimination approach based on successive holdout steps, whose contribution measure is based on a balanced loss function obtained on an independent subset. Our experiments are based on six highly imbalanced microarray data sets, comparing our methods with well-known feature selection techniques, and obtaining a better prediction with consistently fewer relevant features.; CONICYT...

Spatial interpolation of large climate data sets using bivariate thin plate smoothing splines

Hutchinson, Michael; Hancock, P A
Fonte: Pergamon-Elsevier Ltd Publicador: Pergamon-Elsevier Ltd
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
65.83%
Thin plate smoothing splines are widely used to spatially interpolate surface climate, however, their application to large data sets is limited by computational efficiency. Standard analytic calculation of thin plate smoothing splines requires O(n3) opera

Creation of theoretical data sets to examine movement variability using modelling

Anderson, Ross; Kenny, Ian; Tucker, Catherine; O'Halloran, Joseph
Fonte: University of Limerick Publicador: University of Limerick
Tipo: info:eu-repo/semantics/conferenceObject; all_ul_research; ul_published_reviewed
ENG
Relevância na Pesquisa
55.94%
peer-reviewed; INTRODUCTION: Recently, a large amount of research has been focused on the effect of movement variability on human performance in sport. It is now generally accepted that specific amounts of variability are essential to attain a high level of performance (Davids et al., 2003). When studying the effect of movement variability on outcome performance, the usual method involves collecting numerous data sets from an individual and, assuming that these data sets will all be different (i.e. contain variability), attempt to connect the amount of variability to the change in outcome or performance measure using a number of statistical techniques. The aim of this study is to remove the requirement to collect a large amount of data which, by chance, may contain the level of variability required and shorten the data collection phase significantly by using the proposed process to create theoretical data sets containing alterable variability content while still exhibiting major characteristics of the actual data. When these theoretical data sets are used in conjunction with a full-body 3D computer model operating inverse and forward dynamics simulations a change in outcome or performance measure can be predicted. The advantages this process offers over traditional techniques is the ability to directly control and quantify the amount of variability introduced into the test data and a significant reduction in data collection time.; EMBARK IRCSET

A role-free approach to indexing large RDF data sets in secondary memory for efficient SPARQL evaluation

Fletcher, George H. L.; Beck, Peter W.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 07/11/2008
Relevância na Pesquisa
55.86%
Massive RDF data sets are becoming commonplace. RDF data is typically generated in social semantic domains (such as personal information management) wherein a fixed schema is often not available a priori. We propose a simple Three-way Triple Tree (TripleT) secondary-memory indexing technique to facilitate efficient SPARQL query evaluation on such data sets. The novelty of TripleT is that (1) the index is built over the atoms occurring in the data set, rather than at a coarser granularity, such as whole triples occurring in the data set; and (2) the atoms are indexed regardless of the roles (i.e., subjects, predicates, or objects) they play in the triples of the data set. We show through extensive empirical evaluation that TripleT exhibits multiple orders of magnitude improvement over the state of the art on RDF indexing, in terms of both storage and query processing costs.; Comment: 12 pages, 5 figures, 2 tables

Nearest Neighbor based Clustering Algorithm for Large Data Sets

Yadav, Pankaj Kumar; Pandey, Sriniwas; Mohanty, Sraban Kumar
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 22/05/2015
Relevância na Pesquisa
55.93%
Clustering is an unsupervised learning technique in which data or objects are grouped into sets based on some similarity measure. Most of the clustering algorithms assume that the main memory is infinite and can accommodate the set of patterns. In reality many applications give rise to a large set of patterns which does not fit in the main memory. When the data set is too large, much of the data is stored in the secondary memory. Input/Outputs (I/O) from the disk are the major bottleneck in designing efficient clustering algorithms for large data sets. Different designing techniques have been used to design clustering algorithms for large data sets. External memory algorithms are one class of algorithms which can be used for large data sets. These algorithms exploit the hierarchical memory structure of the computers by incorporating locality of reference directly in the algorithm. This paper makes some contribution towards designing clustering algorithms in the external memory model (Proposed by Aggarwal and Vitter 1988) to make the algorithms scalable. In this paper, it is shown that the Shared near neighbors algorithm is not very I/O efficient since the computational complexity is same as the I/O complexity. The algorithm is designed in the external memory model and I/O complexity is reduced. The computational complexity remains same. We substantiate the theoretical analysis by showing the performance of the algorithms with their traditional counterpart by implementing in STXXL library.; Comment: 10 pages

Multiscale Geometric Methods for Data Sets II: Geometric Multi-Resolution Analysis

Allard, William K.; Chen, Guangliang; Maggioni, Mauro
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
55.8%
Data sets are often modeled as point clouds in $R^D$, for $D$ large. It is often assumed that the data has some interesting low-dimensional structure, for example that of a $d$-dimensional manifold $M$, with $d$ much smaller than $D$. When $M$ is simply a linear subspace, one may exploit this assumption for encoding efficiently the data by projecting onto a dictionary of $d$ vectors in $R^D$ (for example found by SVD), at a cost $(n+D)d$ for $n$ data points. When $M$ is nonlinear, there are no "explicit" constructions of dictionaries that achieve a similar efficiency: typically one uses either random dictionaries, or dictionaries obtained by black-box optimization. In this paper we construct data-dependent multi-scale dictionaries that aim at efficient encoding and manipulating of the data. Their construction is fast, and so are the algorithms that map data points to dictionary coefficients and vice versa. In addition, data points are guaranteed to have a sparse representation in terms of the dictionary. We think of dictionaries as the analogue of wavelets, but for approximating point clouds rather than functions.; Comment: Re-formatted using AMS style

An Interactive 3D Visualization Tool for Large Scale Data Sets for Quantitative Atom Probe Tomography

Dahal, Hari; Stukowski, Michael; Graf, Matthias J.; Balatsky, Alexander V.; Rajan, Krishna
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 26/08/2011
Relevância na Pesquisa
55.86%
Several visualization schemes have been developed for imaging materials at the atomic level through atom probe tomography. The main shortcoming of these tools is their inability to parallel process data using multi-core computing units to tackle the problem of larger data sets. This critically handicaps the ability to make a quantitative interpretation of spatial correlations in chemical composition, since a significant amount of the data is missed during subsequent analysis. In addition, since these visualization tools are not open-source software there is always a problem with developing a common language for the interpretation of data. In this contribution we present results of our work on using an open-source advanced interactive visualization software tool, which overcomes the difficulty of visualizing larger data sets by supporting parallel rendering on a graphical user interface or script user interface and permits quantitative analysis of atom probe tomography data in real time. This advancement allows materials scientists a codesign approach to making, measuring and modeling new and nanostructured materials by providing a direct feedback to the fabrication and designing of samples in real time.; Comment: 5 pages, 2 figures...

Classifying extremely imbalanced data sets

Britsch, Markward; Gagunashvili, Nikolai; Schmelling, Michael
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 29/11/2010
Relevância na Pesquisa
55.88%
Imbalanced data sets containing much more background than signal instances are very common in particle physics, and will also be characteristic for the upcoming analyses of LHC data. Following up the work presented at ACAT 2008, we use the multivariate technique presented there (a rule growing algorithm with the meta-methods bagging and instance weighting) on much more imbalanced data sets, especially a selection of D0 decays without the use of particle identification. It turns out that the quality of the result strongly depends on the number of background instances used for training. We discuss methods to exploit this in order to improve the results significantly, and how to handle and reduce the size of large training sets without loss of result quality in general. We will also comment on how to take into account statistical fluctuation in receiver operation characteristic curves (ROC) for comparing classifier methods.

Mesoscale ocean variability signal recovered from altimeter data in the SW Atlantic Ocean: a comparison of orbit error correction in three Geosat data sets

Goni, Gustavo; Podesta, Guillermo; Brown, Otis; Brown, James
Fonte: Universidade de São Paulo. Instituto Oceanográfico Publicador: Universidade de São Paulo. Instituto Oceanográfico
Tipo: info:eu-repo/semantics/article; info:eu-repo/semantics/publishedVersion; ; Formato: application/pdf
Publicado em 01/01/1995 ENG
Relevância na Pesquisa
65.93%
Erro orbital tem sido a principal fonte de incerteza no processamento de dados altimétricos. Recentes conjuntos de dados, baseados em modelos de predição orbital mais avançados c em novas metodologias de correção de erro, já foram capazes de reduzir o erro orbital de ate uma ordem de magnitude em comparação com os GDRs originais. Ncslc trabalho nós avaliamos os resultados dessas melhores eslimativas na descrição da variabilidade "meso- escalar" na parte sudoeste do oceano Atlântico Sul. Comparamos resultados obtidos cm tres conjuntos de dados: os GDRs originais c os conjuntos de dados GEM-T2 c Sirkes-Wunsch. Para garantir a "sensibilidade" das estimativas dc variabilidade mcso-cscalar quanto às mudanças na precisão orbital, utilizamos as mesmas "correções ambientais" c o mesmo método dc processamento de dados no tratamento dos três conjuntos dc dados. Para investigar as possíveis diferenças entre os valores de variabilidade meso-escalar produzidos pelos tres conjuntos dc dados utilizamos as características espectrais dos residuais de "amplitude do mar" obtidas antes c depois da remoção do erro orbital "dependente" do tempo. O fato da componente mcso-cscalar do espectro quase não ter sido afetada pela remoção do maior comprimento de onda do sinal (o que corresponde principalmente ao erro orbital) sugere que muito pouco do sinal meso-escalar foi realmente removido através deste processo. Um "pico" menor no espectro da "faixa" B confirma uma variabilidade oceânica local menor com respeito à faixa A. Uma análise mais profunda demonstra que...

Reconstruction of Sea-Surface Temperatures from Assemblages of Planktonic Foraminifera: Multi-Technique Approach Based on Geographically Constrained Calibration Data Sets and its Application to Glacial Atlantic and Pacific Oceans

Kucera, Michal; Weinelt, Mara; Kiefer, T; Pflaumann, U; Hayes, Angela; Weinelt, Martin; Chen, Min-Te; Mix, Alan C.; Barrows, Timothy; Cortijo, E; Duprat, Josette; Juggins, Steve; Waelbroeck, Claire
Fonte: Pergamon-Elsevier Ltd Publicador: Pergamon-Elsevier Ltd
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
65.92%
We present a conceptual framework for a new approach to environmental calibration of planktonic foraminifer census counts. This approach is based on simultaneous application of a variety of transfer function techniques, which are trained on geographically constrained calibration data sets. It serves to minimise bias associated with the presence of cryptic species of planktonic foraminifera and provides an objective tool for assessing reliability of environmental estimates in fossil samples, allowing identification of adverse effects of no-analog faunas and technique-specific bias. We have compiled new calibration data sets for the North (N=862) and South (N=321) Atlantic and the Pacific Ocean (N=1111). We show evidence that these data sets offer adequate coverage of the Sea-Surface Temperature (SST) and faunal variation range and that they are not affected by the presence of pre-Holocene samples and/or calcite dissolution. We have applied four transfer function techniques, including Artificial Neural Networks, Revised Analog Method and SIMMAX (with and without distance weighting) on faunal counts in a Last Glacial Maximum (LGM) data set for the Atlantic Ocean (748 samples in 167 cores; based on the GLAMAP-2000 compilation) and a new data set for the Pacific Ocean (265 samples in 82 cores) and show that three of these techniques provide adequate degree of independence for the advantage of a multi-technique approach to be realised. The application of our new approach to the glacial Pacific lends support to the contraction and perhaps even a cooling of the Western Pacific Warm Pool and a substantial (>3°C) cooling of the eastern equatorial Pacific and the eastern boundary currents. Our results do not provide conclusive evidence for LGM warming anywhere in the Pacific. The Atlantic reconstruction shows a number of robust patterns...

Feasibility of road traffic injury surveillance integrating police and health insurance data sets in the Dominican Republic

Puello,Adrian; Bhatti,Junaid; Salmi,Louis-Rachid
Fonte: Organización Panamericana de la Salud Publicador: Organización Panamericana de la Salud
Tipo: Artigo de Revista Científica Formato: text/html
Publicado em 01/07/2013 EN
Relevância na Pesquisa
55.93%
OBJECTIVE: To assess the feasibility of semiautomated linking of road traffic injury (RTI) cases in different data sets in low- and middle-income countries. METHODS: The study population consisted of RTI cases in the Dominican Republic in 2010 and were identified in police and health insurance data sets. After duplicates were removed and fatality reporting was corrected by using forensic data, police and health insurance RTI records were linked if they had the same province, collision date, and gender of RTI cases and similar age within five years. A multinomial logistic regression model assessed the likelihood of being in only one of the data sets. RESULTS: One of five records was a duplicate, including 21.1% of 6 396 police and 16.2% of 6 178 insurance records. Health insurance data recorded 43 of 417 deaths as only injured. Capture - recapture estimated that both data sets recorded one of five RTI cases. Characteristics associated with increased likelihood (P < 0.05) of being only in the police data set were female gender [adjusted odds ratio (OR) = 2.5], age ≥ 16 years (OR = 1.7), collision in the regions of Cibao Northeast (OR = 4.1) and Valdesia (OR = 6.4), day of occurrence from Tuesday to Saturday (ORs from 1.5 to 2.9)...