Página 1 dos resultados de 1721 itens digitais encontrados em 0.012 segundos

Operações de consulta por similaridade em grandes bases de dados complexos; Similarity search operations in large complex databases

Barioni, Maria Camila Nardini
Fonte: Biblioteca Digitais de Teses e Dissertações da USP Publicador: Biblioteca Digitais de Teses e Dissertações da USP
Tipo: Dissertação de Mestrado Formato: application/pdf
Publicado em 04/09/2006 PT
Relevância na Pesquisa
56.25%
Os Sistemas de Gerenciamento de Bases de Dados (SGBD) foram desenvolvidos para armazenar e recuperar de maneira eficiente dados formados apenas por números ou cadeias de caracteres. Entretanto, nas últimas décadas houve um aumento expressivo, não só da quantidade, mas da complexidade dos dados manipulados em bases de dados, dentre eles os de natureza multimídia (como imagens, áudio e vídeo), informações geo-referenciadas, séries temporais, entre outros. Assim, surgiu a necessidade do desenvolvimento de novas técnicas que permitam a manipulação eficiente de tipos de dados complexos. Para atender às buscas necessárias às aplicações de base de dados modernas é preciso que os SGBD ofereçam suporte para buscas por similaridade ? consultas que realizam busca por objetos da base similares a um objeto de consulta, de acordo com uma certa medida de similaridade. Outro fator importante que veio contribuir para a necessidade de suportar a realização de consultas por similaridade em SGBD está relacionado à integração de técnicas de mineração de dados. É fundamental para essa integração o fornecimento de recursos pelos SGBD que permitam a realização de operações básicas para as diversas técnicas de mineração de dados existentes. Uma operação básica para várias dessas técnicas...

Sistema de busca e exibição de dados georreferenciados; Georeferenced data search and display system

Santos, Vinícius Rosa dos
Fonte: Universidade Federal do Rio Grande do Sul Publicador: Universidade Federal do Rio Grande do Sul
Tipo: Trabalho de Conclusão de Curso Formato: application/pdf
POR
Relevância na Pesquisa
56.11%
Um dos maiores problemas encontrados por historiadores e geólogos ao trabalharem com informações históricas relativas à nossa colonização, especialmente informações geográficas e referências genealógicas, é como publicar tais dados. As ferramentas de software de georreferenciamento e geoprocessamento disponíveis no mercado carecem de uma forma simplificada e acessível de visualização e pesquisa das informações coletadas. Uma alternativa é combinar serviços e técnicas variadas em uma só solução, provendo ao usuário final uma experiência satisfatória na interação com os dados coloniais. Neste trabalho é apresentado um sistema que mescla as funcionalidades de um serviço Web de mapeamento geográfico com técnicas de busca de informações por similaridade, com o objetivo de publicar informações coloniais do Estado do Rio Grande do Sul. Desta forma, obtém-se uma fácil visualização das regiões coloniais em um mapa e uma alta flexibilidade na pesquisa dos dados cadastrados, uma vez que informações históricas são sensíveis a erros tipográficos e possuem grafias alternativas para nomes de lugares e pessoas. Além disso, a solução proposta é acessível globalmente através da Internet, podendo ser consultada por qualquer interessado.; One of the main problems encountered by historians and geologists when they work with historical information related to our colonization...

Similaridade de series temporais na bolsa de valores; Time Series similarity applied to Brazilian stock market

Jeske, Jonas
Fonte: Universidade Federal do Rio Grande do Sul Publicador: Universidade Federal do Rio Grande do Sul
Tipo: Trabalho de Conclusão de Curso Formato: application/pdf
POR
Relevância na Pesquisa
56.27%
Uma das premissas da Análise Técnica de ações da Bolsa de Valores é a repetição da história, ou seja, dentro do histórico de cotações de preços alguns padrões podem ser encontrados. Logo, tendo uma forma conhecida, é interessante ter alguma espécie de busca para encontrar padrões similares à essa forma. Analisando alguns dos principais softwares para análise técnica de cotações, percebemos que não existe um mecanismo deste tipo de busca implementado. Em nosso trabalho sugerimos a busca de séries temporais dentro do domínio da Bolsa de Valores. Essa busca é dada por uma função de similaridade chamada Dynamic Time Warping (DTW) que implementamos em nosso protótipo. Para a análise de resultados escolhemos algumas ações e alguns padrões para as consultas. Após isso foram feitas as medidas de precisão e revocação para os resultados obtidos. A busca por similaride DTW na Bolsa de Valores pode ser considerada eficiente quando procuramos somente uma subsérie que seja similar a consulta, ou seja, o melhor caso retornado será quase sempre satisfatório.; One of the premisses in which Technical Analysis of the Stock Market is based is "history repeats itself". It means we can find some patterns on the prices of the actions. So...

Busca por similaridade em uma base de dados de genealogia; Similarity search in a personal database

Veronez, Rovian Voelz
Fonte: Universidade Federal do Rio Grande do Sul Publicador: Universidade Federal do Rio Grande do Sul
Tipo: Trabalho de Conclusão de Curso Formato: application/pdf
POR
Relevância na Pesquisa
66.26%
Na área da genealogia, nomes são muitas vezes grafados de várias maneiras diferentes, porém semelhantes. Os motivos para isto são vários, desde mudanças na gramática ao longo dos anos, diferença na grafia de certos nomes em línguas diferentes e até por erros ortográficos cometidos ao longo da história. Portanto, é importante que ao realizar uma pesquisa em uma base de dados genealógica, exista a opção de realizar uma pesquisa por palavras similares, para que resultados relevantes não sejam ignorados por não serem idênticos à palavra pesquisada. Banco de dados relacionais não oferecem naturalmente o suporte para que uma busca por similaridade seja feita, por isso, este trabalho se propõe a apresentar uma implementação de uma busca por similaridade no software de genealogia TNG, com uma técnica eficiente, evitando assim a perda de resultados relevantes em uma pesquisa histórica.; In genealogy, names are often spelled in different ways, although similar. There are a plenty of reasons for this, from grammar changes over the years, difference between names spellings in different languages and even spelling errors committed throughout history. For this reason, it is important to have the option of making a similarity search when a research is made over a genealogical database...

Biosequence Similarity Search on the Mercury System

Krishnamurthy, Praveen; Buhler, Jeremy; Chamberlain, Roger; Franklin, Mark; Gyang, Kwame; Jacob, Arpith; Lancaster, Joseph
Fonte: PubMed Publicador: PubMed
Tipo: Artigo de Revista Científica
Publicado em //2007 EN
Relevância na Pesquisa
46.21%
Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity search tool for biosequences is BLAST, a program designed to compare query sequences to a database. Here, we present the design of BLASTN, the version of BLAST that searches DNA sequences, on the Mercury system, an architecture that supports high-volume, high-throughput data movement off a data store and into reconfigurable hardware. An important component of application deployment on the Mercury system is the functional decomposition of the application onto both the reconfigurable hardware and the traditional processor. Both the Mercury BLASTN application design and its performance analysis are described.

G-Hash: Towards Fast Kernel-based Similarity Search in Large Graph Databases

Wang, Xiaohong; Smalter, Aaron; Huan, Jun; Lushington, Gerald H.
Fonte: PubMed Publicador: PubMed
Tipo: Artigo de Revista Científica
Publicado em //2009 EN
Relevância na Pesquisa
46.21%
Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation of graph databases, similarity search in graph databases has emerged as an important research topic. Graph similarity search has applications in a wide range of domains including cheminformatics, bioinformatics, sensor network management, social network management, and XML documents, among others.

RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data

Zhao, Yongan; Tang, Haixu; Ye, Yuzhen
Fonte: Oxford University Press Publicador: Oxford University Press
Tipo: Artigo de Revista Científica
EN
Relevância na Pesquisa
46.21%
Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode.

Querying Event Sequences by Exact Match or Similarity Search: Design and Empirical Evaluation

Wongsuphasawat, Krist; Plaisant, Catherine; Taieb-Maimon, Meirav; Shneiderman, Ben
Fonte: PubMed Publicador: PubMed
Tipo: Artigo de Revista Científica
Publicado em 01/03/2012 EN
Relevância na Pesquisa
46.24%
Specifying event sequence queries is challenging even for skilled computer professionals familiar with SQL. Most graphical user interfaces for database search use an exact match approach, which is often effective, but near misses may also be of interest. We describe a new similarity search interface, in which users specify a query by simply placing events on a blank timeline and retrieve a similarity-ranked list of results. Behind this user interface is a new similarity measure for event sequences which the users can customize by four decision criteria, enabling them to adjust the impact of missing, extra, or swapped events or the impact of time shifts. We describe a use case with Electronic Health Records based on our ongoing collaboration with hospital physicians. A controlled experiment with 18 participants compared exact match and similarity search interfaces. We report on the advantages and disadvantages of each interface and suggest a hybrid interface combining the best of both.

Distributed Efficient Similarity Search Mechanism in Wireless Sensor Networks

Ahmed, Khandakar; Gregory, Mark A.
Fonte: MDPI Publicador: MDPI
Tipo: Artigo de Revista Científica
Publicado em 05/03/2015 EN
Relevância na Pesquisa
46.26%
The Wireless Sensor Network similarity search problem has received considerable research attention due to sensor hardware imprecision and environmental parameter variations. Most of the state-of-the-art distributed data centric storage (DCS) schemes lack optimization for similarity queries of events. In this paper, a DCS scheme with metric based similarity searching (DCSMSS) is proposed. DCSMSS takes motivation from vector distance index, called iDistance, in order to transform the issue of similarity searching into the problem of an interval search in one dimension. In addition, a sector based distance routing algorithm is used to efficiently route messages. Extensive simulation results reveal that DCSMSS is highly efficient and significantly outperforms previous approaches in processing similarity search queries.

Optimizing similarity queries in metric spaces meeting user\'s expectation; Otimização de operações de busca por similaridade em espaços métricos

Ferreira, Mônica Ribeiro Porto
Fonte: Biblioteca Digitais de Teses e Dissertações da USP Publicador: Biblioteca Digitais de Teses e Dissertações da USP
Tipo: Tese de Doutorado Formato: application/pdf
Publicado em 22/10/2012 EN
Relevância na Pesquisa
46.31%
The complexity of data stored in large databases has increased at very fast paces. Hence, operations more elaborated than traditional queries are essential in order to extract all required information from the database. Therefore, the interest of the database community in similarity search has increased significantly. Two of the well-known types of similarity search are the Range (\'R IND. q\') and the k-Nearest Neighbor (\'kNN IND. q\') queries, which, as any of the traditional ones, can be sped up by indexing structures of the Database Management System (DBMS). Another way of speeding up queries is to perform query optimization. In this process, metrics about data are collected and employed to adjust the parameters of the search algorithms in each query execution. However, although the integration of similarity search into DBMS has begun to be deeply studied more recently, the query optimization has been developed and employed just to answer traditional queries. The execution of similarity queries, even using efficient indexing structures, tends to present higher computational cost than the execution of traditional ones. Two strategies can be applied to speed up the execution of any query, and thus they are worth to employ to answer also similarity queries. The first strategy is query rewriting based on algebraic properties and cost functions. The second technique is when external query factors are applied...

Dynamic Similarity Search in MultiMetric Spaces

Skopal, Tomás; Bustos Cárdenas, Benjamín Eugenio
Fonte: Universidade do Chile Publicador: Universidade do Chile
Tipo: Artículo de revista
EN_US
Relevância na Pesquisa
46.21%
Artículo de publicación ISI; An important research issue in multimedia databases is the retrieval of similar objects. For most applications in multimedia databases, an exact search is not meaningful. Thus, much effort has been devoted to develop efficient and effective similarity search techniques. A recent approach, that has been shown to improve the effectiveness of similarity search in multimedia databases, resorts to the usage of combinations of metrics where the desirable contribution (weight) of each metric is chosen at query time. This paper presents the Multi-Metric M-tree (M3-tree), a metric access method that supports similarity queries with dynamic combinations of metric functions. The M3-tree, an extension of the Mtree, stores partial distances to better estimate the weighed distances between routing/ground entries and each query, where a single distance function is used to build the whole index. An experimental evaluation shows that the M3-tree may be as efficient as having multiple M-trees (one for each combination of metrics).

Bayesian Locality Sensitive Hashing for Fast Similarity Search

Satuluri, Venu; Parthasarathy, Srinivasan
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
46.29%
Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search - i.e. efficient indexing for candidate generation. In this paper, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. BayesLSH is able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH's output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH...

Efficient Subgraph Similarity Search on Large Probabilistic Graph Databases

Yuan, Ye; Wang, Guoren; Chen, Lei; Wang, Haixun
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 30/05/2012
Relevância na Pesquisa
46.26%
Many studies have been conducted on seeking the efficient solution for subgraph similarity search over certain (deterministic) graphs due to its wide application in many fields, including bioinformatics, social network analysis, and Resource Description Framework (RDF) data management. All these works assume that the underlying data are certain. However, in reality, graphs are often noisy and uncertain due to various factors, such as errors in data extraction, inconsistencies in data integration, and privacy preserving purposes. Therefore, in this paper, we study subgraph similarity search on large probabilistic graph databases. Different from previous works assuming that edges in an uncertain graph are independent of each other, we study the uncertain graphs where edges' occurrences are correlated. We formally prove that subgraph similarity search over probabilistic graphs is #P-complete, thus, we employ a filter-and-verify framework to speed up the search. In the filtering phase,we develop tight lower and upper bounds of subgraph similarity probability based on a probabilistic matrix index, PMI. PMI is composed of discriminative subgraph features associated with tight lower and upper bounds of subgraph isomorphism probability. Based on PMI...

SEAL: Spatio-Textual Similarity Search

Fan, Ju; Li, Guoliang; Zhou, Lizhu; Chen, Shanshan; Hu, Jun
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 30/05/2012
Relevância na Pesquisa
46.23%
Location-based services (LBS) have become more and more ubiquitous recently. Existing methods focus on finding relevant points-of-interest (POIs) based on users' locations and query keywords. Nowadays, modern LBS applications generate a new kind of spatio-textual data, regions-of-interest (ROIs), containing region-based spatial information and textual description, e.g., mobile user profiles with active regions and interest tags. To satisfy search requirements on ROIs, we study a new research problem, called spatio-textual similarity search: Given a set of ROIs and a query ROI, we find the similar ROIs by considering spatial overlap and textual similarity. Spatio-textual similarity search has many important applications, e.g., social marketing in location-aware social networks. It calls for an efficient search method to support large scales of spatio-textual data in LBS systems. To this end, we introduce a filter-and-verification framework to compute the answers. In the filter step, we generate signatures for the ROIs and the query, and utilize the signatures to generate candidates whose signatures are similar to that of the query. In the verification step, we verify the candidates and identify the final answers. To achieve high performance...

Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets

Teixeira, Thiago S. F. X.; Teodoro, George; Valle, Eduardo; Saltz, Joel H.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 15/10/2013
Relevância na Pesquisa
46.28%
Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while keeping low response times. Thus, scalability is imperative for similarity search in Web-scale applications, but most existing methods are sequential and target shared-memory machines. Here we address these issues with a distributed, efficient, and scalable index based on Locality-Sensitive Hashing (LSH). LSH is one of the most efficient and popular techniques for similarity search, but its poor referential locality properties has made its implementation a challenging problem. Our solution is based on a widely asynchronous dataflow parallelization with a number of optimizations that include a hierarchical parallelization to decouple indexing and data storage, locality-aware data partition strategies to reduce message passing, and multi-probing to limit memory usage. The proposed parallelization attained an efficiency of 90% in a distributed system with about 800 CPU cores. In particular, the original locality-aware data partition reduced the number of messages exchanged in 30%. Our parallel LSH was evaluated using the largest public dataset for similarity search (to the best of our knowledge) with $10^9$ 128-d SIFT descriptors extracted from Web images. This is two orders of magnitude larger than datasets that previous LSH parallelizations could handle.

Representation Independent Proximity and Similarity Search

Chodpathumwan, Yodsawalai; Aleyasin, Amirhossein; Termehchy, Arash; Sun, Yizhou
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 15/08/2015
Relevância na Pesquisa
46.38%
Finding similar or strongly related entities in a graph database is a fundamental problem in data management and analytics with applications in similarity query processing, entity resolution, and pattern matching. Similarity search algorithms usually leverage the structural properties of the data graph to quantify the degree of similarity or relevance between entities. Nevertheless, the same information can be represented in many different structures and the structural properties observed over particular representations do not necessarily hold for alternative structures. Thus, these algorithms are effective on some representations and ineffective on others. We postulate that a similarity search algorithm should return essentially the same answers over different databases that represent the same information. We formally define the property of representation independence for similarity search algorithms as their robustness against transformations that modify the structure of databases and preserve their information content. We formalize two widespread groups of such transformations called {\it relationship reorganizing} and {\it entity rearranging} transformations. We show that current similarity search algorithms are not representation independent under these transformations and propose an algorithm called {\bf R-PathSim}...

Curse of Dimensionality in the Application of Pivot-based Indexes to the Similarity Search Problem

Volnyansky, Ilya
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 13/05/2009
Relevância na Pesquisa
46.23%
In this work we study the validity of the so-called curse of dimensionality for indexing of databases for similarity search. We perform an asymptotic analysis, with a test model based on a sequence of metric spaces $(\Omega_d)$ from which we pick datasets $X_d$ in an i.i.d. fashion. We call the subscript $d$ the dimension of the space $\Omega_d$ (e.g. for $\mathbb{R}^d$ the dimension is just the usual one) and we allow the size of the dataset $n=n_d$ to be such that $d$ is superlogarithmic but subpolynomial in $n$. We study the asymptotic performance of pivot-based indexing schemes where the number of pivots is $o(n/d)$. We pick the relatively simple cost model of similarity search where we count each distance calculation as a single computation and disregard the rest. We demonstrate that if the spaces $\Omega_d$ exhibit the (fairly common) concentration of measure phenomenon the performance of similarity search using such indexes is asymptotically linear in $n$. That is for large enough $d$ the difference between using such an index and performing a search without an index at all is negligeable. Thus we confirm the curse of dimensionality in this setting.; Comment: 56 pages, 7 figures Master's Thesis in Mathematics, University of Ottawa (Canada) Supervisor: Vladimir Pestov

Performance Evaluation and Optimization of Math-Similarity Search

Zhang, Qun; Youssef, Abdou
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 29/05/2015
Relevância na Pesquisa
46.29%
Similarity search in math is to find mathematical expressions that are similar to a user's query. We conceptualized the similarity factors between mathematical expressions, and proposed an approach to math similarity search (MSS) by defining metrics based on those similarity factors [11]. Our preliminary implementation indicated the advantage of MSS compared to non-similarity based search. In order to more effectively and efficiently search similar math expressions, MSS is further optimized. This paper focuses on performance evaluation and optimization of MSS. Our results show that the proposed optimization process significantly improved the performance of MSS with respect to both relevance ranking and recall.; Comment: 15 pages, 8 figures

Fast Structural Similarity Search of Noncoding RNAs Based on Matched Filtering of Stem Patterns

Yoon, Byung-Jun; Vaidyanathan, P. P.
Fonte: IEEE Publicador: IEEE
Tipo: Book Section; PeerReviewed Formato: application/pdf
Publicado em //2007
Relevância na Pesquisa
46.23%
Many noncoding RNAs (ncRNAs) have characteristic secondary structures that give rise to complicated base correlations in their primary sequences. Therefore, when performing an RNA similarity search to find new members of a ncRNA family, we need a statistical model - such as the profile- csHMM or the covariance model (CM) - that can effectively describe the correlations between distant bases. However, these models are computationally expensive, making the resulting RNA search very slow. To overcome this problem, various prescreening methods have been proposed that first use a simpler model to scan the database and filter out the dissimilar regions. Only the remaining regions that bear some similarity are passed to a more complex model for closer inspection. It has been shown that the prescreening approach can make the search speed significantly faster at no (or a slight) loss of prediction accuracy. In this paper, we propose a novel prescreening method based on matched filtering of stem patterns. Unlike many existing methods, the proposed method can prescreen the database solely based on structural similarity. The proposed method can handle RNAs with arbitrary secondary structures, and it can be easily incorporated into various search methods that use different statistical models. Furthermore...

Solving Multiple Queries through a Permutation Index in GPU

Lopresti,Mariela; Miranda,Natalia; Piccoli,Fabiana; Reyes,Nora
Fonte: Centro de Investigación en computación, IPN Publicador: Centro de Investigación en computación, IPN
Tipo: Artigo de Revista Científica Formato: text/html
Publicado em 01/09/2013 EN
Relevância na Pesquisa
46.24%
Query-by-content by means of similarity search is a fundamental operation for applications that deal with multimedia data. For this kind of query it is meaningless to look for elements exactly equal to the one given as query. Instead, we need to measure dissimilarity between the query object and each database object. The metric space model is a paradigm that allows modeling all similarity search problems. Metric databases permit to store objects from a metric space and efficiently perform similarity queries over them, in general, by reducing the number of distance evaluations needed. Therefore, the goal is to preprocess a particular dataset in such a way that queries can be answered with as few distance computations as possible. Moreover, for a very large metric database it is not enough to preprocess the dataset by building an index, it is also necessary to speed up the queries via high performance computing using GPU. In this work we show an implementation of a pure GPU architecture to build a Permutation Index used for approximate similarity search on databases of different data nature and to solve many queries at the same time. Besides, we evaluate the tradeoff between the answer quality and time performance of our implementation.