Página 5 dos resultados de 23237 itens digitais encontrados em 0.022 segundos

Complexity and Approximation of Fixing Numerical Attributes in Databases Under Integrity Constraints

Bertossi, L.; Bravo, L.; Franconi, E.; Lopatenko, A.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
36.41%
Consistent query answering is the problem of computing the answers from a database that are consistent with respect to certain integrity constraints that the database as a whole may fail to satisfy. Those answers are characterized as those that are invariant under minimal forms of restoring the consistency of the database. In this context, we study the problem of repairing databases by fixing integer numerical values at the attribute level with respect to denial and aggregation constraints. We introduce a quantitative definition of database fix, and investigate the complexity of several decision and optimization problems, including DFP, i.e. the existence of fixes within a given distance from the original instance, and CQA, i.e. deciding consistency of answers to aggregate conjunctive queries under different semantics. We provide sharp complexity bounds, identify relevant tractable cases; and introduce approximation algorithms for some of those that are intractable. More specifically, we obtain results like undecidability of existence of fixes for aggregation constraints; MAXSNP-hardness of DFP, but a good approximation algorithm for a relevant special case; and intractability but good approximation for CQA for aggregate queries for one database atom denials (plus built-ins).; Comment: 35 pages. Extended version of the camera ready version to appear in Proc. of the Databases Programming Languages Conference (DBPL 05)...

A Unified Approach to Ranking in Probabilistic Databases

Li, Jian; Saha, Barna; Deshpande, Amol
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
36.41%
The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decision-making over such data. In this paper, we present a unified approach to ranking and top-k query processing in probabilistic databases by viewing it as a multi-criteria optimization problem, and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databases, and we instead propose two parameterized ranking functions, called PRF-w and PRF-e, that generalize or can approximate many of the previously proposed ranking functions. We present novel generating functions-based algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations modeled using probabilistic and/xor trees or Markov networks. We further propose that the parameters of the ranking function be learned from user preferences, and we develop an approach to learn those parameters. Finally, we present a comprehensive experimental study that illustrates the effectiveness of our parameterized ranking functions...

Evolving NoSQL Databases Without Downtime

Saur, Karla; Dumitraş, Tudor; Hicks, Michael
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
36.41%
NoSQL databases like Redis, Cassandra, and MongoDB are increasingly popular because they are flexible, lightweight, and easy to work with. Applications that use these databases will evolve over time, sometimes necessitating (or preferring) a change to the format or organization of the data. The problem we address in this paper is: How can we support the evolution of high-availability applications and their NoSQL data online, without excessive delays or interruptions, even in the presence of backward-incompatible data- format changes? We present KVolve, an extension to the popular Redis NoSQL database, as a solution to this problem. KVolve permits a developer to submit an upgrade specification that defines how to transform existing data to the newest version. This transformation is applied lazily as applications interact with the database, thus avoiding long pause times. We demonstrate performing updates to the data stored in Redis from programs with backward-incompatible naming and format changes. We find that KVolve has essentially no overhead in general use and minimal impact during updates, which is a significant improvement over the lengthy pause times incurred when transforming the entire database offline.; Comment: 11 pages...

Graphulo: Linear Algebra Graph Kernels for NoSQL Databases

Gadepally, Vijay; Bolewski, Jake; Hook, Dan; Hutchison, Dylan; Miller, Ben; Kepner, Jeremy
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
36.41%
Big data and the Internet of Things era continue to challenge computational systems. Several technology solutions such as NoSQL databases have been developed to deal with this challenge. In order to generate meaningful results from large datasets, analysts often use a graph representation which provides an intuitive way to work with the data. Graph vertices can represent users and events, and edges can represent the relationship between vertices. Graph algorithms are used to extract meaningful information from these very large graphs. At MIT, the Graphulo initiative is an effort to perform graph algorithms directly in NoSQL databases such as Apache Accumulo or SciDB, which have an inherently sparse data storage scheme. Sparse matrix operations have a history of efficient implementations and the Graph Basic Linear Algebra Subprogram (GraphBLAS) community has developed a set of key kernels that can be used to develop efficient linear algebra operations. However, in order to use the GraphBLAS kernels, it is important that common graph algorithms be recast using the linear algebra building blocks. In this article, we look at common classes of graph algorithms and recast them into linear algebra operations using the GraphBLAS building blocks.; Comment: 10 pages

INSTRUCT: Space-Efficient Structure for Indexing and Complete Query Management of String Databases

Dutta, Sourav; Bhattacharya, Arnab
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
36.41%
The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining applications, combined with the popularity of readily available scanning devices and optical character recognition tools, has necessitated efficient storage, retrieval and management of massive text databases for various modern applications. For such applications, we propose a novel data structure, INSTRUCT, for efficient storage and management of sequence databases. Our structure uses bit vectors for reusing the storage space for common triplets, and hence, has a very low memory requirement. INSTRUCT efficiently handles prefix and suffix search queries in addition to the exact string search operation by iteratively checking the presence of triplets. We also propose an extension of the structure to handle substring search efficiently, albeit with an increase in the space requirements. This extension is important in the context of trie-based solutions which are unable to handle such queries efficiently. We perform several experiments portraying that INSTRUCT outperforms the existing structures by nearly a factor of two in terms of space requirements, while the query times are better. The ability to handle insertion and deletion of strings in addition to supporting all kinds of queries including exact search...

EMBANKS: Towards Disk Based Algorithms For Keyword-Search In Structured Databases

Gupta, Nitin
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 21/04/2011
Relevância na Pesquisa
36.41%
In recent years, there has been a lot of interest in the field of keyword querying relational databases. A variety of systems such as DBXplorer [ACD02], Discover [HP02] and ObjectRank [BHP04] have been proposed. Another such system is BANKS, which enables data and schema browsing together with keyword-based search for relational databases. It models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. The size of the database graph that BANKS uses is proportional to the sum of the number of nodes and edges in the graph. Systems such as SPIN, which search on Personal Information Networks and use BANKS as the backend, maintain a lot of information about the users' data. Since these systems run on the user workstation which have other demands of memory, such a heavy use of memory is unreasonable and if possible, should be avoided. In order to alleviate this problem, we introduce EMBANKS (acronym for External Memory BANKS), a framework for an optimized disk-based BANKS system. The complexity of this framework poses many questions, some of which we try to answer in this thesis. We demonstrate that the cluster representation proposed in EMBANKS enables in-memory processing of very large database graphs. We also present detailed experiments that show that EMBANKS can significantly reduce database load time and query execution times when compared to the original BANKS algorithms.; Comment: 45 pages...

Causality in Databases: The Diagnosis and Repair Connections

Salimi, Babak; Bertossi, Leopoldo
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
36.41%
In this work we establish and investigate the connections between causality for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new problems in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes and the other way around. The vast body of research on database repairs can be applied to the newer problem of determining actual causes for query answers. By formulating a causality problem as a diagnosis problem, we manage to characterize causes in terms of a system's diagnoses.; Comment: Proc. 15th International Workshop on Non-Monotonic Reasoning (NMR 2014)

Controlling Diversity in Benchmarking Graph Databases

Bagan, Guillaume; Bonifati, Angela; Ciucanu, Radu; Fletcher, George H. L.; Lemay, Aurélien; Advokaat, Nicky
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
36.41%
Massive graph data sets are pervasive in contemporary application domains. Hence, graph database systems are becoming increasingly important. In the study of these systems, it is vital that the research community has shared benchmarking solutions for the generation of database instances and query workloads having predictable and controllable properties. Similarly to TPC benchmarks for relational databases, benchmarks for graph databases have been important drivers for the Semantic Web and graph data management communities. Current benchmarks, however, are either limited to fixed graphs or graph schemas, or provide limited or no support for generating tailored query workloads to accompany graph instances. To move the community forward, a benchmarking approach which overcomes these limitations is crucial. In this paper, we present the design and engineering principles of gMark, a domain- and query language-independent graph benchmark addressing these limitations of current solutions. A core contribution of gMark is its ability to target and control the diversity of properties of both the generated graph instances and the generated query workloads coupled to these instances. A further novelty is the support of recursive regular path queries...

Vertical partitioning of relational OLTP databases using integer programming

Amossen, Rasmus Resen
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
36.41%
A way to optimize performance of relational row store databases is to reduce the row widths by vertically partitioning tables into table fractions in order to minimize the number of irrelevant columns/attributes read by each transaction. This paper considers vertical partitioning algorithms for relational row-store OLTP databases with an H-store-like architecture, meaning that we would like to maximize the number of single-sited transactions. We present a model for the vertical partitioning problem that, given a schema together with a vertical partitioning and a workload, estimates the costs (bytes read/written by storage layer access methods and bytes transferred between sites) of evaluating the workload on the given partitioning. The cost model allows for arbitrarily prioritizing load balancing of sites vs. total cost minimization. We show that finding a minimum-cost vertical partitioning in this model is NP-hard and present two algorithms returning solutions in which single-sitedness of read queries is preserved while allowing column replication (which may allow a drastically reduced cost compared to disjoint partitioning). The first algorithm is a quadratic integer program that finds optimal minimum-cost solutions with respect to the model...

Heuristic Algorithm for Interpretation of Non-Atomic Categorical Attributes in Similarity-based Fuzzy Databases - Scalability Evaluation

Hossain, M. Shahriar; Angryk, Rafal A.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 29/03/2011
Relevância na Pesquisa
36.41%
In this work we are analyzing scalability of the heuristic algorithm we used in the past to discover knowledge from multi-valued symbolic attributes in fuzzy databases. The non-atomic descriptors, characterizing a single attribute of a database record, are commonly used in fuzzy databases to reflect uncertainty about the recorded observation. In this paper, we present implementation details and scalability tests of the algorithm, which we developed to precisely interpret such non-atomic values and to transfer (i.e. defuzzify) the fuzzy tuples to the forms acceptable for many regular (i.e. atomic values based) data mining algorithms. Important advantages of our approach are: (1) its linear scalability, and (2) its unique capability of incorporating background knowledge, implicitly stored in the fuzzy database models in the form of fuzzy similarity hierarchy, into the interpretation/defuzzification process.

SkyQuery: A WebService Approach to Federate Databases

Malik, Tanu; Szalay, Alex S.; Budavari, Tamas; Thakar, Ani R.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 19/11/2002
Relevância na Pesquisa
36.41%
Traditional science searched for new objects and phenomena that led to discoveries. Tomorrow's science will combine together the large pool of information in scientific archives and make discoveries. Scienthists are currently keen to federate together the existing scientific databases. The major challenge in building a federation of these autonomous and heterogeneous databases is system integration. Ineffective integration will result in defunct federations and under utilized scientific data. Astronomy, in particular, has many autonomous archives spread over the Internet. It is now seeking to federate these, with minimal effort, into a Virtual Observatory that will solve complex distributed computing tasks such as answering federated spatial join queries. In this paper, we present SkyQuery, a successful prototype of an evolving federation of astronomy archives. It interoperates using the emerging Web services standard. We describe the SkyQuery architecture and show how it efficiently evaluates a probabilistic federated spatial join query.; Comment: 9 pages, 3 Figures, To Appear in CIDR'03, Also at http://www.skyquery.net

An Integrated Approach for Extraction of Objects from XML and Transformation to Heterogeneous Object Oriented Databases

Ahmad, Uzair; Hassan, Mohammad Waseem; Ali, Arshad; McClatchey, Richard; Willers, Ian
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 02/02/2004
Relevância na Pesquisa
36.41%
CERN's (European Organization for Nuclear Research) WISDOM project uses XML for the replication of data between different data repositories in a heterogeneous operating system environment. For exchanging data from Web-resident databases, the data needs to be transformed into XML and back to the database format. Many different approaches are employed to do this transformation. This paper addresses issues that make this job more efficient and robust than existing approaches. It incorporates the World Wide Web Consortium (W3C) XML Schema specification in the database-XML relationship. Incorporation of the XML Schema exhibits significant improvements in XML content usage and reduces the limitations of DTD-based database XML services. Secondly the paper explores the possibility of database independent transformation of data between XML and different databases. It proposes a standard XML format that every serialized object should follow. This makes it possible to use objects of heterogeneous database seamlessly using XML.; Comment: 4 pages, 5 figures. Presented at the 5th Int Conf on Enterprise Information Systems, ICEIS'03. Angers France April 2003

A Sample-Based Approach to Data Quality Assessment in Spatial Databases with Application to Mobile Trajectory Nearest-Neighbor Search

Saberi, Bagher; Ghadiri, Nasser
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 09/09/2014
Relevância na Pesquisa
36.41%
Spatial data is playing an emerging role in new technologies such as web and mobile mapping and Geographic Information Systems (GIS). Important decisions in political, social and many other aspects of modern human life are being made using location data. Decision makers in many countries are exploiting spatial databases for collecting information, analyzing them and planning for the future. In fact, not every spatial database is suitable for this type of application. Inaccuracy, imprecision and other deficiencies are present in location data just as any other type of data and may have a negative impact on credibility of any action taken based on unrefined information. So we need a method for evaluating the quality of spatial data and separating usable data from misleading data which leads to weak decisions. On the other hand, spatial databases are usually huge in size and therefore working with this type of data has a negative impact on efficiency. To improve the efficiency of working with spatial big data, we need a method for shrinking the volume of data. Sampling is one of these methods, but its negative effects on the quality of data are inevitable. In this paper we are trying to show and assess this change in quality of spatial data that is a consequence of sampling. We used this approach for evaluating the quality of sampled spatial data related to mobile user trajectories in China which are available in a well-known spatial database. The results show that sample-based control of data quality will increase the query performance significantly...

Object Oriented Approach for Integration of Heterogeneous Databases in a Multidatabase System and Local Schemas Modifications Propagation

Ali, Mohammad Ghulam
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 03/12/2009
Relevância na Pesquisa
36.41%
One of the challenging problems in the multidatabase systems is to find the most viable solution to the problem of interoperability of distributed heterogeneous autonomous local component databases. This has resulted in the creation of a global schema over set of these local component database schemas to provide a uniform representation of local schemas. The aim of this paper is to use object oriented approach to integrate schemas of distributed heterogeneous autonomous local component database schemas into a global schema. The resulting global schema provides a uniform interface and high level of location transparency for retrieval of data from the local component databases. A set of integration operators are defined to integrate local schemas based on the semantic relevance of their classes and to provide a model independent representation of virtual classes of the global schema. The schematic representation and heterogeneity is also taken into account in the integration process. Justifications about Object Oriented Modal are also discussed. Bottom up local schema modifications propagation in Global schema is also considered to maintain Global schema as local schemas are autonomous and evolve over time. An example illustrates the applicability of the integration operator defined.; Comment: 6 pages IEEE format...

Flexible queries in XML native databases

Arfaoui, Olfa; Sassi-Hidri, Minyar
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 06/12/2013
Relevância na Pesquisa
36.41%
To date, most of the XML native databases (DB) flexible querying systems are based on exploiting the tree structure of their semi structured data (SSD). However, it becomes important to test the efficiency of Formal Concept Analysis (FCA) formalism for this type of data since it has been proved a great performance in the field of information retrieval (IR). So, the IR in XML databases based on FCA is mainly based on the use of the lattice structure. Each concept of this lattice can be interpreted as a pair (response, query). In this work, we provide a new flexible modeling of XML DB based on fuzzy FCA as a first step towards flexible querying of SSD.; Comment: 5 Pages, 1 Figure

On the Scalability of Multidimensional Databases

Szépkúti, István
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
36.41%
It is commonly accepted in the practice of on-line analytical processing of databases that the multidimensional database organization is less scalable than the relational one. It is easy to see that the size of the multidimensional organization may increase very quickly. For example, if we introduce one additional dimension, then the total number of possible cells will be at least doubled. However, this reasoning does not takethe fact into account that the multidimensional organization can be compressed. There are compression techniques, which can remove all or at least a part of the empty cells from the multidimensional organization, while maintaining a good retrieval performance. Relational databases often use B-tree indices to speed up the access to given rows of tables. It can be proven, under some reasonable assumptions, that the total size of the table and the B-tree index is bigger than a compressed multidimensional representation. This implies that the compressed array results in a smaller database and faster access at the same time. This paper compares several compression techniques and shows when we should and should not apply compressed arrays instead of relational tables.; Comment: 18 pages, 1 figure, 8 tables. Paper presented at the Second Conference of PhD Students in Computer Science...

Computing Multi-Relational Sufficient Statistics for Large Databases

Qian, Zhensong; Schulte, Oliver; Sun, Yan
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 22/08/2014
Relevância na Pesquisa
36.41%
Databases contain information about which relationships do and do not hold among entities. To make this information accessible for statistical analysis requires computing sufficient statistics that combine information from different database tables. Such statistics may involve any number of {\em positive and negative} relationships. With a naive enumeration approach, computing sufficient statistics for negative relationships is feasible only for small databases. We solve this problem with a new dynamic programming algorithm that performs a virtual join, where the requisite counts are computed without materializing join tables. Contingency table algebra is a new extension of relational algebra, that facilitates the efficient implementation of this M\"obius virtual join operation. The M\"obius Join scales to large datasets (over 1M tuples) with complex schemas. Empirical evaluation with seven benchmark datasets showed that information about the presence and absence of links can be exploited in feature selection, association rule mining, and Bayesian network learning.; Comment: 11pages, 8 figures, 8 tables, CIKM'14,November 3--7, 2014, Shanghai, China

A Visual Query Language for Complex-Value Databases

Koch, Christoph
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 03/02/2006
Relevância na Pesquisa
36.41%
In this paper, a visual language, VCP, for queries on complex-value databases is proposed. The main strength of the new language is that it is purely visual: (i) It has no notion of variable, quantification, partiality, join, pattern matching, regular expression, recursion, or any other construct proper to logical, functional, or other database query languages and (ii) has a very natural, strong, and intuitive design metaphor. The main operation is that of copying and pasting in a schema tree. We show that despite its simplicity, VCP precisely captures complex-value algebra without powerset, or equivalently, monad algebra with union and difference. Thus, its expressive power is precisely that of the language that is usually considered to play the role of relational algebra for complex-value databases.; Comment: 12 pages, 6 figures, 1 table

Fast Updates on Read-Optimized Databases Using Multi-Core CPUs

Krueger, Jens; Kim, Changkyu; Grund, Martin; Satish, Nadathur; Schwalb, David; Chhugani, Jatin; Plattner, Hasso; Dubey, Pradeep; Zeier, Alexander
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 30/09/2011
Relevância na Pesquisa
36.41%
Read-optimized columnar databases use differential updates to handle writes by maintaining a separate write-optimized delta partition which is periodically merged with the read-optimized and compressed main partition. This merge process introduces significant overheads and unacceptable downtimes in update intensive systems, aspiring to combine transactional and analytical workloads into one system. In the first part of the paper, we report data analyses of 12 SAP Business Suite customer systems. In the second half, we present an optimized merge process reducing the merge overhead of current systems by a factor of 30. Our linear-time merge algorithm exploits the underlying high compute and bandwidth resources of modern multi-core CPUs with architecture-aware optimizations and efficient parallelization. This enables compressed in-memory column stores to handle the transactional update rate required by enterprise applications, while keeping properties of read-optimized databases for analytic-style queries.; Comment: VLDB2012

Combining Heterogeneous Classifiers for Relational Databases

Manjunatha, Geetha; Murty, M Narasimha; Sitaram, Dinkar
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
36.41%
Most enterprise data is distributed in multiple relational databases with expert-designed schema. Using traditional single-table machine learning techniques over such data not only incur a computational penalty for converting to a 'flat' form (mega-join), even the human-specified semantic information present in the relations is lost. In this paper, we present a practical, two-phase hierarchical meta-classification algorithm for relational databases with a semantic divide and conquer approach. We propose a recursive, prediction aggregation technique over heterogeneous classifiers applied on individual database tables. The proposed algorithm was evaluated on three diverse datasets, namely TPCH, PKDD and UCI benchmarks and showed considerable reduction in classification time without any loss of prediction accuracy.; Comment: Withdrawn - as that was a trial upload only. Non public information