Página 1 dos resultados de 1063 itens digitais encontrados em 0.088 segundos

Improving biodiversity data retrieval through semantic search and ontologies

Amanqui, Flor Karina Mamani; Serique, Kleberson Junio do Amaral; Cardoso, Silvio Domingos; Santos, José L. dos; Albuquerque, Andrea; Moreira, Dilvan de Abreu
Fonte: University of Warsaw; Institute of Electrical and Electronics Engineers - IEEE; Web Intelligence Consortium - WIC; Association for Computing Machinery - ACM; Warsaw Publicador: University of Warsaw; Institute of Electrical and Electronics Engineers - IEEE; Web Intelligence Consortium - WIC; Association for Computing Machinery - ACM; Warsaw
Tipo: Conferência ou Objeto de Conferência
ENG
Relevância na Pesquisa
55.63%
Due to the increased amount of available biodiversity data, many biodiversity research institutions are now making their databases openly available on the web. Researchers in the field use this databases to extract new knowledge and also share their own discoveries. However, when these researchers need to find relevant information in the data, they still rely on the traditional search approach, based on text matching, that is not appropriate to be used in these large amounts of heterogeneous biodiversity’s data, leading to search results with low precision and recall. We present a new architecture that tackle this problem using a semantic search system for biodiversity data. Semantic search aims to improve search accuracy by using ontologies to understand user objectives and the contextual meaning of terms used in the search to generate more relevant results. Biodiversity data is mapped to terms from relevant ontologies, such as Darwin Core, DBpedia, Ontobio and Catalogue of Life, stored using semantic web formats and queried using semantic web tools (such as triple stores). A prototype semantic search tool was successfully implemented and evaluated by users from the National Research Institute for the Amazon (INPA). Our results show that the semantic search approach has a better precision (28% improvement) and recall (25% improvement) when compared to keyword based search...

Risky business: social media metrics and political risk analysis

Nelson, Laura Kathleen
Fonte: Fundação Getúlio Vargas Publicador: Fundação Getúlio Vargas
Tipo: Dissertação
EN_US
Relevância na Pesquisa
55.73%
A quantificação do risco país – e do risco político em particular – levanta várias dificuldades às empresas, instituições, e investidores. Como os indicadores econômicos são atualizados com muito menos freqüência do que o Facebook, compreender, e mais precisamente, medir – o que está ocorrendo no terreno em tempo real pode constituir um desafio para os analistas de risco político. No entanto, com a crescente disponibilidade de “big data” de ferramentas sociais como o Twitter, agora é o momento oportuno para examinar os tipos de métricas das ferramentas sociais que estão disponíveis e as limitações da sua aplicação para a análise de risco país, especialmente durante episódios de violência política. Utilizando o método qualitativo de pesquisa bibliográfica, este estudo identifica a paisagem atual de dados disponíveis a partir do Twitter, analisa os métodos atuais e potenciais de análise, e discute a sua possível aplicação no campo da análise de risco político. Depois de uma revisão completa do campo até hoje, e tendo em conta os avanços tecnológicos esperados a curto e médio prazo, este estudo conclui que, apesar de obstáculos como o custo de armazenamento de informação, as limitações da análise em tempo real...

PS1-24: Proof of Principle Demonstrations of a Distributed Research Network: Findings and Lessons Learned

Brown, Jeffrey S; Holmes, John H; Lazarus, Ross; Langer, Robert; Magid, David; Nelson, Andrew; Selby, Joe; Wagner, Edward; Platt, Richard
Fonte: Marshfield Clinic Publicador: Marshfield Clinic
Tipo: Artigo de Revista Científica
Publicado em /03/2010 EN
Relevância na Pesquisa
55.63%
Researchers, policy makers and others commonly use electronic data routinely collected during the delivery of, or payment for, medical care to study the “real world” effectiveness, comparative effectiveness, safety, and costs of medical interventions. However, even very large individual healthcare data resources are often not big enough to adequately conduct post-marketing evidence studies. To improve the ability to use multiple distributed data resources, like the HMORN’s Virtual Data Warehouse, the Developing Evidence to Inform Decisions about Effectiveness (DEcIDE) centers at the HMORN Center for Education and Research on Therapeutics (CERT) and the University of Pennsylvania are developing a design for a scalable distributed research network. Funded by AHRQ’s Effective Health Care Program, the project intends to develop the framework for a distributed research network that will help close the knowledge gap regarding the “real world” effectiveness, comparative effectiveness, and safety of medical technologies. The network will have these important attributes: distributed architecture, strong local control of data uses, and federated querying. Constructing the network is a challenge, even among sites with fully operational local virtual data warehouse (VDW) installations...

Cloudwave: Distributed Processing of “Big Data” from Electrophysiological Recordings for Epilepsy Clinical Research Using Hadoop

Jayapandian, Catherine P.; Chen, Chien-Hung; Bozorgi, Alireza; Lhatoo, Samden D.; Zhang, Guo-Qiang; Sahoo, Satya S.
Fonte: American Medical Informatics Association Publicador: American Medical Informatics Association
Tipo: Artigo de Revista Científica
Publicado em 16/11/2013 EN
Relevância na Pesquisa
55.72%
Epilepsy is the most common serious neurological disorder affecting 50–60 million persons worldwide. Multi-modal electrophysiological data, such as electroencephalography (EEG) and electrocardiography (EKG), are central to effective patient care and clinical research in epilepsy. Electrophysiological data is an example of clinical “big data” consisting of more than 100 multi-channel signals with recordings from each patient generating 5–10GB of data. Current approaches to store and analyze signal data using standalone tools, such as Nihon Kohden neurology software, are inadequate to meet the growing volume of data and the need for supporting multi-center collaborative studies with real time and interactive access. We introduce the Cloudwave platform in this paper that features a Web-based intuitive signal analysis interface integrated with a Hadoop-based data processing module implemented on clinical data stored in a “private cloud”. Cloudwave has been developed as part of the National Institute of Neurological Disorders and Strokes (NINDS) funded multi-center Prevention and Risk Identification of SUDEP Mortality (PRISM) project. The Cloudwave visualization interface provides real-time rendering of multi-modal signals with “montages” for EEG feature characterization over 2TB of patient data generated at the Case University Hospital Epilepsy Monitoring Unit. Results from performance evaluation of the Cloudwave Hadoop data processing module demonstrate one order of magnitude improvement in performance over 77GB of patient data. (Cloudwave project: http://prism.case.edu/prism/index.php/Cloudwave)

Some experiences and opportunities for big data in translational research

Chute, Christopher G.; Ullman-Cullere, Mollie; Wood, Grant M.; Lin, Simon M.; He, Min; Pathak, Jyotishman
Fonte: PubMed Publicador: PubMed
Tipo: Artigo de Revista Científica
EN
Relevância na Pesquisa
55.84%
Health care has become increasingly information intensive. The advent of genomic data, integrated into patient care, significantly accelerates the complexity and amount of clinical data. Translational research in the present day increasingly embraces new biomedical discovery in this data-intensive world, thus entering the domain of “big data.” The Electronic Medical Records and Genomics consortium has taught us many lessons, while simultaneously advances in commodity computing methods enable the academic community to affordably manage and process big data. Although great promise can emerge from the adoption of big data methods and philosophy, the heterogeneity and complexity of clinical data, in particular, pose additional challenges for big data inferencing and clinical application. However, the ultimate comparability and consistency of heterogeneous clinical information sources can be enhanced by existing and emerging data standards, which promise to bring order to clinical data chaos. Meaningful Use data standards in particular have already simplified the task of identifying clinical phenotyping patterns in electronic health records.

Big Data: Survey, Technologies, Opportunities, and Challenges

Khan, Nawsher; Yaqoob, Ibrar; Hashem, Ibrahim Abaker Targio; Inayat, Zakira; Mahmoud Ali, Waleed Kamaleldin; Alam, Muhammad; Shiraz, Muhammad; Gani, Abdullah
Fonte: Hindawi Publishing Corporation Publicador: Hindawi Publishing Corporation
Tipo: Artigo de Revista Científica
EN
Relevância na Pesquisa
65.87%
Big Data has gained much attention from the academia and the IT industry. In the digital and computing world, information is generated and collected at a rate that rapidly exceeds the boundary range. Currently, over 2 billion people worldwide are connected to the Internet, and over 5 billion individuals own mobile phones. By 2020, 50 billion devices are expected to be connected to the Internet. At this point, predicted data production will be 44 times greater than that in 2009. As information is transferred and shared at light speed on optic fiber and wireless networks, the volume of data and the speed of market growth increase. However, the fast growth rate of such large data generates numerous challenges, such as the rapid growth of data, transfer speed, diverse data, and security. Nonetheless, Big Data is still in its infancy stage, and the domain has not been reviewed in general. Hence, this study comprehensively surveys and classifies the various attributes of Big Data, including its nature, definitions, rapid growth rate, volume, management, analysis, and security. This study also proposes a data life cycle that uses the technologies and terminologies of Big Data. Future research directions in this field are determined based on opportunities and several open issues in Big Data domination. These research directions facilitate the exploration of the domain and the development of optimal techniques to address Big Data.

Big data and clinical research: focusing on the area of critical care medicine in mainland China

Zhang, Zhongheng
Fonte: AME Publishing Company Publicador: AME Publishing Company
Tipo: Artigo de Revista Científica
Publicado em /10/2014 EN
Relevância na Pesquisa
55.8%
Big data has long been found its way into clinical practice since the advent of information technology era. Medical records and follow-up data can be more efficiently stored and extracted with information technology. Immediately after admission a patient immediately produces a large amount of data including laboratory findings, medications, fluid balance, progressing notes and imaging findings. Clinicians and clinical investigators should make every effort to make full use of the big data that is being continuously generated by electronic medical record (EMR) system and other healthcare databases. At this stage, more training courses on data management and statistical analysis are required before clinicians and clinical investigators can handle big data and translate them into advances in medical science. China is a large country with a population of 1.3 billion and can contribute greatly to clinical researches by providing reliable and high-quality big data.

Big Data and Biomedical Informatics: A Challenging Opportunity

Bellazzi, Riccardo
Fonte: Schattauer GmbH Publicador: Schattauer GmbH
Tipo: Artigo de Revista Científica
Publicado em 22/05/2014 EN
Relevância na Pesquisa
55.93%
Big data are receiving an increasing attention in biomedicine and healthcare. It is therefore important to understand the reason why big data are assuming a crucial role for the biomedical informatics community. The capability of handling big data is becoming an enabler to carry out unprecedented research studies and to implement new models of healthcare delivery. Therefore, it is first necessary to deeply understand the four elements that constitute big data, namely Volume, Variety, Velocity, and Veracity, and their meaning in practice. Then, it is mandatory to understand where big data are present, and where they can be beneficially collected. There are research fields, such as translational bioinformatics, which need to rely on big data technologies to withstand the shock wave of data that is generated every day. Other areas, ranging from epidemiology to clinical care, can benefit from the exploitation of the large amounts of data that are nowadays available, from personal monitoring to primary care. However, building big data-enabled systems carries on relevant implications in terms of reproducibility of research studies and management of privacy and data access; proper actions should be taken to deal with these issues. An interesting consequence of the big data scenario is the availability of new software...

Making Big Data Useful for Health Care: A Summary of the Inaugural MIT Critical Data Conference

Badawi, Omar; Brennan, Thomas; Celi, Leo Anthony; Feng, Mengling; Ghassemi, Marzyeh; Ippolito, Andrea; Johnson, Alistair; Mark, Roger G; Mayaud, Louis; Moody, George; Moses, Christopher; Naumann, Tristan; Nikore, Vipan; Pimentel, Marco; Pollard, Tom J; Sa
Fonte: Gunther Eysenbach Publicador: Gunther Eysenbach
Tipo: Artigo de Revista Científica
Publicado em 22/08/2014 EN
Relevância na Pesquisa
55.77%
With growing concerns that big data will only augment the problem of unreliable research, the Laboratory of Computational Physiology at the Massachusetts Institute of Technology organized the Critical Data Conference in January 2014. Thought leaders from academia, government, and industry across disciplines—including clinical medicine, computer science, public health, informatics, biomedical research, health technology, statistics, and epidemiology—gathered and discussed the pitfalls and challenges of big data in health care. The key message from the conference is that the value of large amounts of data hinges on the ability of researchers to share data, methodologies, and findings in an open setting. If empirical value is to be from the analysis of retrospective data, groups must continuously work together on similar problems to create more effective peer review. This will lead to improvement in methodology and quality, with each iteration of analysis resulting in more reliability.

What Difference Does Quantity Make? On the Epistemology of Big Data in Biology

Leonelli, Sabina
Fonte: PubMed Publicador: PubMed
Tipo: Artigo de Revista Científica
Publicado em 01/06/2014 EN
Relevância na Pesquisa
55.84%
Is big data science a whole new way of doing research? And what difference does data quantity make to knowledge production strategies and their outputs? I argue that the novelty of big data science does not lie in the sheer quantity of data involved, but rather in (1) the prominence and status acquired by data as commodity and recognised output, both within and outside of the scientific community; and (2) the methods, infrastructures, technologies, skills and knowledge developed to handle data. These developments generate the impression that data-intensive research is a new mode of doing science, with its own epistemology and norms. To assess this claim, one needs to consider the ways in which data are actually disseminated and used to generate knowledge. Accordingly, this paper reviews the development of sophisticated ways to disseminate, integrate and re-use data acquired on model organisms over the last three decades of work in experimental biology. I focus on online databases as prominent infrastructures set up to organise and interpret such data; and examine the wealth and diversity of expertise, resources and conceptual scaffolding that such databases draw upon. This illuminates some of the conditions under which big data need to be curated to support processes of discovery across biological subfields...

Intuitive Web-Based Experimental Design for High-Throughput Biomedical Data

Friedrich, Andreas; Kenar, Erhan; Kohlbacher, Oliver; Nahnsen, Sven
Fonte: Hindawi Publishing Corporation Publicador: Hindawi Publishing Corporation
Tipo: Artigo de Revista Científica
EN
Relevância na Pesquisa
55.68%
Big data bioinformatics aims at drawing biological conclusions from huge and complex biological datasets. Added value from the analysis of big data, however, is only possible if the data is accompanied by accurate metadata annotation. Particularly in high-throughput experiments intelligent approaches are needed to keep track of the experimental design, including the conditions that are studied as well as information that might be interesting for failure analysis or further experiments in the future. In addition to the management of this information, means for an integrated design and interfaces for structured data annotation are urgently needed by researchers. Here, we propose a factor-based experimental design approach that enables scientists to easily create large-scale experiments with the help of a web-based system. We present a novel implementation of a web-based interface allowing the collection of arbitrary metadata. To exchange and edit information we provide a spreadsheet-based, humanly readable format. Subsequently, sample sheets with identifiers and metainformation for data generation facilities can be created. Data files created after measurement of the samples can be uploaded to a datastore, where they are automatically linked to the previously created experimental design model.

Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

Gao, Xiaoming
Fonte: [Bloomington, Ind.] : Indiana University Publicador: [Bloomington, Ind.] : Indiana University
Tipo: Doctoral Dissertation
EN
Relevância na Pesquisa
65.81%
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015; As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required. To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems. In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue...

A DNA-Based Semantic Fusion Model for Remote Sensing Data

Sun, Heng; Weng, Jian; Yu, Guangchuang; Massawe, Richard H.
Fonte: Public Library of Science Publicador: Public Library of Science
Tipo: Artigo de Revista Científica
Publicado em 08/10/2013 EN
Relevância na Pesquisa
55.73%
Semantic technology plays a key role in various domains, from conversation understanding to algorithm analysis. As the most efficient semantic tool, ontology can represent, process and manage the widespread knowledge. Nowadays, many researchers use ontology to collect and organize data's semantic information in order to maximize research productivity. In this paper, we firstly describe our work on the development of a remote sensing data ontology, with a primary focus on semantic fusion-driven research for big data. Our ontology is made up of 1,264 concepts and 2,030 semantic relationships. However, the growth of big data is straining the capacities of current semantic fusion and reasoning practices. Considering the massive parallelism of DNA strands, we propose a novel DNA-based semantic fusion model. In this model, a parallel strategy is developed to encode the semantic information in DNA for a large volume of remote sensing data. The semantic information is read in a parallel and bit-wise manner and an individual bit is converted to a base. By doing so, a considerable amount of conversion time can be saved, i.e., the cluster-based multi-processes program can reduce the conversion time from 81,536 seconds to 4,937 seconds for 4.34 GB source data files. Moreover...

A tomada de decisão no contexto do Big Data : estudo de caso único.

Canary, Vivian Passos
Fonte: Universidade Federal do Rio Grande do Sul Publicador: Universidade Federal do Rio Grande do Sul
Tipo: Trabalho de Conclusão de Curso Formato: application/pdf
POR
Relevância na Pesquisa
65.84%
A competição entre marcas está cada vez mais acirrada, exigindo que as empresas tomem decisões rápidas para criarem um diferencial competitivo frente aos concorrentes (BARTON e COURT, 2012). A fim de minimizar os riscos resultantes de uma tomada de decisão inadequada, os gestores deverão embasá-la com informações relevantes e seguras. O crescimento exponencial no volume de dados gerados em função dos avanços tecnológicos e da mudança de comportamento dos consumidores garantirá às organizações informações suficientes para isso, de forma rápida. Esse fenômeno é chamado de Big Data. No entanto, os gestores serão responsáveis por coletar, filtrar, tratar e analisar as informações que lhes forem úteis, aproveitando-se para gerar vantagem competitiva para os seus negócios. O objetivo da pesquisa é verificar o efeito dos fatores “5 V’s” (volume, variedade, velocidade, valor e veracidade) do Big Data no processo de tomada de decisão de executivos de diferentes níveis hierárquicos em um Sistema de Crédito Cooperativo. Para atingi-lo, foi utilizado o método de estudo de caso único. Como contribuição desta pesquisa estão: explorar o tema do Big Data de forma teórica e aliá-lo ao processo de tomada de decisão praticado de uma organização.; The competition among brands is continuously increasing...

Empirical Big Data Research: A Systematic Literature Mapping

Wienhofen, Leendert; Mathisen, Bjørn Magnus; Roman, Dumitru
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 10/09/2015
Relevância na Pesquisa
55.89%
Background: Big Data is a relatively new field of research and technology, and literature reports a wide variety of concepts labeled with Big Data. The maturity of a research field can be measured in the number of publications containing empirical results. In this paper we present the current status of empirical research in Big Data. Method: We employed a systematic mapping method with which we mapped the collected research according to the labels Variety, Volume and Velocity. In addition, we addressed the application areas of Big Data. Results: We found that 151 of the assessed 1778 contributions contain a form of empirical result and can be mapped to one or more of the 3 V's and 59 address an application area. Conclusions: The share of publications containing empirical results is well below the average compared to computer science research as a whole. In order to mature the research on Big Data, we recommend applying empirical methods to strengthen the confidence in the reported results. Based on our trend analysis we consider Variety to be the most promising uncharted area in Big Data.; Comment: 18 pages paper, 32 pages in total including references. Submitted to Elsevier Information Systems

BigDataBench: a Big Data Benchmark Suite from Internet Services

Wang, Lei; Zhan, Jianfeng; Luo, Chunjie; Zhu, Yuqing; Yang, Qiang; He, Yongqiang; Gao, Wanling; Jia, Zhen; Shi, Yingjie; Zhang, Shujie; Zheng, Chen; Lu, Gang; Zhan, Kent; Li, Xiaona; Qiu, Bizhu
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Relevância na Pesquisa
55.91%
As architecture, systems, and data management communities pay greater attention to innovative big data systems and architectures, the pressure of benchmarking and evaluating these systems rises. Considering the broad use of big data systems, big data benchmarks must include diversity of data and workloads. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. BigDataBench is publicly available from http://prof.ict.ac.cn/BigDataBench . Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity; Second, the volume of data input has non-negligible impact on micro-architecture characteristics...

Big Data Analytics in Future Internet of Things

Ding, Guoru; Wang, Long; Wu, Qihui
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 16/11/2013
Relevância na Pesquisa
55.83%
Current research on Internet of Things (IoT) mainly focuses on how to enable general objects to see, hear, and smell the physical world for themselves, and make them connected to share the observations. In this paper, we argue that only connected is not enough, beyond that, general objects should have the capability to learn, think, and understand both the physical world by themselves. On the other hand, the future IoT will be highly populated by large numbers of heterogeneous networked embedded devices, which are generating massive or big data in an explosive fashion. Although there is a consensus among almost everyone on the great importance of big data analytics in IoT, to date, limited results, especially the mathematical foundations, are obtained. These practical needs impels us to propose a systematic tutorial on the development of effective algorithms for big data analytics in future IoT, which are grouped into four classes: 1) heterogeneous data processing, 2) nonlinear data processing, 3) high-dimensional data processing, and 4) distributed and parallel data processing. We envision that the presented research is offered as a mere baby step in a potentially fruitful research direction. We hope that this article, with interdisciplinary perspectives...

Big Data Analytics in Bioinformatics: A Machine Learning Perspective

Kashyap, Hirak; Ahmed, Hasin Afzal; Hoque, Nazrul; Roy, Swarup; Bhattacharyya, Dhruba Kumar
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 15/06/2015
Relevância na Pesquisa
65.87%
Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch-mode and are not optimized for iterative processing and high data dependency among operations. In the recent years, parallel, incremental, and multi-view machine learning algorithms have been proposed. Similarly, graph-based architectures and in-memory big data tools have been developed to minimize I/O cost and optimize iterative processing. However, there lack standard big data architectures and tools for many important bioinformatics problems, such as fast construction of co-expression and regulatory networks and salient module identification, detection of complexes over growing protein-protein interaction data, fast analysis of massive DNA, RNA, and protein sequence data, and fast querying on incremental and heterogeneous disease networks. This paper addresses the issues and challenges posed by several big data problems in bioinformatics, and gives an overview of the state of the art and the future research opportunities.; Comment: 20 pages survey paper on Big data analytics in Bioinformatics

BigDataBench: a Big Data Benchmark Suite from Web Search Engines

Gao, Wanling; Zhu, Yuqing; Jia, Zhen; Luo, Chunjie; Wang, Lei; Li, Zhiguo; Zhan, Jianfeng; Qi, Yong; He, Yongqiang; Gong, Shiming; Li, Xiaona; Zhang, Shujie; Qiu, Bizhu
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 01/07/2013
Relevância na Pesquisa
55.84%
This paper presents our joint research efforts on big data benchmarking with several industrial partners. Considering the complexity, diversity, workload churns, and rapid evolution of big data systems, we take an incremental approach in big data benchmarking. For the first step, we pay attention to search engines, which are the most important domain in Internet services in terms of the number of page views and daily visitors. However, search engine service providers treat data, applications, and web access logs as business confidentiality, which prevents us from building benchmarks. To overcome those difficulties, with several industry partners, we widely investigated the open source solutions in search engines, and obtained the permission of using anonymous Web access logs. Moreover, with two years' great efforts, we created a sematic search engine named ProfSearch (available from http://prof.ict.ac.cn). These efforts pave the path for our big data benchmark suite from search engines---BigDataBench, which is released on the web page (http://prof.ict.ac.cn/BigDataBench). We report our detailed analysis of search engine workloads, and present our benchmarking methodology. An innovative data generation methodology and tool are proposed to generate scalable volumes of big data from a small seed of real data...

The Value of Using Big Data Technologies in Computational Social Science

Ch'ng, Eugene
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 13/08/2014
Relevância na Pesquisa
55.81%
The discovery of phenomena in social networks has prompted renewed interests in the field. Data in social networks however can be massive, requiring scalable Big Data architecture. Conversely, research in Big Data needs the volume and velocity of social media data for testing its scalability. Not only so, appropriate data processing and mining of acquired datasets involve complex issues in the variety, veracity, and variability of the data, after which visualisation must occur before we can see fruition in our efforts. This article presents topical, multimodal, and longitudinal social media datasets from the integration of various scalable open source technologies. The article details the process that led to the discovery of social information landscapes within the Twitter social network, highlighting the experience of dealing with social media datasets, using a funneling approach so that data becomes manageable. The article demonstrated the feasibility and value of using scalable open source technologies for acquiring massive, connected datasets for research in the social sciences.; Comment: 3rd ASE Big Data Science Conference, Tsinghua University Beijing, 3-7 August 2014