- About Us
As a microscope allows one to peer into the microscopic world, Groupscope will allow one to make observations into large group behavior. Currently studies involving large group behavior, where groups are made of many people over one or more locations, are done manually through observation, surveys, and event coding if video data is collected. The Groupscope project will utilize work in the field of Computer Vision and Audio Recognition to create a framework of tools that will automate the more tedious portions of these investigations, such as the tracking of individuals and their interactions, and attempt to provide an objective means by which to classify observed behavior in these studies. Utilizing NCSA Medici, a content aware analogue to the popular Dropbox, custom extraction services will obtain the needed information from collections of multi-vantage point video recordings, audio recordings, images, and spoken dialogue annotations. From this extracted metadata novel visualizations will be developed so as to allow communication researchers to easily track observed individuals and their interactions over large spans in space and time.
The passing down of information from one generation to the next has been an important part of our species since the dawn of civilization. From scrolls to books to paintings to sheet music, practices have long been established to make these artifacts of information available for the generations centuries following the death of their creators. It is with some irony that we, in the information age, an age of seeming mastery over large amounts of information at the fingertips of billions, seem to be at risk of stunting or even ending this continuum of knowledge preservation.
Moving from the preservation of traditional analogue media such as paper, to digital media such as computer files, has proven to be entirely non-trivial. In addition to the threats of physical damage over time, digital media suffers from further challenges surrounding the magnitude and varieties of information available. Because of the benefits of digital media in terms of information manipulation and accessibility, the amount of information available grows exponentially with the passage of time. The varieties of information also increase over time as new software is created with new requirements for storing data, new types of data are explored, and commercial interests come into play. The end effect is that digital preservation not only must deal with the physical storage of the large number of bytes representing the information we are preserving, but also deal with ensuring that we can access and interpret all pieces of archived information centuries down the road.
This project addresses the research challenges involved with digital preservation in terms of data diversity and scale while also focusing on the development of preservation solutions in the form of tools and services. Specifically, we address accessibility with regards to the ever growing number of file formats that represent essentially the same kinds of information. It has been the case, and will continue to be the case, that digital files are preserved on tape or disk, yet are inaccessible some decades later because the software to load the data no longer exists. In the case of 3D data we have documented over 140 file formats. Many of these formats are proprietary with undisclosed specifications meaning that if the owning company where ever to disappear then it is very possible that all user data stored in that format would in short time become inaccessible. These types of situations are occurring today and will only grow worse with time. In past work we have investigated the problem of identifying an optimal file format for long term preservation so as to maximize accessibility while simultaneously minimizing information loss as we convert to the desired format from other formats in an archive. In turn we developed tools to carry out large numbers of conversions in a massively scalable manner, created a registry of software indexable by input/output formats, and laid down the framework for a library of comparison measures so as to estimate content loss before and after conversions across a number of data types.
The practical motivation of our research stems from the exponentially growing number of electronic records and the growing number of file formats dealt with by archives when conducting business with the US government. As an example, electronic records (i.e. digital files) come to the National Archives and Records Administration (NARA) from the Congress, the courts, the Executive Office of the President, numerous Presidential commissions, nearly 100 bureaus, departments, and other components of executive branch agencies and their contractors . These digital files arrive in large quantities and in a wide variety of file formats. These files must be appraised and stored in a manner that will allow access for centuries to come, a task made difficult by a lack of services to manipulate these files in general, render them, and compare their contents. The objectives of this effort are to: enable automated and computationally scalable file format conversions in a manner that will also include control of conversion parameters, predict computational costs associated with data-intensive and CPU intensive file format conversions and file comparisons, and support content-based file-to-file comparisons as well as the choice of comparison methods and their parameters based on specific end user needs.
In two words the purpose of this project is providing “searchable access” to digital archives. In particular we are interested in the many digital archives of handwritten forms that currently possess no viable or practical means of searchable access. Today access can be provided to archived handwritten forms by digitally scanning them and making them available on the web as a collection of digital images. This type of access is only of limited use however as currently in this form users have no easy way of searching through the individual forms other than by going through them one by one manually. Given that these collections are almost always large, in the terabytes of data made up of millions of images, the process of searching through the data manually is for all practical purposes impossible for an individual.
The motivation for our work is the decadal release of US Census forms coming up in April of 2012 for the 1940 Census. The digital scanning of 1940 US Census microfilms by the US Census Bureau and the National Archives and Records Administration (NARA) has been estimated to result in 3.25 million images and approximately 125 terabytes of raw image data. Compressed copies of the images, approximately 18 terabytes worth, will be made available over the internet for the first time as the primary source of the Census information.
Upon the release of the census information various companies and organizations will begin the long tedious task of manually transcribing the contents of the handwritten forms. As transcribed digital text the Census information will become searchable. To do this the companies/organizations will employ anywhere from thousands to hundreds of thousands of people to manually look at each word within the forms and retype them as digital text. The process will take between 6 to 9 months to complete. Due to the large investment in human labor, these companies will provide the searchable content for a fee. At the moment there is no low cost, near free, alternative to human transcription for providing searchable access to information within digital archives of handwritten information. It is here were our work focuses.
We aim to provide a means of providing low cost searchable access to digital archives where there is currently nothing. The field of Computer Vision, which deals with the extraction of information from images, is by no means at the state required to provide perfect automated transcription of the contents of these forms. However, our goal is not perfect machine driven transcription, but instead providing some form of searchable image based access where no searchable access would exist otherwise.
Towards this goal of providing low cost searchable access to digital archives of handwritten forms we investigate and develop a hybrid automated and crowdsourcing approach. The automated portion, utilizing a technique known as Word Spotting, will provide immediate searchable image based access to the content within digital handwritten forms. The crowdsourcing component, made up of both active and passive crowdsourcing elements, will accumulate traditional transcriptions over time. Of particular interest is the passive crowdsourcing element which would acquire these transcriptions without the user being made aware of the fact that they are carrying out this job. As more and more users use the system and transcriptions are acquired over time the system will gradually shift from solely image based search to a combination of image based search and text based search. Given the limitations of the current state of computer vision our approach emphasizes the human in the loop. Even within the automated portion we provide novel interfaces so that the human can search in a manner that is more readily tractable by the computer. In addition to the computer vision and crowdsourcing elements we must deal with the scale of the problem at hand, that is millions of images and terabytes of data. Towards these ends we also investigate the use of scalable distributed databases, efficient means of providing high resolution images over the web, and means of indexing the large amount of text within an archive of digital images.
The goal of this work is to create a portable interactive system by which to allow for remote collaboration in studies involving cultural heritage. Derived from work started under Peter Bajcsy, we are moving our Tele-immersive Environment for Everybody (TEEVE) work to utilize the relatively new low cost Kinect 3D depth cameras. Similar to Skype video conferencing, the TEEVE work captures and transmits 3D reconstructions rather than conventional 2D video. Because the data is three dimensional users at remote locations can zoom in and rotate the transmitted scenes. With lower cost 3D cameras and small profile modern computers available today the group hopes to put these feeds into the field. There are cultural artifacts that are broken over time and scattered in museums throughout the world. One of the goals of this work is to allow researchers to reassemble these pieces in a shared 3D virtual environment. With such a system a researcher will be able to capture the geometry of an artifact in a museum and work with other researchers around the world in real time to interactively place such pieces with others that are physically stored in remote locations. Utilizing the skeletal tracking that has made the Kinect the success it is the system will track a user’s hand and arm movements in order to do this. This work will leverage other work developed at NCSA such as Medici as the backend to store and link information between systems and AVL’s Virtual Director.
The University of Illinois at Urbana-Champaign and the University of Wisconsin - Madison have been awarded an NSF ABI grant to develop an integrated ecological bioinformatics toolbox dubbed the Predictive Ecosystem Analyzer (PEcAn). This toolbox consists of:
The project is motivated by the fact that many of the most pressing questions about global change are not necessarily limited by the need to collect new data as much as by our ability to utilize existing data. This project seeks to improve this ability by developing a framework for integrating multiple data sources in a sensible manner. PEcAn is initially being developed around the Ecosystem Demography model (ED), one of the few terrestrial biosphere models capable of integrating a large suite of observational data at different spatial and temporal scales. At the same time PEcAn is being designed to interface with a wide class of ecosystem models. The output of the data assimilation system will be a regional-scale high-resolution estimate of both the terrestrial carbon cycle and plant biodiversity based on the best available data and with a robust accounting of the uncertainties involved. The workflow system will allow ecosystem modeling to be more reproducible, automated, and transparent in terms of operations applied to data, and thus ultimately more reusable and comprehensible to both peers and the public. It will reduce the redundancy of effort among modeling groups, facilitate collaboration, and make models more accessible the rest of the research community due to the open nature of the workflow system.
As a test bed for the development and application of these ecological bioinformatics tools, the project will focus on the temperate/boreal transition zone in northern Wisconsin, a region that is expected to show large climate change responses and is arguably the most ecologically data-rich region in the country. The tools developed here will enable us to partition carbon flux and pool variability in space and time and to attribute the regional-scale responses to specific biotic and abiotic drivers. The data-assimilation framework will partition different sources of uncertainty which will enable a better understanding of which are limiting our inference and provide a more complete propagation of uncertainty into model forecasts. ED will also be used to forecast regional-scale dynamics under decadal to centennial scale climate change scenarios. This approach will allow us to assess for the first time how much our uncertainty about the current state of the ecosystem impacts our ability to anticipate the future.
BETY database repository:
Digging into Image Data to Answer Authorship Related Questions (DID-ARQ) seeks to explore authorship studies of visual arts through the use of Computer Vision. In the past, authorship has been explored in terms of attributions, typically of either individual masterpieces or small collections of art from the same period, location, or school. Due to these localized strategies of exploration and research commonalities and shared characteristics are largely unexplored. In fact, it is rare to find discussions beyond a single discrete dataset. More significantly, to our knowledge, there have to date been no studies of image analyses targeting the problem of authorship applied to very large collections of images and evaluated in terms of accuracy over diverse datasets.
DID-ARQ investigates the accuracy and computational scalability of image analyses when applied to diverse collections of image data. While identifying distinct characteristics of artists is time-consuming for individual researchers using traditional methodologies, computer-assisted techniques can help humanists discover salient characteristics and increase the reliability of those findings over a large-volume corpus of digitized images. Computer-assisted techniques can provide an initial bridge from the low-level images features, such as colors or pixels, to higher-level semantic concepts such as brush strokes, compositions, or patterns. This effort will utilize three datasets of visual works: 15th-century manuscripts, 17th and 18th-century maps, and 19th and 20th-century quilts to investigate what might be revealed about the authors and their artistic lineages by comparing manuscripts, maps, and quilts across four centuries. Based on the artistic, scientific or technological questions, DID-ARQ intends to formulate and address the problem of finding salient characteristics of artists from two-dimensional (2D) images of historical artifacts. Given a set of 2D images of historical artifacts with known authors, our project aims to discover what salient characteristics make an artist different from others, and then to enable automated classification of individual and collective authorship.
An NIH funded project to study the immunoregulatory properties of Mesenchymal stem cells (MSC) during eyelet cell transplants. NCSA conducts Core C which looks at eScience approaches, looking at large already existing collections of data, to make new discoveries from previously conducted research involving MSC. Medical research inherently suffers from what statisticians call the curse of dimensionality. Involving complex organisms such as ourselves with untold numbers of interacting biological properties, coupled with lengthy and costly experiments to observe only a small number of these, researchers are faced with large sparse collections of data that offer small glimpses into a very high dimensional feature space. Keeping up with the growing amount of published work produced as a result of this, what is needed to make any discoveries, is becoming difficult. Further, published works often consider only a small portion of the available data (e.g. the data gathered solely by the authors). Because of the nonlinear nature of the data, considering data from multiple studies at once may very well lead to new discoveries.
In this work we address the need to index large diverse collections of data scattered in a variety of formats from spreadsheet files, to databases, to PDFs of published articles. Our goals include providing access to information within these heterogenous sources in a uniform manner, providing a host of user friendly visualizations and data mining tools to allow medical researchers to explore the data, and exploring means of robustly incorporating information contained within unstructured data sources such as microscopy images.
Water sustainability is an urgent, complex, and transdisciplinary problem. Complex biophysical and social processes influence water use, quality, and availability. Few research areas have a greater need for modern cyberinfrastructure tools than water science, yet progress on meeting the grand challenge of water sustainability is hindered by insufficient coordination and collaboration among the exceptionally diverse research communities involved and because community-developed software and cyberinfrastructure have not been professionally designed to be interoperable, sustainable, or reusable.
This project will develop the concept of a Water Science Software Institute (WSSI) and to create a strategic plan to implement the WSSI. The WSSI mission will be to concurrently transform the research culture and the software culture of the water science community.
During the conceptualization phase of the Institute, the project will use an open community engagement process with assistance from the National Socio-environmental Synthesis Center (SESYNC), to involve the water science community in activities that simultaneously synthesize input for the strategic plan and serve as prototypes of the processes the Institute will use to achieve its mission. Specifically, the project will have two synthesis workshops that will define the functional elements of the Institute, a community forum that will present the Institute concept to stakeholder communities for their input, and a software prototyping activity that will demonstrate and evaluate methods for developing a culture of production-quality software engineering within the water science community. Between community engagement activities, the project will produce white papers on the elements of the Institute that will be incorporated into the WSSI strategic plan.
MarketMaker is a national partnership of land grant institutions and State Departments of Agriculture dedicated to the development of a comprehensive interactive database of food industry marketing and business data. Put simply, MarketMaker is a platform that seeks to foster business relationships between producers and consumers of food industry products and services. Below, you'll find information on the types of businesses and operations that want to be found on MarketMaker. ISDA developed the search and web-mapping components of MarketMaker using opensource projects such as the Google Web Toolkit, OpenLayers, and Geoserver.