The objective of this project is to construct a service that will allow for past and present un-curated data to be utilized by science while simultaneously demonstrating the novel science that can be conducted from such data. The proposed effort will focus on the large distributed and heterogeneous bodies of past and present un-curated data, what is often referred to in the scientific community as long-tail data, data that would have great value to science if its contents were readily accessible. The proposed framework will be made up of two re-purposable cyberinfrastructure building blocks referred to as a Data Access Proxy (DAP) and Data Tilling Service (DTS). These building blocks will be developed and tested in the context of three use cases that will advance science in geoscience, biology, engineering, and social science. The DAP will aim to enable a new era of applications that are agnostic to file formats through the use of a tool called a Software Server which itself will serve as a workflow tool to access functionality within 3rd party applications. By chaining together open/save operations within arbitrary software the DAP will provide a consistent means of gaining access to content stored across the large numbers of file formats that plague long tail data. The DTS will utilize the DAP to access data contents and will serve to index unstructured data sources (i.e. instrument data or data without text metadata). Building off of the Versus content based comparison framework and the Medici extraction services for auto-curation the DTS will assign content specific identifiers to untagged data allowing one to search collections of such data. The intellectual merit of this work lies in the proposed solution which does not attempt to construct a single piece of software that magically understands all data, but instead aims at utilizing every possible source of automatable help already in existence in a robust and provenance preserving manner to create a service that can deal with as much of this data as possible. This proverbial “super mutt” of software, or Brown Dog, will serve as a low level data infrastructure to interface with digital data contents and through its capabilities enable a new era of science and applications at large. The broader impact of this work is in its potential to serve not just the scientific community but the general public, as a DNS for data, moving civilization towards an era where a user’s access to data is not limited by a file’s format or un-curated collections.

In two words the purpose of this project is providing “searchable access” to digital archives. In particular we are interested in the many digital archives of handwritten forms that currently possess no viable or practical means of searchable access. Today access can be provided to archived handwritten forms by digitally scanning them and making them available on the web as a collection of digital images. This type of access is only of limited use however as currently in this form users have no easy way of searching through the individual forms other than by going through them one by one manually. Given that these collections are almost always large, in the terabytes of data made up of millions of images, the process of searching through the data manually is for all practical purposes impossible for an individual.

The motivation for our work is the decadal release of US Census forms coming up in April of 2012 for the 1940 Census. The digital scanning of 1940 US Census microfilms by the US Census Bureau and the National Archives and Records Administration (NARA) has been estimated to result in 3.25 million images and approximately 125 terabytes of raw image data. Compressed copies of the images, approximately 18 terabytes worth, will be made available over the internet for the first time as the primary source of the Census information.

Upon the release of the census information various companies and organizations will begin the long tedious task of manually transcribing the contents of the handwritten forms. As transcribed digital text the Census information will become searchable. To do this the companies/organizations will employ anywhere from thousands to hundreds of thousands of people to manually look at each word within the forms and retype them as digital text. The process will take between 6 to 9 months to complete. Due to the large investment in human labor, these companies will provide the searchable content for a fee. At the moment there is no low cost, near free, alternative to human transcription for providing searchable access to information within digital archives of handwritten information. It is here were our work focuses.

We aim to provide a means of providing low cost searchable access to digital archives where there is currently nothing. The field of Computer Vision, which deals with the extraction of information from images, is by no means at the state required to provide perfect automated transcription of the contents of these forms. However, our goal is not perfect machine driven transcription, but instead providing some form of searchable image based access where no searchable access would exist otherwise.

Towards this goal of providing low cost searchable access to digital archives of handwritten forms we investigate and develop a hybrid automated and crowdsourcing approach. The automated portion, utilizing a technique known as Word Spotting, will provide immediate searchable image based access to the content within digital handwritten forms. The crowdsourcing component, made up of both active and passive crowdsourcing elements, will accumulate traditional transcriptions over time. Of particular interest is the passive crowdsourcing element which would acquire these transcriptions without the user being made aware of the fact that they are carrying out this job. As more and more users use the system and transcriptions are acquired over time the system will gradually shift from solely image based search to a combination of image based search and text based search. Given the limitations of the current state of computer vision our approach emphasizes the human in the loop. Even within the automated portion we provide novel interfaces so that the human can search in a manner that is more readily tractable by the computer. In addition to the computer vision and crowdsourcing elements we must deal with the scale of the problem at hand, that is millions of images and terabytes of data. Towards these ends we also investigate the use of scalable distributed databases, efficient means of providing high resolution images over the web, and means of indexing the large amount of text within an archive of digital images.

A “crowd-sourced” design framework will be developed to enable stakeholders to interactively create and evaluate potential GI designs that reflect consideration of the full breadth of social, economic, and environmental criteria. The following specific research tasks will be undertaken: (1) create integrated models to predict hydrologic, human, and ecosystem impacts of green infra-structure designs from site to catchment scales (Research Questions 1 and 2); (2) develop interactive methods for crowd-sourcing green infrastructure design (Research Question 3); and (3) implement modeling and crowd-sourced design methods in a cyberinfrastructure (CI) framework (Research Question 4).

The research questions will be evaluated in diverse neighborhoods within three urban catchments in the Baltimore Ecosystem Study, which have extensive existing data on pretreatment stormwater and nutrient conditions, and planned or ongoing GI implementation. These data will be used to calibrate and validate the hydrologic and ecosystem models. Environmental non-governmental organizations (NGOs) in Baltimore will provide access and interface with communities that are currently implementing GI. Their input will be used to evaluate and improve predictions of human GI preferences, the efficacy of the crowd-sourced design framework, and improvements in stakeholder engagement in GI design through interactive CI.


Intensively managed landscapes, regions of significant land use change, serve as a cradle for economic prosperity. However, the intensity of change is responsible for unintended deterioration of our land and water environments. By understanding present day dynamics in the context of long-term co-evolution of the landscape, soil and biota, IML-CZO aims to support the assessment of short- and long-term resilience of the crucial ecological, hydrological and climatic services. These include freshwater quality and quantity, provision for food, fiber and (bio)fuel, nutrient transformations, and terrestrial carbon storage. The goals of this project are to quantify the fluxes and transformations, as well as interactions, thresholds, and dynamic feedbacks of water, nutrients, and sediment in IMLs, and to characterize how rapid land use changes have altered the vulnerability and resilience of these systems. An observational network of two sites in Illinois (3,690-km2 Upper Sangamon River Basin) and Iowa (270-km2 Clear Creek Watershed), and a partner site in Minnesota (44,000-km2 Minnesota River Basin), which together capture a range of geological diversity of the low relief glaciated and tile-drained landscape in the Midwest, will drive the scientific and technological advances. The guiding hypothesis for the scientific effort is that through human modification, the critical zone of IMLs has passed a tipping point (or threshold) and has changed from being a transformer of material flux, with high nutrient, water, and sediment storage, to being a transporter. This change threatens the resilience of the landscape to accommodate future impacts associated with ongoing human activity, including climate change and bioenergy crop production. Further, it increases the vulnerability of IMLs by compromising the sustainability of key critical-zone services on which ecological systems and human populations depend. Understanding and quantifying shifts in the response of the critical zone to human development remains a challenge, and current assessments are at best qualitative. IML-CZO research will identify threats to resilience of the critical zone, and will also inform management strategies aimed at reducing the vulnerability of the system to human activities that threaten sustainability. We will develop methods and knowledgebase that are broadly applicable across the Midwest and similar low-gradient landscapes worldwide.

The project will provide leadership in developing the next generation of work force and informing sustainable management strategies. The IML-CZO will be a launch pad for several new educational and outreach initiatives, and it will be an integral resource to connect and partner with existing organizations. It will draw on and add to several resources and programs available throughout the region. The CZO will provide a testbed for student-led sensing and data collection initiatives, and is expected to stimulate new research ideas and further advance the sensors and measurements curriculum. The CZO will also work to bring together collaborations with National Great Rivers Research and Education Center, IOWATER, Minnesota and Illinois RiverWatch Volunteer programs to enlist "citizen scientist" (age 10 - 70) in the work of the CZO. IML-CZO will also serve as a training ground for several undergraduate and graduate students, and post-doctoral research associates by engaging them in interdisciplinary research.


Digging into Image Data to Answer Authorship Related Questions (DID-ARQ) seeks to explore authorship studies of visual arts through the use of Computer Vision. In the past, authorship has been explored in terms of attributions, typically of either individual masterpieces or small collections of art from the same period, location, or school. Due to these localized strategies of exploration and research commonalities and shared characteristics are largely unexplored. In fact, it is rare to find discussions beyond a single discrete dataset. More significantly, to our knowledge, there have to date been no studies of image analyses targeting the problem of authorship applied to very large collections of images and evaluated in terms of accuracy over diverse datasets.

DID-ARQ investigates the accuracy and computational scalability of image analyses when applied to diverse collections of image data. While identifying distinct characteristics of artists is time-consuming for individual researchers using traditional methodologies, computer-assisted techniques can help humanists discover salient characteristics and increase the reliability of those findings over a large-volume corpus of digitized images. Computer-assisted techniques can provide an initial bridge from the low-level images features, such as colors or pixels, to higher-level semantic concepts such as brush strokes, compositions, or patterns. This effort will utilize three datasets of visual works: 15th-century manuscripts, 17th and 18th-century maps, and 19th and 20th-century quilts to investigate what might be revealed about the authors and their artistic lineages by comparing manuscripts, maps, and quilts across four centuries. Based on the artistic, scientific or technological questions, DID-ARQ intends to formulate and address the problem of finding salient characteristics of artists from two-dimensional (2D) images of historical artifacts. Given a set of 2D images of historical artifacts with known authors, our project aims to discover what salient characteristics make an artist different from others, and then to enable automated classification of individual and collective authorship.


This project offers a unique and transformative approach to integrate existing and emerging long-tail model and data resources. The project will develop a knowledge framework to close the loop from models? queries back to data sources by first investigating the required concepts architecture for integrating two leading examples of long-tail resources in geoscience: Community Surface Dynamic Modeling System (CSDMS) and Sustainable Environment Actionable Data (SEAD). The project will also develop a context-based data model that provides an explicit interpretation of a metadata attribute. The researchers will capture the metadata concepts and semantic from various geo-informatics systems and provide tools for ensuring conceptual integration between the resources. Next, the project will develop a knowledge discovery tool that allows automated coupling of a model and data coming from different contributors. Finally, the project will provide a prototype physical implementation of the knowledge framework in CSDMS modeling framework to demonstrate how it can advance the seamless discovery, selection, and integration between models and data, and how to achieve dynamic reusability of resources across multiple Earth Science long-tail resources.

The Great Lakes to Gulf (GLTG) Virtual Observatory will facilitate ready access to water resource information from the Mississippi River and its tributaries, expediting data-to-knowledge-to-policy connections. It is a project of the National Great Rivers Research and Education Center (NGRREC), a partnership of Lewis and Clark Community College and the University of Illinois at Urbana-Champaign (UIUC).

As a microscope allows one to peer into the microscopic world, Groupscope will allow one to make observations into large group behavior. Currently studies involving large group behavior, where groups are made of many people over one or more locations, are done manually through observation, surveys, and event coding if video data is collected. The Groupscope project will utilize work in the field of Computer Vision and Audio Recognition to create a framework of tools that will automate the more tedious portions of these investigations, such as the tracking of individuals and their interactions, and attempt to provide an objective means by which to classify observed behavior in these studies. Utilizing NCSA Medici, a content aware analogue to the popular Dropbox, custom extraction services will obtain the needed information from collections of multi-vantage point video recordings, audio recordings, images, and spoken dialogue annotations. From this extracted metadata novel visualizations will be developed so as to allow communication researchers to easily track observed individuals and their interactions over large spans in space and time.

An ongoing partnership between the National Center for Supercomputing Applications and the Korean Institute for Science and Technology Information has developed a prototype non-domain specific platform called the KNSG (KISTI-NCSA Science Gateway) application framework for building domain-specific HPC applications. This application framework provides a core set of reusable components for building new applications.

The goal of work package 8 is to create tools for cultural heritage research. Utilizing the Medici web based content repository as the underlying framework we are including support for file types used in cultural heritage research in terms of metadata extraction and web based previewing. Medici will also serve as a central, and accessible, content repository where data can be shared and socially curated.

The goal of work package 11 is to create a portable interactive system by which to allow for remote collaboration in studies involving cultural heritage. Derived from work started under Peter Bajcsy, we are moving our Tele-immersive Environment for Everybody (TEEVE) work to utilize the relatively new low cost Kinect 3D depth cameras. Similar to Skype video conferencing, the TEEVE work captures and transmits 3D reconstructions rather than conventional 2D video. Because the data is three dimensional users at remote locations can zoom in and rotate the transmitted scenes. With lower cost 3D cameras and small profile modern computers available today the group hopes to put these feeds into the field. There are cultural artifacts that are broken over time and scattered in museums throughout the world. One of the goals of this work is to allow researchers to reassemble these pieces in a shared 3D virtual environment. With such a system a researcher will be able to capture the geometry of an artifact in a museum and work with other researchers around the world in real time to interactively place such pieces with others that are physically stored in remote locations. Utilizing the skeletal tracking that has made the Kinect the success it is the system will track a user’s hand and arm movements in order to do this. This work will leverage other work developed at NCSA such as Medici as the backend to store and link information between systems and AVL’s Virtual Director.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer