Christine Borgman, author of Big Data, Little Data, No Data: Scholarship in the Networked World, discusses her book and argues for investments in data management in scholarly research.
What is the meaning behind the title of the book? What makes data “big” or “little”?
“Big Data” has had a remarkably long run in the hype cycle, promoted by business, government, and academe as a means to gain insights to all manner of unstated problems. Data can be “big” in many respects, whether in numbers of bits, pages, or physical volume, or other measures; in the rate of production or change, considering their stability or dynamic nature; or in their heterogeneity, ranging from simple structures to unique observations. Big is thus a relative comparison. Data can be “big” if they are beyond our present capacity to interpret them.
“Little data” also are relative. The big vs. little comparison in science, for example, usually refers to characteristics of scientific practice. “Big science” is conducted with large instruments and large collaborations over long periods of time, whereas “little science” tends to address smaller and more focused questions. Even these distinctions are relative, as each research project, in each scholarly field, can use data in a particular way.
“No data,” the third phrase in the title, draws attention to the absence of useful or usable data. Implicit in the promotion of big data is the assumption that data exist on almost anything; the challenge is merely to find those data and to exploit them. Rather, data often do not exist for a particular research purpose, whether because they were not collected, were not curated and preserved, were not released by those who created them, cannot be read by available software and hardware, are proprietary, or for other reasons.
The subtitle, Scholarship in the Networked World, reflects the scope of the book’s concern. This is an analysis of the changing nature of scholarship, as viewed through the lens of data practices and policy.
You write that, “Data are both assets and liabilities.” Can you expand on that?
Scholars collect, create, manage, analyze, and interpret data in the course of their research. Those data, which may take many forms—digital, paper, specimens, audio, visual, static, dynamic, and so on—are assets that can be exploited for new questions and new findings. They also may be assets to other people and institutions, to be exploited alone or in combination with other data. Data can serve instrumental and symbolic purposes in the collections of repositories, libraries, archives, museums, and individuals.
Data can be liabilities in several respects. The most obvious is the investments required in curating them. To remain valuable, data need metadata and other forms of description. If data are in digital form, they must be migrated to new technologies as they appear. They must be packaged with, or linked to, related information necessary for their interpretation, such as software, instrumentation, calibration, and research protocols. Data also can be liabilities in the potential for misuse, misinterpretation, breaches of security or confidentiality, errors, contractual violations, and so on. Keeping data useful requires that they be kept well.
Can you describe the challenges and opportunities afforded by digital data and open data?
The book presents six provocations that are explored in ten chapters. These provocations span issues of reproducibility, reuse, sharing, and control of data; incentives, motivations, risks, and rewards associated with releasing and reusing data; the difficulties of transferring knowledge across contexts and over time; the evolving roles of data in scholarly communication; relationships among open access publishing and open data; economics of data management; redistribution of expertise, responsibilities, costs, and benefits associated with data and with scholarly publications; and the long view of knowledge infrastructures in the face of these challenges and opportunities.
In scholarly practice, what investments need to be made in order to manage and exploit data in the long run?
Investments in data need to be made throughout all stages of the research enterprise. The more that scholars become aware of the value inherent in their data, both as assets and liabilities, the more likely they may be to invest in data management. Managing data well for one’s own purposes is a first step in making those data useful to others in the future.
Managing data requires expertise in the research domain and in knowledge organization. Rarely are these skills taught as part of graduate education. One significant investment is to include data management training in PhD programs within individual domains. Another is to invest in new partnerships with libraries, archives, and repositories, and in their professional expertise in knowledge organization. New combinations of skills are required across the scholarly enterprise, from the computation and statistics necessary to clean, model, and interpret data, to the metadata, ontologies, provenance, and migration work necessary to maintain the value in data. Training in the latter sets of skills is scattered across data science programs in computer science, statistics, and business, and data management programs in information studies. Better coordination is needed, along with more expansive investments in knowledge infrastructures.
What kinds of discussions and debates do you hope the book provokes?
My goal is to provoke discussion among the many competing stakeholders about “data” as a complex, multi-faceted construct. The continuing fascination with “big data” obscures the lack of agreement about what constitutes “data.” One person’s signal is another’s noise. Increasing the size of the haystack does not make a needle easier to find. Having the right data is usually better than having more data.
A central part of that debate concerns the value proposition for data sharing and reuse. Data sharing has become policy enforced by governments, funding agencies, journals, and other stakeholders. Arguments in favor of data sharing include leveraging investments in research, reducing the need to collect new data, addressing new research questions by reusing or combining extant data, and reproducing research, which would lead to greater accountability, transparency, and less fraud.
Despite these laudable goals, relatively few fields or individuals release their data on a regular basis. Reuse of data is even more rare. The incentives and motivations of scholarship augur against data sharing and reuse. Researchers’ time and resources are finite. Resources spent on data management, curation, and preservation typically are viewed as resources not spent on research. Costs and benefits are distributed unevenly. A much fuller discussion is needed about what to keep, why, how, for how long, and by whom.
Much of the scholarship on data practices attempts to understand the sociotechnical barriers to sharing, with goals to design infrastructures, policies, and cultural interventions that will overcome these barriers. Rather than assume these are surmountable barriers, stakeholders should be assessing the roles of data in research, both as process and as product. We must ask what kinds of infrastructures, workforces, policies, and practices are necessary to promote the progress of science and the useful arts in the 21st century.