The Task of Managing Big Data in the 21st Century
Chris Bouton, General Manager Thomson Reuters Life Sciences

The overwhelming abundance of opportunity presented by Big Data is enough to make a scientific head spin. When we consider the vast amount of value and analytics that exist in our scholarly and scientific universe, we understand that there’s a growing swath of causal, actionable data that could lead us and our clients toward more effective and efficient research and development.

Although we’ve been hearing increased talk about Big Data, it’s not a new phenomenon. Big Data has created challenges since the patent boom of the midcentury, when mammoth amounts of pharma patents challenged scientists’ abilities to keep up with new findings in their fields. By 1960, a pharma scientist would have had to read more than 1,000 patents just to stay current. And that’s not even considering the languages that scientist would have to be able to read: English, German, Japanese, and French.

Then back in the 1970s we witnessed the emergence of computer databases that had the ability to store, search and display chemical structures. Suddenly companies were able to build internal registry databases and online transactional services emerged. Bibliometrics was born. Data was now in a computable form and filters were available to sift through the volume of scientific and scholarly journals—whose numbers kept growing and growing.

Public sequence databanks came along in the early 1980s, which allowed available biological sequences to be identified by nascent sequencing technologies and in the 1990s, we witnessed the emergence of microarray technology that was able to simultaneously measure the expression levels of large numbers of genes, providing experimental data on an unprecedented scale. This generated extraordinary excitement within the scientific community, but it wasn’t without its challenges.

Next generation sequencing in the 2000s built on previous breakthrough technologies and—as is typically the case—created its own host of new challenges. So, as far back as mid-century, Big Data had become an elephant that was difficult to manage. Interestingly, however, the elephant has only become bigger and more ornery as science, scholarship, and technology has advanced. It has reached an exhausting climax that companies seek to address.

The fact is humans can’t work with this Big Data directly. The answer is to reduce this seemingly insurmountable mountain of data to human proportions that we’re able to comprehend, use and appreciate.

Even if the unspecific, unscaled term “Big Data” fades away, the challenges of data integration and analytics will remain. Big Data is real, not a buzzword. It defines the challenges we face every day to effectively use and gain insight from all the available information out there. Being able to connect disparate data sets is just as important as having the data in the first place.

According to some estimates, the Big Data market could be worth more than $50 billion by 2017, and organizations are actively searching for solutions.

“Making Big Data Look Like Little Data.”
Many clients approach the Big Data challenge by shrinking it into Little Data through customized in-silico experiments that require building up data sets from incongruent sources. Little Data is reliable, evidence-backed facts that scientists can use in models, visualizations and analysis—it’s data that we’re equipped to handle in a practical way. Little Data is vital because it’s actionable and can help companies determine how to make core business development decisions. This shrinking, however, requires cross-functional teams and proves to be a significant challenge in data harmonization. It’s a grandiose task to filter the outputs of collected data so the correlations you need stand out among those you don’t.

Example challenges faced by Big Data include:
  • Connecting internal content with external sources
  • Discovering unexpected associations
  • Scanning for novel information
  • Connecting document repositories with structured sources
  • Achieving a competitive advantage
  • Catalyzing greater insights
  • Driving revenue growth
Thomson Reuters surveys have shown that the volume and variety of data behind the firewall of a typical pharma company is fast approaching a point where it exceeds the content from the outside. More than 45 percent of respondents said access to external content was their top challenge. Our surveys also showed a strong interest in NoSQL databases, linked data/ semantic web technologies and visual analytics as necessary tools to solve Big Data challenges.

Many organizations purchase high-quality content externally and generate significant data internally. All of this data—most of which is textually challenging—needs to work together, and organizations need to be able to unlock its value. As if this isn’t daunting enough, the task usually needs to be done sooner rather than later.

Over the years, information professional have become skilled at extracting relevant content from the most valuable resources and reformulating and presenting those results to users; however, this process is unsustainable over the long-term and lends itself to an organizational bottleneck.

We have more data that we know what to do with—so much that we don’t know the things we know. We need to get the data into a pattern that has tangible meaning and then we have to translate that meaning into knowledge. Without an information provider to do that for you, you’re essentially lost. It’s like traversing through a forest of endless resources without a compass—you’re surrounded by the things you need, but have no direction to get where you’re going.

The answer to the Big Data question requires a trusted, respected and effective mechanism to make content manageable at its most important research, discovery and developmental level. As organizations tap into the complicated and potentially lucrative aspect of Big Data management, it will be crucial for solutions to include reliable, linked data technology frameworks that allow content to be shared across applications and enterprise or community boundaries. This framework must then connect users with data from internal proprietary systems, as well as third-party resources.

The realization of unexpected associations—a critical part of the Big Data management process—enables clients to make new discoveries and pioneer new paths in their fields. This is perhaps one of the most far-reaching characteristics of solution - based analytics.

The ability to whittle Big Data into Little Data in an effective, productive and resourceful way is how information providers can best serve their clients. Little Data has the ontologies that can meld content from multiple sources, filter the noise of large-scale analyses, provide context that enables users to follow a train of thought, and power the analytics that turn content into insight. It isn’t just about the technology—it’s about knowing, understanding and appreciating the data environment in a way that allows the information to flow in a relevant manner, which requires breaking down the walls between internal, external, public and commercial content and putting the tools into the right hands.

There may not be a single correct answer, but there are several right paths. Information providers have a history of building and maintaining Little Data so the elephant doesn’t run rampant across the scholarly scientific landscape. This is an opportunity for companies to build the next generation of solutions that will enable new discoveries, which is why we’re all in this business in the first place.