As part of the Edinburgh and South East Scotland City Region Deal, we're co-designing an entry-level text and data mining service for students, researchers and local businesses.

The challenge

Text is, and will remain, our main method of knowledge transfer. Over 500 million tweets were sent per day in 2014 and 269 billion emails were sent per day in 2017. The total volume of published academic literature is doubling at least every nine years. A lack of technical skills should not preclude anyone from analysing at scale the wealth of textual data we create every day – and yet a large number of those working in the arts and humanities, in national and local government, in small businesses and local communities are excluded from experimenting with data-driven innovation by poor data literacy, barriers to using more complex tools and inadequate infrastructure.

We’ve won seed funding from the Edinburgh and South East Scotland City Region Deal to address this challenge and serve a national demand for accessible tools in data science and analysis.

Our strategy

The ‘answer’ to data literacy starts with user research. We have collaborated extensively with DDI Programme Sector Leads, academic and professional services colleagues across The University of Edinburgh, the National Library of Scotland, The Data Lab, The Scottish Government, Project Jupyter, SMEs and other communities to understand user needs and assess how current text and data mining solutions support their research and skills.

Based on insights from this user research phase, we’re developing an intuitive, easy-to-use text and data mining pilot service based on the Defoe tool created by Professor Melissa Terras and Dr Rosa Filgueira Vicente from EPCC, taking forward the Research Data Spring work funded by JISC at UCL and the British Library. We will build an intuitive visual interface and develop a service for students, researchers and regional SMEs supported by comprehensive computational and data literacy skills training. The service will receive and securely host collections of documents, automatically process them, and support novel analyses that would otherwise be beyond the scale of human comprehension or too time-consuming and costly to undertake.

What we’re delivering

  • A pilot text and data mining service, based on Defoe technology, ready for test and learn adoption by 2020
  • A recommendations report describing TDM skills, dataset and service requirements across the City Deal region that will drive data education and adoption by learners with low computational skills
  • A catalogue of exemplar datasets and a clear ingest process for new datasets
  • A report clarifying legal issues and proposing appropriate frameworks around copyright, licensing, exploitation and derived data issues
  • Exemplar datasets and data learning materials

Skills we’re bringing to the project

  • Python
  • ElasticSearch
  • Data ingest
  • User research
  • Service design