Why Open City Data is the Brownfield Regeneration Challenge of the Information Age
(Graphic of New York's ethnic diversity from Eric Fischer)
I often use this blog to explore ways in which technology can add value to city systems. In this article, I'm going to dig more deeply into my own professional expertise: the engineering of the platforms that make technology reliably available.
Many cities are considering how they can create a city-wide information platform. The potential benefits are considerable: Dublin's "Dublinked" platform, for example, has stimulated the creation of new high-technology businesses, and is used by scientific researchers to examine ways in which the city's systems can operate more efficiently and sustainably. And the announcements today by San Francisco that they are legislating to promote open data and have appointed a "Chief Data Officer" for the city are sure to add to the momentum.
But if cities such as Dublin, San Francisco and Chicago have found such platforms so useful, why aren't there more of them already?
To answer that question, I'd like to start by setting an expectation:
City information platforms are not "new" systems; they are a brownfield regeneration challenge for technology.
Just as urban regenerations need to take account of the existing physical infrastructures such as buildings, transport and utility networks; when thinking about new city technology solutions we need to consider the information infrastructure that is already in place.
A typical city authority has many hundreds of IT systems and applications that store and manage data about their city and region. Private sector organisations who operate services such as buses, trains and power, or who simply own and operate buildings, have similarly large and complex portfolios of applications and data.
So in every city there are thousands – probably tens of thousands - of applications and data sources containing relevant information. (The Dublinked platform was launched in October 2011 with over 3,000 data sets covering the environment, planning, water and transport, for example). Only a very small fraction of those systems will have been designed with the purpose of making information available to and usable by city stakeholders; and they certainly will not have been designed to do so in a joined-up, consistent way.
The picture to the left is a reproduction of a map of the IT systems of a real organisation, and the connections between them. Each block in the diagram represents a major business application that manages data; each line represents a connection between two or more such systems. Some of these individual systems will have involved hundreds of person-years of development over decades of time. Engineering the connections between them will also have involved significant effort and expense.
Whilst most organisations improve the management of their systems over time and sometimes achieve significant simplifications, by and large this picture is typical of the vast majority of organisations today, including those that support the operation of cities.
In the rest of this article, I'll explore some of the specific challenges for city data and open data that result from this complexity.
My intention is not to argue against bringing city information together and making it available to communities, businesses and researchers. As I've frequently argued on this blog, I believe that doing so is a fundamental enabler to transforming the way that cities work to meet the very real social, economic and environmental challenges facing us. But unless we take a realistic, informed approach and undertake the required engineering diligence, we will not be successful in that endeavour.
1. Which data is useful?
Amongst those thousands of data sets that contain information about cities, on which should we concentrate the effort required to make them widely available and usable?
That's a very hard question to answer. We are seeking innovative change in city systems, which by definition is unpredictable.
One answer is to look at what's worked elsewhere. For example, wherever information about transport has been made open, applications have sprung up to make that information available to travellers and other transport users in useful ways. In fact most information that describes the urban environment is likely to quickly prove useful; including maps, land use characterisation, planning applications, and the locations of shops, parks, public toilets and other facilities .
The other datasets that will prove useful are less predictable; but there's a very simple way to discover them: ask. Ask local entrepreneurs what information they need to start new businesses. Ask existing businesses what information about the city would help them be more successful. Ask citizens and communities.
This is the approach we have followed in Sunderland, and more recently in Birmingham through the Smart City Commission and the recent "Smart Hack" weekend. The Dublinked information partnership in Dublin also engages in consultation with city communities and stakeholders to prioritise the datasets that are made available through the platform. The Knight Foundation's "Information Needs of Communities" report is an excellent explanation of the importance of taking this approach.
2. What data is available?
How do we know what information is contained in those hundreds or thousands of data sets? Many individual organisations find it difficult to "know what they know"; across an entire city the challenge is much harder.
Arguably, that challenge is greatest for local authorities: whilst every organisation is different, as a rule of thumb private sector companies tend to need tens to low hundreds of business systems to manage their customers, suppliers, products, services and operations. Local authorities, obliged by law to deliver hundreds or even thousands of individual services, usually operate systems numbering in the high hundreds or low thousands. The process of discovering, cataloguing and characterising information systems is time-consuming and hence potentially expensive.
The key to resolving the dilemma is an open catalogue which allows this information to be crowdsourced. Anyone who knows of or discovers a data source that is available, or that could be made available, and whose existence and contents are not sensitive, can document it. Correspondingly, anyone who has a need for data that they cannot find or use can document that too. Over time, a picture of the information that describes a city, including what data is available and what is not, will build up. It will not be a complete picture – certainly not initially; but this is a practically achievable way to create useful information.
3. What is the data about?
The content of most data stores is organised by a "key" – a code that indicates the subject of each element of data. That "key" might be a person, a location or an organisation. Unfortunately, all of those things are very difficult to identify correctly and in a way that will be universally understood.
For example, do the following pieces of information refer to the same people, places and organisations?
"Mr. John Jones, Davis and Smith Delicatessen, Harbourne, Birmingham"
"J A Jones, Davies and Smythe, Harborne, B17?
"The Manager, David and Smith Caterers, Birmingham B17?
"Mr. John A and Mrs Jane Elizabeth Jones, 14 Woodhill Crescent, Northfield, Birmingham"
This information is typical of what might be stored in a set of IT systems managing such city information as business rates, citizen information, and supplier details. As human beings we can guess that a Mr. John A Jones lives in Northfield with his wife Mrs. Jane Elizabeth Jones; and that he is the manager of a delicatessen called "Davis and Smith" in Harborne which offers catering services. But to derive that information we have had to interpret several different ways of writing the names of people and businesses; tolerate mistakes in spelling; and tolerate different semantic interpretations of the same entity (is "Davis and Smith" a "Delicatessen" or a "Caterer"? The answer depends on who is asking the question).
(Two views of Exhibition Road in London, which can be freely used by pedestrians, for driving and for parking; the top photograph is by Dave Patten. How should this area be classified? As a road, a car park, a bus-stop, a pavement, a park – or something else? My colleague Gary looks confused by the question in the bottom photograph!)
All of these challenges occur throughout the information stored in IT systems. Some technologies – such as "single view" – exist that are very good at matching the different formats of names, locations and other common pieces of information. In other cases, information that is stored in "codes" – such as "LHR" for "London Heathrow" and "BHX" for "Birmingham International Airport" can be decoded using a glossary or reference data.
Translating semantic meanings is more difficult. For example, is the A45 from Birmingham to Coventry a road that is useful for travelling between the two cities? Or a barrier that makes it difficult to walk from homes on one side of the road to shops on the other? In time semantic models of cities will develop to systematically reconcile such questions, but until they do, human intelligence and interpretation will be required.
4. Sometimes you don't want to know what the data is about
Sometimes, as soon as you know what something is about, you need to forget that you know. I led a project last year that applied analytic technology to derive new insights from healthcare data. Such data is most useful when information from a variety of sources that relate to the same patient is aggregated together; to do that, the sort of matching I've just described is needed. But patient data is sensitive, of course; and in such scenarios patients' identities should not be apparent to those using the data.
Techniques such as anonymisation and aggregation can be applied to address this requirement; but they need to be applied carefully in order to retain the value of data whilst ensuring that identities and other sensitive information are not inadvertently exposed.
For example, the following information contains an anonymised name and very little address information; but should still be enough for you to determine the identity of the subject:
Subject: 00764
Name: XY67 HHJK6UB
Address: SW1A
Profession: Leader of a political party
(Please submit your answers to me at @dr_rick on Twitter!)
This is a contrived example, but the risk is very real. I live on a road with about 100 houses. I know of one profession to which only two people who live on the road belong. One is a man and one is a woman. It would be very easy for me to identify them based on data which is "anonymised" naively. These issues become very, very serious when you consider that within the datasets we are considering there will be information that can reveal the home address of people who are now living separately from previously abusive partners, for example.
5. Data can be difficult to use
There are many, many reasons why data can be difficult to use. Data contained within a table within a formatted report document is not much use to a programmer. A description of the location of a disabled toilet in a shop can only be used by someone who understands the language it is written in. Even clearly presented numerical values may be associated with complex caveats and conditions or expressed in quantities specific to particular domains of expertise.
For example, the following quote from a 2006 report on the global technology industry is only partly explained by the text box shown in the image on the left:
"In 2005, the top 250 ICT firms had total revenues of USD 3 000 billion".
(Source: "Information Technology Outlook 2006", OECD)
Technology can address some of these issues: it can extract information from written reports; transform information between formats; create structured information from written text; and even, to a degree, perform automatic translation between languages. But doing all of that requires effort; and in some cases human expertise will always be required.
In order for city information platforms to be truly useful to city communities, then some thought also needs to be given for how those communities will be offered support to understand and use that information.
6. Can I trust the data?
Several British banks have recently been fined hundreds of millions of dollars for falsely reporting the interest rates at which they are able to borrow money. This information, the "London InterBank Offered Rate" (LIBOR) is an example of open data. The Banks who have been fined were found to have under-reported the interest rate at which they were able to borrow – this made them appear more creditworthy than they actually were.
Such deliberate manipulation is just one of the many reasons we may have to doubt information. Who creates information? How qualified are they to provide accurate information? Who assesses that qualification and tests the accuracy of the information?
For example, every sensor which measures physical information incorporates some element of uncertainty and error. Location information derived from Smartphones is usually accurate to within a few meters when derived from GPS data; but only a few hundred meters when derived by triangulation between mobile transmission masts. That level of inaccuracy is tolerable if you want to know which city you are in; but not if you need to know where the nearest cashpoint is. (Taken to its extreme, this argument has its roots in "Noise Theory", the behaviour of stochastic processes and ultimately Heisenberg's Uncertainty Principle in Quantum Mechanics. Sometimes it's useful to be a Physicist!).
Information also goes out of date very quickly. If roadworks are started at a busy intersection, how does that affect the route-calculation services that many of us depend on to identify the quickest way to get from one place to another? When such roadworks make bus stops inaccessible so that temporary stops are erected in their place, how is that information captured? In fact, this information is often not captured; and as a result, many city transport authorities do not know where all of their bus stops are currently located.
I have barely touched in this section on an enormously rich and complex subject. Suffice to say that determining the "trustability" of information in the broadest sense is an immense challenge.
7. Data is easy to lose
(A computer information failure in Las Vegas photographed by Dave Herholz)
Whenever you find that an office, hotel room, hospital appointment or seat on a train that you've reserved is double-booked you've experienced lost data. Someone made a reservation for you in a computer system; that data was lost; and so the same reservation was made available to someone else.
Some of the world's most sophisticated and well-managed information systems lose data on occasion. That's why we're all familiar with it happening to us.
If cities are to offer information platforms that local people, communities and businesses come to depend on, then we need to accept that providing reliable information comes at a cost. This is one of the many reasons that I have argued in the past that "open data" is not the same thing as "free data". If we want to build a profitable business model that relies on the availability of data, then we should expect to pay for the reliable supply of that data.
A Brownfield Regeneration for the Information Age
So if this is all so hard, should we simply give up?
Of course not; I don't think so, anyway. In this article, I have described some very significant challenges that affect our ability to make city information openly available to those who may be able to use it. But we do not need to overcome all of those challenges at once.
Just as the physical regeneration of a city can be carried out as an evolution in dialogue and partnership with communities, as happened in Vancouver as part of the "Carbon Talks" programme, so can "information regeneration". Engaging in such a dialogue yields insight into the innovations that are possible now; who will create them; what information and data they need to do so; and what social, environmental and financial value will be created as a result.
That last part is crucial. The financial value that results from such "Smarter City" innovations might not be our primary objective in this context – we are more likely to be concerned with economic, social and environmental outcomes; but it is precisely what is needed to support the financial investment required to overcome the challenges I have discussed in this article.
On a final note, it is obviously the case that I am employed by a company, IBM, which provides products and services that address those challenges. I hope that you have noticed that I have not mentioned a single one of those products or services by name in this article, nor provided any links to them. And whilst IBM are involved in some of the cities that I have mentioned, we are not involved in all of them.
I have written this article as a stakeholder in our cities – I live in one – and as an engineer; not as a salesman. I am absolutely convinced that making city information more widely available and usable is crucial to addressing what Professor Geoffrey West described as "the greatest challenges that the planet has faced since humans became social". As a professional engineer of information systems I believe that we must be fully cognisant of the work involved in doing so properly; and as a practical optimist, I believe that it is possible to do so in affordable, manageable steps that create real value and the opportunity to change our cities for the better. I hope that I have managed to persuade you to agree.