AI is most often synonymous with data analysis (or analytics). That means you’ll need data to analyze. Most of the time analysis work consists of finding and managing data. If more people publish open data, those tasks will be easier to accomplish. The alternative to using open data is that the pace of the AI revolution will slow down.

If you’ve followed the news lately you know that AI is a hot topic. AI, artificial intelligence, represents the dream, or the fear for some, of computers taking over many tasks from humans. AI also represents the vision of creating new solutions that humans aren’t capable of performing themselves.

How far has the development of AI come in Sweden, and in other comparable countries? AI isn’t established broadly, but there are active projects, some producing good results. Most often these are about the AI sub discipline machine learning.

What’s machine learning?

Maybe ”advanced data analysis” would be a better description than machine learning. Most often the analytics solutions aren’t intuitive. They often involve large data volumes and requires substantial computer capacity.

One thing about machine learning is easy to grasp: without access to the right data there will be no analysis. Experts, and research, usually conclude that between 80 and 90 percent of the time in machine learning projects is dedicated to finding, organizing, performing quality controls on, and managing data in different ways.

The remaining time is dedicated to deciding what analyses to perform, formulating and implementing algorithms, running the analyses, and interpret the results of the analyses.

Machine learning is already in use, directly and indirectly, in many use cases. One example is that analyses performed with machine learning form the basis for how autonomous cars decide what is a bicycle, and what’s not. Another example is approving applications of different kinds, or not, for example for loans.

In short, all decisions and priorities based on data are possible. Many things existing in the world can be described as data, for example pictures, sound, numbers, computations, written text, and so on.

There are many reasons to use machine learning for analyses like these. One reason is the ability to manage large data volumes and advanced analyses of data. Other reasons are high performance and a high degree of automation. The need for human involvement is reduced. That means both lower cost and fewer errors. And it provides a solution for the problems of recruiting competent staff.

Let’s start from the beginning with machine learning: where to find suitable data. This is where open data come into the picture. Access to open data can provide a solution to many of the problems regarding finding data for machine learning, for example the following:

  • Finding suitable data at all, to start with.
  • Making sure that adequate quality, quantity, and variation of data are provided.
  • Structuring data in a good way.

Access to open, linked, data by using public API:s will solve the first problem, providing that public data are available. Available public data will also go a long way towards solving the third problem, and will help with solving the second. All these promises are, of course, dependent on competent people working with structuring, and publishing, the public data. It can be assumed that competent people are available in organizations working with public data.

There are a lot of horror stories floating around that put a light on problems with data quality, and with variation of data, regarding machine learning. One of the most talked-about is the recruiting solution that consistently disadvantaged women and people that weren’t white.

Why? Because most of the people currently holding positions were white men, and that was reflected in the data used for analyses.

Problems like these can be avoided by creating algorithms, interpreting results, and designing solutions in certain ways. But that requires people doing the work. The better the quality of data, the greater opportunity for automation, and the lesser risk of lopsided results of analyses.

To summarize, publishing open data is the best strategy today to make data for machine learning widely available. That is especially true for the public sector, where there should be incentives for agencies, counties, regions, organizations, and publicly owned companies to cooperate by publishing open data.

Everybody wins by sticking to that strategy.