Session 6 : Big data and Earth observation data in official statistics

EO4Urban : Sentinel-1A SAR and Sentinel-2A MSI Data for Global Urban Services
Yifang Ban (KTH Royal Institute of Technology, Sweden)

With more than half of the world population now living in cities, and 2.5 billion more people expected to move into cities by 2050, urban areas pose significant challenges on the environment. Although only a small percentage of global land cover, urban areas significantly alter climate, biogeochemistry, and hydrology at local, regional, and global scales. Thus, accurate and timely information on urban land cover and their changing patterns is of critical importance to support sustainable urban development.
EO4Urban is a project within the ESA DUE INNOVATOR III program. The overall objective of this research is to evaluate multitemporal Sentinel-1A SAR and Sentinel-2A MSI data for global urban services using innovative methods and algorithms.
Based on the user needs, the following specific objectives are put forward : to map urban extent at a global scale, to produce detailed urban land cover maps, to detect new builtup areas, and to map urban green structure as well as its changes at a regional scale. KTH Urban Extractor, a robust algorithm was adapted for urban extent extraction and KTH-SEG, a novel object-based classification method was used for detailed urban land cover mapping. Ten cities around the world in different geographical and environmental conditions are selected as study areas. In the first phase of the project, multitemporal Sentinel-1A IW SAR and Sentinel-2A MSI data over Beijing, Stockholm and Lagos were acquired during 2015 vegetation season.
The results showed that urban areas and small towns could be well extracted using multitemporal Sentinel-1A SAR data. With 10 m resolution, Sentinel-2A MSI data was able to better map low-density builtup areas than Sentinel-1A SAR data. The fusion of SAR and optical data could reduce commission errors of urban extractions from each sensor alone. For urban land cover mapping, the results show that multitemporal Sentinel-1A SAR data alone yielded an overall classification accuracy of 60% for Stockholm. Sentinel-2A MSI data as well as the fusion of Sentinel-1A SAR and Sentinel-2A MSI data, however, produced much higher classification accuracies, both reached 80%. Compared to the urban extraction results from ENVISAT ASAR or ERS SAR data in 1995 and 2005, the urbanization patterns and trends in Beijing, Stockholm and Lagos are being analyzed and will be presented. Urban green structure and their changes are being derived from the urban land cover maps and will also be presented.

Cop4Stat_2015plus – Analysis of COPERNICUS remote sensing data for areal statistical purposes
Stephan Arnold, Thomas Wiatr (Federal Statistical Office (DESTATIS), Germany)

“COP4STAT_2015plus” is a cooperation project between the Federal Statistical Office (Statistisches Bundesamt, DESTATIS) and the Federal Agency for Cartography and Geodesy (Bundesamt für Kartographie und Geodäsie, BKG) in Germany. The project aims at evaluating the possibilities of the Copernicus remote sensing products for statistical purposes on national level regarding information on land cover (LC) and land use (LU). The existing national and European land cover and land use classification systems use different definitions of classes in comparison to those used by the Statistical Office. A key criterion is the clear and consequent separation between LC and LU definitions. This requirement is manifested in the LUCAS classification system as used by EUROSTAT. The aim of the project Cop4Stat is to identify reproducible and consistent methods to derive LC and LU information from remote sensing data which can be analysed for the purpose of areal statistical calculations. As input data, multispectral satellite imagery from optical Sensor Sentinel-2a will be used. As additional data sources other existing datasets (national land cover model LBM-DE, Copernicus product High Resolution Layer) are integrated in the work process. The multispectral datasets will first be analysed using established remote sensing methods of segmentation (e.g. watershed segmentation), classification (e.g. supervised classification) and image processing algorithms (e.g. spectral indices). This approach will be processed for selected test areas to finally compare the calculated statistical results from remote sensing data with the official land use statistics (based on cadastral data).

Collaboration to produce official statistics from satellite data products from the Australian Geoscience Data Cube
Martin Brady, Matt Jakab, Richard Dunsome (Australian Bureau of Statistics, Australia)

The detailed classification of land cover from medium resolution multispectral imagery can be challenging when performed at the continental scale, as is required in Australia. A practical alternative is the calculation of Fractional Cover, which indicates the fraction of cover in three broad classes - green vegetation, dry vegetation and bare soil. In Australia, the value of producing official statistics derived from Earth observation data products such as Fractional Cover is being realised. The Australian Bureau of Statistics (ABS) is currently collaborating with Geoscience Australia (GA) to access the Landsat imagery archive for Australia. This archive is stored in the Australian Geoscience Data Cube (AGDC) on the National Computational Infrastructure (NCI) and covers a period from 1987 to the present with a 16 day repeat collection interval. GA is supporting ABS to calculate statistical summaries of Fractional Cover within Statistical Areas as defined in the Australian Statistical Geography Standard (ASGS). Changes in Fractional Cover through time in these areas will be analysed to investigate relationships between land cover, land use and land value across Australia. The results of this work will be featured in upcoming ABS Experimental Land Account publications, and is the first step in a broader collaboration between ABS and GA to leverage time-series Earth observations from a selection of sensors for the production of official statistics for Australia.

Integrating Geography and Statistics, but what about Earth Observation ?
Ola Nordbeck (Norwegian Space Centre, Norway)

The on-going digital revolution is providing us with more data and allowing increased and unprecedented possibilities to combine data. This revolution is important in order to follow the global changes in time and space (geographical location) which is crucial for the success of the 2030 Agenda for Sustainable Development.
The EFGS (European Forum for Geography and Statistics) is aiming at making the statistical community to work closer with the Geographical Information (GI) community. This work has progressed through various initiatives allowing these two communities to identify obstacles for better integration.
A third actor, the Earth Observation community, is now investing heavily in new satellite programmes. These investments will result in the provision of frequently updated satellite data allowing the GI and statistical community to get a better understanding of changes over time and potentially resulting in more updated registers.
A collaboration between these three communities is important and has a lot of potentials. Differences in perception and policies are however challenging. It can therefore be tempting to choose alternative avenues for data collaboration, which can have undesirable consequences.
This presentation provides examples of the potentials in collaboration between the three above mentioned communities and recommendations for the way forward.

Mining Mobile Phone Data to Recognize Urban Areas
Stéphanie Combes, Marie-Pierre de Bellefon (Statistics, France), Maarten Vanhoof (Orange Labs – University of Newcastle, United Kindom) and Thomas Plötz (University of Newcastle, United Kindom)

Understanding territory organization, for example in terms of employment, home location and mobility, is crucial for the implementation of policy measures. In France, the National Statistics Office (INSEE) produces a zoning (ZAUER: Urban Area and Rural Employment Area’s Zoning) to identify the geographical extent of cities’ influence over their environment at the national level. Producing this typology is a complex task. It involves multiple actors and methods, and many arbitrary thresholds have to be chosen. As a consequence a zoning is characterized by long delays between consecutive updates. Recently, mobile phone data has shown promising results for land use classification as they provide for disaggregated, geo-localized and timely information on activity patterns of large shares of populations. In this paper, we exploit a dataset of hourly mobile phone activity profiles collected at each antenna by the French operator Orange to investigate the capabilities of mobile phone data to reproduce the French Urban Area Zoning. Since the ZAUER classification uses commuting information to delineate urban areas, we hypothesize our dataset to be particularly suited for reproducing this zoning by means of supervised classification techniques. In particular, we compare the spatially smoothed predictions of penalized logistic regressions, boosting trees and random forests algorithms using the Fuzzy-Kappa remote sensing metric to account for the fuzziness of our context (in terms of location and cate-gories). Our best results depict an excellent prediction of urban clusters but more difficult disentanglement of rural areas. Besides showing the relevance of mobile phone data for land use classification tasks at a nation-wide scale, our paper explicitely elaborates on the experience of using supervised classification procedure to produce and control the quality of an official statistics indicator.

Global Population Distribution : a continuum of modeling methods
Greg Yetman, Kytt MacManus (CIESIN, Columbia University, USA)

Global population modeling efforts have evolved from simple approaches using census data to geostatistical models that integrate multiple correlates, through to the use of artificial intelligence to disaggregate census population to settlement locations. Additional approaches eschew census data and rely on mobile device and social media streams to model population at extremely fine resolutions in time and space. Inter-comparison of these models and their outputs is lacking. A review of the methodologies used is presented with a discussion of the implications for use (and  misuse) of model outputs in analysis and applications.

Merging big data and official statistics for modelling statistical commuting
Pasi Piela (Statistics Finland, Finland)

Commuting distance and commuting time can be calculated as a point-to-point estimate from almost every employed’s home to corresponding workplace (the national coverage being about 93 percent). In order to make modelling more realistic different data sources have been merged with current administrative census data at the micro level.
The study to be presented promotes sustainable development by taking into account bicycling (or walking) and public transport opportunities, and specifically the modern urban-rural classification by the Finnish Environment Institute ( at micro level.
The speed estimation for a private car usage follows a rather complex estimation structure. Speed estimations for each road element are made by using several variables on a national route network, Digiroad ( This estimation is enriched by the traffic sensor data by the Finnish Transport Agency (FTA).
Cycle commuting is modelled more straightforwardly in a simplified manner. Commuting model is based on a shortest non-hierarchical path including separate cycle paths and excluding motorways (according to the traffic regulations). The average speed is assumed to be 17 kilometers per hour.
The estimation of the public transport accessibility is implemented by utilizing the open application programming interface (API) of the Journey Planner of the Helsinki Regional Transport Authority. Correspondingly FTA’s country-wide public transport service has been researched and the results will be presented.
Results show clearly differences by region and area type. The presentation also discusses the data safety issues on a modern computing environments.