Data challenges of fusing remote sensing with ‘chaos’ of social media

Published on 18/08/2020

Imagine if it was possible to combine data from sources such as satellites and radar, with the plethora of information hailing from people’s Twitter feed or Facebook posts, in order to try to obtain more accurate weather forecast maps?

That’s the aim of LIST project PUBLIMAPE, one of the Institute’s recent success stories that has received FNR funding.

LIST’s Pierrick Bruneau working within IT for Innovative Services explained the thinking behind the PUBLIMAPE idea. “We know about getting satellite visible radar images to analyse information about floods, wild fires, all these kinds of phenomena. My area is quite different because I work mostly on machine learning and data science with user-generated content, so analysing text and multimedia data”.

Indeed, it is still difficult to provide reliable forecasts, particularly in areas of high population urban areas from just satellite imagery as remote sensing has difficulty in urban areas. The technology can suffer issues such as reflections of radar signals. However, it is quite the opposite from the social media point of view, whereby the more urban an area is, the more content and information there is due to larger populations.

“At some point the idea came up that, while it’s ok to do this kind of analysis from space, there’s another side with content such as Twitter or Facebook where people post stuff about what happens to them,” Pierrick explained before elaborating. “A part of that might be related to catastrophic events or similar, so the idea came up to say – why not combine these sources – on one side you have remote sensing and measurements from various kinds of data sources, and also see if adding social data can improve the forecasting, improve the characterisation of these kind of events. The idea of the project was built around that”.

But while remote sensing deals with facts and figures, isn’t social media a ‘chaos’ of information that can be correct and incorrect? “One of the main difficulties is how to isolate what is relevant to the events we mentioned, assuming you have enough content because this information only really works for urban areas, that was one of the rationales of the project,” Pierrick stated. “But mostly – and we discovered this in the course of the project - because we studied a real case that happened in 2017 - there is a big haystack with few needles in it!” Mused Pierrick.

The real-scale pilot use case he referred to was Hurricane Harvey that occurred between mid-August and mid-September 2017. The US region of the Colorado River between Columbus and the Gulf of Mexico was studied. The project testing used this innovative two-pronged approach for this disaster study that involved major flooding in the area.

“A large part of our work from the data science side, is getting those needles,” said Pierrick. “So we have to say, ok I have this big database of Tweets, but how do I isolate irrelevant content, bearing in mind that when you collect Tweets you know nothing about what is relevant or not?”

This is the dilemma the PUBLIMAPE project is currently addressing and researching ways of tweaking variables and sifting information to locate relevant information, before introducing it to the remote sensing side.

But the system doesn’t just process text, it processes images too as Pierrick explained. ”We also use the images that we collect because from Twitter you have links to Instagram and from Instagram you have many images. So we implement models that can detect if an image can be used or not and come up with something multi-modal”.

What about language issues when dealing with worldwide social media platforms? Pierrick clarified. “In the pilot case we focused on English. It happened in Texas so we discovered a bit of Spanish in the mix. You have a language marker in Twitter so we can exclude these elements. The multi-language problem was not initially considered in the project, so that needs to be some kind of extension to address this”.

What is the LIST contribution?

“In terms of contribution it is almost only people from LIST - in remote sensing there are people like Patrick Matgen, Marco Chini and Renaud Hostache – they are specialists in this kind of data. I had to plug into this and into the big picture in that sense. And so there are people from the data processing group so mainly Thomas Tamisier who is the PI of the project, thePhD student Etienne Brangbour and me.

Contributions from the outside are regard PhD student supervision, and simulated flood maps that serve as input for forecasting using the remote sensing data. The PhD supervisor is a professor for from the University of Geneva, specialising more on the machine learning side, and flood maps for the pilot use case were supplied by remote sensing solutions. Also, in the steering committee there are people from the stakeholders, but they are not involved in the project per se, so it is like 90% a LIST project”.

What has been achieved so far?

“In relation to output we were able to publish some papers. At some point we implemented a large action managing Twitter data, so collecting data and storing it in adapted data bases as well as building some language models,” Pierrick outlined before elaborating. “All the features we implement in the software allows us to manage and digest this Twitter data to make more sense than its original form, and also to build maps and other actionable outputs”.

Pierrick then explained that various social medias behave and react differently. For example Facebook is difficult for technical reasons as there is no access to general, and open, news flow. “The good thing about Twitter is it is kind of public so you have the possibility to query and you have access to everything that is published whereas in Facebook you cannot do that. It was possible in the past, but not now as their business model is built around keeping that information secret,” he explained.

Twitter’s philosophy is indeed different, although they still make money from advertising as Facebook does, with adverts appearing in timelines, but that is easy to deal with Pierrick stated. “If you want fetch details of Tweets currently being sent you have the possibility to do it for free, but if you want a query in the past you have to pay. This is mostly interesting for big businesses like in the food industry for example, so we had to pay for that. But still you have the possibility to obtain raw data and that’s what we were interested in,” he concluded.

About Pierrick Bruneau

Pierrick Bruneau works in the Data Science and Analytics Unit of the ITIS Department of LIST. He holds an M.Sc degree in computer sciences, received from Polytech’Nantes (France) in 2007. He also obtained a Ph.D. from the University of Nantes in 2010, and conducted postdoctoral research at CEA LIST (Saclay, France). For nearly 8 years now at LIST, he has contributed to several funded projects on multimedia data annotation and analysis, and machine learning applied to environmental applications such as those advocated in Publimape. His research interests range from Bayesian estimation, to visual analytics, including neural network engineering (so-called “AI”), as well as scaling up machine learning algorithms using High Performance Computers and GPU hardware.