ViWi: A Deep Learning Dataset Framework for Vision-Aided Wireless Communications

ViWi: A Deep Learning Dataset Framework for Vision-Aided Wireless Communications

Muhammad Alrabeiah*, Andrew Hredzak*1, Zhenhao Liu, and Ahmed Alkhateeb
Arizona State University, Emails: {malrabei, ahredzak, zliu294, alkhateeb}
*Authors contributed equally
1footnotemark: 1

The growing role artificial intelligence and specifically machine learning is playing in shaping the future of wireless communications has opened up many new and intriguing research directions. This paper motivates the research in the novel direction of vision-aided wireless communications, which aims at leveraging visual sensory information in tackling wireless communication problems. Like any new research direction driven by machine learning, obtaining a development dataset poses the first and most important challenge to vision-aided wireless communications. This paper addresses this issue by introducing the Vision-Wireless (ViWi) dataset framework. It is developed to be a parametric, systematic, and scalable data generation framework. It utilizes advanced 3D-modeling and ray-tracing softwares to generate high-fidelity synthetic wireless and vision data samples for the same scenes. The result is a framework that does not only offer a way to generate training and testing datasets but helps provide a common ground on which the quality of different machine learning-powered solutions could be assessed.

I Introduction

Can we use vision to help wireless communication? There are several reasons that motivate asking this question. First, future wireless communication devices and base stations will likely employ large numbers of antennas (at sub-6GHz or mmWave band ) to satisfy the high data rate requirements [1, 2, 3]. These large-scale multiple-input multiple-output (MIMO) transceivers, however, are subject to critical challenges such as the requirement of large channel/beam training overhead and the sensitivity of mmWave links to blockages [4, 5, 3]. Interestingly, most of these devices that employ large-antenna arrays will most likely have other sensors, such as RGB cameras, depth cameras, or LiDAR sensors. This is the case, for example, in vehicles, 5G phones, AR/VR, intersection nodes of self-driving cars, and probably base stations in the near future. It is therefore natural to ask whether this vision data (generated for example by cameras) can help overcome the non-trivial wireless communication challenges, such as mmWave beams and blockage prediction, massive MIMO channel subspace prediction, hand-over prediction, and proactive network management among others. This is further motivated by the recent advances in deep learning and computer vision that can extract high-level semantics from complex visual scenes, and the increasing interest of leveraging machine/deep learning tools in wireless communication problems [6, 7, 8, 9, 10, 11].

The need for a dataset: To enable leveraging deep learning and computer vision for the proposed vision-aided wireless communication research, it is crucial to have sufficiently large and suitable datasets. These datasets will allow the researchers to (i) develop deep-learning/computer vision algorithms and evaluate their performance, (ii) reproduce the results of the other papers, (iii) set benchmarks for the various vision-aided wireless communication problems, and (iv) compare the different proposed solutions based on common data. Next, we describe the main requirements in such a dataset to be useful for the vision-aided wireless communication research.

  • Co-existing visual and wireless data: Since the objective is to use visual data, captured for example by cameras or LiDAR sensors, to help the wireless communication systems that operate in the same device or environment, the visual data (such as the RGB and depth images as well as point cloud data) and the wireless data (such as the communication and radar channels) need to be collected from the same environment.

  • Accuracy: The methodology of collecting the visual and wireless data should ensure the accuracy of this data.

  • Scalability: The dataset collection process should be scalable to many scenarios and sizes to be able to efficiently address several use cases.

  • Parameterized dataset: In wireless communication problems, it is normally important to evaluate the performance versus different system and channel parameters, such as the number of antennas and array geometry. Similarly, we expect that it will be desirable to study the vision-aided wireless communication algorithms for different visual-data parameters, such as the camera resolution, color space, depth, and point cloud perturbation. Therefore, the dataset that enables these research directions needs to be parameterized.

There are several datasets that have been developed over the last decade for visual data alone [12, 13], or more recently for wireless data alone [14]. To the best of our knowledge, however, there are no publicly available datasets that provide co-existing visual and wireless data.

Fig. 1: A block diagram for the overall ViWi dataset generation framework structure. It shows the main three stages of dataset generation, the elements of each stage, and some example outputs form each element.

The ViWi dataset: This paper presents the Vision-Wireless (ViWi)222The latest versions of the ViWi datasets and codes can be found on the dataset website [15]. framework that is designed to satisfy the mentioned requirements. ViWi is a data-generating framework that does not only provide wireless data but combines it with visual data taken from the same scenes. This is achieved by utilizing advanced 3D modeling and ray-tracing simulators that generate high-fidelity synthetic vision and wireless data. The main goal of creating the ViWi dataset framework is to encourage and facilitate research in vision-aided wireless communications, which utilizes the advances in computer vision, machine learning, and point cloud analysis to tackle the critical challenges in wireless communications. In the first release, we make four ViWi-generated datasets publicly available [15]. Each ViWi-dataset consists of 4-tuples of image, depth map, wireless channel, and user location.

Before diving into the details, here is how this paper is structured. The next section, Section.II, provides an overview of ViWi and highlights its major components. Section III takes a deeper look into those components using two example scenarios, which results in the first two vision-wireless datasets. Section.V presents some possible applications for the framework. Finally, Section VII concludes the paper.

Ii Framework Overview

The availability of a development dataset constitutes a major challenge to vision-aided wireless communications. As such, this work presents a novel framework for visual-wireless synthetic data generation. The choice of using synthetic data is mainly motivated by two factors: (i) its relatively-low cost and (ii) its scalability. Acquiring real-world visual and wireless data, like images and channels, requires two completely different equipment setups, and the data acquisition process itself is time consuming; the process entails building a physical scenario, placing the equipment, synchronizing the acquisition process, and collecting data over a lengthly period of time. All that translates into increased cost and difficult scalability when compared to generating synthetic data [16]. These challenges have been acknowledged, albeit independently, by the computer vision and wireless communication communities. An increasing amount of work in both communities has been relying on synthetic data generated by 3D game engines and electromagnetic ray-tracing softwares, see for example [16, 17, 18, 6, 14, 7]. Hence, advanced game engines and ray-tracing softwares are the backbone of the proposed Vision-Wireless (WiVi) dataset framework.

Object Dimensions (Width, Length, Hight in meters) Material Note
Model Building 1* Brick Replaced with same-dimensions cube
Model Building 2* Concrete Replaced with same-dimensions cube
Street Asphalt
Sidewalk Concrete
Fence* Removed
Bush Dense deciduous forest
Trafic Light* Removed
Fire Hydrant* Removed
Garbage Dumpster* Removed
Car* Replaced with user grid.
Bus* Metal Replaced with same-dimensions cube
TABLE I: A list of objects composing the visual and wireless instances of the scenario

The dataset generation in the proposed framework goes through three main stages as shown in Fig. 1. These stages are:

  • Scenario definition: Addressing a vision-wireless problem starts by describing the physical study environment where the problem is defined, which is referred to as scenario definition. This description must identify two types of elements, visual and electromagnetic. The visual elements, e.g., buildings, curbs, streets, cars, trees, people…etc, are built and assembled using a game engine software. They all together form the visual instance of the scenario. The same scenario definition with its visual elements is constructed in a ray-tracing software. This software defines the electromagnetic characteristics of the scenario, like dielectric properties of different objects, creating the wireless instance of the scenario. See the left column in Fig. 1.

  • Raw-data generation: The two scenario instances are processed by the game engine and the ray-tracing software to produce two sets of raw data. The first is a set of visual data, which are RGB images of the environment, accurate depth maps, and LiDAR point cloud, while the other set has wireless data such as angles of arrival/ departure and path gains of all the rays between the transmitters and receivers. These two sets together define the scenario raw data. See the middle column in Fig. 1.

  • Parameterized processing: This stage offers the choice of customizing the raw data using two sets of user-defined parameters. Both sets define how the visual and wireless raw data is processed to extract the final and, often, more realistic data samples. For visual raw data, this may include transforming images to different color spaces, lowering the resolution of images and depth maps, adding some artifacts to them, or distorting point cloud data. On the other hand, for the wireless data, this may include constructing wireless communication/radar channels and obtaining user locations. See the right column in Fig. 1.

What is interesting and unique about this three-stages framework is that every dataset is completely defined by its scenario name and the parameterization sets; it is enough to provide the name of the scenario and the parameter sets to completely describe a certain dataset or generate (reproduce) it. This allows for fast and easy re-generation and makes the framework favorable for benchmarking.

Iii ViWi: A Detailed Description

With the aforementioned ViWi structure in mind, this section discusses the inner-workings of each stage using two example datasets333The first release of ViWi could, actually, be used to generate four different datasets using four different scenario raw data, but for the sake of clarity, only two of those four datasets are used as examples in the discussion., namely distributed-camera and co-located-camera datasets. Both of them are generated using the same example scenario as it is explained below.

Iii-a Scenarios Definition

The two example datasets generated with ViWi are for an outdoor scenario, which shows a car driving through a city street. Fig. 2-a depicts an areal view of the visual instance of this scenario. It is built and generated using the popular game engine Blender ™. This instance is composed of many elements that are found in real-world metropolitans, like buildings, bushes, sidewalks, cars,…etc. Table I lists the building blocks of the scenario and their dimensions. To animate the scenario, five trajectories are defined to represent possible car paths, each of which has one thousand equally spaced points that are 0.089 meters apart. The trajectories are also separated form each other with equal distance, which is 0.5 meters. This visual instance is used for both datasets but with some minor visual changes. More on that in Section III-B

To generate the wireless instance of the scenario, the visual instance is imported into the ray-tracing software of choice, which is, in this work, Wireless InSite ®. Some of the objects in the visual instance have very fine visual details that may substantially slow down the ray-tracing simulation. In cases where hardware or software capabilities are limited, those objects could be either removed form the wireless instance of the scenario or replaced with objects of simpler geometry with no major impact on the simulation results. Fig. 2-b shows an example of how the visual instance of the example scenario is simplified for ray-tracing simulation. Once the complexity situation is settled, the wireless instance is completed by setting the dielectric properties of all its objects. Table I shows the material used for every object in the wireless instance and identifies which objects are removed or replaced.

Iii-B Raw-Data Generation

This stage prepares the instance for processing and generates visual and wireless raw data. The visual and wireless instances are both fitted with raw-data generators, like cameras, transmitters, and receivers, and their properties are set in preparation for data generation. Both instances are run separately to get the output data, which is in its initial form. The visual raw data of the generated two datasets consists of RGB images and depth maps whereas the wireless raw data is composed of the angles of departure, path gains, and channel impulse responses for every simulated ray from the transmitter to the receiver.

The RGB images and depth maps in the two example datasets are produced from the same visual instance but using different visual data generators. For the first dataset, the visual instance is fitted with a total of 3 cameras (data generators) that are 5-meters high and 30-meters apart, Fig. 3-a. Each camera has a field of view of 100 degrees. These properties are chosen so that the cameras cover the whole street with minimum field of view overlap. With these settings and using the defined car trajectories, the scenario is animated in Blender to generate the visual raw data of the fist example dataset, which will be henceforth referred to as the distributed-camera dataset. For the second example dataset, three differently-oriented cameras with 75-, 110-, and 75-degree fields of view are placed half-way through the street and 5-meters above the ground, Fig. 3-b. They are oriented in different directions such that they cover the whole street with the least possible overlap. Similar to the first example, the scenario is animated using the new generators and same car trajectories to produce the visual raw data of the second example dataset, which will be henceforth referred to as the co-located-camera dataset.

To generate wireless raw data of both example datasets, the wireless instance of the scenario, Fig. 2-b, is fitted with distributed data generators (transmitters and receivers) with similar properties. For both datasets, all transmitter and receiver antennas implement half-wave dipoles operating at a frequency of 60 GHz and with a sinusoid waveform. The first example has transmitter antennas, referred to as BaseStations (BSs), replacing the three distributed cameras, and a user grid of receiver antennas placed along each of the five pre-defined car trajectories. On the other hand, the second example has the three cameras replaced with one BS and uses the same five trajectories to define the receiver grid. Wireless InSite is used with both wireless instances to identify all possible rays going from every BS to every user in both examples, and produce two sets of wireless raw data, one for each dataset.

Fig. 2: Two images of the visual and wireless instances of the scenario. (a) is an aerial view of the visual instance while (b) is an aerial view of the wireless instance. They clearly shows the geometric changes between the two instances, e.g., no traffic lights and buildings have simpler geometry.

Iii-C Parametrized Processing

The raw data, whether visual or wireless, could be directly used as samples of a development dataset. However, studying real-world engineering problems and applications requires some form of control over the data acquisition process and the environmental settings. For instance, the quality of the camera feed or channel information could be subjects of interest in certain vision-aided communication problems. In such cases, the output raw data is in a primitive form that cannot be used to address any of those issues. Hence, raw data has to undergo another optional layer of processing to produce the final dataset, the last stage in the proposed framework.

The last stage is a parametrized layer where the user defines how the raw data is processed. ViWi provides processing for the wireless and visual raw data independently in the form of a package of scripts. For the visual raw data, the scripts offer control over a set of parameters that chooses the scenario of interest and applies some filtering and transformation operations on the images or depth maps. Examples of such operations are image blurring filters, noise-corruption processes, resolution control, and color-space transformation. On the other hand, for the wireless raw data, other scripts define another set of control parameters. It includes specifying the scenario of interest, number of active BSs, number of antennas across x-, y-, and z-axes, and antenna spacing to name a few, the full list of wireless parameters and their definitions are the same as those of the DeepMIMO dataset [14]. By setting those parameters, a user can produce a task-specific set of wireless data samples such as complex-valued channels and user locations.

Fig. 3: Top view of the two example visual instances. Top image, (a), shows the locations of the distributed cameras where the bottom image, (b), shows the location of the centered cameras.

Iv How to Use ViWi?

The first release of ViWi provides four sets of raw data and a dataset-generating package [15]. Each set has the visual and wireless raw data required to generate the final dataset. This release provides only the scripts that parameterize the wireless raw data. Visual raw data do not undergo any processing, and, therefore, they are directly included in the final dataset. However, they are provided in popular data format, i.e., JPEG for RGB images and MAT data444MATLAB ® native data structure, which could be easily read using other scripting languages like Python or R. The reason behind this choice is the popularity of MAT format compared to the original OpenEXR format used for depth maps. for depth maps, so they could be easily processed by the user.

ViWi provides visual raw data in sub-directories enclosed in a main compressed directory ready for download. The images and depth maps are stored into two separate sub-directories. All raw images and depth maps have 720p HD resolution, i.e., pixels pixels, and are, as stated above, stored in JPEG and MAT format. Every image has a corresponding depth map, so they both have the same name in both subdirectories. More information on the naming system could be found in the ”README.txt” file in the main compressed directory. Generating visual data samples of a dataset only requires unpacking (unzipping) the compressed directory.

The main compressed directory also contains a third sub-directory of MAT data files. Every BS (transmitter) in the wireless instance of the scenario contributes three MAT data files: (i) angles of departure file, (ii) complex impulse response file, and (iii) path gains file. This means there are 9 MAT files in the sub-directory of the distributed-cameras dataset and three files in that of the co-located-cameras dataset. To generate wireless data samples, the wireless raw data needs to be unpacked and processed using the ViWi script, which is provided separately. More technical details on how to generate the output wireless data and their structure could be found in the “README.txt” file enclosed with the script package.

V Possible Applications

Vision-aided wireless communications is a relatively new direction of research with a lot of potential. With the ViWi framework, it is now possible to investigate more problems and benchmark more computer-vision-powered solutions. The following three subsections provide a rough categorization of the problems that could benefit from ViWi.

V-a Camera-Aided Beam Prediction

Beam-prediction is a well-known problem in mmWave communications. Typical approaches to tackling this problem usually involve a form of beam-training with a fixed beam-forming codebook, which is usually time consumming. Some solutions have recently been proposed to utilize machine learning [5] and reduce that training burden. However, all those solutions use exclusively wireless data to do beam prediction. The introduction of visual sensory data could make an interesting addition to the problem, for it provides a method to understand or analyze the surrounding environment of the transmitter and receiver. The two datasets produced with ViWi could be easily used to study such problem; both contain RGB images, depth maps, and channels for every user position.

V-B Blockage Prediction

This is one of the most elusive problems not only in mmWaves but in wireless communications in general. It requires a strong sense of the surroundings and its dynamics as well as an intelligent analysis and prediction algorithm. The use of machine learning for predicting blockages has been investigated in [9, 19], and the results are overall promising. The colocated-cameras dataset in ViWi provides an interesting scenario where more advanced solutions could be studied; along with wireless and depth data, it provides RGB images and user spatial locations.

Vi Acknowledgment

The author thanks Mr. Tarun Chawla and Remcom for supporting and encouraging this work. The authors also thanks Prof. Aldebaro Klautau from Federal University of Para for suggesting Blender to generate the synthetic images.

Vii Conclusion

For the interesting premise and massive potential of vision-aided wireless communications, this paper introduces the Vision-Wireless (ViWi) dataset generation framework. ViWi facilitates the research in this direction by offering a method for unified and modular generation of development datasets and for benchmarking different solutions. The current version of ViWi offers four datasets for four outdoor scenarios. Each dataset provides a sequence of 4-tuple of RGB image, depth map, wireless channel, and user location. Future work on the framework includes expanding the ViWi database of scenarios and incorporate more data processing features.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description