ViWi: A Deep Learning Dataset Framework for Vision-Aided Wireless Communications
The growing role artificial intelligence and specifically machine learning is playing in shaping the future of wireless communications has opened up many new and intriguing research directions. This paper motivates the research in the novel direction of vision-aided wireless communications, which aims at leveraging visual sensory information in tackling wireless communication problems. Like any new research direction driven by machine learning, obtaining a development dataset poses the first and most important challenge to vision-aided wireless communications. This paper addresses this issue by introducing the Vision-Wireless (ViWi) dataset framework. It is developed to be a parametric, systematic, and scalable data generation framework. It utilizes advanced 3D-modeling and ray-tracing softwares to generate high-fidelity synthetic wireless and vision data samples for the same scenes. The result is a framework that does not only offer a way to generate training and testing datasets but helps provide a common ground on which the quality of different machine learning-powered solutions could be assessed.
Can we use vision to help wireless communication? There are several reasons that motivate asking this question. First, future wireless communication devices and base stations will likely employ large numbers of antennas (at sub-6GHz or mmWave band ) to satisfy the high data rate requirements [1, 2, 3]. These large-scale multiple-input multiple-output (MIMO) transceivers, however, are subject to critical challenges such as the requirement of large channel/beam training overhead and the sensitivity of mmWave links to blockages [4, 5, 3]. Interestingly, most of these devices that employ large-antenna arrays will most likely have other sensors, such as RGB cameras, depth cameras, or LiDAR sensors. This is the case, for example, in vehicles, 5G phones, AR/VR, intersection nodes of self-driving cars, and probably base stations in the near future. It is therefore natural to ask whether this vision data (generated for example by cameras) can help overcome the non-trivial wireless communication challenges, such as mmWave beams and blockage prediction, massive MIMO channel subspace prediction, hand-over prediction, and proactive network management among others. This is further motivated by the recent advances in deep learning and computer vision that can extract high-level semantics from complex visual scenes, and the increasing interest of leveraging machine/deep learning tools in wireless communication problems [6, 7, 8, 9, 10, 11].
The need for a dataset: To enable leveraging deep learning and computer vision for the proposed vision-aided wireless communication research, it is crucial to have sufficiently large and suitable datasets. These datasets will allow the researchers to (i) develop deep-learning/computer vision algorithms and evaluate their performance, (ii) reproduce the results of the other papers, (iii) set benchmarks for the various vision-aided wireless communication problems, and (iv) compare the different proposed solutions based on common data. Next, we describe the main requirements in such a dataset to be useful for the vision-aided wireless communication research.
Co-existing visual and wireless data: Since the objective is to use visual data, captured for example by cameras or LiDAR sensors, to help the wireless communication systems that operate in the same device or environment, the visual data (such as the RGB and depth images as well as point cloud data) and the wireless data (such as the communication and radar channels) need to be collected from the same environment.
Accuracy: The methodology of collecting the visual and wireless data should ensure the accuracy of this data.
Scalability: The dataset collection process should be scalable to many scenarios and sizes to be able to efficiently address several use cases.
Parameterized dataset: In wireless communication problems, it is normally important to evaluate the performance versus different system and channel parameters, such as the number of antennas and array geometry. Similarly, we expect that it will be desirable to study the vision-aided wireless communication algorithms for different visual-data parameters, such as the camera resolution, color space, depth, and point cloud perturbation. Therefore, the dataset that enables these research directions needs to be parameterized.
There are several datasets that have been developed over the last decade for visual data alone [12, 13], or more recently for wireless data alone . To the best of our knowledge, however, there are no publicly available datasets that provide co-existing visual and wireless data.
The ViWi dataset: This paper presents the Vision-Wireless (ViWi)222The latest versions of the ViWi datasets and codes can be found on the dataset website . framework that is designed to satisfy the mentioned requirements. ViWi is a data-generating framework that does not only provide wireless data but combines it with visual data taken from the same scenes. This is achieved by utilizing advanced 3D modeling and ray-tracing simulators that generate high-fidelity synthetic vision and wireless data. The main goal of creating the ViWi dataset framework is to encourage and facilitate research in vision-aided wireless communications, which utilizes the advances in computer vision, machine learning, and point cloud analysis to tackle the critical challenges in wireless communications. In the first release, we make four ViWi-generated datasets publicly available . Each ViWi-dataset consists of 4-tuples of image, depth map, wireless channel, and user location.
Before diving into the details, here is how this paper is structured. The next section, Section.II, provides an overview of ViWi and highlights its major components. Section III takes a deeper look into those components using two example scenarios, which results in the first two vision-wireless datasets. Section.V presents some possible applications for the framework. Finally, Section VII concludes the paper.
Ii Framework Overview
The availability of a development dataset constitutes a major challenge to vision-aided wireless communications. As such, this work presents a novel framework for visual-wireless synthetic data generation. The choice of using synthetic data is mainly motivated by two factors: (i) its relatively-low cost and (ii) its scalability. Acquiring real-world visual and wireless data, like images and channels, requires two completely different equipment setups, and the data acquisition process itself is time consuming; the process entails building a physical scenario, placing the equipment, synchronizing the acquisition process, and collecting data over a lengthly period of time. All that translates into increased cost and difficult scalability when compared to generating synthetic data . These challenges have been acknowledged, albeit independently, by the computer vision and wireless communication communities. An increasing amount of work in both communities has been relying on synthetic data generated by 3D game engines and electromagnetic ray-tracing softwares, see for example [16, 17, 18, 6, 14, 7]. Hence, advanced game engines and ray-tracing softwares are the backbone of the proposed Vision-Wireless (WiVi) dataset framework.
|Object||Dimensions (Width, Length, Hight in meters)||Material||Note|
|Model Building 1*||Brick||Replaced with same-dimensions cube|
|Model Building 2*||Concrete||Replaced with same-dimensions cube|
|Bush||Dense deciduous forest|
|Car*||–||Replaced with user grid.|
|Bus*||Metal||Replaced with same-dimensions cube|
The dataset generation in the proposed framework goes through three main stages as shown in Fig. 1. These stages are:
Scenario definition: Addressing a vision-wireless problem starts by describing the physical study environment where the problem is defined, which is referred to as scenario definition. This description must identify two types of elements, visual and electromagnetic. The visual elements, e.g., buildings, curbs, streets, cars, trees, people…etc, are built and assembled using a game engine software. They all together form the visual instance of the scenario. The same scenario definition with its visual elements is constructed in a ray-tracing software. This software defines the electromagnetic characteristics of the scenario, like dielectric properties of different objects, creating the wireless instance of the scenario. See the left column in Fig. 1.
Raw-data generation: The two scenario instances are processed by the game engine and the ray-tracing software to produce two sets of raw data. The first is a set of visual data, which are RGB images of the environment, accurate depth maps, and LiDAR point cloud, while the other set has wireless data such as angles of arrival/ departure and path gains of all the rays between the transmitters and receivers. These two sets together define the scenario raw data. See the middle column in Fig. 1.
Parameterized processing: This stage offers the choice of customizing the raw data using two sets of user-defined parameters. Both sets define how the visual and wireless raw data is processed to extract the final and, often, more realistic data samples. For visual raw data, this may include transforming images to different color spaces, lowering the resolution of images and depth maps, adding some artifacts to them, or distorting point cloud data. On the other hand, for the wireless data, this may include constructing wireless communication/radar channels and obtaining user locations. See the right column in Fig. 1.
What is interesting and unique about this three-stages framework is that every dataset is completely defined by its scenario name and the parameterization sets; it is enough to provide the name of the scenario and the parameter sets to completely describe a certain dataset or generate (reproduce) it. This allows for fast and easy re-generation and makes the framework favorable for benchmarking.
Iii ViWi: A Detailed Description
With the aforementioned ViWi structure in mind, this section discusses the inner-workings of each stage using two example datasets333The first release of ViWi could, actually, be used to generate four different datasets using four different scenario raw data, but for the sake of clarity, only two of those four datasets are used as examples in the discussion., namely distributed-camera and co-located-camera datasets. Both of them are generated using the same example scenario as it is explained below.
Iii-a Scenarios Definition
The two example datasets generated with ViWi are for an outdoor scenario, which shows a car driving through a city street. Fig. 2-a depicts an areal view of the visual instance of this scenario. It is built and generated using the popular game engine Blender ™. This instance is composed of many elements that are found in real-world metropolitans, like buildings, bushes, sidewalks, cars,…etc. Table I lists the building blocks of the scenario and their dimensions. To animate the scenario, five trajectories are defined to represent possible car paths, each of which has one thousand equally spaced points that are 0.089 meters apart. The trajectories are also separated form each other with equal distance, which is 0.5 meters. This visual instance is used for both datasets but with some minor visual changes. More on that in Section III-B
To generate the wireless instance of the scenario, the visual instance is imported into the ray-tracing software of choice, which is, in this work, Wireless InSite ®. Some of the objects in the visual instance have very fine visual details that may substantially slow down the ray-tracing simulation. In cases where hardware or software capabilities are limited, those objects could be either removed form the wireless instance of the scenario or replaced with objects of simpler geometry with no major impact on the simulation results. Fig. 2-b shows an example of how the visual instance of the example scenario is simplified for ray-tracing simulation. Once the complexity situation is settled, the wireless instance is completed by setting the dielectric properties of all its objects. Table I shows the material used for every object in the wireless instance and identifies which objects are removed or replaced.
Iii-B Raw-Data Generation
This stage prepares the instance for processing and generates visual and wireless raw data. The visual and wireless instances are both fitted with raw-data generators, like cameras, transmitters, and receivers, and their properties are set in preparation for data generation. Both instances are run separately to get the output data, which is in its initial form. The visual raw data of the generated two datasets consists of RGB images and depth maps whereas the wireless raw data is composed of the angles of departure, path gains, and channel impulse responses for every simulated ray from the transmitter to the receiver.
The RGB images and depth maps in the two example datasets are produced from the same visual instance but using different visual data generators. For the first dataset, the visual instance is fitted with a total of 3 cameras (data generators) that are 5-meters high and 30-meters apart, Fig. 3-a. Each camera has a field of view of 100 degrees. These properties are chosen so that the cameras cover the whole street with minimum field of view overlap. With these settings and using the defined car trajectories, the scenario is animated in Blender to generate the visual raw data of the fist example dataset, which will be henceforth referred to as the distributed-camera dataset. For the second example dataset, three differently-oriented cameras with 75-, 110-, and 75-degree fields of view are placed half-way through the street and 5-meters above the ground, Fig. 3-b. They are oriented in different directions such that they cover the whole street with the least possible overlap. Similar to the first example, the scenario is animated using the new generators and same car trajectories to produce the visual raw data of the second example dataset, which will be henceforth referred to as the co-located-camera dataset.
To generate wireless raw data of both example datasets, the wireless instance of the scenario, Fig. 2-b, is fitted with distributed data generators (transmitters and receivers) with similar properties. For both datasets, all transmitter and receiver antennas implement half-wave dipoles operating at a frequency of 60 GHz and with a sinusoid waveform. The first example has transmitter antennas, referred to as BaseStations (BSs), replacing the three distributed cameras, and a user grid of receiver antennas placed along each of the five pre-defined car trajectories. On the other hand, the second example has the three cameras replaced with one BS and uses the same five trajectories to define the receiver grid. Wireless InSite is used with both wireless instances to identify all possible rays going from every BS to every user in both examples, and produce two sets of wireless raw data, one for each dataset.
Iii-C Parametrized Processing
The raw data, whether visual or wireless, could be directly used as samples of a development dataset. However, studying real-world engineering problems and applications requires some form of control over the data acquisition process and the environmental settings. For instance, the quality of the camera feed or channel information could be subjects of interest in certain vision-aided communication problems. In such cases, the output raw data is in a primitive form that cannot be used to address any of those issues. Hence, raw data has to undergo another optional layer of processing to produce the final dataset, the last stage in the proposed framework.
The last stage is a parametrized layer where the user defines how the raw data is processed. ViWi provides processing for the wireless and visual raw data independently in the form of a package of scripts. For the visual raw data, the scripts offer control over a set of parameters that chooses the scenario of interest and applies some filtering and transformation operations on the images or depth maps. Examples of such operations are image blurring filters, noise-corruption processes, resolution control, and color-space transformation. On the other hand, for the wireless raw data, other scripts define another set of control parameters. It includes specifying the scenario of interest, number of active BSs, number of antennas across x-, y-, and z-axes, and antenna spacing to name a few, the full list of wireless parameters and their definitions are the same as those of the DeepMIMO dataset . By setting those parameters, a user can produce a task-specific set of wireless data samples such as complex-valued channels and user locations.
Iv How to Use ViWi?
The first release of ViWi provides four sets of raw data and a dataset-generating package . Each set has the visual and wireless raw data required to generate the final dataset. This release provides only the scripts that parameterize the wireless raw data. Visual raw data do not undergo any processing, and, therefore, they are directly included in the final dataset. However, they are provided in popular data format, i.e., JPEG for RGB images and MAT data444MATLAB ® native data structure, which could be easily read using other scripting languages like Python or R. The reason behind this choice is the popularity of MAT format compared to the original OpenEXR format used for depth maps. for depth maps, so they could be easily processed by the user.
ViWi provides visual raw data in sub-directories enclosed in a main compressed directory ready for download. The images and depth maps are stored into two separate sub-directories. All raw images and depth maps have 720p HD resolution, i.e., pixels pixels, and are, as stated above, stored in JPEG and MAT format. Every image has a corresponding depth map, so they both have the same name in both subdirectories. More information on the naming system could be found in the ”README.txt” file in the main compressed directory. Generating visual data samples of a dataset only requires unpacking (unzipping) the compressed directory.
The main compressed directory also contains a third sub-directory of MAT data files. Every BS (transmitter) in the wireless instance of the scenario contributes three MAT data files: (i) angles of departure file, (ii) complex impulse response file, and (iii) path gains file. This means there are 9 MAT files in the sub-directory of the distributed-cameras dataset and three files in that of the co-located-cameras dataset. To generate wireless data samples, the wireless raw data needs to be unpacked and processed using the ViWi script, which is provided separately. More technical details on how to generate the output wireless data and their structure could be found in the “README.txt” file enclosed with the script package.
V Possible Applications
Vision-aided wireless communications is a relatively new direction of research with a lot of potential. With the ViWi framework, it is now possible to investigate more problems and benchmark more computer-vision-powered solutions. The following three subsections provide a rough categorization of the problems that could benefit from ViWi.
V-a Camera-Aided Beam Prediction
Beam-prediction is a well-known problem in mmWave communications. Typical approaches to tackling this problem usually involve a form of beam-training with a fixed beam-forming codebook, which is usually time consumming. Some solutions have recently been proposed to utilize machine learning  and reduce that training burden. However, all those solutions use exclusively wireless data to do beam prediction. The introduction of visual sensory data could make an interesting addition to the problem, for it provides a method to understand or analyze the surrounding environment of the transmitter and receiver. The two datasets produced with ViWi could be easily used to study such problem; both contain RGB images, depth maps, and channels for every user position.
V-B Blockage Prediction
This is one of the most elusive problems not only in mmWaves but in wireless communications in general. It requires a strong sense of the surroundings and its dynamics as well as an intelligent analysis and prediction algorithm. The use of machine learning for predicting blockages has been investigated in [9, 19], and the results are overall promising. The colocated-cameras dataset in ViWi provides an interesting scenario where more advanced solutions could be studied; along with wireless and depth data, it provides RGB images and user spatial locations.
The author thanks Mr. Tarun Chawla and Remcom for supporting and encouraging this work. The authors also thanks Prof. Aldebaro Klautau from Federal University of Para for suggesting Blender to generate the synthetic images.
For the interesting premise and massive potential of vision-aided wireless communications, this paper introduces the Vision-Wireless (ViWi) dataset generation framework. ViWi facilitates the research in this direction by offering a method for unified and modular generation of development datasets and for benchmarking different solutions. The current version of ViWi offers four datasets for four outdoor scenarios. Each dataset provides a sequence of 4-tuple of RGB image, depth map, wireless channel, and user location. Future work on the framework includes expanding the ViWi database of scenarios and incorporate more data processing features.
-  F. Boccardi, R. Heath, A. Lozano, T. Marzetta, and P. Popovski, “Five disruptive technology directions for 5G,” IEEE Communications Magazine, vol. 52, no. 2, pp. 74–80, Feb. 2014.
-  R. W. Heath, N. Gonzlez-Prelcic, S. Rangan, W. Roh, and A. M. Sayeed, “An overview of signal processing techniques for millimeter wave MIMO systems,” IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 3, pp. 436–453, April 2016.
-  T. Rappaport, S. Sun, R. Mayzus, H. Zhao, Y. Azar, K. Wang, G. Wong, J. Schulz, M. Samimi, and F. Gutierrez, “Millimeter wave mobile communications for 5G cellular: It will work!” IEEE Access, vol. 1, pp. 335–349, May 2013.
-  E. Bjornson, E. G. Larsson, and T. L. Marzetta, “Massive MIMO: Ten myths and one critical question,” IEEE Communications Magazine, vol. 54, no. 2, pp. 114–123, February 2016.
-  A. Alkhateeb, S. Alex, P. Varkey, Y. Li, Q. Qu, and D. Tujkovic, “Deep learning coordinated beamforming for highly-mobile millimeter wave systems,” IEEE Access, vol. 6, pp. 37 328–37 348, 2018.
-  A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelligent surfaces with compressive sensing and deep learning,” arXiv preprint arXiv:1904.10136, 2019.
-  M. Alrabeiah and A. Alkhateeb, “Deep learning for TDD and FDD massive mimo: Mapping channels in space and frequency,” arXiv preprint arXiv:1905.03761, 2019.
-  T. J. OShea, T. Roy, and T. C. Clancy, “Over-the-air deep learning based radio signal classification,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 168–179, Feb 2018.
-  A. Alkhateeb, I. Beltagy, and S. Alex, “Machine learning for reliable mmwave systems: Blockage prediction and proactive handoff,” in IEEE GlobalSIP, arXiv preprint arXiv:1807.02723, 2018.
-  X. Li and A. Alkhateeb, “Deep learning for direct hybrid precoding in millimeter wave massive MIMO systems,” in Proc. of Asilomar CSSC, arXiv e-prints, p. arXiv:1905.13212, May 2019.
-  N. Samuel, T. Diskin, and A. Wiesel, “Deep mimo detection,” in 2017 IEEE 18th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), July 2017, pp. 1–5.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
-  DeepMIMO Dataset. [Online]. Available: https://www.DeepMIMO.net
-  ViWi Dataset. [Online]. Available: https://www.viwi-dataset.net
-  A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multi-object tracking analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4340–4349.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3234–3243.
-  M. Müller, V. Casser, J. Lahoud, N. Smith, and B. Ghanem, “Sim4cv: A photo-realistic simulator for computer vision applications,” International Journal of Computer Vision, vol. 126, no. 9, pp. 902–919, 2018.
-  M. Alrabeiah and A. Alkhateeb, “Deep learning for mmwave beam and blockage prediction using Sub-6GHz channels,” submitted to IEEE Transactions on Communications, arXiv e-prints, p. arXiv:1910.02900, Oct 2019.