So you’ve written the ultimate ROS program: after thousands of lines of code your robot will finally achieve sentience and bring about the singularity!
One by one you launch your nodes. Each one bringing the Apocalypse ever closer. You hit enter on the last command. And. And, nothing happens. What went wrong? How will you find, and forever squash that bug that prevented your moment of triumph? This blog attempts to answer those questions, and more*.
At BLUEsat we’ve had our share of complicated ROS debugging problems. The best ones happen when you are half-way through a competition task, with time ticking on the clock. Although this article will also look at the more common situation of debugging in a less time pressured, and fire prone environment**.
Below are several tools and techniques that we have successfully deployed to debug our ROS environment.
Keep Calm And … FIRE!
You’ve probably heard this before but its very important when debugging to not jump to conclusions or apply fixes you haven’t tested properly. Google for example has a policy of rolling back changes on its services rather than trying to push a fix. A similar idea applies in a competition or time pressured situation: make sure you have thought through that patch that removes the “don’t kill humans” safety from your robot! That being said, unfortunately a roll back is unlikely to be applicable in a competition situation, nor is it likely to be able to put out that fire you just started on your robot. So we can’t just copy Google, but we should think about what we are doing before we do it.
Basically any patches or configuration fixes you apply during such a situation is a calculated risk, and you should make sure you understand those risks before you do something. During the European Rover Challenge last year I found it was possible to tweak small settings, restart certain nodes, and re-calibrate systems; but it was too risky to power cycle the rover during a task due to the time it took to establish communication. Likewise restarting our drive systems or cameras was very disruptive to our pilot, so could only be done in certain situations where the damage done by not fixing the system could be worse. That being said, after a critical camera failed we did attempt to programmatically power cycle that device – the decision being that the camera was important enough to attempt such a risky move. (In the end we weren’t able to do this during the task, and our pilot managed to navigate the rover without the camera in question.)
In a non time pressured situation you can be more flexible. It is possible to test different options and see if they work. That is provided they don’t damage your robot. However a structured approach is often beneficial for those more complicated bugs. I often find that when I’m debugging an intermittent or difficult to detect problem that it is easy to end up lose track of what I’ve tried, or get results mixed up. A technique I’ve found to be very useful was to record what I was doing as I did it, especially if the problem includes sensor data. We had a number of problems with our Rover’s steering system when we first implemented our swerve drive and I found writing down ADC values and rotation readings in different situations really helped debug it (You can read more about how we use ADC’s in our steering system in one of our previous articles).
Basically the main point is too keep your head clear, and think through the consequences before you act. Know the risks and have your E-Stop button ready! Now lets look at some tools you can use to aid you in your debugging.
Grab Your RQT
The handy “rqt” tool is a ROS debugger’s Swiss Army Knife. It’s saved me many times during both time pressured, and non time pressured debugging. During the 2016 European Rover Challenge it was my constant companion at the debugging station, providing many useful insights and a lot of useful diagnostic data. (A standard BLUEsat rover driving crew consists of a pilot – who drives the rover; a co-pilot – who keeps track of task goals, the rovers position, and manages communication to the pilot; and a debugger who monitors the state of the rover’s software and manages the different running ROS nodes).
RQT is run using the “rqt” command in the terminal, and contains a range of widgets that can be loaded through the Plugins menu. Below I’ll give a brief tutorial on some of it’s must useful widgets.
The Node Graph
The first tool in RQT’s arsenal is the Node Graph. This widget depicts all the nodes in your ROS graph as ovals and all of your topics as squares. Directional arrows indicate which nodes are advertising or subscribing to a topic. You can also choose to only show topics that are currently connected to both a publisher and subscriber (active), or to only display nodes without displaying topic information.
When I start debugging a ROS problem the node graph is one of the first things I look at. With a glance I can see which nodes are running, and if two nodes are connected correctly. Its amazing how often a ROS problem can be as simple as a node that isn’t running (or is running when it shouldn’t be). The graph also allows us to see if nodes are connected correctly – a misspelled topic name certainly doesn’t jump out at you in code, but its immediately obvious as a missing link in the graph.
The Topic Monitor
If we can’t find our problem using the Node Graph than this next widget will often help. The Topic Monitor is the younger, better organised sibling of the rostopic echo command line tool. It displays a list of all currently advertised topics, and allows you to monitor them. Besides each topic is a checkbox, which when checked subscribe’s us to that topic, displaying its full output as well as the bandwidth it’s using and the frequency it’s being published at.
This is extremely useful for checking that the correct information is travelling through your ROS network without having to add ROS_INFO debugs in all your nodes. On the BLUEtongue Rover we publish a lot of diagnostic information as ROS topics (a bit more about that here). Some of it is used to provide information to our pilot via the GUI, but if we need to get into the nitty gritty details then RQT can provide a massive wealth of information.
In addition to diagnostic information the topic monitor can be used to find problems in your network. A common case of this is a node that isn’t actually publishing any messages – in which case it may not be connected properly and you should take a look at the ROSWTF section. You can also see if a node is publishing the wrong message type, or if any values are incorrect.
Finally you can also use the topic monitor to identify potential bandwidth problems, although you should remember when doing this that rqt will subscribe to the topic itself, which may exacerbate the issue.
The Message Publisher
The RQT Message Publisher is the Topic Monitor’s evil twin. As the name implies it allows you to publish messages, providing very similar functionality to the command line rostopic pub command – you can select a topic, message type and frequency and then enter the data you want to send. However it also provides some additional visual aids that speed up the debugging process.
Firstly it pre-populates a list of topics and a corresponding list of types, allowing you to very quickly publish to any subscriber currently in the network. This can be a life saver under stress, and prevents you from having to constantly remind yourself if that cmd_vel message is a geometry_msgs/Twist or a geometry_msgs/TwistStamped.
Once you have selected the message type, it will also display the fields of that message, making it much simpler to fill out those more complicated messages. It also remembers messages you have previously sent, allowing you to quickly resend them. This can be great if you need to do something like send a specific set of messages, or quickly enable a message after an event has occurred.
Finally, if you are a power user or need to send a more complicated message you can enter valid python expressions into the “expression” field, rather than actual values. This includes any method in the time, math or random modules. In addition it provides you with an automatic counter i (see the above image for examples).
The TF Tree
The final tool I’m going to talk about in our RQT debugging arsenal is the TF Tree. The tool is useful if you are using ROS’s transform system, if not you may want to skip over this section.
The TF tree displays the connection structure of your transforms, as well as which node is publishing a given frame, the last time it was updated, and the oldest transform in the system.
The best use I’ve had for this is detecting gaps in the graph. For example ROS’s robot_state_publisher won’t publish a transform for a non-fixed joint if you haven’t published any information to joint_states about it, which can lead to unreachable transforms. If something like this happens the best approach is often to go back and check to make sure whichever node is supposed to be publishing a transform is functioning correctly. It is also useful for identifying the cause of transform timeouts by looking at the average rate and most recent transform values.
Find Yourself with RViz
RQT is an amazing tool for general day-to-day debugging, however if you are dealing with very visual information such as point clouds or where your software thinks the different parts of your robot are then something more powerful is needed. That is where RViz comes in, it is a 3D scene where you can visualise different types of ROS data. As well as URDF robot models, RViz supports point clouds, occupancy grids, and much more. Basically if the topic you want to visualise is part of ros-desktop, then you can probably see it in RViz. (Note: if you really want to use RQT for everything you can use RViz as an RQT plugin).
RViz actually has some reasonable tutorials on the ROS Wiki, so I’m just going to give the Cliff’s Notes here. The key feature of RViz is its ability to load in different ROS messages and visualise them relative to each other. This is useful if you are trying to debug anything to do with localisation or automation as you can quickly work out if you robot thinks it is in the wrong place, or is having problems with sensor data. As an example of RViz’s versatility, we made use of it during the 2016 European Rover Challenge’s “Blind Navigation” task to display a single plane of LIDAR data relative to the rover’s estimated position (camera feeds or full 3D sensor visualisation was forbidden during this task). We also used it extensively to debug LIDAR sensor input (see below), and various SLAM solutions.
If you are having weird connection issues or nodes that otherwise seem to be functioning properly then the roswtf tool is your guy/gal. Basically ROSWTF is designed to be your one stop shop for identifying issues in your ROS system, although my experience is that it’s not quite there yet. What it is really good for however is detecting any setup or networking issues with your ROS network.
One such issue is machines on your ROS Network not being able to recognise each other’s host names. This can happen if you don’t have something like DHCP or DNS sharing them between machines, or if your machine’s DNS name does not match what local programs think your host name is. This is a difficult problem to detect, because nodes will often connect and run correctly until they try and communicate with a node on a different machine. ROSWTF will detect this hard to find problem.
In most cases there are two ways to fix this, the first is configuring your local machine’s ROS_HOST environment variables to be their IP addresses, and the second is fixing your hostname resolution so that machines can find each other. The latter can be done by adding entries to your /etc/hosts file or updating your local DNS server. At BLUEsat we tend to use the environment variable option as our network setup often means we don’t use DHCP, and having hosts know their own ip means we don’t have to update /etc/hosts on every machine in our network anytime we add a new host.
Other problems that roswtf can detect are: misconfigured ROS_MASTER nodes, actual network problems, and ros launch file configuration problems. If everything looks like it should be working, but isn’t, roswtf is the tool for the job!
Dig Deeper with GDB and Valgrind
Needless to say at this point if you have gone through the other steps and your robot is still on fire, you probably don’t have very much robot left. GDB and Valgrind are tools best left for initial testing and development, however when your robot is not on fire they can be very useful.
Both of these tools are topics in themselves and I recommend that you read full tutorials on the individual tools (gdb, valgrind) to get a good understanding of them. Here we will primarily cover how you use both these tools in a ROS environment.
In order to use either of these tools effectively you must first recompile your code with debugging symbols. This allows the tools to give you information about line numbers, and snippets of code where errors may be occurring. To recompile a catkin workspace with debugging symbols enabled you can run the following command.
$ catkin_make -DCMAKE_BUILD_TYPE=Debug [Catkin Output Goes Here]
The second difficulty is actually locating the executable to run it using gdb or valgrind (you can’t just run gdb on rosrun unfortunately). If you are in the root of your catkin workspace then the executable should be located in devel/lib/<ros package name>/<node name>. That means you can run a node with gdb using the following command:
$ gdb devel/lib/[ros_package_name]/[node_name]
You can then step through your program as you normal would in gdb. Likewise you can run valgrind with the following command:
$ valgrind --leak-check=yes devel/lib/[ros_package_name]/[node_name]
I find that I tend to use gdb when I am trying to debug segfaults, weird outputs or unexpected behaviour; whilst I use valgrind almost exclusively for finding memory leaks and array overflows. They are certainly key tools for debugging C++ code and I highly recommend you read more about them!
Now Go Get that Singularity!
All these tools have been of great use to me during my time at BLUEsat, especially during the European Rover Challenge tasks. I hope you find them helpful next time you are trying to create the singularity, or even when you are just debugging normal ROS code. If not, what’s here only scrapes the surface of what you can do with many of these tools and I encourage the reader to experiment and dive deeper into all of them!
* BLUEsat UNSW and its members (mostly) do not endorse the bringing about of the apocalypse, and wave all responsibility for any death or planet-wide calamity that may occur as the result of apocalypse related debugging.
** Debugging techniques in this article not guaranteed to be safe for application during a fire. BLUEsat recommends liberal use of E-Stop systems.