Fetch failures lead to a different failure handling path: (1) we don't abort the stage after 4 task failures, instead we immediately go back to the stage which generated the map output, and regenerate the missing data.
Fetch failures lead to a different failure handling path: (1) we don't abort the stage after 4 task failures, instead we immediately go back to the stage which generated the map output, and regenerate the missing data. (2) we don't count fetch failures for blacklisting, since presumably its not the fault of the executor where the task ran, but the executor which stored the data. This is especially important because we might rack up a bunch of fetch-failures in rapid succession, on all nodes of the cluster, due to one bad node.
Error message displayed in the web UI.
Error message displayed in the web UI.
:: DeveloperApi :: Task failed to fetch shuffle data from a remote node. Probably means we have lost the remote executors the task is trying to fetch from, and thus need to rerun the previous stage.