Skip to content

Monitor Bandwidth Utilization of Nodes While Training #548

Description

@orwa-te

Environment:

  • Python version [3.7.7]
  • Spark version [3.0.1]
  • TensorFlow version [2.3.0]
  • TensorFlowOnSpark version [2.2.1]
  • Cluster version [Standalone]

Question:
Is there a way to monitor the network utilization of nodes while communicating with each other to transfer the gradients in order to update the model? I want to measure the size of data sent from one node to another one for a single batch and all batches. I think that Tensorboard does not support such a feature

Spark Submit Command Line:
spark-submit --master spark://master:7077 train_file.py --cluster_size 3 --epochs 10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions