논문 인용하기
각 논문마다 생성되어 있는 BibTeX를 사용하시면 자신이 원하는 스타일의 인용 문구를 생성할 수 있습니다.
생성된 BibTeX 코드를 복사하여 BibTeX Parser를 사용해 일반 문자열로 바꾸십시오. 아래의 사이트와 같이 웹에서 변환할 수도 있습니다.
bibtex.online2023
Yoon, Daegun; Oh, Sangyoon
MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training🌏 InternationalConference
30th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2023), 2023.
Links | BibTeX | 태그: distributed deep learning, gradient sparsification
@conference{nokey,
title = {MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training},
author = {Daegun Yoon and Sangyoon Oh},
url = {https://ieeexplore.ieee.org/abstract/document/10487098},
year = {2023},
date = {2023-10-02},
urldate = {2023-10-02},
booktitle = {30th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2023)},
keywords = {distributed deep learning, gradient sparsification},
pubstate = {published},
tppubtype = {conference}
}
Yoon, Daegun; Oh, Sangyoon
DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification🌏 InternationalConference
International Conference on Parallel Processing (ICPP) 2023, 2023.
Abstract | Links | BibTeX | 태그: distributed deep learning, gradient sparsification
@conference{nokey,
title = {DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification},
author = {Daegun Yoon and Sangyoon Oh},
url = {https://dl.acm.org/doi/10.1145/3605573.3605609},
year = {2023},
date = {2023-08-07},
urldate = {2023-08-07},
booktitle = {International Conference on Parallel Processing (ICPP) 2023},
abstract = {Gradient sparsification is a widely adopted solution for reducing
the excessive communication traffic in distributed deep learning.
However, most existing gradient sparsifiers have relatively poor
scalability because of considerable computational cost of gradient
selection and/or increased communication traffic owing to gradient
build-up. To address these challenges, we propose a novel gradient
sparsification scheme, DEFT, that partitions the gradient selection
task into sub tasks and distributes them to workers. DEFT differs
from existing sparsifiers, wherein every worker selects gradients
among all gradients. Consequently, the computational cost can
be reduced as the number of workers increases. Moreover, gradient build-up can be eliminated because DEFT allows workers to
select gradients in partitions that are non-intersecting (between
workers). Therefore, even if the number of workers increases, the
communication traffic can be maintained as per user requirement.
To avoid the loss of significance of gradient selection, DEFT
selects more gradients in the layers that have a larger gradient
norm than the other layers. Because every layer has a different
computational load, DEFT allocates layers to workers using a binpacking algorithm to maintain a balanced load of gradient selection
between workers. In our empirical evaluation, DEFT shows a significant improvement in training performance in terms of speed
in gradient selection over existing sparsifiers while achieving high
convergence performance.},
keywords = {distributed deep learning, gradient sparsification},
pubstate = {published},
tppubtype = {conference}
}
the excessive communication traffic in distributed deep learning.
However, most existing gradient sparsifiers have relatively poor
scalability because of considerable computational cost of gradient
selection and/or increased communication traffic owing to gradient
build-up. To address these challenges, we propose a novel gradient
sparsification scheme, DEFT, that partitions the gradient selection
task into sub tasks and distributes them to workers. DEFT differs
from existing sparsifiers, wherein every worker selects gradients
among all gradients. Consequently, the computational cost can
be reduced as the number of workers increases. Moreover, gradient build-up can be eliminated because DEFT allows workers to
select gradients in partitions that are non-intersecting (between
workers). Therefore, even if the number of workers increases, the
communication traffic can be maintained as per user requirement.
To avoid the loss of significance of gradient selection, DEFT
selects more gradients in the layers that have a larger gradient
norm than the other layers. Because every layer has a different
computational load, DEFT allocates layers to workers using a binpacking algorithm to maintain a balanced load of gradient selection
between workers. In our empirical evaluation, DEFT shows a significant improvement in training performance in terms of speed
in gradient selection over existing sparsifiers while achieving high
convergence performance.
Yoon, Daegun; Jeong, Minjoong; Oh, Sangyoon
SAGE: toward on-the-fly gradient compression ratio scaling🌏 InternationalJournal Article
In: The Journal of Supercomputing, pp. 1–23, 2023.
Abstract | Links | BibTeX | 태그: distributed deep learning, gradient sparsification
@article{yoon2023sage,
title = {SAGE: toward on-the-fly gradient compression ratio scaling},
author = {Daegun Yoon and Minjoong Jeong and Sangyoon Oh},
url = {https://link.springer.com/article/10.1007/s11227-023-05120-7},
doi = {https://doi.org/10.1007/s11227-023-05120-7},
year = {2023},
date = {2023-02-25},
urldate = {2023-02-25},
journal = {The Journal of Supercomputing},
pages = {1--23},
abstract = {Gradient sparsification is widely adopted in distributed training; however, it suffers from a trade-off between computation and communication. The prevalent Top-k sparsifier has a hard constraint on computational overhead while achieving the desired gradient compression ratio. Conversely, the hard-threshold sparsifier eliminates computational constraints but fail to achieve the targeted compression ratio. Motivated by this tradeoff, we designed a novel threshold-based sparsifier called SAGE, which achieves a compression ratio close to that of the Top-k sparsifier with negligible computational overhead. SAGE scales the compression ratio by deriving an adjustable threshold based on each iteration’s heuristics. Experimental results show that SAGE achieves a compression ratio closer to the desired ratio than a hard-threshold sparsifier without exacerbating the accuracy of model training. In terms of computation time for gradient selection, SAGE achieves a speedup of up to 23.62×
over the Top-k sparsifier.},
keywords = {distributed deep learning, gradient sparsification},
pubstate = {published},
tppubtype = {article}
}
over the Top-k sparsifier.
2022
Yoon, Daegun; Oh, Sangyoon
The 8th International Conference on Next Generation Computing (ICNGC) 2022, 2022.
Abstract | Links | BibTeX | 태그: distributed deep learning, GPU, gradient sparsification
@conference{yoon2022empirical,
title = {Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment},
author = {Daegun Yoon and Sangyoon Oh},
doi = {10.48550/arXiv.2209.08497},
year = {2022},
date = {2022-09-19},
booktitle = {The 8th International Conference on Next Generation Computing (ICNGC) 2022},
abstract = {To train deep learning models faster, distributed training on multiple GPUs is the very popular scheme in recent years. However, the communication bandwidth is still a major bottleneck of training performance. To improve overall training performance, recent works have proposed gradient sparsification methods that reduce the communication traffic significantly. Most of them require gradient sorting to select meaningful gradients such as Top-k gradient sparsification (Top-k SGD). However, Top-k SGD has a limit to increase the speed up overall training performance because gradient sorting is significantly inefficient on GPUs. In this paper, we conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance. Based on observations from our empirical analysis, we plan to yield a high performance gradient sparsification method as a future work. },
keywords = {distributed deep learning, GPU, gradient sparsification},
pubstate = {published},
tppubtype = {conference}
}