논문 인용하기
각 논문마다 생성되어 있는 BibTeX를 사용하시면 자신이 원하는 스타일의 인용 문구를 생성할 수 있습니다.
생성된 BibTeX 코드를 복사하여 BibTeX Parser를 사용해 일반 문자열로 바꾸십시오. 아래의 사이트와 같이 웹에서 변환할 수도 있습니다.
bibtex.online2023
Yoon, Daegun; Oh, Sangyoon
MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training🌏 InternationalConference
30th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2023), 2023.
Links | BibTeX | 태그: distributed deep learning, gradient sparsification
@conference{nokey,
title = {MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training},
author = {Daegun Yoon and Sangyoon Oh},
url = {https://ieeexplore.ieee.org/abstract/document/10487098},
year = {2023},
date = {2023-10-02},
urldate = {2023-10-02},
booktitle = {30th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2023)},
keywords = {distributed deep learning, gradient sparsification},
pubstate = {published},
tppubtype = {conference}
}
Yoon, Daegun; Oh, Sangyoon
DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification🌏 InternationalConference
International Conference on Parallel Processing (ICPP) 2023, 2023.
Abstract | Links | BibTeX | 태그: distributed deep learning, gradient sparsification
@conference{nokey,
title = {DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification},
author = {Daegun Yoon and Sangyoon Oh},
url = {https://dl.acm.org/doi/10.1145/3605573.3605609},
year = {2023},
date = {2023-08-07},
urldate = {2023-08-07},
booktitle = {International Conference on Parallel Processing (ICPP) 2023},
abstract = {Gradient sparsification is a widely adopted solution for reducing
the excessive communication traffic in distributed deep learning.
However, most existing gradient sparsifiers have relatively poor
scalability because of considerable computational cost of gradient
selection and/or increased communication traffic owing to gradient
build-up. To address these challenges, we propose a novel gradient
sparsification scheme, DEFT, that partitions the gradient selection
task into sub tasks and distributes them to workers. DEFT differs
from existing sparsifiers, wherein every worker selects gradients
among all gradients. Consequently, the computational cost can
be reduced as the number of workers increases. Moreover, gradient build-up can be eliminated because DEFT allows workers to
select gradients in partitions that are non-intersecting (between
workers). Therefore, even if the number of workers increases, the
communication traffic can be maintained as per user requirement.
To avoid the loss of significance of gradient selection, DEFT
selects more gradients in the layers that have a larger gradient
norm than the other layers. Because every layer has a different
computational load, DEFT allocates layers to workers using a binpacking algorithm to maintain a balanced load of gradient selection
between workers. In our empirical evaluation, DEFT shows a significant improvement in training performance in terms of speed
in gradient selection over existing sparsifiers while achieving high
convergence performance.},
keywords = {distributed deep learning, gradient sparsification},
pubstate = {published},
tppubtype = {conference}
}
the excessive communication traffic in distributed deep learning.
However, most existing gradient sparsifiers have relatively poor
scalability because of considerable computational cost of gradient
selection and/or increased communication traffic owing to gradient
build-up. To address these challenges, we propose a novel gradient
sparsification scheme, DEFT, that partitions the gradient selection
task into sub tasks and distributes them to workers. DEFT differs
from existing sparsifiers, wherein every worker selects gradients
among all gradients. Consequently, the computational cost can
be reduced as the number of workers increases. Moreover, gradient build-up can be eliminated because DEFT allows workers to
select gradients in partitions that are non-intersecting (between
workers). Therefore, even if the number of workers increases, the
communication traffic can be maintained as per user requirement.
To avoid the loss of significance of gradient selection, DEFT
selects more gradients in the layers that have a larger gradient
norm than the other layers. Because every layer has a different
computational load, DEFT allocates layers to workers using a binpacking algorithm to maintain a balanced load of gradient selection
between workers. In our empirical evaluation, DEFT shows a significant improvement in training performance in terms of speed
in gradient selection over existing sparsifiers while achieving high
convergence performance.
Yoon, Daegun; Jeong, Minjoong; Oh, Sangyoon
SAGE: toward on-the-fly gradient compression ratio scaling🌏 InternationalJournal Article
In: The Journal of Supercomputing, pp. 1–23, 2023.
Abstract | Links | BibTeX | 태그: distributed deep learning, gradient sparsification
@article{yoon2023sage,
title = {SAGE: toward on-the-fly gradient compression ratio scaling},
author = {Daegun Yoon and Minjoong Jeong and Sangyoon Oh},
url = {https://link.springer.com/article/10.1007/s11227-023-05120-7},
doi = {https://doi.org/10.1007/s11227-023-05120-7},
year = {2023},
date = {2023-02-25},
urldate = {2023-02-25},
journal = {The Journal of Supercomputing},
pages = {1--23},
abstract = {Gradient sparsification is widely adopted in distributed training; however, it suffers from a trade-off between computation and communication. The prevalent Top-k sparsifier has a hard constraint on computational overhead while achieving the desired gradient compression ratio. Conversely, the hard-threshold sparsifier eliminates computational constraints but fail to achieve the targeted compression ratio. Motivated by this tradeoff, we designed a novel threshold-based sparsifier called SAGE, which achieves a compression ratio close to that of the Top-k sparsifier with negligible computational overhead. SAGE scales the compression ratio by deriving an adjustable threshold based on each iteration’s heuristics. Experimental results show that SAGE achieves a compression ratio closer to the desired ratio than a hard-threshold sparsifier without exacerbating the accuracy of model training. In terms of computation time for gradient selection, SAGE achieves a speedup of up to 23.62×
over the Top-k sparsifier.},
keywords = {distributed deep learning, gradient sparsification},
pubstate = {published},
tppubtype = {article}
}
over the Top-k sparsifier.
2022
여상호,; 배민호,; 정민중,; 권오경,; 오상윤,
Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability🌏 InternationalJournal Article
In: Concurrency and Computation: Practice and Experience, 2022.
Abstract | Links | BibTeX | 태그: deep learning, distributed deep learning
@article{여상호2022Crossover-SGD,
title = {Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability},
author = {여상호 and 배민호 and 정민중 and 권오경 and 오상윤},
url = {https://arxiv.org/abs/2012.15198},
doi = {10.48550/arXiv.2012.15198},
year = {2022},
date = {2022-11-01},
urldate = {2022-11-01},
journal = {Concurrency and Computation: Practice and Experience},
abstract = { Distributed deep learning is an effective way to reduce the training time of deep learning for large datasets as well as complex models. However, the limited scalability caused by network overheads makes it difficult to synchronize the parameters of all workers. To resolve this problem, gossip-based methods that demonstrates stable scalability regardless of the number of workers have been proposed. However, to use gossip-based methods in general cases, the validation accuracy for a large mini-batch needs to be verified. To verify this, we first empirically study the characteristics of gossip methods in a large mini-batch problem and observe that the gossip methods preserve higher validation accuracy than AllReduce-SGD(Stochastic Gradient Descent) when the number of batch sizes is increased and the number of workers is fixed. However, the delayed parameter propagation of the gossip-based models decreases validation accuracy in large node scales. To cope with this problem, we propose Crossover-SGD that alleviates the delay propagation of weight parameters via segment-wise communication and load balancing random network topology. We also adapt hierarchical communication to limit the number of workers in gossip-based communication methods. To validate the effectiveness of our proposed method, we conduct empirical experiments and observe that our Crossover-SGD shows higher node scalability than SGP(Stochastic Gradient Push). },
keywords = {deep learning, distributed deep learning},
pubstate = {published},
tppubtype = {article}
}
Yoon, Daegun; Oh, Sangyoon
The 8th International Conference on Next Generation Computing (ICNGC) 2022, 2022.
Abstract | Links | BibTeX | 태그: distributed deep learning, GPU, gradient sparsification
@conference{yoon2022empirical,
title = {Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment},
author = {Daegun Yoon and Sangyoon Oh},
doi = {10.48550/arXiv.2209.08497},
year = {2022},
date = {2022-09-19},
booktitle = {The 8th International Conference on Next Generation Computing (ICNGC) 2022},
abstract = {To train deep learning models faster, distributed training on multiple GPUs is the very popular scheme in recent years. However, the communication bandwidth is still a major bottleneck of training performance. To improve overall training performance, recent works have proposed gradient sparsification methods that reduce the communication traffic significantly. Most of them require gradient sorting to select meaningful gradients such as Top-k gradient sparsification (Top-k SGD). However, Top-k SGD has a limit to increase the speed up overall training performance because gradient sorting is significantly inefficient on GPUs. In this paper, we conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance. Based on observations from our empirical analysis, we plan to yield a high performance gradient sparsification method as a future work. },
keywords = {distributed deep learning, GPU, gradient sparsification},
pubstate = {published},
tppubtype = {conference}
}
2021
이승준,; 여상호,; 오상윤,
Edge AI의 추론 과정을 위한 계층적 작업 분할 배치 기법🇰🇷 DomesticConference
2021 한국차세대컴퓨팅학회 춘계학술대회, 한국차세대컴퓨팅학회, 2021.
Abstract | Links | BibTeX | 태그: deep learning, distributed deep learning, edge computing, neural network
@conference{이승준2021edge,
title = {Edge AI의 추론 과정을 위한 계층적 작업 분할 배치 기법},
author = {이승준 and 여상호 and 오상윤},
url = {https://www.earticle.net/Article/A409319},
year = {2021},
date = {2021-05-13},
urldate = {2021-05-13},
booktitle = {2021 한국차세대컴퓨팅학회 춘계학술대회},
pages = {26-29},
publisher = {한국차세대컴퓨팅학회},
abstract = {머신러닝 모델을 엣지 디바이스에 안정적으로 배포하기 위해서 기존 클라우드 기반의 머신 러닝 모델 배포는 높은 지연 시간으로 인해 머신 러닝 서비스의 질을 떨어트리는 문제를 야기한다. 또한, 추론을 위한 입력 데이터의 전송 과정은 개인 정보의 유출을 야기한다. 이러한 문제를 해결하기 위해서 개인 정보 유출 및 통신 부하 문제를 해결할 수 있는 엣지 서버 및 엣지 디바이스를 활용한 추론 과정의 정의가 요구된다. 본 연구팀은 효과적인 추론 과정의 정의를 위해 기존 분산 딥러닝의 모델 및 데이터 병렬화 파이프라인 기법에 기반하는 단일 추론 모델에 대한 엣지 서버-디바이스 간 모델 분할 기법 및 엣지에서 요청되는 독립된 다중 작업들에 대한 효과적인 스케쥴링 기법을 제안한다.},
keywords = {deep learning, distributed deep learning, edge computing, neural network},
pubstate = {published},
tppubtype = {conference}
}
김대현,; 여상호,; 오상윤,
분산 딥러닝에서 통신 오버헤드를 줄이기 위해 레이어를 오버래핑하는 하이브리드 올-리듀스 기법🇰🇷 DomesticJournal Article
In: 정보처리학회논문지. 컴퓨터 및 통신시스템, vol. 10, no. 7, pp. 191–198, 2021.
Abstract | Links | BibTeX | 태그: all-reduce, deep learning, distributed deep learning, layer overlapping, synchronization
@article{김대현2021분산,
title = {분산 딥러닝에서 통신 오버헤드를 줄이기 위해 레이어를 오버래핑하는 하이브리드 올-리듀스 기법},
author = {김대현 and 여상호 and 오상윤},
url = {https://kiss.kstudy.com/thesis/thesis-view.asp?key=3898298},
year = {2021},
date = {2021-01-01},
urldate = {2021-01-01},
journal = {정보처리학회논문지. 컴퓨터 및 통신시스템},
volume = {10},
number = {7},
pages = {191--198},
abstract = {분산 딥러닝은 각 노드에서 지역적으로 업데이트한 지역 파라미터를 동기화는 과정이 요구된다. 본 연구에서는 분산 딥러닝의 효과적인 파라미터 동기화 과정을 위해, 레이어 별 특성을 고려한 allreduce 통신과 연산 오버래핑(overlapping) 기법을 제안한다. 상위 레이어의 파라미터 동기화는 하위 레이어의 다음 전파과정 이전까지 통신/계산(학습) 시간을 오버랩하여 진행할 수 있다. 또한 이미지 분류를 위한 일반적인 딥러닝 모델의 상위 레이어는 convolution 레이어, 하위 레이어는 fully-connected 레이어로 구성되어 있다. Convolution 레이어는 fully-connected 레이어 대비 적은 수의 파라미터를 가지고 있고 상위에 레이어가 위치하므로 네트워크 오버랩 허용시간이 짧고, 이를 고려하여 네트워크 지연시간을 단축할 수 있는 butterfly all-reduce를 사용하는 것이 효과적이다. 반면 오버랩 허용시간이 보다 긴 경우, 네트워크 대역폭을 고려한 ring all-reduce를 사용한다. 본 논문의 제안 방법의 효과를 검증하기 위해 제안 방법을 PyTorch 플랫폼에 적용하여 이를 기반으로 실험 환경을 구성하여 배치크기에 대한 성능 평가를 진행하였다. 실험을 통해 제안 기법의 학습시간은 기존 PyTorch 방식 대비 최고 33% 단축된 모습을 확인하였다.},
keywords = {all-reduce, deep learning, distributed deep learning, layer overlapping, synchronization},
pubstate = {published},
tppubtype = {article}
}