4.3 객체 탐지(Object Detection)

1. 정의

객체 탐지(Object Detection)는 입력 데이터(영상 또는 포인트 클라우드) 내에서 관심 객체(차량, 보행자, 자전거, 교통 표지판 등)의 존재를 감지하고 그 위치를 특정하는 과업이다. 2차원 객체 탐지는 영상 내에서 경계 상자(Bounding Box)를 예측하며, 3차원 객체 탐지는 3D 공간에서의 위치, 크기, 방향을 추정한다.

2. D 객체 탐지

2.1 단계 검출기 (Two-Stage Detector)

2단계 검출기는 영역 제안(Region Proposal)과 분류/회귀를 순차적으로 수행하는 구조이다. R-CNN(Girshick et al., 2014), Fast R-CNN(Girshick, 2015), Faster R-CNN(Ren et al., 2015)으로 발전하였다. Faster R-CNN은 영역 제안 네트워크(Region Proposal Network, RPN)를 도입하여 전체 과정을 종단간으로 학습 가능하게 하였다. 정확도가 높으나 실시간 처리에 상대적으로 불리하다.

2.2 단계 검출기 (One-Stage Detector)

1단계 검출기는 영역 제안 없이 영상의 각 위치에서 직접 객체의 클래스와 경계 상자를 예측하는 구조이다. YOLO(Redmon et al., 2016), SSD(Liu et al., 2016), RetinaNet(Lin et al., 2017) 등이 대표적이다. 처리 속도가 빠르며, YOLO 계열은 실시간 자율주행 인지에 널리 사용된다.

2.3 트랜스포머 기반 검출기

DETR(Carion et al., 2020)은 트랜스포머를 객체 탐지에 최초로 적용한 연구로, 앵커(Anchor)와 비최대 억제(Non-Maximum Suppression, NMS) 없이 집합 예측(Set Prediction) 방식으로 객체를 검출한다. DETR의 후속 연구인 Deformable DETR, DAB-DETR, DINO 등이 정확도와 수렴 속도를 개선하였다.

3. D 객체 탐지

3.1 라이다 기반 3D 탐지

PointPillars(Lang et al., 2019): 포인트 클라우드를 수직 필라(Pillar)로 그룹화하고 2D 합성곱으로 처리하여 높은 추론 속도를 달성한다.
CenterPoint(Yin et al., 2021): 객체의 중심점을 히트맵(Heatmap)으로 예측하는 앵커프리(Anchor-Free) 방식으로, 3D 검출과 추적을 통합한다.
VoxelNet(Zhou & Tuzel, 2018): 포인트 클라우드를 복셀로 변환하고 3D 합성곱 신경망으로 처리한다.

3.2 카메라 기반 3D 탐지

BEVFormer(Li et al., 2022): 다중 카메라 영상으로부터 트랜스포머를 이용하여 BEV 특징을 생성하고 3D 객체를 검출한다.
BEVDet(Huang et al., 2022): LSS(Lift-Splat-Shoot) 방식으로 영상 특징을 3D 공간으로 투영하여 BEV 표현을 구성한다.
PETR(Liu et al., 2022): 위치 인코딩을 3D 공간으로 확장하여 다중 카메라 기반 3D 검출을 수행한다.

3.3 센서 융합 기반 3D 탐지

카메라와 라이다의 데이터를 결합하여 3D 검출을 수행하는 방식이다. BEVFusion(Liu et al., 2023)은 카메라와 라이다 각각에서 추출한 BEV 특징을 융합하여 검출 성능을 향상시킨다.

4. 자율주행에서의 요구사항

자율주행에서의 객체 탐지는 다음의 추가적 요구사항을 가진다.

실시간 처리: 주행 중 지연 없이 연속적으로 검출이 수행되어야 한다.
장거리 검출: 고속 주행 시 충분한 반응 시간을 확보하기 위해 200m 이상의 장거리에서 객체를 검출하여야 한다.
다양한 객체 크기: 대형 트럭부터 소형 동물까지 다양한 크기의 객체를 검출하여야 한다.
악천후 강건성: 비, 눈, 안개, 야간 등의 열악한 환경에서도 안정적인 검출 성능이 요구된다.

5. 참고 문헌

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. Proceedings of the European Conference on Computer Vision (ECCV), 213–229.
Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., & Beijbom, O. (2019). PointPillars: Fast encoders for object detection from point clouds. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12697–12705.
Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., … & Dai, J. (2022). BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. Proceedings of the European Conference on Computer Vision (ECCV), 1–18.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2980–2988.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. Proceedings of the European Conference on Computer Vision (ECCV), 21–37.
Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D., & Han, S. (2023). BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 779–788.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 91–99.
Yin, T., Zhou, X., & Krahenbuhl, P. (2021). Center-based 3D object detection and tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11784–11793.
Zhou, Y., & Tuzel, O. (2018). VoxelNet: End-to-end learning for point cloud based 3D object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4490–4499.

v1.0