Hi
I have a few questions about evaluation, input resolution, keypoint design, and reliability:
1) Input resolution and expected gains
- RTMWPose-x reports 70.2 AP on whole‑body estimation using 384×288 inputs. Many public datasets are at most ~640×640 (relatively low resolution). If someone uses a higher‑resolution dataset (e.g. 1280×720 or 1920×1080) and scales the network inputs accordingly (e.g. 768×576 or 1152×864), should we expect improved performance? The Sapiens model uses 1024×768 inputs and gets better results — do you think that improvement is mainly due to larger input size, and might RTMWPose also benefit on high‑res datasets?
2) Annotation noise and AP reliability
- Since most datasets are annotated manually, validation labels can contain errors. Does that make reported average precision less reliable or meaningful? If annotations were produced more systematically (or with fewer mistakes), would AP scores likely increase? In your experiments, did you use cleaner training data or a stricter/cleaner validation pipeline?
3) Adding extra, fixed “skin” keypoints to improve accuracy
- To reach higher precision and reliability, have you considered adding additional keypoints on the person’s “skin” at fixed, well‑defined positions to better reconstruct the skeleton? For example, placing two points on the forearm (one on the dorsal side and one on the ventral side) can reveal forearm rotation without fitting an SMPL model. This idea is similar in spirit to the Triplet Representation (TRB) paper (https://openaccess.thecvf.com/content_ICCV_2019/papers/Duan_TRB_A_Novel_Triplet_Representation_for_Understanding_2D_Human_Body_ICCV_2019_paper.pdf). Could that approach work here? With more fixed surface keypoints, could you reconstruct joint poses more faithfully without having to fit a mesh, simply because you have more direct measurements?
4) Per‑joint confidence and filtering spurious joints
- Is it possible to output per‑joint confidence scores (or to expose them if they already exist)? Sometimes RTMWPose produces wildly incorrect joints when a subject is cropped or partially visible; with a confidence threshold you could exclude low‑confidence joints. Do you support that, or is there a recommended way to filter spurious detections?
Thanks in advance for any insights — I appreciate any details about experiments, limitations, or suggested directions.
Top comments (0)