Some high-level thoughts in computer vision research
- Work on supertask instead of small/fake/intermediate tasks. For example, DUST3R, GPT4, etc.
- When a task is important enough, try to build a so-called foundation model that has scalable training -- using large-scale free unlabeled data.
- Mid-level vs High-level vision tasks. The current best solution to mid-level tasks is (1) pretraining on synthetic data (2) unsupervised training on unlabeled data. The backbone could be pretrained SSL models (e.g. DINO-v2); For high-level vision tasks, SSL+fine-tuning dominates. Mid-level vision has many applications even though high-level vision is the main focus of SSL.
- Industry prefers light-weighted and fast models. Academia cares more about novelty. If a model is not worth working, it is not worth working fast.
- It is so useful to talk to other researchers on topics such as their long-term goals, research agenda, current projects, and thoughts on your projects. People in the same subfield tend to have very similar research idea, which is probably why the community is so competitative and being unique is both hard and highly rewarding.
- The trends in vision community: moving to architectures with less inductive bias and more data-driven (cnn vs vit vit pit); specific sub-task to a supertask (optical flow vs dense tracking); well-designed fancy architectures on small data to generic architecture with large-scale unlabeled data.
- Read widely, fast, and deeply! I talked with the chair of our department. I feel bad that I haven't read enough details for things that are related to his questions (e.g. mambda, kan). So get ready to answer any questions in your field. This requires you to read widely, deeply, and quickly.
- The most interesting papers I have writen are the ones that have unexpected experimental results (e.g. ambiguity instead of code bugs!!) or observations (e.g. bias in annotations).