# PLIP
**Repository Path**: luoxihaha/PLIP
## Basic Information
- **Project Name**: PLIP
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-10-15
- **Last Updated**: 2023-10-16
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# PLIP
**PLIP** is a novel **L**anguage-**I**mage **P**re-training framework for generic **P**erson representation learning which benefits a range of downstream person-centric tasks.
Also, we present a large-scale person dataset named **SYNTH-PEDES** to verify its effectiveness, where the Stylish Pedestrian Attributes-union Captioning method **(SPAC)** is proposed to synthesize diverse textual descriptions.
Experiments show that our model not only significantly improves existing methods on downstream tasks, but also shows great ability in the few-shot and domain generalization settings. More details can be found at our paper [PLIP: Language-Image Pre-training for Person Representation Learning](http://export.arxiv.org/abs/2305.08386).

## News
* 🔥[06.1] The **SYNTH-PEDES** is released. Welcome to download and use!
* 🔥[06.1] The code for **CMPM/C fine-tuning** is released! It leads to SOTA performance without bells and whistles!
* 🔥[05.31] The pre-trained model and **zero-shot inference** code are released !
## SYNTH-PEDES
SYNTH-PEDES is by far the largest person dataset with textual descriptions without any human annotation effort. Every person image has 2 or 3 different texutal descriptions and 6 attribute annotations. The dataset is released at [Baidu Yun](https://pan.baidu.com/s/11jQ3gvkn77b3jjVx-quQxQ?pwd=1037).
**Note that SYNTH-PEDES can only be used for research, any commercial usage is forbidden.**
This is the comparison of SYNTH-PEDES with other popular datasets.

These are some examples of our SYNTH-PEDES dataset.

Annotation format:
```
{
"id": 7,
"file_path": "Part1/7/1.jpg",
"attributes": [
"man,black hair,black shirt,pink shorts,black shoes,unknown"
],
"captions": [
"A man in his mid-twenties with short black hair is wearing a black t-shirt over light pink trousers. He is also wearing black shoes.",
"The man with short black hair is wearing a black shirt and salmon pink shorts. He is also wearing black shoes."
],
"prompt_caption": [
"A man with black hair is wearing a black shirt with pink shorts and a pair of black shoes."
]
}
```
## Models
We utilize ResNet50 and Bert as our encoders. After pre-training, we fine-tune and evaluate the performance on three downstream tasks. The checkpoints have been released at [Baidu Yun](https://pan.baidu.com/s/1LjT-x6kjGwpO2EP4Ni7bCA?pwd=1037) and [Google Drive](https://drive.google.com/file/d/1Cpid6AGHXF_is5ULB3UJKMGvl6kf2Tmg/view?usp=sharing).
### CUHK-PEDES dataset (Text Re-ID R@1/R@10)
| Pre-train | CMPM/C | SSAN | LGUR |
| :---: |:---: |:---: | :---:
| IN sup | 54.81/83.22 | 61.37/86.73 | 64.21/87.93
| IN unsup |55.34/83.76| 61.97/86.63| 65.33/88.47
| CLIP |55.67/83.82| 62.09/86.89| 64.70/88.76
| LUP |57.21/84.68| 63.91/88.36| 65.42/89.36
| LUP-NL |57.35/84.77| 63.71/87.46| 64.68/88.69
| **PLIP(ours)** |**69.23/91.16**| **64.91/88.39**| **67.22/89.49**
### ICFG-PEDES dataset (Text Re-ID R@1/R@10)
| Pre-train | CMPM/C | SSAN | LGUR |
| :---: |:---: |:---: | :---:
| IN sup | 47.61/75.48| 54.23/79.53| 57.42/81.45
| IN unsup |48.34/75.66| 55.27/79.64| 59.90/82.94
| CLIP |48.12/75.51| 53.58/78.96| 58.35/82.02
| LUP |50.12/76.23| 56.51/80.41| 60.33/83.06
| LUP-NL |49.64/76.15| 55.59/79.78| 60.25/82.84
| **PLIP(ours)** |**64.25/86.32**| **60.12/82.84**| **62.27/83.96**
### Market1501 & DukeMTMC (Image Re-ID mAP/cmc1)
| Methods | Market1501 | DukeMTMC |
| :---: |:---: |:---:
| BOT | 85.9/94.5 |76.4/86.4
| BDB |86.7/95.3| 76.0/89.0
| MGN |87.5/95.1 |79.4/89.0
| ABDNet |88.3/95.6| 78.6/89.0
| **PLIP+BOT** | 88.0/95.1| 77.0/86.5
| **PLIP+BDB** |88.4/95.7| 78.2/89.8
| **PLIP+MGN** |90.6/96.3| **81.7**/90.3
| **PLIP+ABDNet**|**91.2**/**96.7** |81.6/**90.9**
### Evaluate on PETA & PA-100K & RAP (PAR mA/F1)
| Methods | PETA | PA-100K | RAP
| :---: |:---: |:---: |:---:
| DeepMAR | 80.14/83.56| 78.28/84.32| 76.81/78.94
| Rethink |83.96/86.35 |80.21/87.40 |79.27/79.95
| VTB |84.12/86.63| 81.02/87.31| 81.43/80.63
| Label2Label |84.08/86.57 |82.24/87.08| 81.82/80.93
| **PLIP+DeepMAR** | 82.46/85.87 |80.33/87.24 |78.96/80.12
| **PLIP+Rethink**|85.56/87.63| 82.09/88.12| 81.87/81.53
| **PLIP+VTB** |86.03/**88.14**| 83.24/88.57 |83.64/**81.78**
| **PLIP+Label2Label** |**86.12**/88.08 |**84.36**/**88.63**| **83.77**/81.49
## Usage
### Install Requirements
we use 4 RTX3090 24G GPU for training and evaluation.
Create conda environment.
```
conda create --name PLIP --file requirements.txt
conda activate PLIP
```
### Datasets Prepare
Download the CUHK-PEDES dataset from [here](https://github.com/ShuangLI59/Person-Search-with-Natural-Language-Description) and ICFG-PEDES dataset from [here](https://github.com/zifyloo/SSAN).
Organize them in `data` folder as follows:
```
|-- data/
| |-- /
| |-- imgs
| |-- cam_a
| |-- cam_b
| |-- ...
| |-- reid_raw.json
|
| |-- /
| |-- imgs
| |-- test
| |-- train
| |-- ICFG_PEDES.json
|
| |-- /
| |-- Part1
| |-- ...
| |-- Part11
| |-- synthpedes_dataset.json
```
### Zero-shot Inference
Our pre-trained model can directly be transfered to downstream tasks, especially text-based Re-ID.
1. Run the python file and generate train/test/valid json files respectively.
```
python dataset_split.py
```
2. Then you can evaluate by running:
```
python zs_inference.py
```
### Fine-tuning Inference
Almost all existing downstream person-centric methods can be improved through replacing the backbone with our pre-trained model. Taking CMPM/C as example:
1. Go to the CMPM/C root:
```
cd Downstreams/CMPM-C
```
2. Run the following to train and test. Note that you can modify the code yourself for single GPU training:
```
python dataset_split.py
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train.py
python test.py
```
### Evaluate on Other Methods and Tasks.
By simply replacing the visual backbone with our pre-trained model, almost all existing methods on downstream tasks make significant improvements. For example, you can try by the following repositories:
**Text-based Re-ID:**
[SSAN](https://github.com/zifyloo/SSAN), [LGUR](https://github.com/ZhiyinShao-H/LGUR)
**Image-based Re-ID:**
[BOT](https://github.com/michuanhaohao/reid-strong-baseline), [MGN](https://github.com/seathiefwang/MGN-pytorch), [ABD-Net](https://github.com/VITA-Group/ABD-Net)
**Person Attribute Recognition:**
[Rethink](https://github.com/valencebond/Rethinking_of_PAR), [Label2label](https://github.com/Li-Wanhua/Label2Label/tree/main/Pedestrian_Attribute), [VTB](https://github.com/cxh0519/VTB)
## Reference
If you use PLIP in your research, please cite it by the following BibTeX entry:
```
@misc{zuo2023plip,
title={PLIP: Language-Image Pre-training for Person Representation Learning},
author={Jialong Zuo and Changqian Yu and Nong Sang and Changxin Gao},
year={2023},
eprint={2305.08386},
archivePrefix={arXiv},
primaryClass={cs.CV}
}