/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
2023/07/19 14:23:22 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow.
2023/07/19 14:23:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2023/07/19 14:23:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Downloading and preparing dataset glue/mrpc to /home/aiscuser/.cache/huggingface/datasets/glue/mrpc/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353...
Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]
Downloading data: 0.00B [00:00, ?B/s][ADownloading data: 6.22kB [00:00, 3.59MB/s]
Downloading data files:  33%|███▎      | 1/3 [00:00<00:00,  5.07it/s]
Downloading data: 0.00B [00:00, ?B/s][ADownloading data: 1.05MB [00:00, 30.9MB/s]
Downloading data files:  67%|██████▋   | 2/3 [00:00<00:00,  5.19it/s]
Downloading data: 0.00B [00:00, ?B/s][ADownloading data: 441kB [00:00, 23.4MB/s]
Downloading data files: 100%|██████████| 3/3 [00:00<00:00,  5.33it/s]Downloading data files: 100%|██████████| 3/3 [00:00<00:00,  5.27it/s]
Generating train split: 0 examples [00:00, ? examples/s]Generating train split: 1820 examples [00:00, 18140.94 examples/s]                                                                  Generating validation split: 0 examples [00:00, ? examples/s]                                                             Generating test split: 0 examples [00:00, ? examples/s]                                                       Dataset glue downloaded and prepared to /home/aiscuser/.cache/huggingface/datasets/glue/mrpc/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353. Subsequent calls will reuse this data.
  0%|          | 0/3 [00:00<?, ?it/s]100%|██████████| 3/3 [00:00<00:00, 775.19it/s]
disable token pruning.
enable token pruning. token_prune_loc: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
NOTICE: THIS IS PRUNING STAGE
max_seq_length: 256
Running tokenizer on dataset:   0%|          | 0/3668 [00:00<?, ? examples/s]Running tokenizer on dataset:  27%|██▋       | 1000/3668 [00:00<00:00, 3582.85 examples/s]Running tokenizer on dataset:  55%|█████▍    | 2000/3668 [00:00<00:00, 3673.16 examples/s]Running tokenizer on dataset:  82%|████████▏ | 3000/3668 [00:00<00:00, 3716.00 examples/s]Running tokenizer on dataset: 100%|██████████| 3668/3668 [00:00<00:00, 3735.68 examples/s]                                                                                          Running tokenizer on dataset:   0%|          | 0/408 [00:00<?, ? examples/s]Running tokenizer on dataset: 100%|██████████| 408/408 [00:00<00:00, 3716.54 examples/s]                                                                                        Running tokenizer on dataset:   0%|          | 0/1725 [00:00<?, ? examples/s]Running tokenizer on dataset:  58%|█████▊    | 1000/1725 [00:00<00:00, 3793.34 examples/s]Running tokenizer on dataset: 100%|██████████| 1725/1725 [00:00<00:00, 3777.11 examples/s]                                                                                          Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]Downloading builder script: 5.76kB [00:00, 3.32MB/s]                   
double check the prune location is loaded correctly: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
double check hard_token_mask: <class 'NoneType'>
Training Arguments
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=50,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=6e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=40,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/MRPC/reproduce1/s0.67_lr6e-05_reglr0.01_alpha0.0002_warmup150_bin50/runs/Jul19_14-23-23_node-0,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=25,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=200.0,
optim=OptimizerNames.ADAMW_HF,
output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/MRPC/reproduce1/s0.67_lr6e-05_reglr0.01_alpha0.0002_warmup150_bin50,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=32,
per_device_train_batch_size=32,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['mlflow'],
resume_from_checkpoint=None,
run_name=/mnt/data/device-aware-bert/token_pruning/experiments/MRPC/reproduce1/s0.67_lr6e-05_reglr0.01_alpha0.0002_warmup150_bin50,
save_on_each_node=False,
save_steps=0,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=57,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
Additional Arguments
AdditionalArguments(test=False, ex_name='s0.67_lr6e-05_reglr0.01_alpha0.0002_warmup150_bin50', pruning_type='token+pruner', reg_learning_rate=0.01, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=150, target_sparsity=0.67, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/MRPC', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.0002, distill_temp=2.0, use_mac_l0=True, prune_location=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=50, topk=20)
----------------------------------------------------------------------
time: 2023-07-19 14:23:56
Evaluating: f1: 0.8981, eval_loss: 0.4779, step: 0
lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000
Starting l0 regularization! using <class 'models.l0_module.L0ModuleForMAC'>, temperature: 0.67, init drop rate: 0.01 token_loga shape: [10, 50] prune location: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
NDCG TOPK= 20
loss: 0.844315, lagrangian_loss: -0.000053, attention_score_distillation_loss: 0.001977
----------------------------------------------------------------------
time: 2023-07-19 14:24:11
Evaluating: f1: 0.8975, eval_loss: 0.4174, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0019, step: 50
lambda_1: -0.1599, lambda_2: 0.4514 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.331014, lagrangian_loss: 0.000220, attention_score_distillation_loss: 0.001975
loss: 0.398958, lagrangian_loss: 0.001112, attention_score_distillation_loss: 0.001971
----------------------------------------------------------------------
time: 2023-07-19 14:24:25
Evaluating: f1: 0.862, eval_loss: 0.6741, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0038, step: 100
lambda_1: -0.9178, lambda_2: 1.2933 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.927653, lagrangian_loss: 0.002207, attention_score_distillation_loss: 0.001969
ETA: 1:48:18 | Epoch 0 finished. Took 32.66 seconds.
loss: 0.056119, lagrangian_loss: 0.002750, attention_score_distillation_loss: 0.001965
----------------------------------------------------------------------
time: 2023-07-19 14:24:40
Evaluating: f1: 0.881, eval_loss: 0.7317, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0057, step: 150
lambda_1: -1.5601, lambda_2: 1.9629 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.200542, lagrangian_loss: 0.002356, attention_score_distillation_loss: 0.001962
loss: 0.186767, lagrangian_loss: 0.000804, attention_score_distillation_loss: 0.001959
----------------------------------------------------------------------
time: 2023-07-19 14:24:54
Evaluating: f1: 0.8981, eval_loss: 0.5048, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0077, step: 200
lambda_1: -1.7949, lambda_2: 2.1305 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.057280, lagrangian_loss: -0.001841, attention_score_distillation_loss: 0.001957
loss: 0.205831, lagrangian_loss: -0.004043, attention_score_distillation_loss: 0.001953
ETA: 1:47:39 | Epoch 1 finished. Took 32.59 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:25:09
Evaluating: f1: 0.8988, eval_loss: 0.4738, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0096, step: 250
lambda_1: -1.1944, lambda_2: 2.8107 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.99 1.  ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.085462, lagrangian_loss: -0.005739, attention_score_distillation_loss: 0.001950
loss: 0.047988, lagrangian_loss: -0.004037, attention_score_distillation_loss: 0.001947
----------------------------------------------------------------------
time: 2023-07-19 14:25:23
Evaluating: f1: 0.8912, eval_loss: 0.4565, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0116, step: 300
lambda_1: -0.0979, lambda_2: 4.2245 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.98 1.  ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.102424, lagrangian_loss: -0.000476, attention_score_distillation_loss: 0.001944
loss: 0.050687, lagrangian_loss: 0.002352, attention_score_distillation_loss: 0.001941
ETA: 1:46:59 | Epoch 2 finished. Took 32.51 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:25:38
Evaluating: f1: 0.9016, eval_loss: 0.4389, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0135, step: 350
lambda_1: 0.7380, lambda_2: 5.1095 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.99 1.  ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.027416, lagrangian_loss: 0.002573, attention_score_distillation_loss: 0.001939
loss: 0.040867, lagrangian_loss: 0.000737, attention_score_distillation_loss: 0.001938
----------------------------------------------------------------------
time: 2023-07-19 14:25:52
Evaluating: f1: 0.913, eval_loss: 0.3877, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0155, step: 400
lambda_1: 0.9770, lambda_2: 5.2526 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.99 1.  ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.029649, lagrangian_loss: -0.001590, attention_score_distillation_loss: 0.001933
loss: 0.028804, lagrangian_loss: -0.002968, attention_score_distillation_loss: 0.001931
----------------------------------------------------------------------
time: 2023-07-19 14:26:06
Evaluating: f1: 0.8932, eval_loss: 0.4944, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0174, step: 450
lambda_1: 0.5783, lambda_2: 5.5013 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   1.   0.99 1.  ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.148262, lagrangian_loss: -0.002896, attention_score_distillation_loss: 0.001929
ETA: 1:48:10 | Epoch 3 finished. Took 34.71 seconds.
loss: 0.027041, lagrangian_loss: -0.001112, attention_score_distillation_loss: 0.001927
----------------------------------------------------------------------
time: 2023-07-19 14:26:21
Evaluating: f1: 0.898, eval_loss: 0.6168, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0193, step: 500
lambda_1: -0.2234, lambda_2: 6.3774 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   1.   0.99 1.  ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.120027, lagrangian_loss: 0.002113, attention_score_distillation_loss: 0.001923
loss: 0.027218, lagrangian_loss: 0.006355, attention_score_distillation_loss: 0.001919
----------------------------------------------------------------------
time: 2023-07-19 14:26:35
Evaluating: f1: 0.8897, eval_loss: 0.5281, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0213, step: 550
lambda_1: -1.1509, lambda_2: 7.5535 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   1.   0.99 1.  ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.028274, lagrangian_loss: 0.010838, attention_score_distillation_loss: 0.001915
ETA: 1:47:17 | Epoch 4 finished. Took 32.59 seconds.
loss: 0.086222, lagrangian_loss: 0.014110, attention_score_distillation_loss: 0.001913
----------------------------------------------------------------------
time: 2023-07-19 14:26:50
Evaluating: f1: 0.8862, eval_loss: 0.5352, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0232, step: 600
lambda_1: -1.9918, lambda_2: 8.5427 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.98 0.99]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.228697, lagrangian_loss: 0.013522, attention_score_distillation_loss: 0.001910
loss: 0.092849, lagrangian_loss: 0.005699, attention_score_distillation_loss: 0.001909
----------------------------------------------------------------------
time: 2023-07-19 14:27:04
Evaluating: f1: 0.8998, eval_loss: 0.5424, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0253, expected_sparsity: 0.0219, expected_sequence_sparsity: 0.5996, target_sparsity: 0.0252, step: 650
lambda_1: -2.3202, lambda_2: 8.8889 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.94 0.96]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111110111011111111100
11111111111111111111111111111111111111111111111111
loss: 0.073946, lagrangian_loss: -0.020756, attention_score_distillation_loss: 0.001913
loss: 0.016344, lagrangian_loss: -0.046345, attention_score_distillation_loss: 0.001901
ETA: 1:46:31 | Epoch 5 finished. Took 32.63 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:27:19
Evaluating: f1: 0.877, eval_loss: 0.5658, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0761, expected_sparsity: 0.075, expected_sequence_sparsity: 0.6213, target_sparsity: 0.0271, step: 700
lambda_1: -0.8683, lambda_2: 10.9040 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 0.99 0.98 0.85 0.82]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.56]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011111110100
11111000111101011111111101011011110011110100011001
loss: 0.030512, lagrangian_loss: -0.016740, attention_score_distillation_loss: 0.001901
loss: 0.043250, lagrangian_loss: 0.017966, attention_score_distillation_loss: 0.001897
----------------------------------------------------------------------
time: 2023-07-19 14:27:33
Evaluating: f1: 0.8679, eval_loss: 0.498, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0337, expected_sparsity: 0.0306, expected_sequence_sparsity: 0.6031, target_sparsity: 0.0291, step: 750
lambda_1: 0.5728, lambda_2: 12.4660 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 0.99 0.98 0.89 0.9 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101111011111110111011111110100
11111111111111111111111111111111111111111111111111
loss: 0.138288, lagrangian_loss: 0.020294, attention_score_distillation_loss: 0.001894
loss: 0.012806, lagrangian_loss: 0.009389, attention_score_distillation_loss: 0.001895
----------------------------------------------------------------------
time: 2023-07-19 14:27:48
Evaluating: f1: 0.8822, eval_loss: 0.5555, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0211, expected_sparsity: 0.0175, expected_sequence_sparsity: 0.5978, target_sparsity: 0.031, step: 800
lambda_1: 1.0413, lambda_2: 12.7199 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.94 0.96]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111110111011111111110
11111111111111111111111111111111111111111111111111
loss: 0.075717, lagrangian_loss: -0.000495, attention_score_distillation_loss: 0.001889
ETA: 1:46:49 | Epoch 6 finished. Took 34.79 seconds.
loss: 0.146381, lagrangian_loss: -0.006434, attention_score_distillation_loss: 0.001886
----------------------------------------------------------------------
time: 2023-07-19 14:28:02
Evaluating: f1: 0.8846, eval_loss: 0.6134, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0169, expected_sparsity: 0.0131, expected_sequence_sparsity: 0.596, target_sparsity: 0.0329, step: 850
lambda_1: 0.9062, lambda_2: 12.7535 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.97 0.98]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.94]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111110111011111111110
11111111111111111111111111111111111111111111111111
loss: 0.114676, lagrangian_loss: -0.008415, attention_score_distillation_loss: 0.001884
loss: 0.013280, lagrangian_loss: -0.007489, attention_score_distillation_loss: 0.001882
----------------------------------------------------------------------
time: 2023-07-19 14:28:17
Evaluating: f1: 0.8893, eval_loss: 0.5554, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0349, step: 900
lambda_1: 0.4872, lambda_2: 12.9000 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.97 0.98]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.076615, lagrangian_loss: -0.004464, attention_score_distillation_loss: 0.001879
ETA: 1:46:01 | Epoch 7 finished. Took 32.58 seconds.
loss: 0.166178, lagrangian_loss: -0.000106, attention_score_distillation_loss: 0.001874
----------------------------------------------------------------------
time: 2023-07-19 14:28:31
Evaluating: f1: 0.8915, eval_loss: 0.5416, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5906, target_sparsity: 0.0368, step: 950
lambda_1: -0.0527, lambda_2: 13.1468 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.97 0.98]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.028054, lagrangian_loss: 0.004809, attention_score_distillation_loss: 0.001872
loss: 0.064744, lagrangian_loss: 0.009240, attention_score_distillation_loss: 0.001867
----------------------------------------------------------------------
time: 2023-07-19 14:28:45
Evaluating: f1: 0.8916, eval_loss: 0.5736, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0211, expected_sparsity: 0.0175, expected_sequence_sparsity: 0.5978, target_sparsity: 0.0388, step: 1000
lambda_1: -0.6015, lambda_2: 13.4139 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.96 0.97]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111110111011111111110
11111111111111111111111111111111111111111111111111
loss: 0.018114, lagrangian_loss: 0.011986, attention_score_distillation_loss: 0.001863
loss: 0.016452, lagrangian_loss: 0.011391, attention_score_distillation_loss: 0.001861
ETA: 1:45:14 | Epoch 8 finished. Took 32.47 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:29:00
Evaluating: f1: 0.8792, eval_loss: 0.5706, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0253, expected_sparsity: 0.0219, expected_sequence_sparsity: 0.5996, target_sparsity: 0.0407, step: 1050
lambda_1: -1.0153, lambda_2: 13.5753 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.99 0.94 0.95]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111110111011111110110
11111111111111111111111111111111111111111111111111
loss: 0.152066, lagrangian_loss: 0.006394, attention_score_distillation_loss: 0.001865
loss: 0.022090, lagrangian_loss: -0.002813, attention_score_distillation_loss: 0.001855
----------------------------------------------------------------------
time: 2023-07-19 14:29:14
Evaluating: f1: 0.89, eval_loss: 0.6287, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0337, expected_sparsity: 0.0306, expected_sequence_sparsity: 0.6031, target_sparsity: 0.0426, step: 1100
lambda_1: -1.0236, lambda_2: 13.6107 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 0.99 0.98 0.9  0.88]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101111011111110111011111110100
11111111111111111111111111111111111111111111111111
loss: 0.014114, lagrangian_loss: -0.011022, attention_score_distillation_loss: 0.001853
loss: 0.095718, lagrangian_loss: -0.010398, attention_score_distillation_loss: 0.001850
----------------------------------------------------------------------
time: 2023-07-19 14:29:29
Evaluating: f1: 0.884, eval_loss: 0.6404, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0719, expected_sparsity: 0.0693, expected_sequence_sparsity: 0.619, target_sparsity: 0.0446, step: 1150
lambda_1: -0.4384, lambda_2: 13.9504 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.98 0.88 0.84]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.7]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.59]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011111110111011111110100
11111000111101011111111101011011110011110110011001
ETA: 1:45:10 | Epoch 9 finished. Took 34.59 seconds.
loss: 0.014720, lagrangian_loss: -0.002838, attention_score_distillation_loss: 0.001850
loss: 0.014295, lagrangian_loss: 0.002671, attention_score_distillation_loss: 0.001843
----------------------------------------------------------------------
time: 2023-07-19 14:29:43
Evaluating: f1: 0.8976, eval_loss: 0.499, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0337, expected_sparsity: 0.0306, expected_sequence_sparsity: 0.6031, target_sparsity: 0.0465, step: 1200
lambda_1: 0.1805, lambda_2: 14.3359 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.98 0.89 0.88]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101111011111110111011111110100
11111111111111111111111111111111111111111111111111
loss: 0.041405, lagrangian_loss: 0.003423, attention_score_distillation_loss: 0.001841
loss: 0.103779, lagrangian_loss: 0.001181, attention_score_distillation_loss: 0.001838
----------------------------------------------------------------------
time: 2023-07-19 14:29:58
Evaluating: f1: 0.8716, eval_loss: 0.5361, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0295, expected_sparsity: 0.0263, expected_sequence_sparsity: 0.6013, target_sparsity: 0.0485, step: 1250
lambda_1: 0.3764, lambda_2: 14.3985 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.98 0.91 0.92]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101111011111110111011111110110
11111111111111111111111111111111111111111111111111
loss: 0.079066, lagrangian_loss: -0.000929, attention_score_distillation_loss: 0.001835
ETA: 1:44:27 | Epoch 10 finished. Took 32.69 seconds.
loss: 0.007130, lagrangian_loss: -0.001459, attention_score_distillation_loss: 0.001835
----------------------------------------------------------------------
time: 2023-07-19 14:30:12
Evaluating: f1: 0.8835, eval_loss: 0.5822, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0295, expected_sparsity: 0.0263, expected_sequence_sparsity: 0.6013, target_sparsity: 0.0504, step: 1300
lambda_1: 0.2121, lambda_2: 14.4316 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.98 0.92 0.93]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101111011111110111011111110110
11111111111111111111111111111111111111111111111111
loss: 0.036706, lagrangian_loss: -0.000766, attention_score_distillation_loss: 0.001830
loss: 0.175276, lagrangian_loss: 0.000557, attention_score_distillation_loss: 0.001829
----------------------------------------------------------------------
time: 2023-07-19 14:30:27
Evaluating: f1: 0.8822, eval_loss: 0.5845, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0295, expected_sparsity: 0.0263, expected_sequence_sparsity: 0.6013, target_sparsity: 0.0524, step: 1350
lambda_1: -0.0910, lambda_2: 14.5182 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.98 0.92 0.92]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101111011111110111011111110110
11111111111111111111111111111111111111111111111111
loss: 0.017843, lagrangian_loss: 0.001768, attention_score_distillation_loss: 0.001826
loss: 0.020944, lagrangian_loss: 0.001759, attention_score_distillation_loss: 0.001823
ETA: 1:43:45 | Epoch 11 finished. Took 32.55 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:30:41
Evaluating: f1: 0.8843, eval_loss: 0.567, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0337, expected_sparsity: 0.0306, expected_sequence_sparsity: 0.6031, target_sparsity: 0.0543, step: 1400
lambda_1: -0.3260, lambda_2: 14.5733 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.99 1.   0.98 0.9  0.89]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101111011111110111011111110100
11111111111111111111111111111111111111111111111111
loss: 0.013917, lagrangian_loss: 0.000552, attention_score_distillation_loss: 0.001820
loss: 0.032516, lagrangian_loss: -0.000790, attention_score_distillation_loss: 0.001817
----------------------------------------------------------------------
time: 2023-07-19 14:30:56
Evaluating: f1: 0.8889, eval_loss: 0.6394, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0379, expected_sparsity: 0.035, expected_sequence_sparsity: 0.6049, target_sparsity: 0.0562, step: 1450
lambda_1: -0.2910, lambda_2: 14.5836 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.98 0.89 0.86]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.84]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011111110111011111110100
11111111111111111111111111111111111111111111111111
loss: 0.012617, lagrangian_loss: -0.001138, attention_score_distillation_loss: 0.001813
loss: 0.014826, lagrangian_loss: -0.000614, attention_score_distillation_loss: 0.001809
ETA: 1:43:08 | Epoch 12 finished. Took 32.86 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:31:10
Evaluating: f1: 0.8931, eval_loss: 0.5337, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0719, expected_sparsity: 0.0693, expected_sequence_sparsity: 0.619, target_sparsity: 0.0582, step: 1500
lambda_1: -0.0762, lambda_2: 14.6249 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.98 0.88 0.85]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.7]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.59]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011111110111011111110100
11111100111101011111111101011011110011110100011001
loss: 0.016289, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.001807
loss: 0.007013, lagrangian_loss: 0.000163, attention_score_distillation_loss: 0.001805
----------------------------------------------------------------------
time: 2023-07-19 14:31:25
Evaluating: f1: 0.8901, eval_loss: 0.4902, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0379, expected_sparsity: 0.035, expected_sequence_sparsity: 0.6049, target_sparsity: 0.0601, step: 1550
lambda_1: 0.0650, lambda_2: 14.6452 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.98 0.89 0.86]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.84]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011111110111011111110100
11111111111111111111111111111111111111111111111111
loss: 0.007331, lagrangian_loss: -0.000001, attention_score_distillation_loss: 0.001800
loss: 0.008404, lagrangian_loss: -0.000064, attention_score_distillation_loss: 0.001799
----------------------------------------------------------------------
time: 2023-07-19 14:31:39
Evaluating: f1: 0.8985, eval_loss: 0.5508, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0337, expected_sparsity: 0.0306, expected_sequence_sparsity: 0.6031, target_sparsity: 0.0621, step: 1600
lambda_1: 0.0176, lambda_2: 14.6497 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.98 0.89 0.86]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101111011111110111011111110100
11111111111111111111111111111111111111111111111111
loss: 0.023607, lagrangian_loss: 0.000078, attention_score_distillation_loss: 0.001797
ETA: 1:42:59 | Epoch 13 finished. Took 34.89 seconds.
loss: 0.010771, lagrangian_loss: 0.000304, attention_score_distillation_loss: 0.001793
----------------------------------------------------------------------
time: 2023-07-19 14:31:54
Evaluating: f1: 0.8865, eval_loss: 0.5213, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0719, expected_sparsity: 0.0693, expected_sequence_sparsity: 0.619, target_sparsity: 0.064, step: 1650
lambda_1: -0.1047, lambda_2: 14.6620 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.98 0.89 0.85]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.7]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.59]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011111110111011111110100
11111100111101011111111101011011110011110100011001
loss: 0.079100, lagrangian_loss: 0.000255, attention_score_distillation_loss: 0.001788
loss: 0.018883, lagrangian_loss: 0.000027, attention_score_distillation_loss: 0.001787
----------------------------------------------------------------------
time: 2023-07-19 14:32:08
Evaluating: f1: 0.879, eval_loss: 0.6022, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0745, expected_sparsity: 0.0715, expected_sequence_sparsity: 0.6199, target_sparsity: 0.066, step: 1700
lambda_1: -0.1376, lambda_2: 14.6646 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.98 0.88 0.83]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.57]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011111110111011111110100
11111000111101011111111101011011110011110100011001
loss: 0.007429, lagrangian_loss: -0.000146, attention_score_distillation_loss: 0.001785
ETA: 1:42:18 | Epoch 14 finished. Took 32.66 seconds.
loss: 0.006198, lagrangian_loss: -0.000159, attention_score_distillation_loss: 0.001782
----------------------------------------------------------------------
time: 2023-07-19 14:32:23
Evaluating: f1: 0.8858, eval_loss: 0.6018, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0745, expected_sparsity: 0.0715, expected_sequence_sparsity: 0.6199, target_sparsity: 0.0679, step: 1750
lambda_1: -0.0622, lambda_2: 14.6694 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.87 0.82]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.57]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011111110111011111110100
11111100111101011111111101011011110010110100011001
loss: 0.007924, lagrangian_loss: -0.000064, attention_score_distillation_loss: 0.001779
loss: 0.016277, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.001777
----------------------------------------------------------------------
time: 2023-07-19 14:32:37
Evaluating: f1: 0.8983, eval_loss: 0.6013, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0745, expected_sparsity: 0.0715, expected_sequence_sparsity: 0.6199, target_sparsity: 0.0698, step: 1800
lambda_1: 0.0089, lambda_2: 14.6740 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.87 0.82]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.57]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011111110111011111110100
11111100111101011111111101011011110010110100011001
loss: 0.007964, lagrangian_loss: -0.000001, attention_score_distillation_loss: 0.001772
loss: 0.009511, lagrangian_loss: 0.000016, attention_score_distillation_loss: 0.001771
ETA: 1:41:39 | Epoch 15 finished. Took 32.66 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:32:52
Evaluating: f1: 0.8962, eval_loss: 0.5568, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0745, expected_sparsity: 0.0715, expected_sequence_sparsity: 0.6199, target_sparsity: 0.0718, step: 1850
lambda_1: -0.0248, lambda_2: 14.6755 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.87 0.82]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.57]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011111110111011111110100
11111100111101011111111101011011110010110100011001
loss: 0.093316, lagrangian_loss: 0.000047, attention_score_distillation_loss: 0.001766
loss: 0.004602, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.001766
----------------------------------------------------------------------
time: 2023-07-19 14:33:06
Evaluating: f1: 0.8889, eval_loss: 0.4992, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0787, expected_sparsity: 0.0772, expected_sequence_sparsity: 0.6222, target_sparsity: 0.0737, step: 1900
lambda_1: -0.0447, lambda_2: 14.6765 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.86 0.8 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.54]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011111110100
11111000111101011111111101011011110010110100011001
loss: 0.284785, lagrangian_loss: -0.000015, attention_score_distillation_loss: 0.001761
loss: 0.023119, lagrangian_loss: 0.000001, attention_score_distillation_loss: 0.001757
----------------------------------------------------------------------
time: 2023-07-19 14:33:21
Evaluating: f1: 0.8938, eval_loss: 0.6179, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0787, expected_sparsity: 0.0772, expected_sequence_sparsity: 0.6222, target_sparsity: 0.0757, step: 1950
lambda_1: -0.0378, lambda_2: 14.6768 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.86 0.79]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.54]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011111110100
11111100111101011101111101011011110010110100011001
loss: 0.015058, lagrangian_loss: 0.000037, attention_score_distillation_loss: 0.001754
ETA: 1:41:26 | Epoch 16 finished. Took 34.96 seconds.
loss: 0.170983, lagrangian_loss: 0.000048, attention_score_distillation_loss: 0.001751
----------------------------------------------------------------------
time: 2023-07-19 14:33:35
Evaluating: f1: 0.8991, eval_loss: 0.5506, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0787, expected_sparsity: 0.0772, expected_sequence_sparsity: 0.6222, target_sparsity: 0.0776, step: 2000
lambda_1: -0.0687, lambda_2: 14.6775 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.85 0.78]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.54]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011111110100
11111100111101011101111101011011110010110100011001
loss: 0.004562, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.001748
loss: 0.008097, lagrangian_loss: -0.000053, attention_score_distillation_loss: 0.001746
----------------------------------------------------------------------
time: 2023-07-19 14:33:50
Evaluating: f1: 0.8956, eval_loss: 0.5843, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0855, expected_sparsity: 0.0828, expected_sequence_sparsity: 0.6245, target_sparsity: 0.0795, step: 2050
lambda_1: -0.0333, lambda_2: 14.6790 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.84 0.78]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.51]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011011110100
11111100111101011101111101011001110010110100011001
loss: 0.003550, lagrangian_loss: -0.000019, attention_score_distillation_loss: 0.001744
ETA: 1:40:47 | Epoch 17 finished. Took 32.7 seconds.
loss: 0.005303, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.001742
----------------------------------------------------------------------
time: 2023-07-19 14:34:04
Evaluating: f1: 0.8981, eval_loss: 0.5534, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0855, expected_sparsity: 0.0828, expected_sequence_sparsity: 0.6245, target_sparsity: 0.0815, step: 2100
lambda_1: -0.0039, lambda_2: 14.6798 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.84 0.77]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.51]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011011110100
11111100111101010101111101011011110010110100011001
loss: 0.005728, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.001740
loss: 0.004771, lagrangian_loss: 0.000045, attention_score_distillation_loss: 0.001736
----------------------------------------------------------------------
time: 2023-07-19 14:34:19
Evaluating: f1: 0.8873, eval_loss: 0.5838, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0855, expected_sparsity: 0.0828, expected_sequence_sparsity: 0.6245, target_sparsity: 0.0834, step: 2150
lambda_1: -0.0537, lambda_2: 14.6816 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.84 0.77]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.51]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011011110100
11111100111101010101111101011011110010110100011001
loss: 0.010478, lagrangian_loss: 0.000068, attention_score_distillation_loss: 0.001733
loss: 0.004719, lagrangian_loss: 0.000008, attention_score_distillation_loss: 0.001729
ETA: 1:40:08 | Epoch 18 finished. Took 32.7 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:34:33
Evaluating: f1: 0.892, eval_loss: 0.5872, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0855, expected_sparsity: 0.085, expected_sequence_sparsity: 0.6254, target_sparsity: 0.0854, step: 2200
lambda_1: -0.0719, lambda_2: 14.6826 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.83 0.75]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.5]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011011110100
11111100111101010101111101011001110010110100011001
loss: 0.003216, lagrangian_loss: -0.000042, attention_score_distillation_loss: 0.001727
loss: 0.005995, lagrangian_loss: -0.000029, attention_score_distillation_loss: 0.001724
----------------------------------------------------------------------
time: 2023-07-19 14:34:48
Evaluating: f1: 0.895, eval_loss: 0.6068, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0855, expected_sparsity: 0.085, expected_sequence_sparsity: 0.6254, target_sparsity: 0.0873, step: 2250
lambda_1: -0.0158, lambda_2: 14.6845 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.83 0.75]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.5]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011011110100
11111100111101010101111101011001110010110100011001
loss: 0.002338, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.001723
loss: 0.004482, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.001719
----------------------------------------------------------------------
time: 2023-07-19 14:35:02
Evaluating: f1: 0.8978, eval_loss: 0.6296, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0923, expected_sparsity: 0.0883, expected_sequence_sparsity: 0.6268, target_sparsity: 0.0893, step: 2300
lambda_1: -0.0236, lambda_2: 14.6852 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.97 0.82 0.74]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.48]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011011110000
11111100111101010101111101011001110010110100011001
ETA: 1:39:49 | Epoch 19 finished. Took 34.79 seconds.
loss: 0.002314, lagrangian_loss: 0.000034, attention_score_distillation_loss: 0.001715
loss: 0.002819, lagrangian_loss: 0.000034, attention_score_distillation_loss: 0.001712
----------------------------------------------------------------------
time: 2023-07-19 14:35:16
Evaluating: f1: 0.8942, eval_loss: 0.5841, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0923, expected_sparsity: 0.0905, expected_sequence_sparsity: 0.6277, target_sparsity: 0.0912, step: 2350
lambda_1: -0.0553, lambda_2: 14.6860 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 1.   0.96 0.82 0.73]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.47]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011011110000
11111100111101010101111101011001110010100100011001
loss: 0.005633, lagrangian_loss: -0.000027, attention_score_distillation_loss: 0.001710
loss: 0.004544, lagrangian_loss: -0.000016, attention_score_distillation_loss: 0.001706
----------------------------------------------------------------------
time: 2023-07-19 14:35:31
Evaluating: f1: 0.9059, eval_loss: 0.5423, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0923, expected_sparsity: 0.0905, expected_sequence_sparsity: 0.6277, target_sparsity: 0.0931, step: 2400
lambda_1: -0.0270, lambda_2: 14.6865 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.81 0.72]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.47]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011011110000
11111100111101010101111101011001110010100100011001
loss: 0.006385, lagrangian_loss: -0.000005, attention_score_distillation_loss: 0.001705
ETA: 1:39:11 | Epoch 20 finished. Took 32.65 seconds.
loss: 0.003317, lagrangian_loss: 0.000005, attention_score_distillation_loss: 0.001701
----------------------------------------------------------------------
time: 2023-07-19 14:35:45
Evaluating: f1: 0.8928, eval_loss: 0.56, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0923, expected_sparsity: 0.0905, expected_sequence_sparsity: 0.6277, target_sparsity: 0.0951, step: 2450
lambda_1: -0.0217, lambda_2: 14.6867 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.81 0.72]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.47]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011110110111011011110000
11111100111101010101111101011001110010100100011001
loss: 0.002210, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.001698
loss: 0.009325, lagrangian_loss: 0.000014, attention_score_distillation_loss: 0.001696
----------------------------------------------------------------------
time: 2023-07-19 14:36:00
Evaluating: f1: 0.8941, eval_loss: 0.5961, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0991, expected_sparsity: 0.0958, expected_sequence_sparsity: 0.6298, target_sparsity: 0.097, step: 2500
lambda_1: -0.0413, lambda_2: 14.6869 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.8  0.71]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.44]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011100110111011011110000
11111000111101010101111101011001110010100100011001
loss: 0.004997, lagrangian_loss: -0.000005, attention_score_distillation_loss: 0.001695
loss: 0.006099, lagrangian_loss: -0.000009, attention_score_distillation_loss: 0.001689
ETA: 1:38:33 | Epoch 21 finished. Took 32.7 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:36:14
Evaluating: f1: 0.8881, eval_loss: 0.563, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0991, expected_sparsity: 0.0958, expected_sequence_sparsity: 0.6298, target_sparsity: 0.099, step: 2550
lambda_1: -0.0289, lambda_2: 14.6871 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.8  0.7 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.44]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011100110111011011110000
11111000111101010101111101011001110010100100011001
loss: 0.002765, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.001687
loss: 0.005608, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.001684
----------------------------------------------------------------------
time: 2023-07-19 14:36:29
Evaluating: f1: 0.8982, eval_loss: 0.5326, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0991, expected_sparsity: 0.0958, expected_sequence_sparsity: 0.6298, target_sparsity: 0.1009, step: 2600
lambda_1: -0.0375, lambda_2: 14.6873 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.79 0.69]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.44]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101101011100110111011011110000
11111000111101010101111101011001110010100100011001
loss: 0.159170, lagrangian_loss: 0.000017, attention_score_distillation_loss: 0.001681
loss: 0.005047, lagrangian_loss: -0.000019, attention_score_distillation_loss: 0.001678
ETA: 1:37:55 | Epoch 22 finished. Took 32.61 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:36:43
Evaluating: f1: 0.8908, eval_loss: 0.552, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1033, expected_sparsity: 0.101, expected_sequence_sparsity: 0.632, target_sparsity: 0.1028, step: 2650
lambda_1: -0.0014, lambda_2: 14.6887 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.78 0.68]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.56]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.41]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100110111011011110000
11111100111101010101110101011001110010100100001001
loss: 0.003213, lagrangian_loss: 0.000001, attention_score_distillation_loss: 0.001674
loss: 0.002689, lagrangian_loss: 0.000005, attention_score_distillation_loss: 0.001671
----------------------------------------------------------------------
time: 2023-07-19 14:36:58
Evaluating: f1: 0.895, eval_loss: 0.5904, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1033, expected_sparsity: 0.101, expected_sequence_sparsity: 0.632, target_sparsity: 0.1048, step: 2700
lambda_1: -0.0231, lambda_2: 14.6899 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.78 0.67]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.56]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.41]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100110111011011110000
11111100111101010101110101011001110010100100001001
loss: 0.002950, lagrangian_loss: 0.000036, attention_score_distillation_loss: 0.001669
loss: 0.004284, lagrangian_loss: 0.000017, attention_score_distillation_loss: 0.001666
----------------------------------------------------------------------
time: 2023-07-19 14:37:13
Evaluating: f1: 0.898, eval_loss: 0.5928, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1033, expected_sparsity: 0.101, expected_sequence_sparsity: 0.632, target_sparsity: 0.1067, step: 2750
lambda_1: -0.0542, lambda_2: 14.6908 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.77 0.66]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.56]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.41]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011110110111011010110000
11111100111101010101110101011001110010100100001001
loss: 0.001743, lagrangian_loss: -0.000025, attention_score_distillation_loss: 0.001664
ETA: 1:37:35 | Epoch 23 finished. Took 34.99 seconds.
loss: 0.003533, lagrangian_loss: -0.000015, attention_score_distillation_loss: 0.001661
----------------------------------------------------------------------
time: 2023-07-19 14:37:27
Evaluating: f1: 0.8956, eval_loss: 0.5784, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1033, expected_sparsity: 0.101, expected_sequence_sparsity: 0.632, target_sparsity: 0.1087, step: 2800
lambda_1: -0.0046, lambda_2: 14.6921 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.77 0.66]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.56]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.41]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011110110111011010110000
11111100111101010101110101011001110010100100001001
loss: 0.003537, lagrangian_loss: -0.000000, attention_score_distillation_loss: 0.001657
loss: 0.002826, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.001653
----------------------------------------------------------------------
time: 2023-07-19 14:37:42
Evaluating: f1: 0.8983, eval_loss: 0.5987, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1075, expected_sparsity: 0.1042, expected_sequence_sparsity: 0.6333, target_sparsity: 0.1106, step: 2850
lambda_1: -0.0286, lambda_2: 14.6929 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.76 0.65]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.56]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.4]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100110111011010110000
11111100111101010101110101011001110010100100001001
loss: 0.002361, lagrangian_loss: 0.000017, attention_score_distillation_loss: 0.001651
ETA: 1:36:59 | Epoch 24 finished. Took 32.82 seconds.
loss: 0.004175, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.001648
----------------------------------------------------------------------
time: 2023-07-19 14:37:56
Evaluating: f1: 0.8916, eval_loss: 0.5692, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1075, expected_sparsity: 0.1061, expected_sequence_sparsity: 0.6341, target_sparsity: 0.1126, step: 2900
lambda_1: -0.0400, lambda_2: 14.6933 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.75 0.64]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.54]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.39]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100110111011010110000
11111100111101010101110101011001010010100100001001
loss: 0.003659, lagrangian_loss: -0.000018, attention_score_distillation_loss: 0.001645
loss: 0.002306, lagrangian_loss: -0.000005, attention_score_distillation_loss: 0.001643
----------------------------------------------------------------------
time: 2023-07-19 14:38:11
Evaluating: f1: 0.8988, eval_loss: 0.5674, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1117, expected_sparsity: 0.1092, expected_sequence_sparsity: 0.6353, target_sparsity: 0.1145, step: 2950
lambda_1: -0.0080, lambda_2: 14.6938 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.74 0.64]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.54]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.38]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100110111011000110000
11111100111101010101110101011001010010100100001001
loss: 0.003555, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.001640
loss: 0.001858, lagrangian_loss: 0.000015, attention_score_distillation_loss: 0.001638
ETA: 1:36:23 | Epoch 25 finished. Took 32.87 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:38:25
Evaluating: f1: 0.897, eval_loss: 0.5436, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1117, expected_sparsity: 0.1092, expected_sequence_sparsity: 0.6353, target_sparsity: 0.1164, step: 3000
lambda_1: -0.0374, lambda_2: 14.6944 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.74 0.63]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.54]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.38]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100110111011000110000
11111100111101010101110101011001010010100100001001
loss: 0.006852, lagrangian_loss: -0.000012, attention_score_distillation_loss: 0.001634
loss: 0.004874, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.001632
----------------------------------------------------------------------
time: 2023-07-19 14:38:40
Evaluating: f1: 0.8966, eval_loss: 0.5696, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1117, expected_sparsity: 0.1092, expected_sequence_sparsity: 0.6353, target_sparsity: 0.1184, step: 3050
lambda_1: -0.0029, lambda_2: 14.6952 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.73 0.62]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.54]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.38]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100110111011000110000
11111000111101010101110101011001110010100100001001
loss: 0.002537, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.001629
loss: 0.002500, lagrangian_loss: 0.000018, attention_score_distillation_loss: 0.001626
----------------------------------------------------------------------
time: 2023-07-19 14:38:54
Evaluating: f1: 0.8985, eval_loss: 0.5933, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1117, expected_sparsity: 0.1092, expected_sequence_sparsity: 0.6353, target_sparsity: 0.1203, step: 3100
lambda_1: -0.0462, lambda_2: 14.6963 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.96 0.73 0.61]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.54]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.38]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100110111011000110000
11111000111101010101110101011001110010100100001001
loss: 0.004406, lagrangian_loss: 0.000035, attention_score_distillation_loss: 0.001624
ETA: 1:36:01 | Epoch 26 finished. Took 34.92 seconds.
loss: 0.001663, lagrangian_loss: -0.000024, attention_score_distillation_loss: 0.001621
----------------------------------------------------------------------
time: 2023-07-19 14:39:09
Evaluating: f1: 0.8958, eval_loss: 0.5927, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1158, expected_sparsity: 0.1141, expected_sequence_sparsity: 0.6374, target_sparsity: 0.1223, step: 3150
lambda_1: -0.0270, lambda_2: 14.6974 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.95 0.71 0.61]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.52]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.35]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100110101011000110000
11111000111101010101110101011001010010100100001001
loss: 0.001565, lagrangian_loss: -0.000012, attention_score_distillation_loss: 0.001617
loss: 0.002438, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.001616
----------------------------------------------------------------------
time: 2023-07-19 14:39:23
Evaluating: f1: 0.8974, eval_loss: 0.556, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1158, expected_sparsity: 0.1141, expected_sequence_sparsity: 0.6374, target_sparsity: 0.1242, step: 3200
lambda_1: -0.0168, lambda_2: 14.6987 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.95 0.71 0.6 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.52]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.35]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100110101011000110000
11111000111101010101110101011001010010100100001001
loss: 0.002271, lagrangian_loss: 0.000023, attention_score_distillation_loss: 0.001612
ETA: 1:35:24 | Epoch 27 finished. Took 32.8 seconds.
loss: 0.002666, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.001610
----------------------------------------------------------------------
time: 2023-07-19 14:39:38
Evaluating: f1: 0.9014, eval_loss: 0.5909, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.12, expected_sparsity: 0.1172, expected_sequence_sparsity: 0.6386, target_sparsity: 0.1262, step: 3250
lambda_1: -0.0499, lambda_2: 14.7000 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.95 0.7  0.6 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.52]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.34]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100010101011000110000
11111000111101010101110101011001010010100100001001
loss: 0.005432, lagrangian_loss: -0.000031, attention_score_distillation_loss: 0.001612
loss: 0.002152, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.001604
----------------------------------------------------------------------
time: 2023-07-19 14:39:53
Evaluating: f1: 0.897, eval_loss: 0.567, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.12, expected_sparsity: 0.1172, expected_sequence_sparsity: 0.6386, target_sparsity: 0.1281, step: 3300
lambda_1: 0.0031, lambda_2: 14.7013 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.95 0.7  0.59]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.52]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.34]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100010101011000110000
11111000111101010101110101011001010010100100001001
loss: 0.069087, lagrangian_loss: 0.000001, attention_score_distillation_loss: 0.001601
loss: 0.003866, lagrangian_loss: 0.000169, attention_score_distillation_loss: 0.001598
ETA: 1:34:48 | Epoch 28 finished. Took 32.81 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:40:07
Evaluating: f1: 0.8808, eval_loss: 0.7724, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1216, expected_sparsity: 0.1202, expected_sequence_sparsity: 0.6399, target_sparsity: 0.13, step: 3350
lambda_1: -0.1138, lambda_2: 14.7086 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.95 0.68 0.59]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.64, 0.52]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.64, 0.33]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100010101001000110000
11111000111101010101110101011001010010100100001001
loss: 0.002763, lagrangian_loss: -0.000086, attention_score_distillation_loss: 0.001596
loss: 0.001701, lagrangian_loss: -0.000053, attention_score_distillation_loss: 0.001594
----------------------------------------------------------------------
time: 2023-07-19 14:40:22
Evaluating: f1: 0.8889, eval_loss: 0.5801, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1216, expected_sparsity: 0.1202, expected_sequence_sparsity: 0.6399, target_sparsity: 0.132, step: 3400
lambda_1: 0.0304, lambda_2: 14.7178 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.95 0.68 0.58]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.64, 0.52]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.64, 0.33]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100010101001000110000
11111000111101010101110101011001010010100100001001
loss: 0.004504, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.001591
loss: 0.001661, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.001587
----------------------------------------------------------------------
time: 2023-07-19 14:40:36
Evaluating: f1: 0.8881, eval_loss: 0.5988, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1216, expected_sparsity: 0.1202, expected_sequence_sparsity: 0.6399, target_sparsity: 0.1339, step: 3450
lambda_1: -0.0536, lambda_2: 14.7237 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.95 0.67 0.59]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.64, 0.52]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.64, 0.33]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100010101001000110000
11111000111101010101110101011001010010100100001001
ETA: 1:34:24 | Epoch 29 finished. Took 34.95 seconds.
loss: 0.005415, lagrangian_loss: 0.000080, attention_score_distillation_loss: 0.001591
loss: 0.002078, lagrangian_loss: -0.000039, attention_score_distillation_loss: 0.001581
----------------------------------------------------------------------
time: 2023-07-19 14:40:51
Evaluating: f1: 0.8955, eval_loss: 0.5882, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1258, expected_sparsity: 0.125, expected_sequence_sparsity: 0.6418, target_sparsity: 0.1359, step: 3500
lambda_1: -0.0622, lambda_2: 14.7285 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.94 0.66 0.58]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.31]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100010101001000100000
11111000111101010101110101011001010000100100001001
loss: 0.001670, lagrangian_loss: -0.000063, attention_score_distillation_loss: 0.001579
loss: 0.005558, lagrangian_loss: 0.000005, attention_score_distillation_loss: 0.001577
----------------------------------------------------------------------
time: 2023-07-19 14:41:05
Evaluating: f1: 0.9048, eval_loss: 0.5794, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1258, expected_sparsity: 0.125, expected_sequence_sparsity: 0.6418, target_sparsity: 0.1378, step: 3550
lambda_1: 0.0152, lambda_2: 14.7327 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.94 0.66 0.58]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.31]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100010101001000100000
11111000111101010101110101011001010000100100001001
loss: 0.002640, lagrangian_loss: 0.000001, attention_score_distillation_loss: 0.001573
ETA: 1:33:48 | Epoch 30 finished. Took 32.83 seconds.
loss: 0.003458, lagrangian_loss: 0.000069, attention_score_distillation_loss: 0.001569
----------------------------------------------------------------------
time: 2023-07-19 14:41:20
Evaluating: f1: 0.8946, eval_loss: 0.5559, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.13, expected_sparsity: 0.1279, expected_sequence_sparsity: 0.643, target_sparsity: 0.1397, step: 3600
lambda_1: -0.0892, lambda_2: 14.7369 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.93 0.65 0.57]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.3]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100000101001000100000
11111000111101010101110101011001010000100100001001
loss: 0.033720, lagrangian_loss: -0.000037, attention_score_distillation_loss: 0.001567
loss: 0.003488, lagrangian_loss: -0.000040, attention_score_distillation_loss: 0.001563
----------------------------------------------------------------------
time: 2023-07-19 14:41:34
Evaluating: f1: 0.8885, eval_loss: 0.5886, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.13, expected_sparsity: 0.1279, expected_sequence_sparsity: 0.643, target_sparsity: 0.1417, step: 3650
lambda_1: 0.0249, lambda_2: 14.7421 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.93 0.64 0.57]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.3]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100000101001000100000
11111000111101010101110101011001010000100100001001
loss: 0.003652, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.001560
loss: 0.003023, lagrangian_loss: -0.000018, attention_score_distillation_loss: 0.001557
ETA: 1:33:13 | Epoch 31 finished. Took 32.9 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:41:49
Evaluating: f1: 0.8752, eval_loss: 0.629, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.13, expected_sparsity: 0.1279, expected_sequence_sparsity: 0.643, target_sparsity: 0.1436, step: 3700
lambda_1: -0.0541, lambda_2: 14.7484 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.93 0.64 0.57]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.3]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100000101001000100000
11111000111101010101110101011001010000100100001001
loss: 0.005066, lagrangian_loss: 0.000121, attention_score_distillation_loss: 0.001553
loss: 0.004385, lagrangian_loss: -0.000082, attention_score_distillation_loss: 0.001551
----------------------------------------------------------------------
time: 2023-07-19 14:42:04
Evaluating: f1: 0.8878, eval_loss: 0.6474, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1315, expected_sparsity: 0.1309, expected_sequence_sparsity: 0.6442, target_sparsity: 0.1456, step: 3750
lambda_1: -0.0542, lambda_2: 14.7586 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.92 0.62 0.56]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.58, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.58, 0.29]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100000101001000000000
11111000111101010101110101011001010000100100001001
loss: 0.003426, lagrangian_loss: -0.000040, attention_score_distillation_loss: 0.001553
loss: 0.006244, lagrangian_loss: 0.000020, attention_score_distillation_loss: 0.001548
ETA: 1:32:38 | Epoch 32 finished. Took 32.95 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:42:18
Evaluating: f1: 0.8877, eval_loss: 0.6504, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1315, expected_sparsity: 0.1309, expected_sequence_sparsity: 0.6442, target_sparsity: 0.1475, step: 3800
lambda_1: 0.0491, lambda_2: 14.7781 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.93 0.63 0.57]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.58, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.58, 0.29]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100000101001000000000
11111000111101010101110101011001010000100100001001
loss: 0.008981, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.001545
loss: 0.000917, lagrangian_loss: 0.000260, attention_score_distillation_loss: 0.001540
----------------------------------------------------------------------
time: 2023-07-19 14:42:33
Evaluating: f1: 0.8815, eval_loss: 0.6344, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1704, expected_sparsity: 0.1683, expected_sequence_sparsity: 0.6596, target_sparsity: 0.1495, step: 3850
lambda_1: -0.1891, lambda_2: 14.8085 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.9  0.61 0.56]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.56, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.45, 0.22]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111111111100011110
11111111111111111111101001011000000101001000000000
11011000111101010101110101011001010000100100001001
loss: 0.001451, lagrangian_loss: -0.000490, attention_score_distillation_loss: 0.001538
loss: 0.004553, lagrangian_loss: 0.000359, attention_score_distillation_loss: 0.001535
----------------------------------------------------------------------
time: 2023-07-19 14:42:48
Evaluating: f1: 0.8694, eval_loss: 0.6309, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1315, expected_sparsity: 0.1309, expected_sequence_sparsity: 0.6442, target_sparsity: 0.1514, step: 3900
lambda_1: 0.2065, lambda_2: 14.8672 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.91 0.62 0.56]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.58, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.58, 0.29]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100000101001000000000
11111000111101010101110101011001010000100100001001
loss: 0.002135, lagrangian_loss: -0.000447, attention_score_distillation_loss: 0.001531
ETA: 1:32:14 | Epoch 33 finished. Took 35.14 seconds.
loss: 0.003687, lagrangian_loss: 0.000761, attention_score_distillation_loss: 0.001528
----------------------------------------------------------------------
time: 2023-07-19 14:43:02
Evaluating: f1: 0.8919, eval_loss: 0.6513, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1357, expected_sparsity: 0.1354, expected_sequence_sparsity: 0.6461, target_sparsity: 0.1533, step: 3950
lambda_1: -0.3649, lambda_2: 14.9930 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.91 0.61 0.55]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.56, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.56, 0.27]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011000000101001000000000
11011000111101010101110101011001010000100100001001
loss: 0.001582, lagrangian_loss: 0.000110, attention_score_distillation_loss: 0.001526
loss: 0.002874, lagrangian_loss: -0.001265, attention_score_distillation_loss: 0.001523
----------------------------------------------------------------------
time: 2023-07-19 14:43:16
Evaluating: f1: 0.8935, eval_loss: 0.6236, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1777, expected_sparsity: 0.1739, expected_sequence_sparsity: 0.6619, target_sparsity: 0.1553, step: 4000
lambda_1: 0.2139, lambda_2: 15.2060 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.88 0.59 0.54]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.54, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.42, 0.2]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110111100011110
11111111111111111111101001010000000101001000000000
11011000111101010101110101011001010000100100001001
loss: 0.001701, lagrangian_loss: 0.001790, attention_score_distillation_loss: 0.001520
ETA: 1:31:38 | Epoch 34 finished. Took 32.8 seconds.
loss: 0.041286, lagrangian_loss: -0.002232, attention_score_distillation_loss: 0.001521
----------------------------------------------------------------------
time: 2023-07-19 14:43:31
Evaluating: f1: 0.8946, eval_loss: 0.6305, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1315, expected_sparsity: 0.1309, expected_sequence_sparsity: 0.6442, target_sparsity: 0.1572, step: 4050
lambda_1: 0.0101, lambda_2: 15.4687 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.92 0.63 0.56]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.58, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.58, 0.29]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111101001011100000101001000000000
11111000111101010101110101011001010000100100001001
loss: 0.004622, lagrangian_loss: 0.002105, attention_score_distillation_loss: 0.001514
loss: 0.003563, lagrangian_loss: 0.001391, attention_score_distillation_loss: 0.001512
----------------------------------------------------------------------
time: 2023-07-19 14:43:46
Evaluating: f1: 0.8901, eval_loss: 0.5755, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1803, expected_sparsity: 0.1762, expected_sequence_sparsity: 0.6628, target_sparsity: 0.1592, step: 4100
lambda_1: -0.5897, lambda_2: 15.7903 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.97 0.98 0.87 0.58 0.53]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.52, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.41, 0.19]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110111100011110
11111111111111111111101001000000000101001000000000
11011000111101010101110101011001010000100100001001
loss: 0.002368, lagrangian_loss: -0.003550, attention_score_distillation_loss: 0.001509
loss: 0.001664, lagrangian_loss: 0.000405, attention_score_distillation_loss: 0.001506
ETA: 1:31:03 | Epoch 35 finished. Took 32.94 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:44:00
Evaluating: f1: 0.8866, eval_loss: 0.6434, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1803, expected_sparsity: 0.1762, expected_sequence_sparsity: 0.6628, target_sparsity: 0.1611, step: 4150
lambda_1: 0.3783, lambda_2: 16.2755 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.97 0.99 0.87 0.58 0.54]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.52, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.41, 0.19]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110111100011110
11111111111111111111101001000000000101001000000000
11011000111101010101110101011001010000100100001001
loss: 0.002583, lagrangian_loss: 0.002737, attention_score_distillation_loss: 0.001503
loss: 0.002735, lagrangian_loss: -0.002541, attention_score_distillation_loss: 0.001501
----------------------------------------------------------------------
time: 2023-07-19 14:44:15
Evaluating: f1: 0.8904, eval_loss: 0.6076, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1704, expected_sparsity: 0.1671, expected_sequence_sparsity: 0.6591, target_sparsity: 0.1631, step: 4200
lambda_1: 0.2527, lambda_2: 16.5210 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.9  0.62 0.57]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.56, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.45, 0.22]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111111111100011110
11111111111111111111101001011000000101001000000000
11011000111101010101110101011001010000100100011001
loss: 0.002468, lagrangian_loss: -0.000532, attention_score_distillation_loss: 0.001498
loss: 0.004049, lagrangian_loss: 0.004055, attention_score_distillation_loss: 0.001495
----------------------------------------------------------------------
time: 2023-07-19 14:44:30
Evaluating: f1: 0.8927, eval_loss: 0.624, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1803, expected_sparsity: 0.1762, expected_sequence_sparsity: 0.6628, target_sparsity: 0.165, step: 4250
lambda_1: -0.6266, lambda_2: 17.0369 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.97 0.99 0.87 0.58 0.53]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.52, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.41, 0.19]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110111100011110
11111111111111111111101001000000000101001000000000
11011000111101010101110101011001010000100100001001
loss: 0.001979, lagrangian_loss: -0.000864, attention_score_distillation_loss: 0.001492
ETA: 1:30:38 | Epoch 36 finished. Took 35.15 seconds.
loss: 0.001414, lagrangian_loss: -0.003321, attention_score_distillation_loss: 0.001490
----------------------------------------------------------------------
time: 2023-07-19 14:44:44
Evaluating: f1: 0.9038, eval_loss: 0.5644, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1876, expected_sparsity: 0.1826, expected_sequence_sparsity: 0.6654, target_sparsity: 0.1669, step: 4300
lambda_1: -0.0964, lambda_2: 17.3405 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.97 0.98 0.85 0.55 0.52]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.5, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.38, 0.17]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110111000011110
11111111111111111110101001000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.002169, lagrangian_loss: 0.000855, attention_score_distillation_loss: 0.001486
loss: 0.001136, lagrangian_loss: 0.001979, attention_score_distillation_loss: 0.001483
----------------------------------------------------------------------
time: 2023-07-19 14:44:59
Evaluating: f1: 0.8988, eval_loss: 0.5826, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1803, expected_sparsity: 0.1762, expected_sequence_sparsity: 0.6628, target_sparsity: 0.1689, step: 4350
lambda_1: 0.4671, lambda_2: 17.6359 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.97 0.99 0.88 0.58 0.54]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.52, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.41, 0.19]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110111100011110
11111111111111111111101001000000000101001000000000
11011000111101010101110101011001010000100100001001
loss: 0.001803, lagrangian_loss: -0.001633, attention_score_distillation_loss: 0.001481
ETA: 1:30:02 | Epoch 37 finished. Took 32.78 seconds.
loss: 0.002306, lagrangian_loss: -0.000777, attention_score_distillation_loss: 0.001478
----------------------------------------------------------------------
time: 2023-07-19 14:45:13
Evaluating: f1: 0.8873, eval_loss: 0.5692, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1803, expected_sparsity: 0.1762, expected_sequence_sparsity: 0.6628, target_sparsity: 0.1708, step: 4400
lambda_1: -0.1490, lambda_2: 17.9658 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.98 0.99 0.88 0.58 0.54]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.52, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.41, 0.19]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110111100011110
11111111111111111111101001000000000101001000000000
11011000111101010101110101011001010000100100001001
loss: 0.002278, lagrangian_loss: 0.002348, attention_score_distillation_loss: 0.001475
loss: 0.001694, lagrangian_loss: 0.000603, attention_score_distillation_loss: 0.001471
----------------------------------------------------------------------
time: 2023-07-19 14:45:28
Evaluating: f1: 0.8908, eval_loss: 0.5759, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1892, expected_sparsity: 0.1857, expected_sequence_sparsity: 0.6667, target_sparsity: 0.1728, step: 4450
lambda_1: -0.4619, lambda_2: 18.1359 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.97 0.98 0.85 0.55 0.51]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.5, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.37, 0.17]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111101100001111101111110111000011110
11111111111111111110101001000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.001542, lagrangian_loss: -0.001897, attention_score_distillation_loss: 0.001469
loss: 0.002186, lagrangian_loss: -0.000558, attention_score_distillation_loss: 0.001467
ETA: 1:29:27 | Epoch 38 finished. Took 33.03 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:45:43
Evaluating: f1: 0.8901, eval_loss: 0.5611, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1892, expected_sparsity: 0.1857, expected_sequence_sparsity: 0.6667, target_sparsity: 0.1747, step: 4500
lambda_1: 0.1087, lambda_2: 18.3987 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.97 0.98 0.85 0.54 0.51]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.5, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.37, 0.17]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111101100001111101111110111000011110
11111111111111111110101001000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.147403, lagrangian_loss: 0.001157, attention_score_distillation_loss: 0.001464
loss: 0.002329, lagrangian_loss: -0.000249, attention_score_distillation_loss: 0.001461
----------------------------------------------------------------------
time: 2023-07-19 14:45:57
Evaluating: f1: 0.8831, eval_loss: 0.6108, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1835, expected_sparsity: 0.1793, expected_sequence_sparsity: 0.6641, target_sparsity: 0.1766, step: 4550
lambda_1: 0.2380, lambda_2: 18.4993 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   0.97 0.99 0.87 0.56 0.53]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.52, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.4, 0.19]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110111000011110
11111111111111111111101001000000000101001000000000
11011000111101010101110101011001010000100100001001
loss: 0.001284, lagrangian_loss: -0.000760, attention_score_distillation_loss: 0.001456
loss: 0.003949, lagrangian_loss: 0.001004, attention_score_distillation_loss: 0.001453
----------------------------------------------------------------------
time: 2023-07-19 14:46:12
Evaluating: f1: 0.8958, eval_loss: 0.5906, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.185, expected_sparsity: 0.1816, expected_sequence_sparsity: 0.665, target_sparsity: 0.1786, step: 4600
lambda_1: -0.2839, lambda_2: 18.7176 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.97 0.98 0.86 0.55 0.52]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.5, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.38, 0.18]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110111000011110
11111111111111111110101001000000000101001000000000
11011000111101010101110101011001010000100100001001
ETA: 1:29:02 | Epoch 39 finished. Took 35.3 seconds.
loss: 0.002275, lagrangian_loss: 0.000923, attention_score_distillation_loss: 0.001451
loss: 0.002025, lagrangian_loss: -0.000761, attention_score_distillation_loss: 0.001449
----------------------------------------------------------------------
time: 2023-07-19 14:46:27
Evaluating: f1: 0.8869, eval_loss: 0.5931, token_prune_loc: [False, False, False, False, False, True, False, True, True, True], macs_sparsity: 0.213, expected_sparsity: 0.208, expected_sequence_sparsity: 0.6758, target_sparsity: 0.1805, step: 4650
lambda_1: -0.2657, lambda_2: 18.7785 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.96 0.98 0.83 0.54 0.51]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 0.74, 0.5, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.68, 0.34, 0.16]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110011000011110
11111111111111111110101001000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.007277, lagrangian_loss: -0.000894, attention_score_distillation_loss: 0.001445
loss: 0.001671, lagrangian_loss: 0.000343, attention_score_distillation_loss: 0.001442
----------------------------------------------------------------------
time: 2023-07-19 14:46:41
Evaluating: f1: 0.8794, eval_loss: 0.6231, token_prune_loc: [False, False, False, False, False, True, False, True, True, True], macs_sparsity: 0.2067, expected_sparsity: 0.2024, expected_sequence_sparsity: 0.6735, target_sparsity: 0.1825, step: 4700
lambda_1: 0.1626, lambda_2: 18.9286 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.96 0.98 0.84 0.54 0.51]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 1.0, 0.74, 0.5, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.7, 0.35, 0.16]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111111111111110110
11111111111111111111111111111111111111111111111111
11111111111011111111100001111101111110011000011110
11111111111111111110101001000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.002219, lagrangian_loss: 0.000162, attention_score_distillation_loss: 0.001441
ETA: 1:28:27 | Epoch 40 finished. Took 33.07 seconds.
loss: 0.002402, lagrangian_loss: -0.000377, attention_score_distillation_loss: 0.001436
----------------------------------------------------------------------
time: 2023-07-19 14:46:56
Evaluating: f1: 0.8976, eval_loss: 0.6232, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1892, expected_sparsity: 0.1857, expected_sequence_sparsity: 0.6667, target_sparsity: 0.1844, step: 4750
lambda_1: 0.0243, lambda_2: 18.9833 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.97 0.98 0.85 0.54 0.52]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.5, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.37, 0.17]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111101100001111101111110111000011110
11111111111111111110101001000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.001965, lagrangian_loss: 0.000308, attention_score_distillation_loss: 0.001433
loss: 0.008746, lagrangian_loss: 0.000630, attention_score_distillation_loss: 0.001430
----------------------------------------------------------------------
time: 2023-07-19 14:47:10
Evaluating: f1: 0.8955, eval_loss: 0.589, token_prune_loc: [False, False, False, False, False, True, False, True, True, True], macs_sparsity: 0.2172, expected_sparsity: 0.2108, expected_sequence_sparsity: 0.677, target_sparsity: 0.1864, step: 4800
lambda_1: -0.2657, lambda_2: 19.0625 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.96 0.98 0.83 0.53 0.51]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 0.72, 0.5, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.66, 0.33, 0.15]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111111111111111111111111111111111
11111111111011111101100001111101111110011000011110
11111111111111111110101001000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.001323, lagrangian_loss: -0.000299, attention_score_distillation_loss: 0.001428
loss: 0.004584, lagrangian_loss: -0.000480, attention_score_distillation_loss: 0.001425
ETA: 1:27:52 | Epoch 41 finished. Took 32.92 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:47:25
Evaluating: f1: 0.8851, eval_loss: 0.6961, token_prune_loc: [False, False, False, False, False, True, False, True, True, True], macs_sparsity: 0.2187, expected_sparsity: 0.2127, expected_sequence_sparsity: 0.6778, target_sparsity: 0.1883, step: 4850
lambda_1: -0.0282, lambda_2: 19.1229 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.96 0.97 0.82 0.52 0.5 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 0.72, 0.48, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.66, 0.32, 0.15]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111111111111111111111111111111111
11111111111011111101100001111101111110011000011110
11111111111111111110101000000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.003234, lagrangian_loss: 0.000107, attention_score_distillation_loss: 0.001422
loss: 0.002721, lagrangian_loss: 0.000015, attention_score_distillation_loss: 0.001420
----------------------------------------------------------------------
time: 2023-07-19 14:47:40
Evaluating: f1: 0.8907, eval_loss: 0.718, token_prune_loc: [False, False, False, False, False, True, False, True, True, True], macs_sparsity: 0.2172, expected_sparsity: 0.2108, expected_sequence_sparsity: 0.677, target_sparsity: 0.1902, step: 4900
lambda_1: 0.0893, lambda_2: 19.1595 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.96 0.98 0.83 0.53 0.51]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 0.72, 0.5, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.66, 0.33, 0.15]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111111111111111111111111111111111
11111111111011111101100001111101111110011000011110
11111111111111111110101001000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.004011, lagrangian_loss: -0.000099, attention_score_distillation_loss: 0.001416
loss: 0.002088, lagrangian_loss: 0.000784, attention_score_distillation_loss: 0.001413
ETA: 1:27:16 | Epoch 42 finished. Took 32.79 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:47:54
Evaluating: f1: 0.8756, eval_loss: 0.6902, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2282, expected_sparsity: 0.2233, expected_sequence_sparsity: 0.6821, target_sparsity: 0.1922, step: 4950
lambda_1: -0.2617, lambda_2: 19.2591 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.96 0.97 0.83 0.52 0.5 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.94, 0.72, 0.48, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.86, 0.62, 0.3, 0.14]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111110111111111110110
11111111111111111111111111111111011111111111110110
11111111111011111101100001111101111110011000011110
11111111111111111110101000000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.004913, lagrangian_loss: 0.000569, attention_score_distillation_loss: 0.001410
loss: 0.130653, lagrangian_loss: -0.000810, attention_score_distillation_loss: 0.001407
----------------------------------------------------------------------
time: 2023-07-19 14:48:09
Evaluating: f1: 0.8811, eval_loss: 0.6308, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2355, expected_sparsity: 0.2293, expected_sequence_sparsity: 0.6846, target_sparsity: 0.1941, step: 5000
lambda_1: -0.1604, lambda_2: 19.3247 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.96 0.96 0.81 0.51 0.49]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.7, 0.48, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.59, 0.28, 0.13]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111110111111111110110
11111111111111111111111011111111011111111111110110
11111111111011111101100001111101111100011000011110
11111111111111111110101000000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.006452, lagrangian_loss: -0.000297, attention_score_distillation_loss: 0.001406
loss: 0.002083, lagrangian_loss: 0.000354, attention_score_distillation_loss: 0.001403
----------------------------------------------------------------------
time: 2023-07-19 14:48:23
Evaluating: f1: 0.8835, eval_loss: 0.6476, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2282, expected_sparsity: 0.2233, expected_sequence_sparsity: 0.6821, target_sparsity: 0.1961, step: 5050
lambda_1: 0.1716, lambda_2: 19.4326 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.96 0.96 0.82 0.52 0.5 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.94, 0.72, 0.48, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.86, 0.62, 0.3, 0.14]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111110111111111110110
11111111111111111111111111111111011111111111110110
11111111111011111101100001111101111110011000011110
11111111111111111110101000000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.002256, lagrangian_loss: -0.000220, attention_score_distillation_loss: 0.001399
ETA: 1:26:49 | Epoch 43 finished. Took 34.97 seconds.
loss: 0.002974, lagrangian_loss: 0.000100, attention_score_distillation_loss: 0.001395
----------------------------------------------------------------------
time: 2023-07-19 14:48:38
Evaluating: f1: 0.8644, eval_loss: 0.6402, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2282, expected_sparsity: 0.2233, expected_sequence_sparsity: 0.6821, target_sparsity: 0.198, step: 5100
lambda_1: -0.1291, lambda_2: 19.5192 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.96 0.96 0.81 0.52 0.5 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.94, 0.72, 0.48, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.86, 0.62, 0.3, 0.14]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111110111111111110110
11111111111111111111111111111111011111111111110110
11111111111011111101100001111101111110011000011110
11111111111111111110101000000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.008504, lagrangian_loss: 0.000511, attention_score_distillation_loss: 0.001394
loss: 0.003168, lagrangian_loss: -0.000217, attention_score_distillation_loss: 0.001390
----------------------------------------------------------------------
time: 2023-07-19 14:48:52
Evaluating: f1: 0.8726, eval_loss: 0.6804, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2355, expected_sparsity: 0.2293, expected_sequence_sparsity: 0.6846, target_sparsity: 0.2, step: 5150
lambda_1: -0.1670, lambda_2: 19.5615 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.95 0.95 0.79 0.51 0.49]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.7, 0.48, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.59, 0.28, 0.13]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111111011111111011111111111110110
11111111111011111111100001111101111100010000011110
11111111111111111110101000000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.006599, lagrangian_loss: -0.000354, attention_score_distillation_loss: 0.001387
ETA: 1:26:13 | Epoch 44 finished. Took 32.81 seconds.
loss: 0.119889, lagrangian_loss: 0.000104, attention_score_distillation_loss: 0.001384
----------------------------------------------------------------------
time: 2023-07-19 14:49:07
Evaluating: f1: 0.8799, eval_loss: 0.7258, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2355, expected_sparsity: 0.2293, expected_sequence_sparsity: 0.6846, target_sparsity: 0.2019, step: 5200
lambda_1: 0.0916, lambda_2: 19.6263 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.96 0.96 0.8  0.51 0.49]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.7, 0.48, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.59, 0.28, 0.13]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111111011111111011111111111110110
11111111111011111111100001111101111100010000011110
11111111111111111110101000000000000101001000000000
11010000111101010101110101011001010000100100001001
loss: 0.007245, lagrangian_loss: -0.000100, attention_score_distillation_loss: 0.001382
loss: 0.125823, lagrangian_loss: 0.000018, attention_score_distillation_loss: 0.001379
----------------------------------------------------------------------
time: 2023-07-19 14:49:22
Evaluating: f1: 0.887, eval_loss: 0.6445, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2355, expected_sparsity: 0.231, expected_sequence_sparsity: 0.6853, target_sparsity: 0.2038, step: 5250
lambda_1: -0.0573, lambda_2: 19.6480 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.95 0.96 0.8  0.5  0.49]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.7, 0.46, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.59, 0.27, 0.13]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111101111111111011111111111110110
11111111111011111111100001111101111100010000011110
11111111111111111110101000000000000100001000000000
11010000111101010101110101011001010000100100001001
loss: 0.002161, lagrangian_loss: 0.000049, attention_score_distillation_loss: 0.001377
loss: 0.002037, lagrangian_loss: -0.000026, attention_score_distillation_loss: 0.001374
ETA: 1:25:39 | Epoch 45 finished. Took 33.07 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:49:36
Evaluating: f1: 0.8859, eval_loss: 0.6789, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2355, expected_sparsity: 0.231, expected_sequence_sparsity: 0.6853, target_sparsity: 0.2058, step: 5300
lambda_1: -0.0599, lambda_2: 19.6521 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.95 0.96 0.79 0.49 0.49]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.7, 0.46, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.59, 0.27, 0.13]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111101111111111011111111111110110
11111111111011111111100001111101111100010000011110
11111111111111111110101000000000000100001000000000
11010000111101010101110101011001010000100100001001
loss: 0.001981, lagrangian_loss: -0.000029, attention_score_distillation_loss: 0.001370
loss: 0.002279, lagrangian_loss: -0.000011, attention_score_distillation_loss: 0.001367
----------------------------------------------------------------------
time: 2023-07-19 14:49:51
Evaluating: f1: 0.8754, eval_loss: 0.6692, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2371, expected_sparsity: 0.2335, expected_sequence_sparsity: 0.6863, target_sparsity: 0.2077, step: 5350
lambda_1: -0.0195, lambda_2: 19.6541 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.95 0.96 0.79 0.49 0.49]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.68, 0.46, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.58, 0.26, 0.12]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111101111111111011111111111110110
11111111111011111101100001111101111100010000011110
11111111111111111110101000000000000100001000000000
11010000111101010101110101011001010000100100001001
loss: 0.008156, lagrangian_loss: 0.000034, attention_score_distillation_loss: 0.001365
loss: 0.002996, lagrangian_loss: 0.000086, attention_score_distillation_loss: 0.001362
----------------------------------------------------------------------
time: 2023-07-19 14:50:06
Evaluating: f1: 0.8912, eval_loss: 0.6219, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2371, expected_sparsity: 0.2342, expected_sequence_sparsity: 0.6866, target_sparsity: 0.2097, step: 5400
lambda_1: -0.1236, lambda_2: 19.6646 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.95 0.95 0.78 0.48 0.48]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.68, 0.46, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.58, 0.26, 0.12]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111101111111111011111111111110110
11111111111011111101100001111101111100010000011110
11111111111111111110101000000000000100001000000000
11010000111101010101110101010001010000100100001001
loss: 0.001552, lagrangian_loss: 0.000226, attention_score_distillation_loss: 0.001359
ETA: 1:25:12 | Epoch 46 finished. Took 35.33 seconds.
loss: 0.002585, lagrangian_loss: -0.000138, attention_score_distillation_loss: 0.001357
----------------------------------------------------------------------
time: 2023-07-19 14:50:20
Evaluating: f1: 0.8801, eval_loss: 0.5993, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2371, expected_sparsity: 0.2342, expected_sequence_sparsity: 0.6866, target_sparsity: 0.2116, step: 5450
lambda_1: -0.0930, lambda_2: 19.6823 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.95 0.95 0.76 0.48 0.48]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.68, 0.46, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.58, 0.26, 0.12]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111101111111111011111111111110110
11111111111011111111100001111101111100010000010110
11111111111111111110101000000000000100001000000000
11010000111101010101110101010001010000100100001001
loss: 0.078087, lagrangian_loss: -0.000103, attention_score_distillation_loss: 0.001354
loss: 0.004137, lagrangian_loss: 0.000084, attention_score_distillation_loss: 0.001352
----------------------------------------------------------------------
time: 2023-07-19 14:50:35
Evaluating: f1: 0.8873, eval_loss: 0.6385, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2371, expected_sparsity: 0.2342, expected_sequence_sparsity: 0.6866, target_sparsity: 0.2135, step: 5500
lambda_1: 0.0919, lambda_2: 19.7210 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.95 0.95 0.77 0.48 0.48]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.68, 0.46, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.58, 0.26, 0.12]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110111111110110
11111111111111111111101111111111011111111111110110
11111111111011111101100001111101111100010000011110
11111111111111111110101000000000000100001000000000
11010000111101010101110101010001010000100100001001
loss: 0.004202, lagrangian_loss: -0.000107, attention_score_distillation_loss: 0.001349
ETA: 1:24:38 | Epoch 47 finished. Took 33.18 seconds.
loss: 0.007732, lagrangian_loss: 0.000392, attention_score_distillation_loss: 0.001346
----------------------------------------------------------------------
time: 2023-07-19 14:50:50
Evaluating: f1: 0.8821, eval_loss: 0.6445, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2412, expected_sparsity: 0.2367, expected_sequence_sparsity: 0.6876, target_sparsity: 0.2155, step: 5550
lambda_1: -0.2254, lambda_2: 19.7955 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.95 0.95 0.76 0.47 0.48]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.66, 0.46, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.56, 0.26, 0.11]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111111011111110110
11111111111111111111101111111111011111111111110110
11111111111011111101100001111101111100010000010110
11111111111111111110101000000000000100001000000000
11010000111101010101110101010001010000100100001001
loss: 0.002439, lagrangian_loss: 0.000167, attention_score_distillation_loss: 0.001344
loss: 0.003180, lagrangian_loss: -0.000499, attention_score_distillation_loss: 0.001340
----------------------------------------------------------------------
time: 2023-07-19 14:51:05
Evaluating: f1: 0.8866, eval_loss: 0.6246, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2507, expected_sparsity: 0.2464, expected_sequence_sparsity: 0.6916, target_sparsity: 0.2174, step: 5600
lambda_1: -0.0218, lambda_2: 19.8628 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.94 0.94 0.75 0.47 0.47]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.66, 0.44, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.53, 0.24, 0.1]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111101011111111011111111111110110
11111111111011111101100001111101111100010000010110
11111111111111011110101000000000000100001000000000
11010000111101010101110101011001010000100100001000
loss: 0.002003, lagrangian_loss: 0.000167, attention_score_distillation_loss: 0.001338
loss: 0.001499, lagrangian_loss: 0.000071, attention_score_distillation_loss: 0.001334
ETA: 1:24:03 | Epoch 48 finished. Took 33.02 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:51:19
Evaluating: f1: 0.887, eval_loss: 0.6624, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2412, expected_sparsity: 0.2367, expected_sequence_sparsity: 0.6876, target_sparsity: 0.2194, step: 5650
lambda_1: 0.1345, lambda_2: 19.9195 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.94 0.95 0.76 0.48 0.48]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.66, 0.46, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.56, 0.26, 0.11]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111111011111110110
11111111111111111111101111111111011111111111110110
11111111111011111101100001111101111100010000010110
11111111111111111110101000000000000100001000000000
11010000111101010101110101010001010000100100001001
loss: 0.004546, lagrangian_loss: -0.000222, attention_score_distillation_loss: 0.001331
loss: 0.009626, lagrangian_loss: 0.000608, attention_score_distillation_loss: 0.001327
----------------------------------------------------------------------
time: 2023-07-19 14:51:34
Evaluating: f1: 0.8873, eval_loss: 0.6521, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2507, expected_sparsity: 0.2464, expected_sequence_sparsity: 0.6916, target_sparsity: 0.2213, step: 5700
lambda_1: -0.2832, lambda_2: 20.0430 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.94 0.95 0.74 0.47 0.47]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.66, 0.44, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.53, 0.24, 0.1]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111101011111111011111111111110110
11111111111011110111100001111101111100010000010110
11111111111111011110101000000000000100001000000000
11010000111101010101110101011001010000100100001000
loss: 0.002739, lagrangian_loss: 0.000332, attention_score_distillation_loss: 0.001325
loss: 0.001201, lagrangian_loss: -0.000605, attention_score_distillation_loss: 0.001323
----------------------------------------------------------------------
time: 2023-07-19 14:51:48
Evaluating: f1: 0.8847, eval_loss: 0.6252, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2523, expected_sparsity: 0.2487, expected_sequence_sparsity: 0.6925, target_sparsity: 0.2233, step: 5750
lambda_1: -0.0978, lambda_2: 20.1130 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.93 0.94 0.72 0.46 0.47]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.64, 0.44, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.52, 0.23, 0.1]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111101111111111011111111011110110
11111111111011110111100001111101111100000000010110
11111111111111011110101000000000000100001000000000
11010000111101010101110101011001010000100100001000
ETA: 1:23:35 | Epoch 49 finished. Took 35.24 seconds.
loss: 0.002118, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.001320
loss: 0.001744, lagrangian_loss: 0.000472, attention_score_distillation_loss: 0.001316
----------------------------------------------------------------------
time: 2023-07-19 14:52:03
Evaluating: f1: 0.8881, eval_loss: 0.5954, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2507, expected_sparsity: 0.2464, expected_sequence_sparsity: 0.6916, target_sparsity: 0.2252, step: 5800
lambda_1: 0.2381, lambda_2: 20.2204 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.94 0.94 0.74 0.47 0.47]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.66, 0.44, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.53, 0.24, 0.1]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111101011111111011111111111110110
11111111111011110111100001111101111100010000010110
11111111111111011110101000000000000100001000000000
11010000111101010101110101011001010000100100001000
loss: 0.001065, lagrangian_loss: -0.000535, attention_score_distillation_loss: 0.001314
loss: 0.002065, lagrangian_loss: 0.000404, attention_score_distillation_loss: 0.001311
----------------------------------------------------------------------
time: 2023-07-19 14:52:18
Evaluating: f1: 0.8832, eval_loss: 0.6782, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2523, expected_sparsity: 0.2487, expected_sequence_sparsity: 0.6925, target_sparsity: 0.2271, step: 5850
lambda_1: -0.2462, lambda_2: 20.4010 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.94 0.94 0.73 0.46 0.47]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.64, 0.44, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.52, 0.23, 0.1]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111101011111111011111111111110110
11111111111011110111100001111101111100000000010110
11111111111111011110101000000000000100001000000000
11010000111101010101110101011001010000100100001000
loss: 0.005399, lagrangian_loss: 0.000750, attention_score_distillation_loss: 0.001307
ETA: 1:23:01 | Epoch 50 finished. Took 33.06 seconds.
loss: 0.002068, lagrangian_loss: -0.000690, attention_score_distillation_loss: 0.001304
----------------------------------------------------------------------
time: 2023-07-19 14:52:32
Evaluating: f1: 0.8773, eval_loss: 0.7855, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.258, expected_sparsity: 0.2524, expected_sequence_sparsity: 0.694, target_sparsity: 0.2291, step: 5900
lambda_1: -0.2118, lambda_2: 20.4835 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.93 0.93 0.71 0.45 0.47]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.62, 0.42, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.5, 0.21, 0.09]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111111011111111011111111011110110
11111111111011110101100001111101111100000000010110
11111111111111011110101000000000000100000000000000
11010000111101010101110101011001010000100100001000
loss: 0.001006, lagrangian_loss: -0.000531, attention_score_distillation_loss: 0.001302
loss: 0.002221, lagrangian_loss: 0.000464, attention_score_distillation_loss: 0.001299
----------------------------------------------------------------------
time: 2023-07-19 14:52:47
Evaluating: f1: 0.8789, eval_loss: 0.6777, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2523, expected_sparsity: 0.2487, expected_sequence_sparsity: 0.6925, target_sparsity: 0.231, step: 5950
lambda_1: 0.2473, lambda_2: 20.6397 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.93 0.93 0.72 0.46 0.47]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.64, 0.44, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.52, 0.23, 0.1]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111101011111111011111111111110110
11111111111011110111100001111101111100000000010110
11111111111111011110101000000000000100001000000000
11010000111101010101110101011001010000100100001000
loss: 0.004002, lagrangian_loss: -0.000142, attention_score_distillation_loss: 0.001297
loss: 0.003563, lagrangian_loss: -0.000285, attention_score_distillation_loss: 0.001295
ETA: 1:22:27 | Epoch 51 finished. Took 33.2 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:53:02
Evaluating: f1: 0.8754, eval_loss: 0.7229, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2523, expected_sparsity: 0.2487, expected_sequence_sparsity: 0.6925, target_sparsity: 0.233, step: 6000
lambda_1: -0.1442, lambda_2: 20.8115 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.93 0.94 0.72 0.46 0.47]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.64, 0.44, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.52, 0.23, 0.1]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111101011111111011111111111110110
11111111111011110111100001111101111100000000010110
11111111111111011110101000000000000100001000000000
11010000111101010101110101011001010000100100001000
loss: 0.003513, lagrangian_loss: 0.001245, attention_score_distillation_loss: 0.001290
loss: 0.003007, lagrangian_loss: 0.000043, attention_score_distillation_loss: 0.001288
----------------------------------------------------------------------
time: 2023-07-19 14:53:17
Evaluating: f1: 0.8758, eval_loss: 0.724, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2612, expected_sparsity: 0.2555, expected_sequence_sparsity: 0.6953, target_sparsity: 0.2349, step: 6050
lambda_1: -0.3557, lambda_2: 20.9263 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.92 0.92 0.69 0.44 0.46]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.88, 0.62, 0.42, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.79, 0.49, 0.21, 0.09]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111111011111111010111111011110110
11111111111011110101100001111101111100000000010110
11111111111111011110101000000000000100000000000000
11010000111101010101110101011001010000100100001000
loss: 0.002983, lagrangian_loss: -0.001101, attention_score_distillation_loss: 0.001286
loss: 0.001251, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.001283
ETA: 1:21:53 | Epoch 52 finished. Took 33.42 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:53:32
Evaluating: f1: 0.8836, eval_loss: 0.676, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2612, expected_sparsity: 0.2555, expected_sequence_sparsity: 0.6953, target_sparsity: 0.2368, step: 6100
lambda_1: 0.1605, lambda_2: 21.1270 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.92 0.92 0.69 0.44 0.46]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.88, 0.62, 0.42, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.79, 0.49, 0.21, 0.09]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111111011111111010111111011110110
11111111111011110101100001111101111100000000010110
11111111111111011110101000000000000100000000000000
11010000111101010101110101011001010000100100001000
loss: 0.004098, lagrangian_loss: 0.000299, attention_score_distillation_loss: 0.001280
loss: 0.003099, lagrangian_loss: -0.000458, attention_score_distillation_loss: 0.001277
----------------------------------------------------------------------
time: 2023-07-19 14:53:46
Evaluating: f1: 0.8807, eval_loss: 0.7365, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.258, expected_sparsity: 0.2524, expected_sequence_sparsity: 0.694, target_sparsity: 0.2388, step: 6150
lambda_1: 0.0093, lambda_2: 21.2409 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.93 0.92 0.7  0.45 0.47]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.62, 0.42, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.5, 0.21, 0.09]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111111111111111010111111011110110
11111111111011110101100001111101111100000000010110
11111111111111011110101000000000000100000000000000
11010000111101010101110101011001010000100100001000
loss: 0.002106, lagrangian_loss: 0.000492, attention_score_distillation_loss: 0.001274
loss: 0.003035, lagrangian_loss: 0.000563, attention_score_distillation_loss: 0.001271
----------------------------------------------------------------------
time: 2023-07-19 14:54:01
Evaluating: f1: 0.8788, eval_loss: 0.6134, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2627, expected_sparsity: 0.2577, expected_sequence_sparsity: 0.6962, target_sparsity: 0.2407, step: 6200
lambda_1: -0.3376, lambda_2: 21.3858 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.92 0.91 0.68 0.44 0.46]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.88, 0.6, 0.42, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.79, 0.48, 0.2, 0.09]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011111110110
11111111111111111111111011111111010111111011110110
11111111111011010101100001111101111100000000010110
11111111111111011110101000000000000100000000000000
11010000111101010101110101011001010000100100001000
loss: 0.002766, lagrangian_loss: -0.000765, attention_score_distillation_loss: 0.001269
ETA: 1:21:25 | Epoch 53 finished. Took 35.54 seconds.
loss: 0.001574, lagrangian_loss: -0.000224, attention_score_distillation_loss: 0.001266
----------------------------------------------------------------------
time: 2023-07-19 14:54:16
Evaluating: f1: 0.8803, eval_loss: 0.6397, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.269, expected_sparsity: 0.2636, expected_sequence_sparsity: 0.6986, target_sparsity: 0.2427, step: 6250
lambda_1: 0.0915, lambda_2: 21.5435 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.92 0.92 0.68 0.43 0.46]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.6, 0.4, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.46, 0.19, 0.08]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110110
11111111111111111111111011111111010111111011110110
11111111111011010111100001111101101100000000010110
11111111111111011110101000000000000000000000000000
11010000111101010101110101011001010000100100001000
loss: 0.001186, lagrangian_loss: 0.000041, attention_score_distillation_loss: 0.001264
loss: 0.001851, lagrangian_loss: -0.000062, attention_score_distillation_loss: 0.001260
----------------------------------------------------------------------
time: 2023-07-19 14:54:30
Evaluating: f1: 0.877, eval_loss: 0.6131, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.269, expected_sparsity: 0.2636, expected_sequence_sparsity: 0.6986, target_sparsity: 0.2446, step: 6300
lambda_1: -0.1096, lambda_2: 21.6312 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.92 0.92 0.68 0.43 0.46]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.6, 0.4, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.46, 0.19, 0.08]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110110
11111111111111111111111011111111010111111011110110
11111111111011010111100001111101101100000000010110
11111111111111011110101000000000000000000000000000
11010000111101010101110101011001010000100100001000
loss: 0.001941, lagrangian_loss: 0.000522, attention_score_distillation_loss: 0.001257
ETA: 1:20:51 | Epoch 54 finished. Took 33.03 seconds.
loss: 0.008417, lagrangian_loss: -0.000074, attention_score_distillation_loss: 0.001256
----------------------------------------------------------------------
time: 2023-07-19 14:54:45
Evaluating: f1: 0.8786, eval_loss: 0.7099, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.269, expected_sparsity: 0.2657, expected_sequence_sparsity: 0.6995, target_sparsity: 0.2466, step: 6350
lambda_1: -0.2436, lambda_2: 21.6846 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.91 0.91 0.66 0.42 0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.58, 0.4, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.45, 0.18, 0.08]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110110
11111111111111111111111011111111010111111011110110
11111111111011110111100001011101101100000000000110
11111111111111011110101000000000000000000000000000
11010000111101010101110101011001010000100100001000
loss: 0.002144, lagrangian_loss: -0.000519, attention_score_distillation_loss: 0.001252
loss: 0.002197, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.001249
----------------------------------------------------------------------
time: 2023-07-19 14:55:00
Evaluating: f1: 0.8801, eval_loss: 0.6969, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.269, expected_sparsity: 0.2657, expected_sequence_sparsity: 0.6995, target_sparsity: 0.2485, step: 6400
lambda_1: 0.1225, lambda_2: 21.7899 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.91 0.92 0.66 0.43 0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.58, 0.4, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.45, 0.18, 0.08]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110110
11111111111111111111111011111111010111111011110110
11111111111011010111100001011101111100000000000110
11111111111111011110101000000000000000000000000000
11010000111101010101110101011001010000100100001000
loss: 0.002499, lagrangian_loss: 0.000054, attention_score_distillation_loss: 0.001245
loss: 0.000613, lagrangian_loss: -0.000180, attention_score_distillation_loss: 0.001243
ETA: 1:20:16 | Epoch 55 finished. Took 33.01 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:55:14
Evaluating: f1: 0.8708, eval_loss: 0.6674, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.269, expected_sparsity: 0.2636, expected_sequence_sparsity: 0.6986, target_sparsity: 0.2504, step: 6450
lambda_1: -0.0735, lambda_2: 21.8731 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.91 0.92 0.66 0.43 0.46]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.6, 0.4, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.46, 0.19, 0.08]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110110
11111111111111111111111011111111010111111011110110
11111111111011110111100001011101111100000000000110
11111111111111011110101000000000000000000000000000
11010000111101010101110101011001010000100100001000
loss: 0.000863, lagrangian_loss: 0.000551, attention_score_distillation_loss: 0.001240
loss: 0.003027, lagrangian_loss: 0.000119, attention_score_distillation_loss: 0.001237
----------------------------------------------------------------------
time: 2023-07-19 14:55:29
Evaluating: f1: 0.8793, eval_loss: 0.6707, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.269, expected_sparsity: 0.2662, expected_sequence_sparsity: 0.6997, target_sparsity: 0.2524, step: 6500
lambda_1: -0.2447, lambda_2: 21.9424 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.9  0.91 0.64 0.42 0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.58, 0.4, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.45, 0.18, 0.08]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110110
11111111111111111111111011111111010111111011110110
11111111111011110111100001011101101100000000000110
11111111111111011110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.000859, lagrangian_loss: -0.000530, attention_score_distillation_loss: 0.001233
loss: 0.001205, lagrangian_loss: 0.000070, attention_score_distillation_loss: 0.001230
----------------------------------------------------------------------
time: 2023-07-19 14:55:44
Evaluating: f1: 0.8792, eval_loss: 0.6639, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.269, expected_sparsity: 0.2662, expected_sequence_sparsity: 0.6997, target_sparsity: 0.2543, step: 6550
lambda_1: 0.1470, lambda_2: 22.0688 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.9  0.91 0.64 0.42 0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.58, 0.4, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.45, 0.18, 0.08]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110110
11111111111111111111111011111111010111111011110110
11111111111011110111100001011101101100000000000110
11111111111111011110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001461, lagrangian_loss: 0.000101, attention_score_distillation_loss: 0.001229
ETA: 1:19:47 | Epoch 56 finished. Took 35.15 seconds.
loss: 0.001796, lagrangian_loss: -0.000160, attention_score_distillation_loss: 0.001224
----------------------------------------------------------------------
time: 2023-07-19 14:55:58
Evaluating: f1: 0.8737, eval_loss: 0.6714, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.269, expected_sparsity: 0.2657, expected_sequence_sparsity: 0.6995, target_sparsity: 0.2563, step: 6600
lambda_1: -0.0954, lambda_2: 22.1629 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.9  0.91 0.65 0.42 0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.58, 0.4, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.45, 0.18, 0.08]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110110
11111111111111111111111011111111010111111011110110
11111111111011110111100001011101101100000000000110
11111111111111011110101000000000000000000000000000
11010000111101010101110101010001010000100100001001
loss: 0.019113, lagrangian_loss: 0.000455, attention_score_distillation_loss: 0.001222
loss: 0.001869, lagrangian_loss: -0.000118, attention_score_distillation_loss: 0.001220
----------------------------------------------------------------------
time: 2023-07-19 14:56:13
Evaluating: f1: 0.8811, eval_loss: 0.6534, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2811, expected_sparsity: 0.2766, expected_sequence_sparsity: 0.704, target_sparsity: 0.2582, step: 6650
lambda_1: -0.2236, lambda_2: 22.2402 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.89 0.9  0.62 0.41 0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86, 0.56, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.41, 0.16, 0.07]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110010
11111111111111111111111011111111010111011011110110
11111111111011010111100001011101101100000000000110
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.004571, lagrangian_loss: -0.000478, attention_score_distillation_loss: 0.001218
ETA: 1:19:12 | Epoch 57 finished. Took 32.99 seconds.
loss: 0.001692, lagrangian_loss: 0.000172, attention_score_distillation_loss: 0.001215
----------------------------------------------------------------------
time: 2023-07-19 14:56:27
Evaluating: f1: 0.8815, eval_loss: 0.6591, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2811, expected_sparsity: 0.2755, expected_sequence_sparsity: 0.7035, target_sparsity: 0.2602, step: 6700
lambda_1: 0.1908, lambda_2: 22.3762 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.89 0.91 0.63 0.41 0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86, 0.56, 0.4, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.41, 0.17, 0.07]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110010
11111111111111111111111011111111010111011011110110
11111111111011010111100001011101101100000000000110
11111111111111011110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.000950, lagrangian_loss: 0.000116, attention_score_distillation_loss: 0.001212
loss: 0.001666, lagrangian_loss: -0.000258, attention_score_distillation_loss: 0.001208
----------------------------------------------------------------------
time: 2023-07-19 14:56:42
Evaluating: f1: 0.8739, eval_loss: 0.6494, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2811, expected_sparsity: 0.2755, expected_sequence_sparsity: 0.7035, target_sparsity: 0.2621, step: 6750
lambda_1: -0.1156, lambda_2: 22.5201 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.89 0.91 0.63 0.42 0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86, 0.56, 0.4, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.41, 0.17, 0.07]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110010
11111111111111111111111011111111010111011011110110
11111111111011010111100001011101101100000000000110
11111111111111011110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001611, lagrangian_loss: 0.000619, attention_score_distillation_loss: 0.001206
loss: 0.001479, lagrangian_loss: -0.000047, attention_score_distillation_loss: 0.001201
ETA: 1:18:37 | Epoch 58 finished. Took 32.83 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:56:57
Evaluating: f1: 0.8803, eval_loss: 0.6538, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2842, expected_sparsity: 0.2786, expected_sequence_sparsity: 0.7048, target_sparsity: 0.264, step: 6800
lambda_1: -0.2579, lambda_2: 22.6256 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.87 0.9  0.6  0.41 0.44]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86, 0.54, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.4, 0.15, 0.06]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110010
11111111111111111111111011111111010111011011110110
11111111111011010111100001011101101100000000000100
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001876, lagrangian_loss: -0.000645, attention_score_distillation_loss: 0.001199
loss: 0.002166, lagrangian_loss: 0.000364, attention_score_distillation_loss: 0.001197
----------------------------------------------------------------------
time: 2023-07-19 14:57:11
Evaluating: f1: 0.875, eval_loss: 0.6359, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2811, expected_sparsity: 0.2766, expected_sequence_sparsity: 0.704, target_sparsity: 0.266, step: 6850
lambda_1: 0.2149, lambda_2: 22.8189 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.88 0.91 0.61 0.41 0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86, 0.56, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.41, 0.16, 0.07]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110010
11111111111111111111111011111111010111011011110110
11111111111011010111100001011101101100000000000110
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.002707, lagrangian_loss: -0.000101, attention_score_distillation_loss: 0.001194
loss: 0.000903, lagrangian_loss: -0.000170, attention_score_distillation_loss: 0.001190
----------------------------------------------------------------------
time: 2023-07-19 14:57:26
Evaluating: f1: 0.8688, eval_loss: 0.7577, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2811, expected_sparsity: 0.2766, expected_sequence_sparsity: 0.704, target_sparsity: 0.2679, step: 6900
lambda_1: -0.1501, lambda_2: 22.9911 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.88 0.91 0.61 0.41 0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86, 0.56, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.41, 0.16, 0.07]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011110010
11111111111111111111111011111111010111011011110110
11111111111011010111100001011101101100000000000110
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
ETA: 1:18:08 | Epoch 59 finished. Took 35.07 seconds.
loss: 0.002016, lagrangian_loss: 0.000735, attention_score_distillation_loss: 0.001188
loss: 0.001338, lagrangian_loss: -0.000265, attention_score_distillation_loss: 0.001186
----------------------------------------------------------------------
time: 2023-07-19 14:57:40
Evaluating: f1: 0.8784, eval_loss: 0.614, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2874, expected_sparsity: 0.2829, expected_sequence_sparsity: 0.7066, target_sparsity: 0.2699, step: 6950
lambda_1: -0.2297, lambda_2: 23.0953 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.98 0.86 0.9  0.59 0.4  0.44]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.86, 0.54, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.39, 0.15, 0.06]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110010011110010
11111111111111111111111011111111010111011011110110
11111111111011010111100001011101101100000000000100
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001778, lagrangian_loss: -0.000544, attention_score_distillation_loss: 0.001182
loss: 0.002700, lagrangian_loss: 0.000403, attention_score_distillation_loss: 0.001179
----------------------------------------------------------------------
time: 2023-07-19 14:57:55
Evaluating: f1: 0.8854, eval_loss: 0.6061, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2874, expected_sparsity: 0.2829, expected_sequence_sparsity: 0.7066, target_sparsity: 0.2718, step: 7000
lambda_1: 0.2091, lambda_2: 23.2701 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.87 0.9  0.6  0.4  0.44]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.86, 0.54, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.39, 0.15, 0.06]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110010011110010
11111111111111111111111011111111010111011011110110
11111111111011010111100001011101101100000000000100
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.000911, lagrangian_loss: -0.000106, attention_score_distillation_loss: 0.001176
ETA: 1:17:32 | Epoch 60 finished. Took 32.76 seconds.
loss: 0.001643, lagrangian_loss: 0.000038, attention_score_distillation_loss: 0.001174
----------------------------------------------------------------------
time: 2023-07-19 14:58:09
Evaluating: f1: 0.8858, eval_loss: 0.6365, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2874, expected_sparsity: 0.2829, expected_sequence_sparsity: 0.7066, target_sparsity: 0.2737, step: 7050
lambda_1: -0.1898, lambda_2: 23.4589 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 0.87 0.9  0.6  0.4  0.45]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.86, 0.54, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.39, 0.15, 0.06]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110010011110010
11111111111111111111111011111111010111011011110110
11111111111011010111100001011101101100000000000100
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001754, lagrangian_loss: 0.000846, attention_score_distillation_loss: 0.001171
loss: 0.003400, lagrangian_loss: -0.000338, attention_score_distillation_loss: 0.001167
----------------------------------------------------------------------
time: 2023-07-19 14:58:24
Evaluating: f1: 0.8832, eval_loss: 0.6641, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2889, expected_sparsity: 0.2848, expected_sequence_sparsity: 0.7073, target_sparsity: 0.2757, step: 7100
lambda_1: -0.2418, lambda_2: 23.5646 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.98 0.85 0.89 0.57 0.39 0.44]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.86, 0.52, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.38, 0.14, 0.06]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011010010
11111111111111111111111011111111010111011011110110
11111111111011000111100001011101101100000000000100
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.000840, lagrangian_loss: -0.000599, attention_score_distillation_loss: 0.001165
loss: 0.001796, lagrangian_loss: 0.000224, attention_score_distillation_loss: 0.001162
ETA: 1:16:58 | Epoch 61 finished. Took 32.83 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:58:38
Evaluating: f1: 0.8804, eval_loss: 0.6649, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2874, expected_sparsity: 0.2829, expected_sequence_sparsity: 0.7066, target_sparsity: 0.2776, step: 7150
lambda_1: 0.1721, lambda_2: 23.7344 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.98 0.86 0.9  0.58 0.4  0.44]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.86, 0.54, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.39, 0.15, 0.06]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011010010
11111111111111111111111011111111010111011011110110
11111111111011010111100001011101101100000000000100
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.002218, lagrangian_loss: -0.000113, attention_score_distillation_loss: 0.001159
loss: 0.001614, lagrangian_loss: 0.000072, attention_score_distillation_loss: 0.001156
----------------------------------------------------------------------
time: 2023-07-19 14:58:53
Evaluating: f1: 0.8612, eval_loss: 0.695, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2889, expected_sparsity: 0.2848, expected_sequence_sparsity: 0.7073, target_sparsity: 0.2796, step: 7200
lambda_1: -0.1783, lambda_2: 23.8799 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.98 0.86 0.9  0.58 0.4  0.44]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.86, 0.52, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.38, 0.14, 0.06]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011011010010
11111111111111111111111011111111010111011011110110
11111111111011000111100001011101101100000000000100
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001262, lagrangian_loss: 0.000504, attention_score_distillation_loss: 0.001153
loss: 0.002022, lagrangian_loss: -0.000058, attention_score_distillation_loss: 0.001150
ETA: 1:16:23 | Epoch 62 finished. Took 32.89 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:59:08
Evaluating: f1: 0.8668, eval_loss: 0.6813, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.301, expected_sparsity: 0.2944, expected_sequence_sparsity: 0.7113, target_sparsity: 0.2815, step: 7250
lambda_1: -0.3009, lambda_2: 23.9464 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.98 0.85 0.89 0.55 0.39 0.43]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.84, 0.5, 0.36, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.69, 0.34, 0.12, 0.05]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110010011010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101101000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.002031, lagrangian_loss: -0.000645, attention_score_distillation_loss: 0.001148
loss: 0.001770, lagrangian_loss: 0.000020, attention_score_distillation_loss: 0.001144
----------------------------------------------------------------------
time: 2023-07-19 14:59:22
Evaluating: f1: 0.8817, eval_loss: 0.5893, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.301, expected_sparsity: 0.2944, expected_sequence_sparsity: 0.7113, target_sparsity: 0.2835, step: 7300
lambda_1: 0.1516, lambda_2: 24.1251 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.98 0.85 0.89 0.55 0.39 0.43]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.84, 0.5, 0.36, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.69, 0.34, 0.12, 0.05]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110010011010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101101000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001972, lagrangian_loss: 0.000150, attention_score_distillation_loss: 0.001141
loss: 0.001034, lagrangian_loss: -0.000322, attention_score_distillation_loss: 0.001138
----------------------------------------------------------------------
time: 2023-07-19 14:59:37
Evaluating: f1: 0.8759, eval_loss: 0.6431, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2952, expected_sparsity: 0.2916, expected_sequence_sparsity: 0.7101, target_sparsity: 0.2854, step: 7350
lambda_1: -0.0369, lambda_2: 24.2518 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.98 0.85 0.9  0.56 0.39 0.44]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.84, 0.52, 0.38, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.69, 0.36, 0.14, 0.06]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110010011010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101101100000000000100
11111111111111010110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.002440, lagrangian_loss: 0.000538, attention_score_distillation_loss: 0.001135
ETA: 1:15:53 | Epoch 63 finished. Took 35.1 seconds.
loss: 0.001492, lagrangian_loss: 0.000477, attention_score_distillation_loss: 0.001134
----------------------------------------------------------------------
time: 2023-07-19 14:59:51
Evaluating: f1: 0.8858, eval_loss: 0.6231, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.301, expected_sparsity: 0.2944, expected_sequence_sparsity: 0.7113, target_sparsity: 0.2873, step: 7400
lambda_1: -0.3572, lambda_2: 24.4122 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.98 0.84 0.89 0.54 0.38 0.43]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.84, 0.5, 0.36, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.69, 0.34, 0.12, 0.05]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110010011010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101101000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001500, lagrangian_loss: -0.000750, attention_score_distillation_loss: 0.001130
loss: 0.006141, lagrangian_loss: -0.000199, attention_score_distillation_loss: 0.001128
----------------------------------------------------------------------
time: 2023-07-19 15:00:06
Evaluating: f1: 0.8736, eval_loss: 0.6062, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.301, expected_sparsity: 0.2944, expected_sequence_sparsity: 0.7113, target_sparsity: 0.2893, step: 7450
lambda_1: 0.1396, lambda_2: 24.6479 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.97 0.84 0.89 0.53 0.38 0.43]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.84, 0.5, 0.36, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.69, 0.34, 0.12, 0.05]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110010011010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101101000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001895, lagrangian_loss: 0.000383, attention_score_distillation_loss: 0.001125
ETA: 1:15:18 | Epoch 64 finished. Took 32.89 seconds.
loss: 0.004323, lagrangian_loss: -0.000458, attention_score_distillation_loss: 0.001122
----------------------------------------------------------------------
time: 2023-07-19 15:00:21
Evaluating: f1: 0.8797, eval_loss: 0.594, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.301, expected_sparsity: 0.2944, expected_sequence_sparsity: 0.7113, target_sparsity: 0.2912, step: 7500
lambda_1: 0.0277, lambda_2: 24.7992 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.98 0.84 0.89 0.54 0.39 0.44]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.84, 0.5, 0.36, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.69, 0.34, 0.12, 0.05]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011001010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101101000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.005531, lagrangian_loss: 0.000449, attention_score_distillation_loss: 0.001118
loss: 0.001741, lagrangian_loss: 0.000555, attention_score_distillation_loss: 0.001116
----------------------------------------------------------------------
time: 2023-07-19 15:00:35
Evaluating: f1: 0.8938, eval_loss: 0.6211, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.3057, expected_sparsity: 0.3003, expected_sequence_sparsity: 0.7137, target_sparsity: 0.2932, step: 7550
lambda_1: -0.3552, lambda_2: 25.0145 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.97 0.83 0.89 0.52 0.38 0.43]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.84, 0.48, 0.36, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.32, 0.12, 0.05]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110010001010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101100000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.002066, lagrangian_loss: -0.000615, attention_score_distillation_loss: 0.001112
loss: 0.004104, lagrangian_loss: -0.000296, attention_score_distillation_loss: 0.001110
ETA: 1:14:44 | Epoch 65 finished. Took 33.04 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:00:50
Evaluating: f1: 0.8739, eval_loss: 0.6106, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.3057, expected_sparsity: 0.3003, expected_sequence_sparsity: 0.7137, target_sparsity: 0.2951, step: 7600
lambda_1: 0.0927, lambda_2: 25.2199 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.97 0.82 0.88 0.52 0.38 0.43]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.84, 0.48, 0.36, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.32, 0.12, 0.05]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110010001010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101100000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001534, lagrangian_loss: 0.000301, attention_score_distillation_loss: 0.001107
loss: 0.002766, lagrangian_loss: -0.000288, attention_score_distillation_loss: 0.001105
----------------------------------------------------------------------
time: 2023-07-19 15:01:05
Evaluating: f1: 0.8923, eval_loss: 0.6507, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.301, expected_sparsity: 0.2944, expected_sequence_sparsity: 0.7113, target_sparsity: 0.2971, step: 7650
lambda_1: 0.0657, lambda_2: 25.3355 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.97 0.83 0.89 0.53 0.38 0.43]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.84, 0.5, 0.36, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.69, 0.34, 0.12, 0.05]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111111110011001010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101101000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001663, lagrangian_loss: 0.000195, attention_score_distillation_loss: 0.001104
loss: 0.001463, lagrangian_loss: 0.000902, attention_score_distillation_loss: 0.001099
----------------------------------------------------------------------
time: 2023-07-19 15:01:19
Evaluating: f1: 0.8912, eval_loss: 0.6398, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3215, expected_sparsity: 0.3153, expected_sequence_sparsity: 0.7198, target_sparsity: 0.299, step: 7700
lambda_1: -0.3977, lambda_2: 25.5696 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.97 0.82 0.88 0.51 0.38 0.43]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.8, 0.84, 0.48, 0.36, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.75, 0.63, 0.3, 0.11, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101111110
11111111111111111111111111011111111110010001010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101100000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001730, lagrangian_loss: -0.000338, attention_score_distillation_loss: 0.001096
ETA: 1:14:14 | Epoch 66 finished. Took 35.26 seconds.
loss: 0.035291, lagrangian_loss: -0.000709, attention_score_distillation_loss: 0.001093
----------------------------------------------------------------------
time: 2023-07-19 15:01:34
Evaluating: f1: 0.8912, eval_loss: 0.6231, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3215, expected_sparsity: 0.3153, expected_sequence_sparsity: 0.7198, target_sparsity: 0.3009, step: 7750
lambda_1: -0.0120, lambda_2: 25.7761 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.96 0.82 0.88 0.5  0.37 0.42]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.8, 0.84, 0.48, 0.36, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.75, 0.63, 0.3, 0.11, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101111110
11111111111111111111111111011111111110010001010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101100000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001078, lagrangian_loss: 0.000373, attention_score_distillation_loss: 0.001091
loss: 0.001250, lagrangian_loss: -0.000127, attention_score_distillation_loss: 0.001087
----------------------------------------------------------------------
time: 2023-07-19 15:01:49
Evaluating: f1: 0.8874, eval_loss: 0.6631, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3215, expected_sparsity: 0.315, expected_sequence_sparsity: 0.7197, target_sparsity: 0.3029, step: 7800
lambda_1: 0.1748, lambda_2: 25.9357 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.97 0.82 0.88 0.51 0.38 0.43]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.8, 0.84, 0.48, 0.36, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.75, 0.63, 0.3, 0.11, 0.05]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101111110
11111111111111111111111111011111111110010001010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011101100000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100001000
loss: 0.001116, lagrangian_loss: -0.000294, attention_score_distillation_loss: 0.001086
ETA: 1:13:40 | Epoch 67 finished. Took 33.12 seconds.
loss: 0.002814, lagrangian_loss: 0.000754, attention_score_distillation_loss: 0.001082
----------------------------------------------------------------------
time: 2023-07-19 15:02:03
Evaluating: f1: 0.8885, eval_loss: 0.6123, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3215, expected_sparsity: 0.3169, expected_sequence_sparsity: 0.7205, target_sparsity: 0.3048, step: 7850
lambda_1: -0.3574, lambda_2: 26.2299 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.96 0.82 0.88 0.5  0.37 0.42]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.8, 0.84, 0.46, 0.36, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.75, 0.63, 0.29, 0.1, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101111110
11111111111111111111111111011111111110010001010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011001100000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001599, lagrangian_loss: -0.000116, attention_score_distillation_loss: 0.001079
loss: 0.001467, lagrangian_loss: -0.000779, attention_score_distillation_loss: 0.001077
----------------------------------------------------------------------
time: 2023-07-19 15:02:18
Evaluating: f1: 0.89, eval_loss: 0.6377, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3231, expected_sparsity: 0.3198, expected_sequence_sparsity: 0.7217, target_sparsity: 0.3068, step: 7900
lambda_1: -0.1190, lambda_2: 26.3841 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.95 0.81 0.87 0.49 0.37 0.42]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.8, 0.82, 0.46, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.75, 0.62, 0.28, 0.1, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101111110
11111111111111111111111111011111111110010001010010
11111111011111111111111011110111010111011011110110
11111111111011000111100001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001852, lagrangian_loss: -0.000098, attention_score_distillation_loss: 0.001074
loss: 0.111494, lagrangian_loss: 0.000177, attention_score_distillation_loss: 0.001071
ETA: 1:13:06 | Epoch 68 finished. Took 33.3 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:02:33
Evaluating: f1: 0.8889, eval_loss: 0.627, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3215, expected_sparsity: 0.3169, expected_sequence_sparsity: 0.7205, target_sparsity: 0.3087, step: 7950
lambda_1: 0.2008, lambda_2: 26.5781 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.96 0.81 0.87 0.5  0.37 0.42]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.8, 0.84, 0.46, 0.36, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.75, 0.63, 0.29, 0.1, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101111110
11111111111111111111111111011111111110010001010010
11111111011111111111111011111111010111011011110110
11111111111011000111100001011001100000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001926, lagrangian_loss: -0.000359, attention_score_distillation_loss: 0.001068
loss: 0.003467, lagrangian_loss: 0.001028, attention_score_distillation_loss: 0.001065
----------------------------------------------------------------------
time: 2023-07-19 15:02:48
Evaluating: f1: 0.892, eval_loss: 0.6134, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3231, expected_sparsity: 0.3198, expected_sequence_sparsity: 0.7217, target_sparsity: 0.3106, step: 8000
lambda_1: -0.4614, lambda_2: 27.0802 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.96 0.81 0.88 0.49 0.37 0.42]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.8, 0.82, 0.46, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.75, 0.62, 0.28, 0.1, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101111110
11111111111111111111111111011111111110010001010010
11111111011111111111111011110111010111011011110110
11111111111011000111100001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.002481, lagrangian_loss: 0.002122, attention_score_distillation_loss: 0.001061
loss: 0.004067, lagrangian_loss: -0.001009, attention_score_distillation_loss: 0.001059
----------------------------------------------------------------------
time: 2023-07-19 15:03:02
Evaluating: f1: 0.8881, eval_loss: 0.6202, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3341, expected_sparsity: 0.3297, expected_sequence_sparsity: 0.7258, target_sparsity: 0.3126, step: 8050
lambda_1: -0.5032, lambda_2: 27.3536 lambda_3: 0.0000
train remain: [1.   1.   1.   0.98 0.95 0.8  0.86 0.48 0.36 0.42]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.78, 0.82, 0.44, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72, 0.59, 0.26, 0.09, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111110010000010010
11111111011111111111111011110111010111011011110110
11111111111011000111000001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
ETA: 1:12:36 | Epoch 69 finished. Took 35.39 seconds.
loss: 0.002666, lagrangian_loss: -0.002139, attention_score_distillation_loss: 0.001056
loss: 0.001414, lagrangian_loss: 0.000775, attention_score_distillation_loss: 0.001053
----------------------------------------------------------------------
time: 2023-07-19 15:03:17
Evaluating: f1: 0.8968, eval_loss: 0.6087, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3325, expected_sparsity: 0.3282, expected_sequence_sparsity: 0.7251, target_sparsity: 0.3145, step: 8100
lambda_1: 0.3582, lambda_2: 28.1973 lambda_3: 0.0000
train remain: [1.   1.   1.   0.98 0.95 0.8  0.87 0.48 0.36 0.42]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.78, 0.82, 0.46, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72, 0.59, 0.27, 0.09, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111110010000010010
11111111011111111111111011110111010111011011110110
11111111111011000111000001011001100000000000000110
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.002814, lagrangian_loss: 0.000167, attention_score_distillation_loss: 0.001049
loss: 0.002013, lagrangian_loss: -0.001005, attention_score_distillation_loss: 0.001046
----------------------------------------------------------------------
time: 2023-07-19 15:03:32
Evaluating: f1: 0.8957, eval_loss: 0.5946, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.331, expected_sparsity: 0.3275, expected_sequence_sparsity: 0.7248, target_sparsity: 0.3165, step: 8150
lambda_1: -0.0308, lambda_2: 28.6338 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.96 0.81 0.87 0.49 0.37 0.43]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.78, 0.82, 0.46, 0.36, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72, 0.59, 0.27, 0.1, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111110010000010010
11111111011111111111111011110111010111011011110110
11111111111011000111100001011001100000000000000100
11111111111111000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001538, lagrangian_loss: 0.001166, attention_score_distillation_loss: 0.001044
ETA: 1:12:02 | Epoch 70 finished. Took 33.05 seconds.
loss: 0.001991, lagrangian_loss: 0.000849, attention_score_distillation_loss: 0.001042
----------------------------------------------------------------------
time: 2023-07-19 15:03:46
Evaluating: f1: 0.8969, eval_loss: 0.6016, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3341, expected_sparsity: 0.3297, expected_sequence_sparsity: 0.7258, target_sparsity: 0.3184, step: 8200
lambda_1: -0.4770, lambda_2: 29.0220 lambda_3: 0.0000
train remain: [1.   1.   1.   0.98 0.95 0.8  0.86 0.47 0.36 0.41]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.78, 0.82, 0.44, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72, 0.59, 0.26, 0.09, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111110010000010010
11111111011111111111111011110111010111011011110110
11111111111011000111000001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001155, lagrangian_loss: -0.001089, attention_score_distillation_loss: 0.001038
loss: 0.001028, lagrangian_loss: -0.000495, attention_score_distillation_loss: 0.001036
----------------------------------------------------------------------
time: 2023-07-19 15:04:01
Evaluating: f1: 0.8919, eval_loss: 0.6308, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3341, expected_sparsity: 0.3297, expected_sequence_sparsity: 0.7258, target_sparsity: 0.3204, step: 8250
lambda_1: 0.0840, lambda_2: 29.4391 lambda_3: 0.0000
train remain: [1.   1.   1.   0.97 0.95 0.79 0.86 0.47 0.36 0.41]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.78, 0.82, 0.44, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72, 0.59, 0.26, 0.09, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111110010000010010
11111111011111111111111011110111010111011011110110
11111111111011000111000001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.000888, lagrangian_loss: 0.000396, attention_score_distillation_loss: 0.001032
loss: 0.000841, lagrangian_loss: -0.000264, attention_score_distillation_loss: 0.001029
ETA: 1:11:27 | Epoch 71 finished. Took 33.05 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:04:16
Evaluating: f1: 0.8801, eval_loss: 0.6063, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3341, expected_sparsity: 0.3297, expected_sequence_sparsity: 0.7258, target_sparsity: 0.3223, step: 8300
lambda_1: 0.0671, lambda_2: 29.6244 lambda_3: 0.0000
train remain: [1.   1.   1.   0.98 0.96 0.8  0.87 0.48 0.36 0.42]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.78, 0.82, 0.44, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72, 0.59, 0.26, 0.09, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111110010000010010
11111111011111111111111011111111010111011011110100
11111111111011000111000001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001465, lagrangian_loss: 0.000267, attention_score_distillation_loss: 0.001027
loss: 0.001035, lagrangian_loss: 0.001036, attention_score_distillation_loss: 0.001024
----------------------------------------------------------------------
time: 2023-07-19 15:04:30
Evaluating: f1: 0.8889, eval_loss: 0.5884, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3341, expected_sparsity: 0.3297, expected_sequence_sparsity: 0.7258, target_sparsity: 0.3242, step: 8350
lambda_1: -0.3564, lambda_2: 29.9035 lambda_3: 0.0000
train remain: [1.   1.   1.   0.97 0.95 0.79 0.86 0.47 0.36 0.41]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.78, 0.82, 0.44, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72, 0.59, 0.26, 0.09, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111110010000010010
11111111011111111111111011111111010111011011110100
11111111111011000111000001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.004048, lagrangian_loss: -0.000350, attention_score_distillation_loss: 0.001022
loss: 0.002446, lagrangian_loss: -0.000377, attention_score_distillation_loss: 0.001018
ETA: 1:10:53 | Epoch 72 finished. Took 33.08 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:04:45
Evaluating: f1: 0.8893, eval_loss: 0.576, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3398, expected_sparsity: 0.3333, expected_sequence_sparsity: 0.7272, target_sparsity: 0.3262, step: 8400
lambda_1: 0.0519, lambda_2: 30.1600 lambda_3: 0.0000
train remain: [1.   1.   1.   0.96 0.95 0.79 0.86 0.47 0.35 0.41]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.76, 0.82, 0.44, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.7, 0.57, 0.25, 0.09, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111100010000010010
11111111011111111111111011111111010111011011110100
11111111111011000111000001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.002517, lagrangian_loss: 0.000296, attention_score_distillation_loss: 0.001015
loss: 0.001100, lagrangian_loss: -0.000196, attention_score_distillation_loss: 0.001012
----------------------------------------------------------------------
time: 2023-07-19 15:05:00
Evaluating: f1: 0.8866, eval_loss: 0.6362, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3341, expected_sparsity: 0.3297, expected_sequence_sparsity: 0.7258, target_sparsity: 0.3281, step: 8450
lambda_1: 0.0082, lambda_2: 30.2973 lambda_3: 0.0000
train remain: [1.   1.   1.   0.97 0.95 0.79 0.86 0.47 0.36 0.41]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.78, 0.82, 0.44, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72, 0.59, 0.26, 0.09, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111110010000010010
11111111011111111111111011111111010111011011110100
11111111111011000111000001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.000964, lagrangian_loss: 0.000257, attention_score_distillation_loss: 0.001009
loss: 0.002089, lagrangian_loss: 0.000265, attention_score_distillation_loss: 0.001007
----------------------------------------------------------------------
time: 2023-07-19 15:05:14
Evaluating: f1: 0.885, eval_loss: 0.6058, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3398, expected_sparsity: 0.3333, expected_sequence_sparsity: 0.7272, target_sparsity: 0.3301, step: 8500
lambda_1: -0.2140, lambda_2: 30.4207 lambda_3: 0.0000
train remain: [1.   1.   1.   0.96 0.95 0.78 0.86 0.46 0.35 0.41]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.76, 0.82, 0.44, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.7, 0.57, 0.25, 0.09, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111100010000010010
11111111011111111111111011111111010111011011110100
11111111111011000111000001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001058, lagrangian_loss: -0.000313, attention_score_distillation_loss: 0.001003
ETA: 1:10:23 | Epoch 73 finished. Took 35.37 seconds.
loss: 0.001282, lagrangian_loss: -0.000044, attention_score_distillation_loss: 0.001001
----------------------------------------------------------------------
time: 2023-07-19 15:05:29
Evaluating: f1: 0.895, eval_loss: 0.6067, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3398, expected_sparsity: 0.3333, expected_sequence_sparsity: 0.7272, target_sparsity: 0.332, step: 8550
lambda_1: 0.0479, lambda_2: 30.5160 lambda_3: 0.0000
train remain: [1.   1.   1.   0.96 0.95 0.78 0.86 0.46 0.35 0.41]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.76, 0.82, 0.44, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.7, 0.57, 0.25, 0.09, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111100010000010010
11111111011111111111111011111111010111011011110100
11111111111011000111000001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001398, lagrangian_loss: -0.000014, attention_score_distillation_loss: 0.000998
loss: 0.001857, lagrangian_loss: 0.000097, attention_score_distillation_loss: 0.000995
----------------------------------------------------------------------
time: 2023-07-19 15:05:44
Evaluating: f1: 0.8855, eval_loss: 0.6537, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3683, expected_sparsity: 0.3628, expected_sequence_sparsity: 0.7393, target_sparsity: 0.334, step: 8600
lambda_1: -0.1310, lambda_2: 30.5849 lambda_3: 0.0000
train remain: [1.   1.   1.   0.95 0.95 0.78 0.86 0.46 0.35 0.41]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.92, 0.76, 0.82, 0.44, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.63, 0.52, 0.23, 0.08, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111110101100
11111111111111111111111111111111111111011101011110
11111111111111111111111111011111111100010000010010
11111111011111111111111011111111010111011011110100
11111111111011000111000001011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001748, lagrangian_loss: 0.000022, attention_score_distillation_loss: 0.000993
ETA: 1:09:49 | Epoch 74 finished. Took 33.11 seconds.
loss: 0.009009, lagrangian_loss: -0.000154, attention_score_distillation_loss: 0.000990
----------------------------------------------------------------------
time: 2023-07-19 15:05:58
Evaluating: f1: 0.897, eval_loss: 0.6269, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3698, expected_sparsity: 0.3641, expected_sequence_sparsity: 0.7398, target_sparsity: 0.3359, step: 8650
lambda_1: -0.0617, lambda_2: 30.6211 lambda_3: 0.0000
train remain: [1.   1.   1.   0.94 0.95 0.78 0.86 0.46 0.36 0.41]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.92, 0.76, 0.82, 0.42, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.63, 0.52, 0.22, 0.07, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111110101100
11111111111111111111111111111111111111011101111100
11111111111111111111111111011111111100010000010010
11111111011111111111111011111111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.003335, lagrangian_loss: -0.000025, attention_score_distillation_loss: 0.000987
loss: 0.003555, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.000984
----------------------------------------------------------------------
time: 2023-07-19 15:06:13
Evaluating: f1: 0.8843, eval_loss: 0.5985, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3698, expected_sparsity: 0.3641, expected_sequence_sparsity: 0.7398, target_sparsity: 0.3378, step: 8700
lambda_1: -0.0473, lambda_2: 30.6368 lambda_3: 0.0000
train remain: [1.   1.   1.   0.94 0.95 0.78 0.86 0.46 0.36 0.41]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.92, 0.76, 0.82, 0.42, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.63, 0.52, 0.22, 0.07, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111110101100
11111111111111111111111111111111111111011101101110
11111111111111111111111111011111111100010000010010
11111111011111111111111011111111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.002702, lagrangian_loss: 0.000083, attention_score_distillation_loss: 0.000981
loss: 0.003429, lagrangian_loss: -0.000000, attention_score_distillation_loss: 0.000977
ETA: 1:09:15 | Epoch 75 finished. Took 33.21 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:06:28
Evaluating: f1: 0.8904, eval_loss: 0.5906, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.373, expected_sparsity: 0.3658, expected_sequence_sparsity: 0.7405, target_sparsity: 0.3398, step: 8750
lambda_1: -0.1219, lambda_2: 30.6473 lambda_3: 0.0000
train remain: [1.   1.   1.   0.94 0.95 0.78 0.85 0.45 0.35 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.92, 0.76, 0.8, 0.42, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.63, 0.5, 0.21, 0.07, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110111100
11111111111111111111111111111111111111011101101110
11111111111111111111111111011111111100010000010010
11111111011111111111111011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001424, lagrangian_loss: 0.000006, attention_score_distillation_loss: 0.000975
loss: 0.000993, lagrangian_loss: 0.000137, attention_score_distillation_loss: 0.000971
----------------------------------------------------------------------
time: 2023-07-19 15:06:43
Evaluating: f1: 0.8818, eval_loss: 0.6758, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.373, expected_sparsity: 0.3658, expected_sequence_sparsity: 0.7405, target_sparsity: 0.3417, step: 8800
lambda_1: -0.2182, lambda_2: 30.6676 lambda_3: 0.0000
train remain: [1.   1.   1.   0.93 0.95 0.77 0.85 0.45 0.35 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.92, 0.76, 0.8, 0.42, 0.34, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.63, 0.5, 0.21, 0.07, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110111100
11111111111111111111111111111111111111011101101110
11111111111111111111111111011111111100010000010010
11111111011111111111111011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110101000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.113589, lagrangian_loss: -0.000035, attention_score_distillation_loss: 0.000970
loss: 0.002136, lagrangian_loss: -0.000039, attention_score_distillation_loss: 0.000967
----------------------------------------------------------------------
time: 2023-07-19 15:06:57
Evaluating: f1: 0.886, eval_loss: 0.7401, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3793, expected_sparsity: 0.3735, expected_sequence_sparsity: 0.7437, target_sparsity: 0.3437, step: 8850
lambda_1: -0.2328, lambda_2: 30.6831 lambda_3: 0.0000
train remain: [1.   1.   1.   0.93 0.94 0.77 0.84 0.45 0.35 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.9, 0.74, 0.8, 0.42, 0.32, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.6, 0.48, 0.2, 0.06, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110111100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111111100000000010010
11111111011111111111111011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.003010, lagrangian_loss: -0.000343, attention_score_distillation_loss: 0.000964
ETA: 1:08:44 | Epoch 76 finished. Took 35.46 seconds.
loss: 0.001699, lagrangian_loss: -0.000095, attention_score_distillation_loss: 0.000961
----------------------------------------------------------------------
time: 2023-07-19 15:07:12
Evaluating: f1: 0.8819, eval_loss: 0.6201, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3793, expected_sparsity: 0.3735, expected_sequence_sparsity: 0.7437, target_sparsity: 0.3456, step: 8900
lambda_1: 0.0290, lambda_2: 30.7607 lambda_3: 0.0000
train remain: [1.   1.   1.   0.93 0.94 0.77 0.85 0.45 0.35 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.9, 0.74, 0.8, 0.42, 0.32, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.6, 0.48, 0.2, 0.06, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110111100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111111100000000010010
11111111011111111111111011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.002833, lagrangian_loss: 0.000020, attention_score_distillation_loss: 0.000958
loss: 0.001636, lagrangian_loss: 0.000046, attention_score_distillation_loss: 0.000956
----------------------------------------------------------------------
time: 2023-07-19 15:07:27
Evaluating: f1: 0.8818, eval_loss: 0.6105, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3793, expected_sparsity: 0.3735, expected_sequence_sparsity: 0.7437, target_sparsity: 0.3475, step: 8950
lambda_1: -0.1522, lambda_2: 30.8416 lambda_3: 0.0000
train remain: [1.   1.   1.   0.93 0.95 0.77 0.85 0.44 0.35 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.9, 0.74, 0.8, 0.42, 0.32, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.6, 0.48, 0.2, 0.06, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110111100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111111100000000010010
11111111011111111111111011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001010000100100000000
loss: 0.001537, lagrangian_loss: 0.000503, attention_score_distillation_loss: 0.000953
ETA: 1:08:10 | Epoch 77 finished. Took 32.96 seconds.
loss: 0.001551, lagrangian_loss: 0.000435, attention_score_distillation_loss: 0.000949
----------------------------------------------------------------------
time: 2023-07-19 15:07:41
Evaluating: f1: 0.8939, eval_loss: 0.5979, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3793, expected_sparsity: 0.3737, expected_sequence_sparsity: 0.7438, target_sparsity: 0.3495, step: 9000
lambda_1: -0.3666, lambda_2: 30.9319 lambda_3: 0.0000
train remain: [1.   1.   1.   0.92 0.94 0.76 0.85 0.44 0.34 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.9, 0.74, 0.8, 0.42, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.6, 0.48, 0.2, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110111100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111111100000000010010
11111111011111111111111011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.007258, lagrangian_loss: -0.000671, attention_score_distillation_loss: 0.000947
loss: 0.001253, lagrangian_loss: -0.000317, attention_score_distillation_loss: 0.000943
----------------------------------------------------------------------
time: 2023-07-19 15:07:56
Evaluating: f1: 0.8933, eval_loss: 0.6784, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3793, expected_sequence_sparsity: 0.7461, target_sparsity: 0.3514, step: 9050
lambda_1: -0.0297, lambda_2: 31.0627 lambda_3: 0.0000
train remain: [1.   1.   1.   0.92 0.94 0.76 0.84 0.44 0.34 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.88, 0.9, 0.74, 0.8, 0.42, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.79, 0.59, 0.47, 0.2, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110101100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111111100000000010010
11111111011111111111111011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.001523, lagrangian_loss: 0.000062, attention_score_distillation_loss: 0.000940
loss: 0.001718, lagrangian_loss: -0.000075, attention_score_distillation_loss: 0.000938
ETA: 1:07:36 | Epoch 78 finished. Took 33.18 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:08:11
Evaluating: f1: 0.8854, eval_loss: 0.6159, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3793, expected_sparsity: 0.3737, expected_sequence_sparsity: 0.7438, target_sparsity: 0.3534, step: 9100
lambda_1: 0.0029, lambda_2: 31.1472 lambda_3: 0.0000
train remain: [1.   1.   1.   0.92 0.94 0.76 0.84 0.44 0.34 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.9, 0.74, 0.8, 0.42, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.6, 0.48, 0.2, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110111100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111111100000000010010
11111111011111111111111011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.001631, lagrangian_loss: 0.000210, attention_score_distillation_loss: 0.000936
loss: 0.000889, lagrangian_loss: 0.000437, attention_score_distillation_loss: 0.000933
----------------------------------------------------------------------
time: 2023-07-19 15:08:25
Evaluating: f1: 0.9003, eval_loss: 0.6012, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3793, expected_sequence_sparsity: 0.7461, target_sparsity: 0.3553, step: 9150
lambda_1: -0.3512, lambda_2: 31.2920 lambda_3: 0.0000
train remain: [1.   1.   1.   0.92 0.94 0.75 0.84 0.43 0.34 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.88, 0.9, 0.74, 0.8, 0.42, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.79, 0.59, 0.47, 0.2, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110101100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111111100000000010010
11111111011111111111111011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.001385, lagrangian_loss: -0.000063, attention_score_distillation_loss: 0.000930
loss: 0.100318, lagrangian_loss: -0.000155, attention_score_distillation_loss: 0.000927
----------------------------------------------------------------------
time: 2023-07-19 15:08:40
Evaluating: f1: 0.8955, eval_loss: 0.5899, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3872, expected_sparsity: 0.3809, expected_sequence_sparsity: 0.7467, target_sparsity: 0.3573, step: 9200
lambda_1: -0.2719, lambda_2: 31.3169 lambda_3: 0.0000
train remain: [1.   1.   1.   0.92 0.94 0.75 0.83 0.43 0.33 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.88, 0.9, 0.74, 0.78, 0.42, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.79, 0.59, 0.46, 0.19, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110101100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111111100000000010010
11111111011111111111101011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
ETA: 1:07:04 | Epoch 79 finished. Took 35.09 seconds.
loss: 0.001115, lagrangian_loss: -0.000307, attention_score_distillation_loss: 0.000924
loss: 0.026647, lagrangian_loss: -0.000116, attention_score_distillation_loss: 0.000921
----------------------------------------------------------------------
time: 2023-07-19 15:08:54
Evaluating: f1: 0.895, eval_loss: 0.6212, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3872, expected_sparsity: 0.3809, expected_sequence_sparsity: 0.7467, target_sparsity: 0.3592, step: 9250
lambda_1: 0.0413, lambda_2: 31.4195 lambda_3: 0.0000
train remain: [1.   1.   1.   0.91 0.94 0.74 0.83 0.43 0.33 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.88, 0.9, 0.74, 0.78, 0.42, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.79, 0.59, 0.46, 0.19, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110101100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111111100000000010010
11111111011111111111101011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.002184, lagrangian_loss: 0.000073, attention_score_distillation_loss: 0.000919
loss: 0.001202, lagrangian_loss: 0.000088, attention_score_distillation_loss: 0.000919
----------------------------------------------------------------------
time: 2023-07-19 15:09:09
Evaluating: f1: 0.8716, eval_loss: 0.6154, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3872, expected_sparsity: 0.3809, expected_sequence_sparsity: 0.7467, target_sparsity: 0.3611, step: 9300
lambda_1: -0.1893, lambda_2: 31.5500 lambda_3: 0.0000
train remain: [1.   1.   1.   0.92 0.94 0.74 0.83 0.43 0.33 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.88, 0.9, 0.74, 0.78, 0.42, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.79, 0.59, 0.46, 0.19, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110101100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111111100000000010010
11111111011111111111101011110111010111011011110100
11111111111011000111000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.003703, lagrangian_loss: 0.000699, attention_score_distillation_loss: 0.000913
ETA: 1:06:30 | Epoch 80 finished. Took 32.97 seconds.
loss: 0.003162, lagrangian_loss: 0.001055, attention_score_distillation_loss: 0.000909
----------------------------------------------------------------------
time: 2023-07-19 15:09:24
Evaluating: f1: 0.8845, eval_loss: 0.5905, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3903, expected_sparsity: 0.3849, expected_sequence_sparsity: 0.7484, target_sparsity: 0.3631, step: 9350
lambda_1: -0.5967, lambda_2: 31.7529 lambda_3: 0.0000
train remain: [1.   1.   1.   0.91 0.93 0.74 0.82 0.42 0.33 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.88, 0.9, 0.72, 0.78, 0.4, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.79, 0.57, 0.44, 0.18, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110101100
11111111111111111111111111111111111111011101101100
11111111111111111111111111011111110100000000010010
11111111011111111111111011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.004339, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000907
loss: 0.002785, lagrangian_loss: -0.000850, attention_score_distillation_loss: 0.000904
----------------------------------------------------------------------
time: 2023-07-19 15:09:38
Evaluating: f1: 0.887, eval_loss: 0.6316, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3935, expected_sparsity: 0.3887, expected_sequence_sparsity: 0.7499, target_sparsity: 0.365, step: 9400
lambda_1: -0.1971, lambda_2: 31.9887 lambda_3: 0.0000
train remain: [1.   1.   1.   0.9  0.93 0.73 0.82 0.42 0.33 0.39]
infer remain: [1.0, 1.0, 1.0, 0.88, 0.88, 0.72, 0.78, 0.4, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.56, 0.43, 0.17, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110101100
11111111111111111111111111111111111111011101001100
11111111111111111111111111011111110100000000010010
11111111011111111111111011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.000602, lagrangian_loss: -0.000300, attention_score_distillation_loss: 0.000900
loss: 0.001700, lagrangian_loss: 0.000371, attention_score_distillation_loss: 0.000899
ETA: 1:05:56 | Epoch 81 finished. Took 32.93 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:09:53
Evaluating: f1: 0.8889, eval_loss: 0.5995, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3935, expected_sparsity: 0.3887, expected_sequence_sparsity: 0.7499, target_sparsity: 0.367, step: 9450
lambda_1: 0.2575, lambda_2: 32.2649 lambda_3: 0.0000
train remain: [1.   1.   1.   0.91 0.93 0.73 0.82 0.42 0.33 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.88, 0.88, 0.72, 0.78, 0.4, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.56, 0.43, 0.17, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110101100
11111111111111111111111111111111111111011101001100
11111111111111111111111111011111110100000000010010
11111111011111111111111011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.001647, lagrangian_loss: -0.000281, attention_score_distillation_loss: 0.000895
loss: 0.002136, lagrangian_loss: -0.000017, attention_score_distillation_loss: 0.000892
----------------------------------------------------------------------
time: 2023-07-19 15:10:08
Evaluating: f1: 0.885, eval_loss: 0.5922, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3935, expected_sparsity: 0.3887, expected_sequence_sparsity: 0.7499, target_sparsity: 0.3689, step: 9500
lambda_1: -0.1954, lambda_2: 32.5506 lambda_3: 0.0000
train remain: [1.   1.   1.   0.91 0.93 0.74 0.82 0.42 0.33 0.4 ]
infer remain: [1.0, 1.0, 1.0, 0.88, 0.88, 0.72, 0.78, 0.4, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.56, 0.43, 0.17, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111101110101100
11111111111111111111111111111111111111011101001100
11111111111111111111111111011111110100000000010010
11111111011111111111111011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.001962, lagrangian_loss: 0.000709, attention_score_distillation_loss: 0.000890
loss: 0.001367, lagrangian_loss: 0.000270, attention_score_distillation_loss: 0.000887
ETA: 1:05:22 | Epoch 82 finished. Took 33.15 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:10:22
Evaluating: f1: 0.8837, eval_loss: 0.5921, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4013, expected_sparsity: 0.3956, expected_sequence_sparsity: 0.7527, target_sparsity: 0.3708, step: 9550
lambda_1: -0.5372, lambda_2: 32.7514 lambda_3: 0.0000
train remain: [1.   1.   1.   0.9  0.93 0.73 0.81 0.42 0.33 0.38]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.88, 0.72, 0.76, 0.4, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.54, 0.41, 0.17, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001101100
11111111111111111111111111011111110100000000010010
11111111011111111111101011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.001354, lagrangian_loss: -0.000770, attention_score_distillation_loss: 0.000884
loss: 0.001605, lagrangian_loss: -0.000805, attention_score_distillation_loss: 0.000882
----------------------------------------------------------------------
time: 2023-07-19 15:10:37
Evaluating: f1: 0.878, eval_loss: 0.6855, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4013, expected_sparsity: 0.3957, expected_sequence_sparsity: 0.7528, target_sparsity: 0.3728, step: 9600
lambda_1: -0.1074, lambda_2: 32.9952 lambda_3: 0.0000
train remain: [1.   1.   1.   0.89 0.92 0.72 0.8  0.41 0.32 0.38]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.88, 0.72, 0.76, 0.4, 0.32, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.54, 0.41, 0.17, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001101100
11111111111111111111111111011111110100000000010010
11111111011111111111101011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010000000000100100000000
loss: 0.001776, lagrangian_loss: -0.000079, attention_score_distillation_loss: 0.000878
loss: 0.001615, lagrangian_loss: 0.000074, attention_score_distillation_loss: 0.000876
----------------------------------------------------------------------
time: 2023-07-19 15:10:52
Evaluating: f1: 0.8822, eval_loss: 0.624, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4013, expected_sparsity: 0.3956, expected_sequence_sparsity: 0.7527, target_sparsity: 0.3747, step: 9650
lambda_1: 0.1692, lambda_2: 33.1444 lambda_3: 0.0000
train remain: [1.   1.   1.   0.9  0.92 0.73 0.81 0.42 0.33 0.38]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.88, 0.72, 0.76, 0.4, 0.32, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.54, 0.41, 0.17, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001101100
11111111111111111111111111011111110100000000010010
11111111011111111111101011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010001000000100100000000
loss: 0.002328, lagrangian_loss: -0.000204, attention_score_distillation_loss: 0.000873
ETA: 1:04:50 | Epoch 83 finished. Took 35.23 seconds.
loss: 0.001011, lagrangian_loss: 0.000402, attention_score_distillation_loss: 0.000872
----------------------------------------------------------------------
time: 2023-07-19 15:11:06
Evaluating: f1: 0.8846, eval_loss: 0.6382, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4013, expected_sparsity: 0.3957, expected_sequence_sparsity: 0.7528, target_sparsity: 0.3767, step: 9700
lambda_1: -0.2526, lambda_2: 33.3621 lambda_3: 0.0000
train remain: [1.   1.   1.   0.9  0.92 0.72 0.8  0.41 0.33 0.38]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.88, 0.72, 0.76, 0.4, 0.32, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.54, 0.41, 0.17, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001101100
11111111111111111111111111011111110100000000010010
11111111011111111111101011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010000000000100100000000
loss: 0.001388, lagrangian_loss: 0.000628, attention_score_distillation_loss: 0.000867
loss: 0.001377, lagrangian_loss: 0.000143, attention_score_distillation_loss: 0.000863
----------------------------------------------------------------------
time: 2023-07-19 15:11:21
Evaluating: f1: 0.8566, eval_loss: 0.6314, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4045, expected_sparsity: 0.3988, expected_sequence_sparsity: 0.7541, target_sparsity: 0.3786, step: 9750
lambda_1: -0.4443, lambda_2: 33.4636 lambda_3: 0.0000
train remain: [1.   1.   1.   0.89 0.92 0.72 0.79 0.41 0.32 0.37]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.88, 0.7, 0.76, 0.4, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.53, 0.4, 0.16, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001101100
11111111111111111111111111011101110100000000010010
11111111011111111111101011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110101010000000000100100000000
loss: 0.000685, lagrangian_loss: -0.000913, attention_score_distillation_loss: 0.000861
ETA: 1:04:16 | Epoch 84 finished. Took 33.02 seconds.
loss: 0.000799, lagrangian_loss: -0.000288, attention_score_distillation_loss: 0.000858
----------------------------------------------------------------------
time: 2023-07-19 15:11:36
Evaluating: f1: 0.884, eval_loss: 0.6665, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4045, expected_sparsity: 0.3988, expected_sequence_sparsity: 0.7541, target_sparsity: 0.3806, step: 9800
lambda_1: 0.1219, lambda_2: 33.8516 lambda_3: 0.0000
train remain: [1.   1.   1.   0.88 0.92 0.72 0.79 0.41 0.32 0.37]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.88, 0.7, 0.76, 0.4, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.53, 0.4, 0.16, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001101100
11111111111111111111111111011101110100000000010010
11111111011111111111101011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110101010000000000100100000000
loss: 0.001067, lagrangian_loss: 0.000373, attention_score_distillation_loss: 0.000855
loss: 0.001140, lagrangian_loss: -0.000270, attention_score_distillation_loss: 0.000852
----------------------------------------------------------------------
time: 2023-07-19 15:11:50
Evaluating: f1: 0.8787, eval_loss: 0.6405, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4045, expected_sparsity: 0.3984, expected_sequence_sparsity: 0.7539, target_sparsity: 0.3825, step: 9850
lambda_1: 0.1524, lambda_2: 34.0114 lambda_3: 0.0000
train remain: [1.   1.   1.   0.89 0.92 0.72 0.8  0.41 0.33 0.38]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.88, 0.7, 0.76, 0.4, 0.32, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.53, 0.4, 0.16, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001101100
11111111111111111111111111011101110100000000010010
11111111011111111111101011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111111011000110001000000000000000000000000000
11010000111101010101110101010000000000100100000000
loss: 0.001417, lagrangian_loss: -0.000156, attention_score_distillation_loss: 0.000850
loss: 0.002073, lagrangian_loss: 0.001148, attention_score_distillation_loss: 0.000847
ETA: 1:03:42 | Epoch 85 finished. Took 33.22 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:12:05
Evaluating: f1: 0.8934, eval_loss: 0.6452, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4045, expected_sparsity: 0.3988, expected_sequence_sparsity: 0.7541, target_sparsity: 0.3844, step: 9900
lambda_1: -0.5018, lambda_2: 34.5168 lambda_3: 0.0000
train remain: [1.   1.   1.   0.88 0.91 0.71 0.79 0.41 0.32 0.37]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.88, 0.7, 0.76, 0.4, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.53, 0.4, 0.16, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001101100
11111111111111111111111111011101110100000000010010
11111111011111111111101011110111010111011010110100
11111111111011000101000000011001100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110101010000000000100100000000
loss: 0.001403, lagrangian_loss: 0.001232, attention_score_distillation_loss: 0.000844
loss: 0.002045, lagrangian_loss: -0.000561, attention_score_distillation_loss: 0.000841
----------------------------------------------------------------------
time: 2023-07-19 15:12:20
Evaluating: f1: 0.8981, eval_loss: 0.5781, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4092, expected_sparsity: 0.4047, expected_sequence_sparsity: 0.7565, target_sparsity: 0.3864, step: 9950
lambda_1: -0.5082, lambda_2: 34.6281 lambda_3: 0.0000
train remain: [1.   1.   1.   0.88 0.91 0.7  0.77 0.4  0.32 0.37]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.86, 0.7, 0.74, 0.38, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.52, 0.38, 0.15, 0.04, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001001100
11111111111111111111111111011101110100000000010010
11111111011111111111101011110111010101011010110100
11111111111011000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.000531, lagrangian_loss: -0.001083, attention_score_distillation_loss: 0.000838
loss: 0.001166, lagrangian_loss: -0.000396, attention_score_distillation_loss: 0.000836
----------------------------------------------------------------------
time: 2023-07-19 15:12:34
Evaluating: f1: 0.8924, eval_loss: 0.6115, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4092, expected_sparsity: 0.4047, expected_sequence_sparsity: 0.7565, target_sparsity: 0.3883, step: 10000
lambda_1: 0.0744, lambda_2: 35.0273 lambda_3: 0.0000
train remain: [1.   1.   1.   0.88 0.9  0.7  0.77 0.4  0.32 0.37]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.86, 0.7, 0.74, 0.38, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.52, 0.38, 0.15, 0.04, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001001100
11111111111111111111111111011101110100000000010010
11111111011111111111111011110111010101011000110100
11111111111011000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.000989, lagrangian_loss: 0.000306, attention_score_distillation_loss: 0.000833
ETA: 1:03:11 | Epoch 86 finished. Took 35.3 seconds.
loss: 0.101004, lagrangian_loss: -0.000133, attention_score_distillation_loss: 0.000830
----------------------------------------------------------------------
time: 2023-07-19 15:12:49
Evaluating: f1: 0.8904, eval_loss: 0.6501, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4045, expected_sparsity: 0.4002, expected_sequence_sparsity: 0.7546, target_sparsity: 0.3903, step: 10050
lambda_1: 0.1706, lambda_2: 35.1572 lambda_3: 0.0000
train remain: [1.   1.   1.   0.88 0.91 0.71 0.78 0.4  0.32 0.38]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.88, 0.7, 0.74, 0.4, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.53, 0.39, 0.16, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001101100
11111111111111111111111111011101110100000000010010
11111111011111111111101011110111010101011010110100
11111111111011000101000000011001100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.002272, lagrangian_loss: -0.000206, attention_score_distillation_loss: 0.000827
loss: 0.001113, lagrangian_loss: 0.000609, attention_score_distillation_loss: 0.000824
----------------------------------------------------------------------
time: 2023-07-19 15:13:04
Evaluating: f1: 0.8866, eval_loss: 0.6166, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4092, expected_sparsity: 0.4047, expected_sequence_sparsity: 0.7565, target_sparsity: 0.3922, step: 10100
lambda_1: -0.3939, lambda_2: 35.5494 lambda_3: 0.0000
train remain: [1.   1.   1.   0.88 0.9  0.7  0.77 0.4  0.32 0.37]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.86, 0.7, 0.74, 0.38, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.52, 0.38, 0.15, 0.04, 0.02]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111111011001001100
11111111111111111111111111011101110100000000010010
11111111011111111111111011110111010101011000110100
11111111111011000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.001330, lagrangian_loss: 0.000743, attention_score_distillation_loss: 0.000821
ETA: 1:02:37 | Epoch 87 finished. Took 33.29 seconds.
loss: 0.000873, lagrangian_loss: -0.000059, attention_score_distillation_loss: 0.000818
----------------------------------------------------------------------
time: 2023-07-19 15:13:19
Evaluating: f1: 0.8877, eval_loss: 0.6482, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4149, expected_sparsity: 0.4086, expected_sequence_sparsity: 0.7581, target_sparsity: 0.3942, step: 10150
lambda_1: -0.5402, lambda_2: 35.6560 lambda_3: 0.0000
train remain: [1.   1.   1.   0.87 0.9  0.69 0.76 0.39 0.32 0.37]
infer remain: [1.0, 1.0, 1.0, 0.86, 0.86, 0.68, 0.72, 0.38, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.5, 0.36, 0.14, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110101100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110000000000010010
11111111011111111111101011110111010101011000110100
11111111111011000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.001726, lagrangian_loss: -0.000718, attention_score_distillation_loss: 0.000815
loss: 0.001420, lagrangian_loss: -0.000633, attention_score_distillation_loss: 0.000813
----------------------------------------------------------------------
time: 2023-07-19 15:13:33
Evaluating: f1: 0.89, eval_loss: 0.5937, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4212, expected_sparsity: 0.4136, expected_sequence_sparsity: 0.7601, target_sparsity: 0.3961, step: 10200
lambda_1: -0.0914, lambda_2: 35.9200 lambda_3: 0.0000
train remain: [1.   1.   1.   0.87 0.9  0.69 0.75 0.39 0.32 0.37]
infer remain: [1.0, 1.0, 1.0, 0.84, 0.86, 0.68, 0.72, 0.38, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.49, 0.35, 0.13, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111101110101100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110100000000000010
11111111011111111111101011110111010101011000110100
11111111111011000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.002917, lagrangian_loss: -0.000016, attention_score_distillation_loss: 0.000810
loss: 0.001198, lagrangian_loss: 0.000121, attention_score_distillation_loss: 0.000806
ETA: 1:02:03 | Epoch 88 finished. Took 33.13 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:13:48
Evaluating: f1: 0.8732, eval_loss: 0.6378, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4212, expected_sparsity: 0.4136, expected_sequence_sparsity: 0.7601, target_sparsity: 0.398, step: 10250
lambda_1: 0.1648, lambda_2: 36.0695 lambda_3: 0.0000
train remain: [1.   1.   1.   0.87 0.9  0.69 0.75 0.39 0.32 0.37]
infer remain: [1.0, 1.0, 1.0, 0.84, 0.86, 0.68, 0.72, 0.38, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.49, 0.35, 0.13, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111101110101100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110100000000000010
11111111011111111111101011110111010101011000110100
11111111111011000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.001004, lagrangian_loss: -0.000165, attention_score_distillation_loss: 0.000803
loss: 0.002251, lagrangian_loss: 0.000266, attention_score_distillation_loss: 0.000801
----------------------------------------------------------------------
time: 2023-07-19 15:14:03
Evaluating: f1: 0.8764, eval_loss: 0.6038, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4212, expected_sparsity: 0.4136, expected_sequence_sparsity: 0.7601, target_sparsity: 0.4, step: 10300
lambda_1: -0.2512, lambda_2: 36.3107 lambda_3: 0.0000
train remain: [1.   1.   1.   0.87 0.9  0.69 0.75 0.39 0.32 0.37]
infer remain: [1.0, 1.0, 1.0, 0.84, 0.86, 0.68, 0.72, 0.38, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.49, 0.35, 0.13, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111101110101100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110100000000000010
11111111011111111111101011110111010101011000110100
11111111111011000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.041986, lagrangian_loss: 0.000612, attention_score_distillation_loss: 0.000798
loss: 0.001651, lagrangian_loss: 0.000348, attention_score_distillation_loss: 0.000795
----------------------------------------------------------------------
time: 2023-07-19 15:14:17
Evaluating: f1: 0.8859, eval_loss: 0.6629, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4228, expected_sparsity: 0.4149, expected_sequence_sparsity: 0.7607, target_sparsity: 0.4019, step: 10350
lambda_1: -0.5760, lambda_2: 36.4842 lambda_3: 0.0000
train remain: [1.   1.   1.   0.86 0.89 0.68 0.73 0.38 0.31 0.36]
infer remain: [1.0, 1.0, 1.0, 0.84, 0.86, 0.68, 0.7, 0.38, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.49, 0.34, 0.13, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110100000000000010
11111111011111111111101010110111010101011000110100
11111111111011000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
ETA: 1:01:32 | Epoch 89 finished. Took 35.46 seconds.
loss: 0.001448, lagrangian_loss: -0.000549, attention_score_distillation_loss: 0.000792
loss: 0.000836, lagrangian_loss: -0.000915, attention_score_distillation_loss: 0.000789
----------------------------------------------------------------------
time: 2023-07-19 15:14:32
Evaluating: f1: 0.8801, eval_loss: 0.6773, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4243, expected_sparsity: 0.4182, expected_sequence_sparsity: 0.762, target_sparsity: 0.4039, step: 10400
lambda_1: -0.1405, lambda_2: 36.7887 lambda_3: 0.0000
train remain: [1.   1.   1.   0.86 0.89 0.68 0.72 0.38 0.31 0.36]
infer remain: [1.0, 1.0, 1.0, 0.84, 0.86, 0.66, 0.7, 0.36, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.48, 0.33, 0.12, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110000000000000010
11111111011111111111101010110111010101011000110100
11111111111010000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.002441, lagrangian_loss: -0.000085, attention_score_distillation_loss: 0.000786
loss: 0.002913, lagrangian_loss: 0.000192, attention_score_distillation_loss: 0.000783
----------------------------------------------------------------------
time: 2023-07-19 15:14:47
Evaluating: f1: 0.8689, eval_loss: 0.6501, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4228, expected_sparsity: 0.4149, expected_sequence_sparsity: 0.7607, target_sparsity: 0.4058, step: 10450
lambda_1: 0.2414, lambda_2: 37.0494 lambda_3: 0.0000
train remain: [1.   1.   1.   0.86 0.89 0.68 0.73 0.38 0.31 0.36]
infer remain: [1.0, 1.0, 1.0, 0.84, 0.86, 0.68, 0.7, 0.38, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.49, 0.34, 0.13, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110100000000000010
11111111011111111111101010110111010101011000110100
11111111111011000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.003924, lagrangian_loss: -0.000264, attention_score_distillation_loss: 0.000781
ETA: 1:00:58 | Epoch 90 finished. Took 33.18 seconds.
loss: 0.000706, lagrangian_loss: 0.000021, attention_score_distillation_loss: 0.000777
----------------------------------------------------------------------
time: 2023-07-19 15:15:02
Evaluating: f1: 0.8764, eval_loss: 0.7872, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4228, expected_sparsity: 0.4149, expected_sequence_sparsity: 0.7607, target_sparsity: 0.4077, step: 10500
lambda_1: -0.2156, lambda_2: 37.3795 lambda_3: 0.0000
train remain: [1.   1.   1.   0.86 0.89 0.68 0.73 0.38 0.31 0.36]
infer remain: [1.0, 1.0, 1.0, 0.84, 0.86, 0.68, 0.7, 0.38, 0.3, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.49, 0.34, 0.13, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110100000000000010
11111111011111111111101010110111010101011000110100
11111111111011000101000000011000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110110010000000000100100000000
loss: 0.001868, lagrangian_loss: 0.000943, attention_score_distillation_loss: 0.000775
loss: 0.001224, lagrangian_loss: 0.000619, attention_score_distillation_loss: 0.000771
----------------------------------------------------------------------
time: 2023-07-19 15:15:16
Evaluating: f1: 0.8756, eval_loss: 0.7088, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4259, expected_sparsity: 0.4194, expected_sequence_sparsity: 0.7625, target_sparsity: 0.4097, step: 10550
lambda_1: -0.5559, lambda_2: 37.5780 lambda_3: 0.0000
train remain: [1.   1.   1.   0.85 0.88 0.67 0.71 0.38 0.31 0.36]
infer remain: [1.0, 1.0, 1.0, 0.84, 0.86, 0.66, 0.68, 0.36, 0.3, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.48, 0.32, 0.12, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111101110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110000000000000010
11111111011111111111111010110111010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110100010000000000100100000000
loss: 0.001128, lagrangian_loss: 0.000117, attention_score_distillation_loss: 0.000769
loss: 0.002719, lagrangian_loss: -0.000931, attention_score_distillation_loss: 0.000766
ETA: 1:00:24 | Epoch 91 finished. Took 33.18 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:15:31
Evaluating: f1: 0.8789, eval_loss: 0.6869, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4322, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7645, target_sparsity: 0.4116, step: 10600
lambda_1: -0.2976, lambda_2: 37.7154 lambda_3: 0.0000
train remain: [1.   1.   1.   0.84 0.88 0.66 0.7  0.37 0.31 0.35]
infer remain: [1.0, 1.0, 1.0, 0.82, 0.86, 0.66, 0.68, 0.36, 0.3, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.71, 0.47, 0.32, 0.11, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110000000000000010
11111111011111111111111010110111010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110100010000000000100100000000
loss: 0.001108, lagrangian_loss: -0.000399, attention_score_distillation_loss: 0.000763
loss: 0.003640, lagrangian_loss: 0.000049, attention_score_distillation_loss: 0.000760
----------------------------------------------------------------------
time: 2023-07-19 15:15:46
Evaluating: f1: 0.8697, eval_loss: 0.7075, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4322, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7645, target_sparsity: 0.4136, step: 10650
lambda_1: 0.1128, lambda_2: 37.9407 lambda_3: 0.0000
train remain: [1.   1.   1.   0.84 0.88 0.66 0.71 0.37 0.31 0.35]
infer remain: [1.0, 1.0, 1.0, 0.82, 0.86, 0.66, 0.68, 0.36, 0.3, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.71, 0.47, 0.32, 0.11, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110000000000000010
11111111011111111111111010110111010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110100010000000000100100000000
loss: 0.001528, lagrangian_loss: -0.000006, attention_score_distillation_loss: 0.000757
loss: 0.000865, lagrangian_loss: -0.000078, attention_score_distillation_loss: 0.000754
ETA: 0:59:50 | Epoch 92 finished. Took 33.08 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:16:00
Evaluating: f1: 0.8722, eval_loss: 0.6903, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4322, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7645, target_sparsity: 0.4155, step: 10700
lambda_1: -0.0601, lambda_2: 38.0436 lambda_3: 0.0000
train remain: [1.   1.   1.   0.84 0.88 0.66 0.71 0.38 0.31 0.35]
infer remain: [1.0, 1.0, 1.0, 0.82, 0.86, 0.66, 0.68, 0.36, 0.3, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.71, 0.47, 0.32, 0.11, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110000000000000010
11111111011111111111111010110111010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110100010000000000100100000000
loss: 0.001398, lagrangian_loss: 0.000257, attention_score_distillation_loss: 0.000752
loss: 0.001177, lagrangian_loss: 0.000829, attention_score_distillation_loss: 0.000749
----------------------------------------------------------------------
time: 2023-07-19 15:16:15
Evaluating: f1: 0.8744, eval_loss: 0.6773, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4322, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7645, target_sparsity: 0.4175, step: 10750
lambda_1: -0.4647, lambda_2: 38.2545 lambda_3: 0.0000
train remain: [1.   1.   1.   0.84 0.88 0.66 0.71 0.37 0.31 0.34]
infer remain: [1.0, 1.0, 1.0, 0.82, 0.86, 0.66, 0.68, 0.36, 0.3, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.71, 0.47, 0.32, 0.11, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111011101110000000000000010
11111111011111111111111010110111010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110100010000000000100100000000
loss: 0.002155, lagrangian_loss: 0.000301, attention_score_distillation_loss: 0.000746
loss: 0.136131, lagrangian_loss: -0.000324, attention_score_distillation_loss: 0.000743
----------------------------------------------------------------------
time: 2023-07-19 15:16:30
Evaluating: f1: 0.8908, eval_loss: 0.6563, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4353, expected_sparsity: 0.4279, expected_sequence_sparsity: 0.766, target_sparsity: 0.4194, step: 10800
lambda_1: -0.4953, lambda_2: 38.3168 lambda_3: 0.0000
train remain: [1.   1.   1.   0.83 0.88 0.65 0.69 0.37 0.31 0.33]
infer remain: [1.0, 1.0, 1.0, 0.82, 0.86, 0.64, 0.66, 0.36, 0.3, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.71, 0.45, 0.3, 0.11, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111010101110000000000000010
11111111011111111111101010110111010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010001110100010000000000100100000000
loss: 0.001122, lagrangian_loss: -0.000693, attention_score_distillation_loss: 0.000740
ETA: 0:59:18 | Epoch 93 finished. Took 35.43 seconds.
loss: 0.001777, lagrangian_loss: -0.000357, attention_score_distillation_loss: 0.000738
----------------------------------------------------------------------
time: 2023-07-19 15:16:45
Evaluating: f1: 0.8628, eval_loss: 0.6591, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4353, expected_sparsity: 0.4279, expected_sequence_sparsity: 0.766, target_sparsity: 0.4213, step: 10850
lambda_1: -0.0697, lambda_2: 38.5546 lambda_3: 0.0000
train remain: [1.   1.   1.   0.83 0.87 0.65 0.69 0.37 0.31 0.33]
infer remain: [1.0, 1.0, 1.0, 0.82, 0.86, 0.64, 0.66, 0.36, 0.3, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.71, 0.45, 0.3, 0.11, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111010101110000000000000010
11111111011111111111101010110111010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010001110100010000000000100100000000
loss: 0.006645, lagrangian_loss: -0.000031, attention_score_distillation_loss: 0.000734
loss: 0.002023, lagrangian_loss: 0.000023, attention_score_distillation_loss: 0.000732
----------------------------------------------------------------------
time: 2023-07-19 15:16:59
Evaluating: f1: 0.8793, eval_loss: 0.6571, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4353, expected_sparsity: 0.4279, expected_sequence_sparsity: 0.766, target_sparsity: 0.4233, step: 10900
lambda_1: 0.1118, lambda_2: 38.6574 lambda_3: 0.0000
train remain: [1.   1.   1.   0.83 0.87 0.65 0.69 0.37 0.31 0.33]
infer remain: [1.0, 1.0, 1.0, 0.82, 0.86, 0.64, 0.66, 0.36, 0.3, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.71, 0.45, 0.3, 0.11, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111010101110000000000000010
11111111011111111111101010110111010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010001110100010000000000100100000000
loss: 0.001467, lagrangian_loss: -0.000080, attention_score_distillation_loss: 0.000729
ETA: 0:58:44 | Epoch 94 finished. Took 33.29 seconds.
loss: 0.000673, lagrangian_loss: 0.000275, attention_score_distillation_loss: 0.000727
----------------------------------------------------------------------
time: 2023-07-19 15:17:14
Evaluating: f1: 0.8789, eval_loss: 0.6823, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4401, expected_sparsity: 0.4328, expected_sequence_sparsity: 0.768, target_sparsity: 0.4252, step: 10950
lambda_1: -0.3104, lambda_2: 38.9081 lambda_3: 0.0000
train remain: [1.   1.   1.   0.83 0.87 0.65 0.69 0.37 0.31 0.33]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.86, 0.64, 0.66, 0.36, 0.3, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.69, 0.44, 0.29, 0.1, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111001110100100
11111111111111111111111111111111111110011001101100
11111111111111111111111111010101110000000000000010
11111111011111111111101010110111010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010001110100010000000000100100000000
loss: 0.001499, lagrangian_loss: 0.000695, attention_score_distillation_loss: 0.000723
loss: 0.001892, lagrangian_loss: 0.000085, attention_score_distillation_loss: 0.000720
----------------------------------------------------------------------
time: 2023-07-19 15:17:29
Evaluating: f1: 0.8831, eval_loss: 0.6254, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4432, expected_sparsity: 0.4358, expected_sequence_sparsity: 0.7692, target_sparsity: 0.4272, step: 11000
lambda_1: -0.5258, lambda_2: 39.0194 lambda_3: 0.0000
train remain: [1.   1.   1.   0.82 0.87 0.64 0.68 0.36 0.3  0.32]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.84, 0.64, 0.66, 0.36, 0.3, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.43, 0.28, 0.1, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001100100100
11111111111111111111111111111111110110011001101100
11111111111111111111111111010101110000000000000010
11111111011111111111101010110111010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010001110100010000000000100100000000
loss: 0.003420, lagrangian_loss: -0.000464, attention_score_distillation_loss: 0.000718
loss: 0.001126, lagrangian_loss: -0.000687, attention_score_distillation_loss: 0.000714
ETA: 0:58:10 | Epoch 95 finished. Took 33.18 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:17:43
Evaluating: f1: 0.8657, eval_loss: 0.6608, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4448, expected_sparsity: 0.439, expected_sequence_sparsity: 0.7705, target_sparsity: 0.4291, step: 11050
lambda_1: -0.1267, lambda_2: 39.2533 lambda_3: 0.0000
train remain: [1.   1.   1.   0.81 0.86 0.64 0.68 0.36 0.3  0.32]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.84, 0.62, 0.64, 0.36, 0.3, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.42, 0.27, 0.1, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001100100100
11111111111111111111111111111111110110011001101100
11111111111111111111111111010101110000000000000000
11111111011111111111101010110110010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110000010000000000100100000000
loss: 0.004817, lagrangian_loss: -0.000101, attention_score_distillation_loss: 0.000712
loss: 0.000982, lagrangian_loss: 0.000050, attention_score_distillation_loss: 0.000709
----------------------------------------------------------------------
time: 2023-07-19 15:17:58
Evaluating: f1: 0.8836, eval_loss: 0.6738, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4432, expected_sparsity: 0.4368, expected_sequence_sparsity: 0.7696, target_sparsity: 0.4311, step: 11100
lambda_1: 0.1305, lambda_2: 39.4028 lambda_3: 0.0000
train remain: [1.   1.   1.   0.81 0.87 0.64 0.68 0.36 0.3  0.32]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.84, 0.64, 0.64, 0.36, 0.3, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.43, 0.28, 0.1, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001100100100
11111111111111111111111111111111110110011001101100
11111111111111111111111111010101110000000000000010
11111111011111111111101010110110010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110000010000000000100100000000
loss: 0.000941, lagrangian_loss: -0.000067, attention_score_distillation_loss: 0.000706
loss: 0.001463, lagrangian_loss: 0.000220, attention_score_distillation_loss: 0.000706
----------------------------------------------------------------------
time: 2023-07-19 15:18:13
Evaluating: f1: 0.8816, eval_loss: 0.6431, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4448, expected_sparsity: 0.439, expected_sequence_sparsity: 0.7705, target_sparsity: 0.433, step: 11150
lambda_1: -0.2664, lambda_2: 39.6299 lambda_3: 0.0000
train remain: [1.   1.   1.   0.81 0.86 0.64 0.68 0.36 0.3  0.32]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.84, 0.62, 0.64, 0.36, 0.3, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.42, 0.27, 0.1, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011111011111001100100100
11111111111111111111111111111111110110011001101100
11111111111111111111111111010101110000000000000000
11111111011111111111101010110110010001011000100100
11111111111011000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010101110000010000000000100100000000
loss: 0.000705, lagrangian_loss: 0.000238, attention_score_distillation_loss: 0.000700
ETA: 0:57:39 | Epoch 96 finished. Took 35.46 seconds.
loss: 0.022674, lagrangian_loss: 0.000781, attention_score_distillation_loss: 0.000701
----------------------------------------------------------------------
time: 2023-07-19 15:18:28
Evaluating: f1: 0.8752, eval_loss: 0.6706, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4495, expected_sparsity: 0.4444, expected_sequence_sparsity: 0.7727, target_sparsity: 0.4349, step: 11200
lambda_1: -0.5323, lambda_2: 39.7671 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.86 0.63 0.67 0.36 0.3  0.32]
infer remain: [1.0, 1.0, 1.0, 0.78, 0.84, 0.62, 0.64, 0.34, 0.3, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.66, 0.41, 0.26, 0.09, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111001100100100
11111111111111111111111111111111110110011001101100
11111111111111111111111111010101110000000000000000
11111111011111111111101010110110010001011000100100
11111111111010000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.000801, lagrangian_loss: -0.000451, attention_score_distillation_loss: 0.000696
loss: 0.000891, lagrangian_loss: -0.000658, attention_score_distillation_loss: 0.000692
----------------------------------------------------------------------
time: 2023-07-19 15:18:42
Evaluating: f1: 0.8673, eval_loss: 0.7001, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4495, expected_sparsity: 0.4446, expected_sequence_sparsity: 0.7728, target_sparsity: 0.4369, step: 11250
lambda_1: -0.1866, lambda_2: 39.9660 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.85 0.62 0.66 0.36 0.3  0.32]
infer remain: [1.0, 1.0, 1.0, 0.78, 0.84, 0.62, 0.64, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.66, 0.41, 0.26, 0.09, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111001100100100
11111111111111111111111111111111111110011000101100
11111111111111111111111111010101110000000000000000
11111111011111111111101010110110010001011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001639, lagrangian_loss: -0.000217, attention_score_distillation_loss: 0.000689
ETA: 0:57:05 | Epoch 97 finished. Took 33.13 seconds.
loss: 0.001900, lagrangian_loss: 0.000100, attention_score_distillation_loss: 0.000686
----------------------------------------------------------------------
time: 2023-07-19 15:18:57
Evaluating: f1: 0.8737, eval_loss: 0.6685, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4495, expected_sparsity: 0.4444, expected_sequence_sparsity: 0.7727, target_sparsity: 0.4388, step: 11300
lambda_1: 0.1611, lambda_2: 40.1671 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.85 0.63 0.66 0.36 0.3  0.32]
infer remain: [1.0, 1.0, 1.0, 0.78, 0.84, 0.62, 0.64, 0.34, 0.3, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.66, 0.41, 0.26, 0.09, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111001100100100
11111111111111111111111111111111111110011000101100
11111111111111111111111111010101110000000000000000
11111111011111111111101010110110010001011000100100
11111111111010000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001714, lagrangian_loss: -0.000072, attention_score_distillation_loss: 0.000683
loss: 0.001359, lagrangian_loss: 0.000100, attention_score_distillation_loss: 0.000681
----------------------------------------------------------------------
time: 2023-07-19 15:19:12
Evaluating: f1: 0.8859, eval_loss: 0.6657, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4495, expected_sparsity: 0.4444, expected_sequence_sparsity: 0.7727, target_sparsity: 0.4408, step: 11350
lambda_1: -0.2237, lambda_2: 40.4317 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.85 0.63 0.66 0.36 0.3  0.32]
infer remain: [1.0, 1.0, 1.0, 0.78, 0.84, 0.62, 0.64, 0.34, 0.3, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.66, 0.41, 0.26, 0.09, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111001100100100
11111111111111111111111111111111111110011000101100
11111111111111111111111111010101110000000000000000
11111111011111111111101010110110010001011000100100
11111111111010000101000000010000100000000000000100
11111111101011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001111, lagrangian_loss: 0.000764, attention_score_distillation_loss: 0.000678
loss: 0.001153, lagrangian_loss: 0.000204, attention_score_distillation_loss: 0.000675
ETA: 0:56:31 | Epoch 98 finished. Took 33.02 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:19:26
Evaluating: f1: 0.8679, eval_loss: 0.6468, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4526, expected_sparsity: 0.4484, expected_sequence_sparsity: 0.7744, target_sparsity: 0.4427, step: 11400
lambda_1: -0.5786, lambda_2: 40.6513 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.84 0.62 0.66 0.35 0.3  0.31]
infer remain: [1.0, 1.0, 1.0, 0.78, 0.82, 0.62, 0.62, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.64, 0.4, 0.25, 0.08, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111001100100100
11111111111111111111111111111111110110011000101100
11111111111111111111111111010101110000000000000000
11111111011111111111101010110100010001011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001013, lagrangian_loss: -0.000463, attention_score_distillation_loss: 0.000672
loss: 0.001756, lagrangian_loss: -0.000699, attention_score_distillation_loss: 0.000669
----------------------------------------------------------------------
time: 2023-07-19 15:19:41
Evaluating: f1: 0.8675, eval_loss: 0.7256, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4605, expected_sparsity: 0.4549, expected_sequence_sparsity: 0.777, target_sparsity: 0.4446, step: 11450
lambda_1: -0.3147, lambda_2: 40.8000 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.83 0.62 0.65 0.35 0.29 0.31]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.82, 0.6, 0.62, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.62, 0.37, 0.23, 0.08, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111000100100100
11111111111111111111111111111111111110011000100100
11111111111111111111111111010101010000000000000000
11111111011111111111101010110110010000011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001003, lagrangian_loss: -0.000469, attention_score_distillation_loss: 0.000667
loss: 0.000877, lagrangian_loss: 0.000025, attention_score_distillation_loss: 0.000663
----------------------------------------------------------------------
time: 2023-07-19 15:19:56
Evaluating: f1: 0.8816, eval_loss: 0.6792, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4557, expected_sparsity: 0.4503, expected_sequence_sparsity: 0.7752, target_sparsity: 0.4466, step: 11500
lambda_1: 0.1557, lambda_2: 41.1232 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.83 0.62 0.65 0.35 0.3  0.31]
infer remain: [1.0, 1.0, 1.0, 0.78, 0.82, 0.6, 0.62, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.64, 0.38, 0.24, 0.08, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111001100100100
11111111111111111111111111111111111110011000100100
11111111111111111111111111010101010000000000000000
11111111011111111111101010110100010001011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
ETA: 0:55:59 | Epoch 99 finished. Took 35.33 seconds.
loss: 0.001461, lagrangian_loss: 0.000073, attention_score_distillation_loss: 0.000660
loss: 0.001411, lagrangian_loss: -0.000033, attention_score_distillation_loss: 0.000657
----------------------------------------------------------------------
time: 2023-07-19 15:20:11
Evaluating: f1: 0.8789, eval_loss: 0.6592, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4557, expected_sparsity: 0.4503, expected_sequence_sparsity: 0.7752, target_sparsity: 0.4485, step: 11550
lambda_1: -0.1426, lambda_2: 41.3372 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.83 0.62 0.65 0.35 0.3  0.31]
infer remain: [1.0, 1.0, 1.0, 0.78, 0.82, 0.6, 0.62, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.64, 0.38, 0.24, 0.08, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111001100100100
11111111111111111111111111111111111110011000100100
11111111111111111111110111010101110000000000000000
11111111011111111111101010110100010001011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001371, lagrangian_loss: 0.000497, attention_score_distillation_loss: 0.000654
loss: 0.100418, lagrangian_loss: 0.000666, attention_score_distillation_loss: 0.000652
----------------------------------------------------------------------
time: 2023-07-19 15:20:25
Evaluating: f1: 0.8741, eval_loss: 0.7338, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4636, expected_sparsity: 0.4575, expected_sequence_sparsity: 0.7781, target_sparsity: 0.4505, step: 11600
lambda_1: -0.5613, lambda_2: 41.5886 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.82 0.61 0.64 0.35 0.29 0.31]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.8, 0.6, 0.62, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.61, 0.36, 0.23, 0.08, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111000100100100
11111111111111111111111111111111101110011000100100
11111111111111111111111111010101010000000000000000
11111111011111111111101010110110010000011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001115, lagrangian_loss: 0.000216, attention_score_distillation_loss: 0.000649
ETA: 0:55:25 | Epoch 100 finished. Took 33.17 seconds.
loss: 0.001358, lagrangian_loss: -0.000629, attention_score_distillation_loss: 0.000646
----------------------------------------------------------------------
time: 2023-07-19 15:20:40
Evaluating: f1: 0.8666, eval_loss: 0.7149, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4636, expected_sparsity: 0.4575, expected_sequence_sparsity: 0.7781, target_sparsity: 0.4524, step: 11650
lambda_1: -0.3741, lambda_2: 41.7162 lambda_3: 0.0000
train remain: [1.   1.   1.   0.77 0.81 0.61 0.64 0.34 0.29 0.31]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.8, 0.6, 0.62, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.61, 0.36, 0.23, 0.08, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111000100100100
11111111111111111111111111111111101110011000100100
11111111111111111111111111010101010000000000000000
11111111011111111111101010110110010000011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001786, lagrangian_loss: -0.000686, attention_score_distillation_loss: 0.000644
loss: 0.001322, lagrangian_loss: -0.000060, attention_score_distillation_loss: 0.000643
----------------------------------------------------------------------
time: 2023-07-19 15:20:54
Evaluating: f1: 0.8702, eval_loss: 0.6964, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4636, expected_sparsity: 0.4575, expected_sequence_sparsity: 0.7781, target_sparsity: 0.4544, step: 11700
lambda_1: 0.0647, lambda_2: 41.9948 lambda_3: 0.0000
train remain: [1.   1.   1.   0.77 0.81 0.61 0.63 0.34 0.29 0.3 ]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.8, 0.6, 0.62, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.61, 0.36, 0.23, 0.08, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111000100100100
11111111111111111111111111111111101110011000100100
11111111111111111111111111010101010000000000000000
11111111011111111111101010110100010001011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.000887, lagrangian_loss: 0.000150, attention_score_distillation_loss: 0.000638
loss: 0.003402, lagrangian_loss: -0.000064, attention_score_distillation_loss: 0.000634
ETA: 0:54:51 | Epoch 101 finished. Took 32.92 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:21:09
Evaluating: f1: 0.8778, eval_loss: 0.6802, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4636, expected_sparsity: 0.4575, expected_sequence_sparsity: 0.7781, target_sparsity: 0.4563, step: 11750
lambda_1: -0.0182, lambda_2: 42.0979 lambda_3: 0.0000
train remain: [1.   1.   1.   0.77 0.81 0.61 0.64 0.35 0.29 0.31]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.8, 0.6, 0.62, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.61, 0.36, 0.23, 0.08, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111000100100100
11111111111111111111111111111111101110011000100100
11111111111111111111111111010101010000000000000000
11111111011111111111101010110100010001011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001729, lagrangian_loss: 0.000452, attention_score_distillation_loss: 0.000632
loss: 0.001485, lagrangian_loss: 0.000588, attention_score_distillation_loss: 0.000629
----------------------------------------------------------------------
time: 2023-07-19 15:21:24
Evaluating: f1: 0.8697, eval_loss: 0.6826, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4652, expected_sparsity: 0.4584, expected_sequence_sparsity: 0.7785, target_sparsity: 0.4582, step: 11800
lambda_1: -0.5298, lambda_2: 42.4544 lambda_3: 0.0000
train remain: [1.   1.   1.   0.77 0.81 0.6  0.63 0.34 0.29 0.3 ]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.8, 0.6, 0.6, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.61, 0.36, 0.22, 0.07, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111000100100100
11111111111111111111111111111111101110011000100100
11111111111111111111111111010101010000000000000000
11111111011111111111101010110100010000011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.000897, lagrangian_loss: 0.000023, attention_score_distillation_loss: 0.000626
loss: 0.001528, lagrangian_loss: -0.000732, attention_score_distillation_loss: 0.000623
ETA: 0:54:17 | Epoch 102 finished. Took 33.25 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:21:38
Evaluating: f1: 0.8648, eval_loss: 0.6836, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4715, expected_sparsity: 0.4672, expected_sequence_sparsity: 0.7821, target_sparsity: 0.4602, step: 11850
lambda_1: -0.4196, lambda_2: 42.5987 lambda_3: 0.0000
train remain: [1.   1.   1.   0.75 0.8  0.6  0.62 0.34 0.29 0.3 ]
infer remain: [1.0, 1.0, 1.0, 0.74, 0.78, 0.58, 0.6, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.58, 0.33, 0.2, 0.07, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011011000100100100
11111111111111111111111111111111100110011000100100
11111111111111111111110111010101010000000000000000
11111111011111111111101010110100010000011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.023355, lagrangian_loss: -0.000330, attention_score_distillation_loss: 0.000620
loss: 0.003435, lagrangian_loss: 0.000057, attention_score_distillation_loss: 0.000617
----------------------------------------------------------------------
time: 2023-07-19 15:21:53
Evaluating: f1: 0.8688, eval_loss: 0.6765, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4715, expected_sparsity: 0.4672, expected_sequence_sparsity: 0.7821, target_sparsity: 0.4621, step: 11900
lambda_1: 0.1809, lambda_2: 43.0957 lambda_3: 0.0000
train remain: [1.   1.   1.   0.75 0.8  0.6  0.62 0.34 0.29 0.3 ]
infer remain: [1.0, 1.0, 1.0, 0.74, 0.78, 0.58, 0.6, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.58, 0.33, 0.2, 0.07, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011011000100100100
11111111111111111111111111111111100110011000100100
11111111111111111111110111010101010000000000000000
11111111011111111111101010110100010000011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.002518, lagrangian_loss: 0.000055, attention_score_distillation_loss: 0.000614
loss: 0.007086, lagrangian_loss: -0.000189, attention_score_distillation_loss: 0.000612
----------------------------------------------------------------------
time: 2023-07-19 15:22:08
Evaluating: f1: 0.8878, eval_loss: 0.7009, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4715, expected_sparsity: 0.4672, expected_sequence_sparsity: 0.7821, target_sparsity: 0.4641, step: 11950
lambda_1: -0.0647, lambda_2: 43.3523 lambda_3: 0.0000
train remain: [1.   1.   1.   0.75 0.8  0.6  0.62 0.34 0.29 0.3 ]
infer remain: [1.0, 1.0, 1.0, 0.74, 0.78, 0.58, 0.6, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.58, 0.33, 0.2, 0.07, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111000000100100
11111111111111111111111111111111100110011000100100
11111111111111111111110111010101010000000000000000
11111111011111111111101010110100010000011000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001352, lagrangian_loss: 0.000418, attention_score_distillation_loss: 0.000609
ETA: 0:53:45 | Epoch 103 finished. Took 35.39 seconds.
loss: 0.001552, lagrangian_loss: 0.000470, attention_score_distillation_loss: 0.000606
----------------------------------------------------------------------
time: 2023-07-19 15:22:23
Evaluating: f1: 0.8705, eval_loss: 0.6578, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4715, expected_sparsity: 0.4672, expected_sequence_sparsity: 0.7821, target_sparsity: 0.466, step: 12000
lambda_1: -0.5926, lambda_2: 43.7621 lambda_3: 0.0000
train remain: [1.   1.   1.   0.74 0.79 0.6  0.62 0.34 0.28 0.3 ]
infer remain: [1.0, 1.0, 1.0, 0.74, 0.78, 0.58, 0.6, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.74, 0.58, 0.33, 0.2, 0.07, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011111000000100100
11111111111111111111111111111111100110011000100100
11111111111111111111110111010101010000000000000000
11111111011111111111101010110110010000010000100100
11111111111010000101000000010000100000000000000100
11111111100011000110001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.005579, lagrangian_loss: 0.000882, attention_score_distillation_loss: 0.000605
loss: 0.001504, lagrangian_loss: -0.000470, attention_score_distillation_loss: 0.000600
----------------------------------------------------------------------
time: 2023-07-19 15:22:37
Evaluating: f1: 0.8804, eval_loss: 0.6577, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4746, expected_sparsity: 0.4715, expected_sequence_sparsity: 0.7839, target_sparsity: 0.468, step: 12050
lambda_1: -0.5561, lambda_2: 43.8503 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.73 0.79 0.59 0.61 0.34 0.28 0.3 ]
infer remain: [1.0, 1.0, 1.0, 0.72, 0.78, 0.58, 0.6, 0.34, 0.28, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.56, 0.33, 0.2, 0.07, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011011000000100100
11111111111111111111111111111111100110011000100100
11111111111111111111110111010101010000000000000000
11111111011111111111101010110110010000010000100100
11111111111010000101000000010000100000000000000100
11111111101011000010001000000000000000000000000000
11010000111101010001110000010000000000100100000000
loss: 0.001853, lagrangian_loss: -0.000464, attention_score_distillation_loss: 0.000598
ETA: 0:53:11 | Epoch 104 finished. Took 32.89 seconds.
loss: 0.001653, lagrangian_loss: -0.000440, attention_score_distillation_loss: 0.000594
----------------------------------------------------------------------
time: 2023-07-19 15:22:52
Evaluating: f1: 0.8821, eval_loss: 0.682, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4762, expected_sparsity: 0.4728, expected_sequence_sparsity: 0.7844, target_sparsity: 0.4699, step: 12100
lambda_1: -0.2189, lambda_2: 44.0224 lambda_3: 0.0000
train remain: [1.   1.   1.   0.73 0.79 0.59 0.61 0.34 0.28 0.29]
infer remain: [1.0, 1.0, 1.0, 0.72, 0.78, 0.58, 0.58, 0.32, 0.28, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.56, 0.33, 0.19, 0.06, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011011000000100100
11111111111111111111111111111111100110011000100100
11111111111111111111111111010101000000000000000000
11111111011111111111101010110100010000010000100100
11111111111010000101000000000000100000000000000100
11111111101011000010001000000000000000000000000000
11010000111100010001110000010000000000100100000000
loss: 0.003494, lagrangian_loss: -0.000232, attention_score_distillation_loss: 0.000592
loss: 0.000987, lagrangian_loss: 0.000027, attention_score_distillation_loss: 0.000589
----------------------------------------------------------------------
time: 2023-07-19 15:23:07
Evaluating: f1: 0.8679, eval_loss: 0.659, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4762, expected_sparsity: 0.4728, expected_sequence_sparsity: 0.7844, target_sparsity: 0.4718, step: 12150
lambda_1: -0.1724, lambda_2: 44.0751 lambda_3: 0.0000
train remain: [1.   1.   1.   0.73 0.78 0.58 0.61 0.34 0.28 0.29]
infer remain: [1.0, 1.0, 1.0, 0.72, 0.78, 0.58, 0.58, 0.32, 0.28, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.56, 0.33, 0.19, 0.06, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011011000000100100
11111111111111111111111111111111100110011000100100
11111111111111111111111111010101000000000000000000
11111111011111111111101010110100010000010000100100
11111111111010000101000000000000100000000000000100
11111111101011000010001000000000000000000000000000
11010000111100010001110000010000000000100100000000
loss: 0.000902, lagrangian_loss: 0.000136, attention_score_distillation_loss: 0.000586
loss: 0.004816, lagrangian_loss: 0.000259, attention_score_distillation_loss: 0.000583
ETA: 0:52:36 | Epoch 105 finished. Took 33.06 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:23:21
Evaluating: f1: 0.8901, eval_loss: 0.6206, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4809, expected_sparsity: 0.477, expected_sequence_sparsity: 0.7861, target_sparsity: 0.4738, step: 12200
lambda_1: -0.4240, lambda_2: 44.1881 lambda_3: 0.0000
train remain: [1.   1.   1.   0.73 0.78 0.58 0.6  0.34 0.28 0.29]
infer remain: [1.0, 1.0, 1.0, 0.72, 0.76, 0.56, 0.58, 0.32, 0.26, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.55, 0.31, 0.18, 0.06, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011011000000100100
11111111111111111111111111111111100100011000100100
11111111111111111111110111010101000000000000000000
11111111011111111111101010110100010000010000100100
11111111111010000101000000000000100000000000000100
11111111100011000010001000000000000000000000000000
11010000111100010001110000010000000000100100000000
loss: 0.001514, lagrangian_loss: 0.000276, attention_score_distillation_loss: 0.000580
loss: 0.001073, lagrangian_loss: 0.001286, attention_score_distillation_loss: 0.000577
----------------------------------------------------------------------
time: 2023-07-19 15:23:36
Evaluating: f1: 0.873, eval_loss: 0.6777, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4809, expected_sparsity: 0.477, expected_sequence_sparsity: 0.7861, target_sparsity: 0.4757, step: 12250
lambda_1: -0.7351, lambda_2: 44.3266 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.73 0.78 0.58 0.59 0.33 0.27 0.28]
infer remain: [1.0, 1.0, 1.0, 0.72, 0.76, 0.56, 0.58, 0.32, 0.26, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.55, 0.31, 0.18, 0.06, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011011000000100100
11111111111111111111111111111111100100011000100100
11111111111111111111110111010101000000000000000000
11111111011111111111101010110100010000010000100100
11111111111010000101000000000000100000000000000100
11111101101011000010001000000000000000000000000000
11010000111100010001110000010000000000100100000000
loss: 0.002913, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000575
loss: 0.001052, lagrangian_loss: -0.000446, attention_score_distillation_loss: 0.000571
----------------------------------------------------------------------
time: 2023-07-19 15:23:50
Evaluating: f1: 0.8761, eval_loss: 0.6734, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4871, expected_sparsity: 0.4812, expected_sequence_sparsity: 0.7878, target_sparsity: 0.4777, step: 12300
lambda_1: -0.8039, lambda_2: 44.3722 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.72 0.77 0.57 0.59 0.33 0.27 0.27]
infer remain: [1.0, 1.0, 1.0, 0.7, 0.76, 0.56, 0.58, 0.32, 0.26, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.53, 0.3, 0.17, 0.06, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000100100
11111111111111111111111111111111100100011000100100
11111111111111111111111111010001000000000000000000
11111111011111111111101010110110010000010000100000
11111111111010000101000000000000100000000000000100
11111101101011000010001000000000000000000000000000
11010000111100010001010000010000000000100100000000
loss: 0.001698, lagrangian_loss: 0.000109, attention_score_distillation_loss: 0.000569
ETA: 0:52:04 | Epoch 106 finished. Took 35.3 seconds.
loss: 0.000810, lagrangian_loss: -0.000713, attention_score_distillation_loss: 0.000566
----------------------------------------------------------------------
time: 2023-07-19 15:24:05
Evaluating: f1: 0.8679, eval_loss: 0.6668, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4887, expected_sparsity: 0.4818, expected_sequence_sparsity: 0.7881, target_sparsity: 0.4796, step: 12350
lambda_1: -0.5277, lambda_2: 44.5002 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.71 0.77 0.56 0.58 0.33 0.27 0.27]
infer remain: [1.0, 1.0, 1.0, 0.7, 0.76, 0.56, 0.56, 0.32, 0.26, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.53, 0.3, 0.17, 0.05, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000100100
11111111111111111111111111111111100100011000100100
11111111111111111111111111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111111010000101000000000000100000000000000100
11111101101011000010001000000000000000000000000000
11010000111100010001010000010000000000100100000000
loss: 0.001101, lagrangian_loss: -0.000205, attention_score_distillation_loss: 0.000563
loss: 0.000370, lagrangian_loss: -0.000514, attention_score_distillation_loss: 0.000560
----------------------------------------------------------------------
time: 2023-07-19 15:24:20
Evaluating: f1: 0.8719, eval_loss: 0.667, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4887, expected_sparsity: 0.4834, expected_sequence_sparsity: 0.7887, target_sparsity: 0.4815, step: 12400
lambda_1: -0.1476, lambda_2: 44.7067 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.71 0.77 0.56 0.58 0.33 0.26 0.27]
infer remain: [1.0, 1.0, 1.0, 0.7, 0.76, 0.54, 0.56, 0.32, 0.26, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.53, 0.29, 0.16, 0.05, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000100100
11111111111111111111111111111111100100011000100100
11111111111111111111110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111111010000101000000000000100000000000000100
11111101101011000010001000000000000000000000000000
11010000111100010001010000010000000000100100000000
loss: 0.001304, lagrangian_loss: -0.000119, attention_score_distillation_loss: 0.000558
ETA: 0:51:30 | Epoch 107 finished. Took 33.1 seconds.
loss: 0.001187, lagrangian_loss: 0.000011, attention_score_distillation_loss: 0.000554
----------------------------------------------------------------------
time: 2023-07-19 15:24:34
Evaluating: f1: 0.8827, eval_loss: 0.6476, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4887, expected_sparsity: 0.4834, expected_sequence_sparsity: 0.7887, target_sparsity: 0.4835, step: 12450
lambda_1: -0.0067, lambda_2: 44.8007 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.71 0.77 0.56 0.58 0.33 0.26 0.27]
infer remain: [1.0, 1.0, 1.0, 0.7, 0.76, 0.54, 0.56, 0.32, 0.26, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.53, 0.29, 0.16, 0.05, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000100100
11111111111111111111111111111111100100011000100100
11111111111111111111110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111111010000101000000000000100000000000000100
11111101101011000010001000000000000000000000000000
11010000111100010001010000010000000000100100000000
loss: 0.001666, lagrangian_loss: 0.000056, attention_score_distillation_loss: 0.000552
loss: 0.001204, lagrangian_loss: 0.000787, attention_score_distillation_loss: 0.000549
----------------------------------------------------------------------
time: 2023-07-19 15:24:49
Evaluating: f1: 0.8846, eval_loss: 0.6177, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4887, expected_sparsity: 0.4834, expected_sequence_sparsity: 0.7887, target_sparsity: 0.4854, step: 12500
lambda_1: -0.5828, lambda_2: 45.2817 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.71 0.77 0.56 0.58 0.33 0.26 0.27]
infer remain: [1.0, 1.0, 1.0, 0.7, 0.76, 0.54, 0.56, 0.32, 0.26, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.53, 0.29, 0.16, 0.05, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000100100
11111111111111111111111111111111100100011000100100
11111111111111111111110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111111010000101000000000000100000000000000100
11111101101011000010001000000000000000000000000000
11010000111100010001010000010000000000100100000000
loss: 0.000947, lagrangian_loss: 0.001835, attention_score_distillation_loss: 0.000546
loss: 0.000761, lagrangian_loss: 0.002188, attention_score_distillation_loss: 0.000543
ETA: 0:50:56 | Epoch 108 finished. Took 33.2 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:25:04
Evaluating: f1: 0.881, eval_loss: 0.6259, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4919, expected_sparsity: 0.4875, expected_sequence_sparsity: 0.7904, target_sparsity: 0.4874, step: 12550
lambda_1: -1.1474, lambda_2: 45.7156 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.7  0.76 0.55 0.58 0.33 0.26 0.27]
infer remain: [1.0, 1.0, 1.0, 0.68, 0.76, 0.54, 0.56, 0.32, 0.26, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.52, 0.28, 0.16, 0.05, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000000100
11111111111111111111111111111111100100011000100100
11111111111111111111110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111111010000101000000000000100000000000000100
11111101101011000010001000000000000000000000000000
11010000111100010001010000010000000000100100000000
loss: 0.000684, lagrangian_loss: 0.001118, attention_score_distillation_loss: 0.000540
loss: 0.001602, lagrangian_loss: 0.000035, attention_score_distillation_loss: 0.000537
----------------------------------------------------------------------
time: 2023-07-19 15:25:19
Evaluating: f1: 0.8808, eval_loss: 0.6777, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4919, expected_sparsity: 0.4876, expected_sequence_sparsity: 0.7905, target_sparsity: 0.4893, step: 12600
lambda_1: -1.3575, lambda_2: 45.8174 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.69 0.76 0.55 0.57 0.32 0.25 0.26]
infer remain: [1.0, 1.0, 1.0, 0.68, 0.76, 0.54, 0.56, 0.32, 0.24, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.52, 0.28, 0.16, 0.05, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000000100
11111111111111111111111111111111100100011000100100
11111111111111111111110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111111010000101000000000000100000000000000100
11111101100011000010001000000000000000000000000000
11010000111100010001010000010000000000100100000000
loss: 0.001309, lagrangian_loss: -0.000216, attention_score_distillation_loss: 0.000534
loss: 0.001416, lagrangian_loss: -0.000388, attention_score_distillation_loss: 0.000531
----------------------------------------------------------------------
time: 2023-07-19 15:25:33
Evaluating: f1: 0.884, eval_loss: 0.6849, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4965, expected_sparsity: 0.4913, expected_sequence_sparsity: 0.792, target_sparsity: 0.4913, step: 12650
lambda_1: -1.2357, lambda_2: 45.8676 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.69 0.75 0.54 0.57 0.32 0.24 0.26]
infer remain: [1.0, 1.0, 1.0, 0.68, 0.74, 0.52, 0.56, 0.32, 0.24, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.5, 0.26, 0.15, 0.05, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000000100
11111111111111111111111111111111100100011000000100
11111111111111111101110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111111010000101000000000000100000000000000100
11111101100011000010001000000000000000000000000000
11010000111100010001010000010000000000100100000000
ETA: 0:50:24 | Epoch 109 finished. Took 35.24 seconds.
loss: 0.002827, lagrangian_loss: -0.001490, attention_score_distillation_loss: 0.000529
loss: 0.002066, lagrangian_loss: -0.000448, attention_score_distillation_loss: 0.000526
----------------------------------------------------------------------
time: 2023-07-19 15:25:48
Evaluating: f1: 0.8582, eval_loss: 0.6807, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4965, expected_sparsity: 0.4913, expected_sequence_sparsity: 0.792, target_sparsity: 0.4932, step: 12700
lambda_1: -1.0027, lambda_2: 45.9659 lambda_3: 0.0000
train remain: [0.99 1.   0.99 0.68 0.75 0.53 0.57 0.32 0.24 0.26]
infer remain: [1.0, 1.0, 1.0, 0.68, 0.74, 0.52, 0.56, 0.32, 0.24, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.5, 0.26, 0.15, 0.05, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000000100
11111111111111111111111111111111100100011000000100
11111111111111111101110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111111010000101000000000000100000000000000100
11111101100011000010001000000000000000000000000000
11010000111100010001010000010000000000100100000000
loss: 0.001403, lagrangian_loss: -0.000424, attention_score_distillation_loss: 0.000523
loss: 0.002676, lagrangian_loss: -0.000550, attention_score_distillation_loss: 0.000522
----------------------------------------------------------------------
time: 2023-07-19 15:26:03
Evaluating: f1: 0.8655, eval_loss: 0.6174, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4965, expected_sparsity: 0.4917, expected_sequence_sparsity: 0.7921, target_sparsity: 0.4951, step: 12750
lambda_1: -0.7413, lambda_2: 46.0778 lambda_3: 0.0000
train remain: [0.99 1.   0.99 0.68 0.74 0.53 0.57 0.31 0.23 0.26]
infer remain: [1.0, 1.0, 1.0, 0.68, 0.74, 0.52, 0.56, 0.3, 0.22, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.5, 0.26, 0.15, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000000100
11111111111111111111111111111101100100011000100100
11111111111111111101110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11111101000011000010001000000000000000000000000000
11010000111100010001010000010000000000100100000000
loss: 0.003527, lagrangian_loss: -0.000618, attention_score_distillation_loss: 0.000517
ETA: 0:49:50 | Epoch 110 finished. Took 33.14 seconds.
loss: 0.000946, lagrangian_loss: -0.000536, attention_score_distillation_loss: 0.000514
----------------------------------------------------------------------
time: 2023-07-19 15:26:17
Evaluating: f1: 0.8897, eval_loss: 0.6281, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4965, expected_sparsity: 0.4917, expected_sequence_sparsity: 0.7921, target_sparsity: 0.4971, step: 12800
lambda_1: -0.5699, lambda_2: 46.1503 lambda_3: 0.0000
train remain: [0.99 1.   0.99 0.68 0.74 0.53 0.57 0.31 0.23 0.25]
infer remain: [1.0, 1.0, 1.0, 0.68, 0.74, 0.52, 0.56, 0.3, 0.22, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.5, 0.26, 0.15, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000000100
11111111111111111111111111111101100100011000100100
11111111111111111101110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11111101000011000010001000000000000000000000000000
10010000111100010001010000010000000000100100000000
loss: 0.000912, lagrangian_loss: -0.000081, attention_score_distillation_loss: 0.000512
loss: 0.011712, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.000510
----------------------------------------------------------------------
time: 2023-07-19 15:26:32
Evaluating: f1: 0.8847, eval_loss: 0.6734, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4981, expected_sparsity: 0.4938, expected_sequence_sparsity: 0.793, target_sparsity: 0.499, step: 12850
lambda_1: -0.4970, lambda_2: 46.1914 lambda_3: 0.0000
train remain: [0.99 1.   0.99 0.68 0.73 0.52 0.57 0.31 0.22 0.25]
infer remain: [1.0, 1.0, 1.0, 0.68, 0.72, 0.52, 0.56, 0.3, 0.22, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.49, 0.25, 0.14, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000000100
11111111111111111111111111101101100100011000100100
11111111111111111101110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11111101000011000010001000000000000000000000000000
10010000111100010001010000010000000000100100000000
loss: 0.001055, lagrangian_loss: -0.000098, attention_score_distillation_loss: 0.000506
loss: 0.001179, lagrangian_loss: 0.000033, attention_score_distillation_loss: 0.000503
ETA: 0:49:16 | Epoch 111 finished. Took 33.24 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:26:47
Evaluating: f1: 0.8722, eval_loss: 0.7173, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4981, expected_sparsity: 0.4938, expected_sequence_sparsity: 0.793, target_sparsity: 0.501, step: 12900
lambda_1: -0.5520, lambda_2: 46.2418 lambda_3: 0.0000
train remain: [0.99 1.   0.99 0.68 0.73 0.52 0.56 0.31 0.22 0.24]
infer remain: [1.0, 1.0, 1.0, 0.68, 0.72, 0.52, 0.56, 0.3, 0.22, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.49, 0.25, 0.14, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000000100
11111111111111111111111111101101100100011000100100
11111111111111111101110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11111101000011000010001000000000000000000000000000
10010000111100010001010000010000000000100100000000
loss: 0.001254, lagrangian_loss: 0.000358, attention_score_distillation_loss: 0.000502
loss: 0.000788, lagrangian_loss: 0.000008, attention_score_distillation_loss: 0.000497
----------------------------------------------------------------------
time: 2023-07-19 15:27:01
Evaluating: f1: 0.8676, eval_loss: 0.6349, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4981, expected_sparsity: 0.4939, expected_sequence_sparsity: 0.7931, target_sparsity: 0.5029, step: 12950
lambda_1: -0.6663, lambda_2: 46.2947 lambda_3: 0.0000
train remain: [0.99 1.   0.99 0.68 0.72 0.52 0.56 0.31 0.21 0.24]
infer remain: [1.0, 1.0, 1.0, 0.68, 0.72, 0.52, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.49, 0.25, 0.14, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111011011011010000000000100
11111111111111111111111111101101100100011000100100
11111111111111111101110111010001000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
10010000111100010001010000010000000000100100000000
loss: 0.139431, lagrangian_loss: 0.000308, attention_score_distillation_loss: 0.000495
loss: 0.001922, lagrangian_loss: -0.000667, attention_score_distillation_loss: 0.000491
ETA: 0:48:42 | Epoch 112 finished. Took 33.2 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:27:16
Evaluating: f1: 0.8746, eval_loss: 0.6465, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5012, expected_sequence_sparsity: 0.796, target_sparsity: 0.5048, step: 13000
lambda_1: -0.6079, lambda_2: 46.3424 lambda_3: 0.0000
train remain: [0.99 1.   0.98 0.67 0.71 0.51 0.56 0.31 0.21 0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.7, 0.5, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.46, 0.23, 0.13, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111111111011011011010000000000100
11111111111111111111111111101101100100011000000100
11111111111111111101110111010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
10010000111100010001010000010000000000100100000000
loss: 0.001217, lagrangian_loss: -0.000257, attention_score_distillation_loss: 0.000489
loss: 0.002290, lagrangian_loss: -0.000501, attention_score_distillation_loss: 0.000486
----------------------------------------------------------------------
time: 2023-07-19 15:27:31
Evaluating: f1: 0.872, eval_loss: 0.6937, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5012, expected_sequence_sparsity: 0.796, target_sparsity: 0.5068, step: 13050
lambda_1: -0.4029, lambda_2: 46.4198 lambda_3: 0.0000
train remain: [0.99 1.   0.98 0.67 0.71 0.51 0.56 0.31 0.21 0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.7, 0.5, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.46, 0.23, 0.13, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111111111011011011010000000000100
11111111111111111111111111101101100100001000100100
11111111111111111101110111010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.001181, lagrangian_loss: -0.000264, attention_score_distillation_loss: 0.000483
loss: 0.009199, lagrangian_loss: 0.000112, attention_score_distillation_loss: 0.000481
----------------------------------------------------------------------
time: 2023-07-19 15:27:46
Evaluating: f1: 0.8778, eval_loss: 0.6688, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5012, expected_sequence_sparsity: 0.796, target_sparsity: 0.5087, step: 13100
lambda_1: -0.2581, lambda_2: 46.4875 lambda_3: 0.0000
train remain: [0.99 1.   0.97 0.67 0.71 0.51 0.56 0.31 0.21 0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.7, 0.5, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.46, 0.23, 0.13, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111111111011011011010000000000100
11111111111111111111111111101101100100001000100100
11111111111111111101110111010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.001172, lagrangian_loss: -0.000077, attention_score_distillation_loss: 0.000477
ETA: 0:48:10 | Epoch 113 finished. Took 35.43 seconds.
loss: 0.000773, lagrangian_loss: -0.000112, attention_score_distillation_loss: 0.000474
----------------------------------------------------------------------
time: 2023-07-19 15:28:00
Evaluating: f1: 0.8545, eval_loss: 0.6922, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5012, expected_sequence_sparsity: 0.796, target_sparsity: 0.5107, step: 13150
lambda_1: -0.3089, lambda_2: 46.5277 lambda_3: 0.0000
train remain: [0.99 1.   0.97 0.67 0.7  0.51 0.56 0.31 0.21 0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.7, 0.5, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.46, 0.23, 0.13, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111111111011011011010000000000100
11111111111111111111111111101101100100001000100100
11111111111111111101110111010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.003223, lagrangian_loss: -0.000147, attention_score_distillation_loss: 0.000474
loss: 0.001386, lagrangian_loss: -0.000045, attention_score_distillation_loss: 0.000470
----------------------------------------------------------------------
time: 2023-07-19 15:28:15
Evaluating: f1: 0.8789, eval_loss: 0.7521, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5012, expected_sequence_sparsity: 0.796, target_sparsity: 0.5126, step: 13200
lambda_1: -0.3590, lambda_2: 46.5802 lambda_3: 0.0000
train remain: [0.99 1.   0.96 0.67 0.7  0.5  0.56 0.31 0.21 0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.7, 0.5, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.46, 0.23, 0.13, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111101111111011011011010000000000100
11111111111111111111111111101101100100001000100100
11111111111111111101110111010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.005307, lagrangian_loss: -0.000204, attention_score_distillation_loss: 0.000467
ETA: 0:47:36 | Epoch 114 finished. Took 33.34 seconds.
loss: 0.001746, lagrangian_loss: -0.000434, attention_score_distillation_loss: 0.000463
----------------------------------------------------------------------
time: 2023-07-19 15:28:30
Evaluating: f1: 0.8779, eval_loss: 0.6287, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5012, expected_sequence_sparsity: 0.796, target_sparsity: 0.5146, step: 13250
lambda_1: -0.1232, lambda_2: 46.7281 lambda_3: 0.0000
train remain: [0.99 1.   0.95 0.67 0.7  0.5  0.56 0.31 0.2  0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.7, 0.5, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.46, 0.23, 0.13, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111101111111011011011010000000000100
11111111111111111111111111101101100100001000100100
11111111111111111101110111010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.001708, lagrangian_loss: -0.000071, attention_score_distillation_loss: 0.000460
loss: 0.001806, lagrangian_loss: 0.000028, attention_score_distillation_loss: 0.000457
----------------------------------------------------------------------
time: 2023-07-19 15:28:45
Evaluating: f1: 0.8545, eval_loss: 0.6745, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5012, expected_sequence_sparsity: 0.796, target_sparsity: 0.5165, step: 13300
lambda_1: -0.0444, lambda_2: 46.8289 lambda_3: 0.0000
train remain: [0.99 1.   0.95 0.67 0.7  0.5  0.56 0.31 0.2  0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.7, 0.5, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.46, 0.23, 0.13, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111101111111011011011010000000000100
11111111111111111111111111101101100100001000100100
11111111111111111101110111010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.003084, lagrangian_loss: 0.000052, attention_score_distillation_loss: 0.000454
loss: 0.003181, lagrangian_loss: 0.000640, attention_score_distillation_loss: 0.000451
ETA: 0:47:02 | Epoch 115 finished. Took 33.23 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:29:00
Evaluating: f1: 0.8657, eval_loss: 0.6823, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5032, expected_sequence_sparsity: 0.7969, target_sparsity: 0.5184, step: 13350
lambda_1: -0.7694, lambda_2: 47.5216 lambda_3: 0.0000
train remain: [0.99 1.   0.95 0.67 0.69 0.5  0.56 0.31 0.2  0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.68, 0.5, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.45, 0.22, 0.13, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111101111111011011011010000000000100
11111111111111111111111111101101100100000000100100
11111111111111111101110111010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.007021, lagrangian_loss: 0.002050, attention_score_distillation_loss: 0.000448
loss: 0.001424, lagrangian_loss: 0.000735, attention_score_distillation_loss: 0.000447
----------------------------------------------------------------------
time: 2023-07-19 15:29:14
Evaluating: f1: 0.8545, eval_loss: 0.6907, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5045, expected_sequence_sparsity: 0.7974, target_sparsity: 0.5204, step: 13400
lambda_1: -1.0465, lambda_2: 47.9125 lambda_3: 0.0000
train remain: [0.99 1.   0.93 0.67 0.68 0.49 0.56 0.31 0.2  0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.68, 0.48, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.45, 0.22, 0.12, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111101111111011011011010000000000100
11111111111111111111111111101101100100000000100100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.002116, lagrangian_loss: -0.002563, attention_score_distillation_loss: 0.000443
loss: 0.142616, lagrangian_loss: -0.001278, attention_score_distillation_loss: 0.000440
----------------------------------------------------------------------
time: 2023-07-19 15:29:29
Evaluating: f1: 0.866, eval_loss: 0.7454, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5045, expected_sequence_sparsity: 0.7974, target_sparsity: 0.5223, step: 13450
lambda_1: -0.0796, lambda_2: 49.0226 lambda_3: 0.0000
train remain: [0.99 1.   0.92 0.67 0.68 0.49 0.56 0.3  0.2  0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.68, 0.48, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.45, 0.22, 0.12, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111101111111011011011010000000000100
11111111111111111111111111101101100100000000100100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.052311, lagrangian_loss: 0.000184, attention_score_distillation_loss: 0.000437
ETA: 0:46:30 | Epoch 116 finished. Took 35.46 seconds.
loss: 0.001285, lagrangian_loss: 0.000157, attention_score_distillation_loss: 0.000434
----------------------------------------------------------------------
time: 2023-07-19 15:29:44
Evaluating: f1: 0.8688, eval_loss: 0.7332, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5045, expected_sequence_sparsity: 0.7974, target_sparsity: 0.5243, step: 13500
lambda_1: 0.2251, lambda_2: 49.3891 lambda_3: 0.0000
train remain: [0.99 1.   0.93 0.67 0.68 0.49 0.56 0.31 0.2  0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.68, 0.48, 0.56, 0.3, 0.2, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.45, 0.22, 0.12, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111101111111011011011010000000000100
11111111111111111111111111101101100100000000100100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000010000100000
11111111110010000101000000000000100000000000000100
11011101000011000010001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.022130, lagrangian_loss: -0.000210, attention_score_distillation_loss: 0.000432
loss: 0.007977, lagrangian_loss: 0.001646, attention_score_distillation_loss: 0.000428
----------------------------------------------------------------------
time: 2023-07-19 15:29:58
Evaluating: f1: 0.8662, eval_loss: 0.6472, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5075, expected_sparsity: 0.5046, expected_sequence_sparsity: 0.7974, target_sparsity: 0.5262, step: 13550
lambda_1: -0.7812, lambda_2: 50.6659 lambda_3: 0.0000
train remain: [0.99 1.   0.93 0.67 0.68 0.49 0.56 0.3  0.19 0.24]
infer remain: [1.0, 1.0, 1.0, 0.66, 0.68, 0.48, 0.56, 0.3, 0.18, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.66, 0.45, 0.22, 0.12, 0.04, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111101111111011011011010000000000100
11111111111111111111111111101101100100000000100100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000000001100000
11111111110010000101000000000000100000000000000100
11011101000011000000001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.005489, lagrangian_loss: 0.003386, attention_score_distillation_loss: 0.000425
ETA: 0:45:56 | Epoch 117 finished. Took 33.19 seconds.
loss: 0.001774, lagrangian_loss: 0.004282, attention_score_distillation_loss: 0.000423
----------------------------------------------------------------------
time: 2023-07-19 15:30:13
Evaluating: f1: 0.8617, eval_loss: 0.7033, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5485, expected_sparsity: 0.543, expected_sequence_sparsity: 0.8132, target_sparsity: 0.5282, step: 13600
lambda_1: -1.6538, lambda_2: 51.5892 lambda_3: 0.0000
train remain: [0.99 1.   0.91 0.67 0.67 0.49 0.56 0.3  0.19 0.24]
infer remain: [1.0, 1.0, 0.82, 0.66, 0.66, 0.48, 0.56, 0.3, 0.18, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.54, 0.36, 0.17, 0.1, 0.03, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111111111011111101110101010011110
11111111111111111101111111011011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110101010000000000100000
11111111111010000101000000000000100000000000000000
11011101000011000000001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.005384, lagrangian_loss: -0.001517, attention_score_distillation_loss: 0.000420
loss: 0.003460, lagrangian_loss: -0.005222, attention_score_distillation_loss: 0.000417
----------------------------------------------------------------------
time: 2023-07-19 15:30:28
Evaluating: f1: 0.8714, eval_loss: 0.7153, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5485, expected_sparsity: 0.543, expected_sequence_sparsity: 0.8132, target_sparsity: 0.5301, step: 13650
lambda_1: -1.1333, lambda_2: 52.2577 lambda_3: 0.0000
train remain: [0.99 1.   0.88 0.66 0.67 0.48 0.56 0.3  0.18 0.24]
infer remain: [1.0, 1.0, 0.82, 0.66, 0.66, 0.48, 0.56, 0.3, 0.18, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.54, 0.36, 0.17, 0.1, 0.03, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111111111011111101110101010011110
11111111111111111111111111010011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110101010000000000100000
11111111111010000101000000000000100000000000000000
11011101000011000000001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.038430, lagrangian_loss: -0.004524, attention_score_distillation_loss: 0.000415
loss: 0.006729, lagrangian_loss: -0.000966, attention_score_distillation_loss: 0.000411
ETA: 0:45:22 | Epoch 118 finished. Took 33.31 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:30:43
Evaluating: f1: 0.8696, eval_loss: 0.6876, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5532, expected_sparsity: 0.547, expected_sequence_sparsity: 0.8148, target_sparsity: 0.532, step: 13700
lambda_1: 0.1499, lambda_2: 54.0245 lambda_3: 0.0000
train remain: [0.99 1.   0.88 0.66 0.67 0.48 0.56 0.3  0.18 0.24]
infer remain: [1.0, 1.0, 0.8, 0.66, 0.66, 0.48, 0.56, 0.3, 0.18, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.53, 0.35, 0.17, 0.09, 0.03, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111011111101110101010011110
11111111111111111111111111010011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110101010000000000100000
11111111111010000101000000000000100000000000000000
11011101000011000000001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.077044, lagrangian_loss: 0.001325, attention_score_distillation_loss: 0.000408
loss: 0.001636, lagrangian_loss: 0.000611, attention_score_distillation_loss: 0.000405
----------------------------------------------------------------------
time: 2023-07-19 15:30:57
Evaluating: f1: 0.8656, eval_loss: 0.6852, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5485, expected_sparsity: 0.543, expected_sequence_sparsity: 0.8132, target_sparsity: 0.534, step: 13750
lambda_1: 0.5451, lambda_2: 54.4503 lambda_3: 0.0000
train remain: [0.99 1.   0.89 0.67 0.67 0.49 0.56 0.31 0.19 0.24]
infer remain: [1.0, 1.0, 0.82, 0.66, 0.66, 0.48, 0.56, 0.3, 0.18, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.54, 0.36, 0.17, 0.1, 0.03, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111111111011111101110101010011110
11111111111111111111111111010011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100110000000000100000
11111111111010000101000000000000100000000000000000
11011101000011000000001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.001351, lagrangian_loss: -0.000863, attention_score_distillation_loss: 0.000403
loss: 0.001897, lagrangian_loss: -0.000047, attention_score_distillation_loss: 0.000399
----------------------------------------------------------------------
time: 2023-07-19 15:31:12
Evaluating: f1: 0.8776, eval_loss: 0.643, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5485, expected_sparsity: 0.543, expected_sequence_sparsity: 0.8132, target_sparsity: 0.5359, step: 13800
lambda_1: -0.2574, lambda_2: 55.2623 lambda_3: 0.0000
train remain: [0.99 1.   0.89 0.67 0.67 0.49 0.56 0.31 0.19 0.24]
infer remain: [1.0, 1.0, 0.82, 0.66, 0.66, 0.48, 0.56, 0.3, 0.18, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.54, 0.36, 0.17, 0.1, 0.03, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111111111011111101110101010011110
11111111111111111111111111010011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000100000100000
11111111111010000101000000000000100000000000000000
11011101000011000000001000000000000000000000000000
11010000111100010001000000010000000000100100000000
ETA: 0:44:50 | Epoch 119 finished. Took 35.4 seconds.
loss: 0.001656, lagrangian_loss: 0.002132, attention_score_distillation_loss: 0.000397
loss: 0.002094, lagrangian_loss: 0.002678, attention_score_distillation_loss: 0.000394
----------------------------------------------------------------------
time: 2023-07-19 15:31:27
Evaluating: f1: 0.8714, eval_loss: 0.6674, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5532, expected_sparsity: 0.547, expected_sequence_sparsity: 0.8148, target_sparsity: 0.5379, step: 13850
lambda_1: -0.9896, lambda_2: 55.9355 lambda_3: 0.0000
train remain: [0.99 1.   0.88 0.66 0.67 0.48 0.56 0.3  0.18 0.24]
infer remain: [1.0, 1.0, 0.8, 0.66, 0.66, 0.48, 0.56, 0.3, 0.18, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.53, 0.35, 0.17, 0.09, 0.03, 0.01, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111011111101110101010011110
11111111111111111111111111010011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000100000100000
11111111111010000101000000000000100000000000000000
11011101000011000000001000000000000000000000000000
11010000111100010001000000010000000000100100000000
loss: 0.203099, lagrangian_loss: 0.000998, attention_score_distillation_loss: 0.000392
loss: 0.003027, lagrangian_loss: -0.000615, attention_score_distillation_loss: 0.000388
----------------------------------------------------------------------
time: 2023-07-19 15:31:42
Evaluating: f1: 0.878, eval_loss: 0.6677, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5532, expected_sparsity: 0.5471, expected_sequence_sparsity: 0.8148, target_sparsity: 0.5398, step: 13900
lambda_1: -1.0186, lambda_2: 56.0066 lambda_3: 0.0000
train remain: [0.98 1.   0.86 0.66 0.66 0.48 0.56 0.3  0.17 0.23]
infer remain: [1.0, 1.0, 0.8, 0.66, 0.66, 0.48, 0.56, 0.3, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.53, 0.35, 0.17, 0.09, 0.03, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111011111101110101010011110
11111111111111111111111111010011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000100000100000
11111111111010000101000000000000100000000000000000
11011101000011000000000000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.001157, lagrangian_loss: -0.000666, attention_score_distillation_loss: 0.000385
ETA: 0:44:16 | Epoch 120 finished. Took 33.18 seconds.
loss: 0.003141, lagrangian_loss: -0.001340, attention_score_distillation_loss: 0.000383
----------------------------------------------------------------------
time: 2023-07-19 15:31:56
Evaluating: f1: 0.867, eval_loss: 0.6312, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5564, expected_sparsity: 0.55, expected_sequence_sparsity: 0.816, target_sparsity: 0.5417, step: 13950
lambda_1: -0.5718, lambda_2: 56.2652 lambda_3: 0.0000
train remain: [0.98 1.   0.85 0.66 0.66 0.48 0.56 0.3  0.17 0.23]
infer remain: [1.0, 1.0, 0.8, 0.64, 0.66, 0.48, 0.56, 0.3, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.51, 0.34, 0.16, 0.09, 0.03, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111011111101110101010011110
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000000000100001
11111111111010000101000000000000100000000000000000
11011101000011000000000000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.002652, lagrangian_loss: -0.000603, attention_score_distillation_loss: 0.000381
loss: 0.001597, lagrangian_loss: -0.000587, attention_score_distillation_loss: 0.000377
----------------------------------------------------------------------
time: 2023-07-19 15:32:11
Evaluating: f1: 0.8691, eval_loss: 0.694, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5564, expected_sparsity: 0.55, expected_sequence_sparsity: 0.816, target_sparsity: 0.5437, step: 14000
lambda_1: -0.2326, lambda_2: 56.4160 lambda_3: 0.0000
train remain: [0.98 1.   0.85 0.66 0.66 0.48 0.56 0.3  0.17 0.23]
infer remain: [1.0, 1.0, 0.8, 0.64, 0.66, 0.48, 0.56, 0.3, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.51, 0.34, 0.16, 0.09, 0.03, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111011111101110101010011110
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111111010110100010000000000100000
11111111111010000101000000000000100000000000000000
11010101000011000000001000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.002427, lagrangian_loss: -0.000224, attention_score_distillation_loss: 0.000374
loss: 0.002004, lagrangian_loss: 0.000016, attention_score_distillation_loss: 0.000371
ETA: 0:43:42 | Epoch 121 finished. Took 33.1 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:32:26
Evaluating: f1: 0.8866, eval_loss: 0.6218, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5564, expected_sparsity: 0.55, expected_sequence_sparsity: 0.816, target_sparsity: 0.5456, step: 14050
lambda_1: -0.1715, lambda_2: 56.4574 lambda_3: 0.0000
train remain: [0.98 1.   0.85 0.65 0.66 0.48 0.56 0.3  0.17 0.23]
infer remain: [1.0, 1.0, 0.8, 0.64, 0.66, 0.48, 0.56, 0.3, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.51, 0.34, 0.16, 0.09, 0.03, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111011111101110101010011110
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000000000100001
11111111111010000101000000000000100000000000000000
11010101000011000000001000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.003676, lagrangian_loss: 0.000778, attention_score_distillation_loss: 0.000369
loss: 0.005993, lagrangian_loss: 0.000401, attention_score_distillation_loss: 0.000365
----------------------------------------------------------------------
time: 2023-07-19 15:32:41
Evaluating: f1: 0.8805, eval_loss: 0.6608, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5564, expected_sparsity: 0.55, expected_sequence_sparsity: 0.816, target_sparsity: 0.5476, step: 14100
lambda_1: -0.4956, lambda_2: 56.6003 lambda_3: 0.0000
train remain: [0.98 1.   0.84 0.65 0.66 0.48 0.56 0.3  0.17 0.23]
infer remain: [1.0, 1.0, 0.8, 0.64, 0.66, 0.48, 0.56, 0.3, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.51, 0.34, 0.16, 0.09, 0.03, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111011111101110101010011110
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100100000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000000000100001
11111111111010000101000000000000100000000000000000
11010101000011000000001000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.001626, lagrangian_loss: 0.001307, attention_score_distillation_loss: 0.000362
loss: 0.003204, lagrangian_loss: 0.001536, attention_score_distillation_loss: 0.000359
ETA: 0:43:08 | Epoch 122 finished. Took 33.16 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:32:55
Evaluating: f1: 0.8728, eval_loss: 0.6651, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5579, expected_sparsity: 0.5539, expected_sequence_sparsity: 0.8176, target_sparsity: 0.5495, step: 14150
lambda_1: -0.8890, lambda_2: 56.7932 lambda_3: 0.0000
train remain: [0.98 1.   0.83 0.65 0.66 0.48 0.56 0.3  0.16 0.22]
infer remain: [1.0, 1.0, 0.78, 0.64, 0.66, 0.48, 0.56, 0.3, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.78, 0.5, 0.33, 0.16, 0.09, 0.03, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111011110111110101010011100
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000100100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010100000000100000
11111111111010000101000000000000100000000000000000
11010101000011000000001000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.002449, lagrangian_loss: 0.000538, attention_score_distillation_loss: 0.000357
loss: 0.001125, lagrangian_loss: 0.000175, attention_score_distillation_loss: 0.000354
----------------------------------------------------------------------
time: 2023-07-19 15:33:10
Evaluating: f1: 0.8779, eval_loss: 0.6466, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5579, expected_sparsity: 0.5539, expected_sequence_sparsity: 0.8176, target_sparsity: 0.5515, step: 14200
lambda_1: -0.9468, lambda_2: 56.8313 lambda_3: 0.0000
train remain: [0.98 1.   0.82 0.65 0.66 0.48 0.56 0.3  0.16 0.22]
infer remain: [1.0, 1.0, 0.78, 0.64, 0.66, 0.48, 0.56, 0.3, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.78, 0.5, 0.33, 0.16, 0.09, 0.03, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111011110111110101010011100
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000100100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010100000000100000
11111111110010000101000000000000110000000000000000
11010101000011000000001000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.151185, lagrangian_loss: -0.001362, attention_score_distillation_loss: 0.000352
loss: 0.002223, lagrangian_loss: -0.000577, attention_score_distillation_loss: 0.000348
----------------------------------------------------------------------
time: 2023-07-19 15:33:25
Evaluating: f1: 0.8632, eval_loss: 0.703, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5579, expected_sparsity: 0.5539, expected_sequence_sparsity: 0.8176, target_sparsity: 0.5534, step: 14250
lambda_1: -0.6840, lambda_2: 56.9351 lambda_3: 0.0000
train remain: [0.98 1.   0.81 0.65 0.66 0.48 0.56 0.3  0.16 0.22]
infer remain: [1.0, 1.0, 0.78, 0.64, 0.66, 0.48, 0.56, 0.3, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.78, 0.5, 0.33, 0.16, 0.09, 0.03, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111011110111110101010011100
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000100100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010100000000100000
11111111110010000101000000000000110000000000000000
11010101000011000000001000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.001710, lagrangian_loss: -0.000815, attention_score_distillation_loss: 0.000345
ETA: 0:42:36 | Epoch 123 finished. Took 35.46 seconds.
loss: 0.002408, lagrangian_loss: -0.000444, attention_score_distillation_loss: 0.000342
----------------------------------------------------------------------
time: 2023-07-19 15:33:39
Evaluating: f1: 0.8759, eval_loss: 0.6738, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5642, expected_sparsity: 0.5579, expected_sequence_sparsity: 0.8193, target_sparsity: 0.5553, step: 14300
lambda_1: -0.3638, lambda_2: 57.0628 lambda_3: 0.0000
train remain: [0.98 1.   0.8  0.65 0.65 0.48 0.56 0.3  0.16 0.22]
infer remain: [1.0, 1.0, 0.76, 0.64, 0.66, 0.48, 0.56, 0.3, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.76, 0.49, 0.32, 0.15, 0.09, 0.03, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111001110111110101010011100
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000100100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000000010100000
11111111110010000101000000000000110000000000000000
11010101000011000000001000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.001342, lagrangian_loss: -0.000423, attention_score_distillation_loss: 0.000339
loss: 0.135495, lagrangian_loss: -0.000284, attention_score_distillation_loss: 0.000337
----------------------------------------------------------------------
time: 2023-07-19 15:33:54
Evaluating: f1: 0.8832, eval_loss: 0.6172, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5642, expected_sparsity: 0.5579, expected_sequence_sparsity: 0.8193, target_sparsity: 0.5573, step: 14350
lambda_1: -0.2535, lambda_2: 57.0997 lambda_3: 0.0000
train remain: [0.98 1.   0.8  0.65 0.65 0.48 0.56 0.29 0.16 0.22]
infer remain: [1.0, 1.0, 0.76, 0.64, 0.66, 0.48, 0.56, 0.3, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.76, 0.49, 0.32, 0.15, 0.09, 0.03, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111001110111110101010011100
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000100100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000000010100000
11111111110010000101000000000000110000000000000000
11010101000011000000001000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.129177, lagrangian_loss: -0.000090, attention_score_distillation_loss: 0.000334
ETA: 0:42:02 | Epoch 124 finished. Took 33.17 seconds.
loss: 0.002739, lagrangian_loss: 0.000038, attention_score_distillation_loss: 0.000331
----------------------------------------------------------------------
time: 2023-07-19 15:34:09
Evaluating: f1: 0.875, eval_loss: 0.6671, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5642, expected_sparsity: 0.5595, expected_sequence_sparsity: 0.8199, target_sparsity: 0.5592, step: 14400
lambda_1: -0.4606, lambda_2: 57.1886 lambda_3: 0.0000
train remain: [0.98 1.   0.8  0.65 0.65 0.48 0.56 0.29 0.15 0.22]
infer remain: [1.0, 1.0, 0.76, 0.64, 0.64, 0.48, 0.56, 0.28, 0.16, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.76, 0.49, 0.31, 0.15, 0.08, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111001110111110101010011100
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010100000000100000
11111111110010000101000000000000100000000000000000
11010101000011000000001000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.002697, lagrangian_loss: 0.000455, attention_score_distillation_loss: 0.000328
loss: 0.001896, lagrangian_loss: 0.001704, attention_score_distillation_loss: 0.000325
----------------------------------------------------------------------
time: 2023-07-19 15:34:23
Evaluating: f1: 0.8628, eval_loss: 0.66, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5642, expected_sparsity: 0.5595, expected_sequence_sparsity: 0.8199, target_sparsity: 0.5612, step: 14450
lambda_1: -0.8841, lambda_2: 57.4118 lambda_3: 0.0000
train remain: [0.98 1.   0.79 0.65 0.65 0.47 0.56 0.29 0.15 0.22]
infer remain: [1.0, 1.0, 0.76, 0.64, 0.64, 0.48, 0.56, 0.28, 0.14, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.76, 0.49, 0.31, 0.15, 0.08, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111001110111110101010011100
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111101110110010000000000000000000000
11111111011111111111101010110100010000000000100001
11111111110010000101000000000000100000000000000000
11010101000011000000000000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.002422, lagrangian_loss: 0.000709, attention_score_distillation_loss: 0.000322
loss: 0.001248, lagrangian_loss: 0.004348, attention_score_distillation_loss: 0.000319
ETA: 0:41:28 | Epoch 125 finished. Took 33.01 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:34:38
Evaluating: f1: 0.8722, eval_loss: 0.6585, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5689, expected_sparsity: 0.5646, expected_sequence_sparsity: 0.822, target_sparsity: 0.5631, step: 14500
lambda_1: -1.4127, lambda_2: 57.8195 lambda_3: 0.0000
train remain: [0.99 0.99 0.78 0.64 0.65 0.47 0.56 0.29 0.15 0.21]
infer remain: [1.0, 1.0, 0.74, 0.64, 0.64, 0.46, 0.54, 0.28, 0.14, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.74, 0.47, 0.3, 0.14, 0.08, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111001110111110101010011000
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111101010110100010000000000100000
11111111110010000101000000000000100000000000000000
11010101000011000000000000000000000000000000000000
10010000111100010001000000010000000000100100000000
loss: 0.003709, lagrangian_loss: 0.004897, attention_score_distillation_loss: 0.000316
loss: 0.204541, lagrangian_loss: 0.002620, attention_score_distillation_loss: 0.000313
----------------------------------------------------------------------
time: 2023-07-19 15:34:53
Evaluating: f1: 0.8734, eval_loss: 0.6232, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5689, expected_sparsity: 0.5646, expected_sequence_sparsity: 0.822, target_sparsity: 0.5651, step: 14550
lambda_1: -1.9687, lambda_2: 58.2209 lambda_3: 0.0000
train remain: [0.98 0.99 0.76 0.64 0.65 0.47 0.55 0.28 0.15 0.2 ]
infer remain: [1.0, 1.0, 0.74, 0.64, 0.64, 0.46, 0.54, 0.28, 0.14, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.74, 0.47, 0.3, 0.14, 0.08, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111001110111110101010011000
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111101010110100010000000000100000
11111111110010000101000000000000100000000000000000
11010101000011000000000000000000000000000000000000
10010000011100010001000000010000000000100100000000
loss: 0.002217, lagrangian_loss: 0.001590, attention_score_distillation_loss: 0.000312
loss: 0.001671, lagrangian_loss: -0.001356, attention_score_distillation_loss: 0.000308
----------------------------------------------------------------------
time: 2023-07-19 15:35:07
Evaluating: f1: 0.8717, eval_loss: 0.6467, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5736, expected_sparsity: 0.5685, expected_sequence_sparsity: 0.8236, target_sparsity: 0.567, step: 14600
lambda_1: -1.8426, lambda_2: 58.3207 lambda_3: 0.0000
train remain: [0.98 0.99 0.74 0.64 0.64 0.47 0.55 0.28 0.14 0.2 ]
infer remain: [1.0, 1.0, 0.72, 0.64, 0.64, 0.46, 0.54, 0.28, 0.14, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.72, 0.46, 0.29, 0.14, 0.07, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111001110111110101010010000
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111101010110100010000000000100000
11111111110010000101000000000000100000000000000000
11010101000011000000000000000000000000000000000000
10010000011100010001000000010000000000100100000000
loss: 0.003029, lagrangian_loss: -0.004543, attention_score_distillation_loss: 0.000305
ETA: 0:40:55 | Epoch 126 finished. Took 35.4 seconds.
loss: 0.001292, lagrangian_loss: -0.004322, attention_score_distillation_loss: 0.000302
----------------------------------------------------------------------
time: 2023-07-19 15:35:22
Evaluating: f1: 0.8743, eval_loss: 0.6541, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5768, expected_sparsity: 0.5725, expected_sequence_sparsity: 0.8253, target_sparsity: 0.5689, step: 14650
lambda_1: -1.0757, lambda_2: 59.0377 lambda_3: 0.0000
train remain: [0.98 0.99 0.73 0.64 0.64 0.47 0.54 0.27 0.14 0.19]
infer remain: [1.0, 1.0, 0.7, 0.64, 0.64, 0.46, 0.54, 0.26, 0.14, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.7, 0.45, 0.29, 0.13, 0.07, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111000110111110101010010000
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111101010110100010000000000100000
11111111110010000101000000000000000000000000000000
11010101000011000000000000000000000000000000000000
10010000010100010001000000010000000000100100000001
loss: 0.001743, lagrangian_loss: -0.003369, attention_score_distillation_loss: 0.000299
loss: 0.002059, lagrangian_loss: -0.001476, attention_score_distillation_loss: 0.000296
----------------------------------------------------------------------
time: 2023-07-19 15:35:37
Evaluating: f1: 0.8848, eval_loss: 0.7084, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5768, expected_sparsity: 0.5725, expected_sequence_sparsity: 0.8253, target_sparsity: 0.5709, step: 14700
lambda_1: -0.2317, lambda_2: 59.8998 lambda_3: 0.0000
train remain: [0.98 0.99 0.73 0.64 0.64 0.47 0.54 0.27 0.14 0.19]
infer remain: [1.0, 1.0, 0.7, 0.64, 0.64, 0.46, 0.54, 0.26, 0.14, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.7, 0.45, 0.29, 0.13, 0.07, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011111111011111000110111110101010010000
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111101010110100010000000000100000
11111111110010000101000000000000000000000000000000
11010101000011000000000000000000000000000000000000
10010000010100010001000000010000000000100100000001
loss: 0.004131, lagrangian_loss: -0.000114, attention_score_distillation_loss: 0.000293
ETA: 0:40:21 | Epoch 127 finished. Took 33.22 seconds.
loss: 0.007172, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.000291
----------------------------------------------------------------------
time: 2023-07-19 15:35:52
Evaluating: f1: 0.8866, eval_loss: 0.6446, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5768, expected_sparsity: 0.5725, expected_sequence_sparsity: 0.8253, target_sparsity: 0.5728, step: 14750
lambda_1: 0.1480, lambda_2: 60.1371 lambda_3: 0.0000
train remain: [0.98 0.99 0.73 0.64 0.64 0.47 0.54 0.27 0.14 0.19]
infer remain: [1.0, 1.0, 0.7, 0.64, 0.64, 0.46, 0.54, 0.26, 0.14, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.7, 0.45, 0.29, 0.13, 0.07, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011110111011111001110111110101010010000
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111101010110100010000000000100000
11111111110010000101000000000000000000000000000000
11010101000011000000000000000000000000000000000000
10010000010100010001000000010000000000100100000001
loss: 0.003271, lagrangian_loss: 0.000047, attention_score_distillation_loss: 0.000288
loss: 0.001387, lagrangian_loss: 0.000064, attention_score_distillation_loss: 0.000285
----------------------------------------------------------------------
time: 2023-07-19 15:36:06
Evaluating: f1: 0.8429, eval_loss: 0.6964, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5768, expected_sparsity: 0.5725, expected_sequence_sparsity: 0.8253, target_sparsity: 0.5748, step: 14800
lambda_1: -0.1268, lambda_2: 60.3048 lambda_3: 0.0000
train remain: [0.98 0.99 0.73 0.64 0.64 0.47 0.54 0.27 0.14 0.19]
infer remain: [1.0, 1.0, 0.7, 0.64, 0.64, 0.46, 0.54, 0.26, 0.14, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.7, 0.45, 0.29, 0.13, 0.07, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011110111011111001110111110101010010000
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111101010110100010000000000100000
11111111110010000101000000000000000000000000000000
11010101000011000000000000000000000000000000000000
10010000010100010001000000010000000000100100000001
loss: 0.001661, lagrangian_loss: 0.000699, attention_score_distillation_loss: 0.000282
loss: 0.001800, lagrangian_loss: 0.003002, attention_score_distillation_loss: 0.000279
ETA: 0:39:47 | Epoch 128 finished. Took 33.18 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:36:21
Evaluating: f1: 0.8722, eval_loss: 0.7274, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5768, expected_sparsity: 0.5725, expected_sequence_sparsity: 0.8253, target_sparsity: 0.5767, step: 14850
lambda_1: -0.7486, lambda_2: 60.8002 lambda_3: 0.0000
train remain: [0.98 0.99 0.73 0.63 0.64 0.47 0.54 0.27 0.14 0.19]
infer remain: [1.0, 1.0, 0.7, 0.64, 0.64, 0.46, 0.54, 0.26, 0.14, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.7, 0.45, 0.29, 0.13, 0.07, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011110111011111001110111110101010010000
11111111111111111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111101010110100010000000000100000
11111111110010000101000000000000000000000000000000
11010101000011000000000000000000000000000000000000
10010000010100010001000000010000000000100100000000
loss: 0.004895, lagrangian_loss: 0.002281, attention_score_distillation_loss: 0.000276
loss: 0.002205, lagrangian_loss: 0.004023, attention_score_distillation_loss: 0.000273
----------------------------------------------------------------------
time: 2023-07-19 15:36:36
Evaluating: f1: 0.8702, eval_loss: 0.66, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5799, expected_sparsity: 0.575, expected_sequence_sparsity: 0.8263, target_sparsity: 0.5786, step: 14900
lambda_1: -1.4409, lambda_2: 61.4185 lambda_3: 0.0000
train remain: [0.99 0.99 0.72 0.63 0.64 0.47 0.54 0.27 0.14 0.19]
infer remain: [1.0, 1.0, 0.7, 0.62, 0.64, 0.46, 0.54, 0.26, 0.14, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.7, 0.43, 0.28, 0.13, 0.07, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011110111011111001110111110101010010000
11111111111011111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111101010110100010000000000100000
11111111110010000101000000000000000000000000000000
11010101000011000000000000000000000000000000000000
10010000010100010001000000010000000000100100000000
loss: 0.001793, lagrangian_loss: 0.005030, attention_score_distillation_loss: 0.000271
loss: 0.001506, lagrangian_loss: 0.004772, attention_score_distillation_loss: 0.000268
----------------------------------------------------------------------
time: 2023-07-19 15:36:51
Evaluating: f1: 0.8763, eval_loss: 0.6732, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5831, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.8278, target_sparsity: 0.5806, step: 14950
lambda_1: -2.0840, lambda_2: 61.9884 lambda_3: 0.0000
train remain: [0.98 0.99 0.71 0.63 0.64 0.46 0.53 0.26 0.13 0.18]
infer remain: [1.0, 1.0, 0.68, 0.62, 0.64, 0.46, 0.54, 0.26, 0.12, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.68, 0.42, 0.27, 0.12, 0.07, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011110111011111000110111110101010010000
11111111111011111101111111010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111101010110100010000000000100000
11111111110010000101000000000000000000000000000000
11000101000011000000000000000000000000000000000000
10010000010100010001000000010000000000000100000001
ETA: 0:39:15 | Epoch 129 finished. Took 35.46 seconds.
loss: 0.002328, lagrangian_loss: 0.005116, attention_score_distillation_loss: 0.000265
loss: 0.011365, lagrangian_loss: 0.000100, attention_score_distillation_loss: 0.000262
----------------------------------------------------------------------
time: 2023-07-19 15:37:05
Evaluating: f1: 0.8741, eval_loss: 0.6584, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5831, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8279, target_sparsity: 0.5825, step: 15000
lambda_1: -2.3410, lambda_2: 62.1159 lambda_3: 0.0000
train remain: [0.98 0.99 0.69 0.63 0.64 0.46 0.53 0.26 0.12 0.18]
infer remain: [1.0, 1.0, 0.68, 0.62, 0.64, 0.46, 0.52, 0.26, 0.12, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.68, 0.42, 0.27, 0.12, 0.06, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011110111011111000110111110101010010000
11111111111111111101111101010011011010000000000100
11111111111111111111111111101101100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111001010110100010000000000100000
11111111110010000101000000000000000000000000000000
11000101000011000000000000000000000000000000000000
10010000010100010001000000010000000000000100000001
loss: 0.181324, lagrangian_loss: -0.000174, attention_score_distillation_loss: 0.000259
loss: 0.001722, lagrangian_loss: -0.003707, attention_score_distillation_loss: 0.000256
----------------------------------------------------------------------
time: 2023-07-19 15:37:20
Evaluating: f1: 0.8667, eval_loss: 0.6744, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5862, expected_sparsity: 0.584, expected_sequence_sparsity: 0.83, target_sparsity: 0.5845, step: 15050
lambda_1: -2.1984, lambda_2: 62.1787 lambda_3: 0.0000
train remain: [0.98 0.99 0.68 0.62 0.63 0.46 0.53 0.26 0.12 0.17]
infer remain: [1.0, 1.0, 0.66, 0.62, 0.62, 0.46, 0.52, 0.26, 0.12, 0.16]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.66, 0.41, 0.25, 0.12, 0.06, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011110111011111000110101110101010010000
11111111111111111101111101010011011010000000000100
11111111111111111111111111101100100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111001010110100010000000000100000
11111111110010000101000000000000000000000000000000
11000101000011000000000000000000000000000000000000
10010000010100010001000000010000000000000100000000
loss: 0.002957, lagrangian_loss: -0.003767, attention_score_distillation_loss: 0.000253
ETA: 0:38:41 | Epoch 130 finished. Took 33.06 seconds.
loss: 0.000799, lagrangian_loss: -0.002619, attention_score_distillation_loss: 0.000250
----------------------------------------------------------------------
time: 2023-07-19 15:37:35
Evaluating: f1: 0.88, eval_loss: 0.6488, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5862, expected_sparsity: 0.584, expected_sequence_sparsity: 0.83, target_sparsity: 0.5864, step: 15100
lambda_1: -1.8176, lambda_2: 62.3793 lambda_3: 0.0000
train remain: [0.98 0.99 0.67 0.62 0.63 0.46 0.53 0.25 0.12 0.16]
infer remain: [1.0, 1.0, 0.66, 0.62, 0.62, 0.46, 0.52, 0.26, 0.12, 0.16]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.66, 0.41, 0.25, 0.12, 0.06, 0.02, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011110111011111000110101110101010010000
11111111111111111101111101010011011010000000000100
11111111111111111111111111101100100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111001010110100010000000000100000
11111111110010000101000000000000000000000000000000
10000101001011000000000000000000000000000000000000
10010000010100010001000000010000000000000100000000
loss: 0.004209, lagrangian_loss: -0.004118, attention_score_distillation_loss: 0.000247
loss: 0.001818, lagrangian_loss: -0.003735, attention_score_distillation_loss: 0.000245
----------------------------------------------------------------------
time: 2023-07-19 15:37:49
Evaluating: f1: 0.8702, eval_loss: 0.718, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5878, expected_sparsity: 0.5841, expected_sequence_sparsity: 0.83, target_sparsity: 0.5884, step: 15150
lambda_1: -1.3515, lambda_2: 62.6702 lambda_3: 0.0000
train remain: [0.98 0.99 0.67 0.62 0.63 0.46 0.52 0.25 0.12 0.16]
infer remain: [1.0, 1.0, 0.66, 0.62, 0.62, 0.46, 0.52, 0.24, 0.12, 0.16]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.66, 0.41, 0.25, 0.12, 0.06, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111011110111011111000110101110101010010000
11111111111111111101111101010011011010000000000100
11111111111111111111111111101100100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101001011000000000000000000000000000000000000
10010000010100010001000000010000000000000100000000
loss: 0.005516, lagrangian_loss: -0.003339, attention_score_distillation_loss: 0.000242
loss: 0.001899, lagrangian_loss: -0.002277, attention_score_distillation_loss: 0.000239
ETA: 0:38:07 | Epoch 131 finished. Took 33.21 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:38:04
Evaluating: f1: 0.8652, eval_loss: 0.6903, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5878, expected_sparsity: 0.5841, expected_sequence_sparsity: 0.83, target_sparsity: 0.5903, step: 15200
lambda_1: -0.9081, lambda_2: 62.9441 lambda_3: 0.0000
train remain: [0.98 0.98 0.67 0.62 0.62 0.46 0.52 0.25 0.12 0.16]
infer remain: [1.0, 1.0, 0.66, 0.62, 0.62, 0.46, 0.52, 0.24, 0.12, 0.16]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.66, 0.41, 0.25, 0.12, 0.06, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111101011110111011111000110111110101010010000
11111111111111111101111101010011011010000000000100
11111111111111111111111111101100100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101001011000000000000000000000000000000000000
10010000010100010001000000010000000000000100000000
loss: 0.004155, lagrangian_loss: -0.000606, attention_score_distillation_loss: 0.000236
loss: 0.002629, lagrangian_loss: -0.000497, attention_score_distillation_loss: 0.000233
----------------------------------------------------------------------
time: 2023-07-19 15:38:19
Evaluating: f1: 0.8824, eval_loss: 0.6687, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5925, expected_sparsity: 0.5879, expected_sequence_sparsity: 0.8316, target_sparsity: 0.5922, step: 15250
lambda_1: -0.6723, lambda_2: 63.0385 lambda_3: 0.0000
train remain: [0.98 0.98 0.66 0.62 0.62 0.46 0.52 0.25 0.12 0.15]
infer remain: [1.0, 1.0, 0.64, 0.62, 0.62, 0.46, 0.52, 0.24, 0.12, 0.16]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.4, 0.25, 0.11, 0.06, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111101011110111011111000110101110101010010000
11111111111111111101111101010011011010000000000100
11111111111111111111111111101100100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101001011000000000000000000000000000000000000
10010000010100010001000000010000000000000100000000
loss: 0.110417, lagrangian_loss: -0.000627, attention_score_distillation_loss: 0.000230
loss: 0.002001, lagrangian_loss: 0.000122, attention_score_distillation_loss: 0.000228
ETA: 0:37:33 | Epoch 132 finished. Took 33.24 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:38:33
Evaluating: f1: 0.8744, eval_loss: 0.7117, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5925, expected_sparsity: 0.5879, expected_sequence_sparsity: 0.8316, target_sparsity: 0.5942, step: 15300
lambda_1: -0.6379, lambda_2: 63.0638 lambda_3: 0.0000
train remain: [0.98 0.98 0.66 0.62 0.62 0.46 0.52 0.25 0.12 0.15]
infer remain: [1.0, 1.0, 0.64, 0.62, 0.62, 0.46, 0.52, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.4, 0.25, 0.11, 0.06, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111101011110111011111000110101110101010010000
11111111111111111101111101010011011010000000000100
11111111111111111111111111101100100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101001011000000000000000000000000000000000000
10000000010100010001000000010000000000000100000000
loss: 0.002437, lagrangian_loss: 0.000388, attention_score_distillation_loss: 0.000225
loss: 0.003287, lagrangian_loss: 0.000607, attention_score_distillation_loss: 0.000222
----------------------------------------------------------------------
time: 2023-07-19 15:38:48
Evaluating: f1: 0.8723, eval_loss: 0.6464, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5925, expected_sparsity: 0.5879, expected_sequence_sparsity: 0.8316, target_sparsity: 0.5961, step: 15350
lambda_1: -0.7981, lambda_2: 63.1259 lambda_3: 0.0000
train remain: [0.98 0.98 0.66 0.62 0.62 0.46 0.52 0.25 0.12 0.15]
infer remain: [1.0, 1.0, 0.64, 0.62, 0.62, 0.46, 0.52, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.4, 0.25, 0.11, 0.06, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111101011110111011111000110101110101010010000
11111111111111111101111101010011011010000000000100
11111111111111111111111111101100100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101001011000000000000000000000000000000000000
10000000010100010001000000010000000000000100000000
loss: 0.002151, lagrangian_loss: 0.001998, attention_score_distillation_loss: 0.000219
loss: 0.177561, lagrangian_loss: 0.001195, attention_score_distillation_loss: 0.000216
----------------------------------------------------------------------
time: 2023-07-19 15:39:03
Evaluating: f1: 0.8737, eval_loss: 0.6766, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5925, expected_sparsity: 0.5879, expected_sequence_sparsity: 0.8316, target_sparsity: 0.5981, step: 15400
lambda_1: -1.0835, lambda_2: 63.2559 lambda_3: 0.0000
train remain: [0.98 0.98 0.66 0.61 0.62 0.45 0.51 0.25 0.12 0.15]
infer remain: [1.0, 1.0, 0.64, 0.62, 0.62, 0.46, 0.52, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.4, 0.25, 0.11, 0.06, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111101011110111011111000110101110101010010000
11111111111111111101111101010011011010000000000100
11111111111111111111111111101100100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111111001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101001011000000000000000000000000000000000000
10000000010100010001000000010000000000000100000000
loss: 0.001902, lagrangian_loss: 0.002498, attention_score_distillation_loss: 0.000213
ETA: 0:37:00 | Epoch 133 finished. Took 35.29 seconds.
loss: 0.001281, lagrangian_loss: 0.001589, attention_score_distillation_loss: 0.000211
----------------------------------------------------------------------
time: 2023-07-19 15:39:18
Evaluating: f1: 0.8646, eval_loss: 0.6855, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5987, expected_sparsity: 0.5914, expected_sequence_sparsity: 0.833, target_sparsity: 0.6, step: 15450
lambda_1: -1.3411, lambda_2: 63.3609 lambda_3: 0.0000
train remain: [0.98 0.97 0.65 0.61 0.61 0.45 0.51 0.25 0.12 0.15]
infer remain: [1.0, 1.0, 0.64, 0.6, 0.6, 0.46, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.38, 0.23, 0.11, 0.05, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111101011110111011111000110101110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000100
11111111111111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101001011000000000000000000000000000000000000
10000000010100010001000000010000000000000100000000
loss: 0.002222, lagrangian_loss: -0.000537, attention_score_distillation_loss: 0.000207
loss: 0.138751, lagrangian_loss: 0.001609, attention_score_distillation_loss: 0.000205
----------------------------------------------------------------------
time: 2023-07-19 15:39:32
Evaluating: f1: 0.8714, eval_loss: 0.6712, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5987, expected_sparsity: 0.5914, expected_sequence_sparsity: 0.833, target_sparsity: 0.602, step: 15500
lambda_1: -1.5837, lambda_2: 63.4781 lambda_3: 0.0000
train remain: [0.98 0.97 0.65 0.61 0.6  0.45 0.51 0.25 0.12 0.15]
infer remain: [1.0, 1.0, 0.64, 0.6, 0.6, 0.46, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.38, 0.23, 0.11, 0.05, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111101011110111011111000110101110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111111101100100000000000000000
11111111111111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011001000000000000000000000000000000000
10000000010100010001000000010000000000000100000000
loss: 0.005439, lagrangian_loss: 0.003906, attention_score_distillation_loss: 0.000202
ETA: 0:36:26 | Epoch 134 finished. Took 33.19 seconds.
loss: 0.002967, lagrangian_loss: 0.003280, attention_score_distillation_loss: 0.000199
----------------------------------------------------------------------
time: 2023-07-19 15:39:47
Evaluating: f1: 0.8705, eval_loss: 0.6847, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5987, expected_sparsity: 0.592, expected_sequence_sparsity: 0.8333, target_sparsity: 0.6039, step: 15550
lambda_1: -2.1133, lambda_2: 63.8635 lambda_3: 0.0000
train remain: [0.98 0.96 0.65 0.61 0.6  0.45 0.51 0.24 0.12 0.15]
infer remain: [1.0, 1.0, 0.64, 0.6, 0.6, 0.44, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.38, 0.23, 0.1, 0.05, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111101011110111011111000110101110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111111101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011000000000000000000000000000000000001
10000000010100010001000000010000000000000100000000
loss: 0.004646, lagrangian_loss: 0.003113, attention_score_distillation_loss: 0.000196
loss: 0.000978, lagrangian_loss: -0.002311, attention_score_distillation_loss: 0.000193
----------------------------------------------------------------------
time: 2023-07-19 15:40:02
Evaluating: f1: 0.8724, eval_loss: 0.6839, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.6019, expected_sparsity: 0.5967, expected_sequence_sparsity: 0.8352, target_sparsity: 0.6058, step: 15600
lambda_1: -2.1185, lambda_2: 63.9727 lambda_3: 0.0000
train remain: [0.98 0.95 0.64 0.61 0.59 0.45 0.5  0.24 0.12 0.14]
infer remain: [1.0, 1.0, 0.62, 0.6, 0.58, 0.44, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.37, 0.22, 0.09, 0.05, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111101011110111011111000110100110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101010011000000000000000000000000000000000000
10000000010100010001000000010000000000000100000000
loss: 0.002437, lagrangian_loss: -0.003967, attention_score_distillation_loss: 0.000190
loss: 0.086966, lagrangian_loss: -0.005129, attention_score_distillation_loss: 0.000187
ETA: 0:35:53 | Epoch 135 finished. Took 33.21 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:40:16
Evaluating: f1: 0.8858, eval_loss: 0.6451, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6303, expected_sparsity: 0.6238, expected_sequence_sparsity: 0.8463, target_sparsity: 0.6078, step: 15650
lambda_1: -1.3891, lambda_2: 64.7355 lambda_3: 0.0000
train remain: [0.98 0.93 0.64 0.6  0.59 0.45 0.5  0.24 0.12 0.14]
infer remain: [1.0, 0.86, 0.62, 0.6, 0.58, 0.44, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.86, 0.53, 0.32, 0.19, 0.08, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111111111111111101111110111111101110110
11111111101011110111011111000110100110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100010001000000010000000000000100000000
loss: 0.003580, lagrangian_loss: -0.004472, attention_score_distillation_loss: 0.000184
loss: 0.143183, lagrangian_loss: -0.002385, attention_score_distillation_loss: 0.000181
----------------------------------------------------------------------
time: 2023-07-19 15:40:31
Evaluating: f1: 0.8691, eval_loss: 0.7763, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6303, expected_sparsity: 0.6238, expected_sequence_sparsity: 0.8463, target_sparsity: 0.6097, step: 15700
lambda_1: -0.2457, lambda_2: 66.3016 lambda_3: 0.0000
train remain: [0.98 0.92 0.64 0.6  0.58 0.45 0.5  0.24 0.12 0.14]
infer remain: [1.0, 0.86, 0.62, 0.6, 0.58, 0.44, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.86, 0.53, 0.32, 0.19, 0.08, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111111111111111101111110111111101110110
11111111101011110111011111000110100110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011000000000000000000000000000000000001
10000000010100010001000000010000000000000100000000
loss: 0.158118, lagrangian_loss: 0.000288, attention_score_distillation_loss: 0.000179
loss: 0.001747, lagrangian_loss: 0.001045, attention_score_distillation_loss: 0.000176
----------------------------------------------------------------------
time: 2023-07-19 15:40:46
Evaluating: f1: 0.8468, eval_loss: 0.7195, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6303, expected_sparsity: 0.6238, expected_sequence_sparsity: 0.8463, target_sparsity: 0.6117, step: 15750
lambda_1: 0.3923, lambda_2: 66.8617 lambda_3: 0.0000
train remain: [0.98 0.93 0.64 0.61 0.59 0.45 0.5  0.24 0.12 0.15]
infer remain: [1.0, 0.86, 0.62, 0.6, 0.58, 0.44, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.86, 0.53, 0.32, 0.19, 0.08, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111111111111111101111110111111101110110
11111111101011110111011111000110100110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011000000000000000000000000000000000001
10000000010100010001000000010000000000000100000000
loss: 0.174319, lagrangian_loss: 0.000165, attention_score_distillation_loss: 0.000173
ETA: 0:35:20 | Epoch 136 finished. Took 35.36 seconds.
loss: 0.003573, lagrangian_loss: -0.000495, attention_score_distillation_loss: 0.000170
----------------------------------------------------------------------
time: 2023-07-19 15:41:01
Evaluating: f1: 0.8387, eval_loss: 0.77, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.6003, expected_sparsity: 0.5961, expected_sequence_sparsity: 0.8349, target_sparsity: 0.6136, step: 15800
lambda_1: 0.2059, lambda_2: 67.0296 lambda_3: 0.0000
train remain: [0.98 0.93 0.64 0.61 0.59 0.45 0.5  0.24 0.12 0.15]
infer remain: [1.0, 1.0, 0.62, 0.6, 0.58, 0.46, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.37, 0.22, 0.1, 0.05, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111101011110111011111000110100110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111111111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011000000000000000000000000000000000001
10000000010100010001000000010000000000000100000000
loss: 0.002947, lagrangian_loss: -0.000102, attention_score_distillation_loss: 0.000167
loss: 0.001858, lagrangian_loss: 0.000816, attention_score_distillation_loss: 0.000164
----------------------------------------------------------------------
time: 2023-07-19 15:41:15
Evaluating: f1: 0.8612, eval_loss: 0.7308, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6303, expected_sparsity: 0.6238, expected_sequence_sparsity: 0.8463, target_sparsity: 0.6155, step: 15850
lambda_1: -0.5346, lambda_2: 67.7248 lambda_3: 0.0000
train remain: [0.98 0.93 0.64 0.61 0.59 0.45 0.5  0.24 0.12 0.14]
infer remain: [1.0, 0.86, 0.62, 0.6, 0.58, 0.44, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.86, 0.53, 0.32, 0.19, 0.08, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111011111111101111111111101111110111111101110110
11111111101011110111011111000110100110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100010001000000010000000000000100000000
loss: 0.000858, lagrangian_loss: 0.001974, attention_score_distillation_loss: 0.000161
ETA: 0:34:46 | Epoch 137 finished. Took 33.15 seconds.
loss: 0.001490, lagrangian_loss: 0.001547, attention_score_distillation_loss: 0.000158
----------------------------------------------------------------------
time: 2023-07-19 15:41:30
Evaluating: f1: 0.862, eval_loss: 0.7697, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6303, expected_sparsity: 0.6238, expected_sequence_sparsity: 0.8463, target_sparsity: 0.6175, step: 15900
lambda_1: -1.1512, lambda_2: 68.2349 lambda_3: 0.0000
train remain: [0.98 0.92 0.64 0.6  0.58 0.45 0.5  0.24 0.12 0.14]
infer remain: [1.0, 0.86, 0.62, 0.6, 0.58, 0.44, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.86, 0.53, 0.32, 0.19, 0.08, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111011111111101111111111101111110111111101110110
11111111101011110111011111000110100110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100010001000000010000000000000100000000
loss: 0.001991, lagrangian_loss: 0.003390, attention_score_distillation_loss: 0.000156
loss: 0.008554, lagrangian_loss: 0.001382, attention_score_distillation_loss: 0.000153
----------------------------------------------------------------------
time: 2023-07-19 15:41:45
Evaluating: f1: 0.8673, eval_loss: 0.7456, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6303, expected_sparsity: 0.6238, expected_sequence_sparsity: 0.8463, target_sparsity: 0.6194, step: 15950
lambda_1: -1.4117, lambda_2: 68.3691 lambda_3: 0.0000
train remain: [0.97 0.9  0.63 0.6  0.58 0.45 0.5  0.24 0.12 0.14]
infer remain: [1.0, 0.86, 0.62, 0.6, 0.58, 0.44, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.86, 0.53, 0.32, 0.19, 0.08, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
11111011111111101111111111101111110111111101110110
11111111101011110111011111000110100110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100010001000000010000000000000000000001
loss: 0.009650, lagrangian_loss: -0.000511, attention_score_distillation_loss: 0.000150
loss: 0.008759, lagrangian_loss: -0.001755, attention_score_distillation_loss: 0.000147
ETA: 0:34:12 | Epoch 138 finished. Took 33.1 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:41:59
Evaluating: f1: 0.8676, eval_loss: 0.7333, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6318, expected_sparsity: 0.6276, expected_sequence_sparsity: 0.8479, target_sparsity: 0.6214, step: 16000
lambda_1: -1.2698, lambda_2: 68.4294 lambda_3: 0.0000
train remain: [0.97 0.89 0.63 0.6  0.58 0.45 0.49 0.23 0.12 0.14]
infer remain: [1.0, 0.84, 0.62, 0.6, 0.58, 0.44, 0.5, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.52, 0.31, 0.18, 0.08, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111111101110110
11111111101011110111011111000110100110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111011111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011001000000000000000000000000000000000
10000000010100010001000000010000000000000000000001
loss: 0.001650, lagrangian_loss: -0.000942, attention_score_distillation_loss: 0.000144
loss: 0.003177, lagrangian_loss: -0.001426, attention_score_distillation_loss: 0.000141
----------------------------------------------------------------------
time: 2023-07-19 15:42:14
Evaluating: f1: 0.8731, eval_loss: 0.7526, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6318, expected_sparsity: 0.6278, expected_sequence_sparsity: 0.8479, target_sparsity: 0.6233, step: 16050
lambda_1: -0.8697, lambda_2: 68.6640 lambda_3: 0.0000
train remain: [0.97 0.88 0.63 0.6  0.57 0.45 0.49 0.23 0.12 0.14]
infer remain: [1.0, 0.84, 0.62, 0.6, 0.58, 0.44, 0.48, 0.24, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.52, 0.31, 0.18, 0.08, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111111101110110
11111111101011110111001111000110101110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111010111111011001010110100010000000000100000
11111111110010000001000000000000000000000000000000
10000101000011001000000000000000000000000000000000
10000000010100010001000000010000000000000000000001
loss: 0.006447, lagrangian_loss: -0.001072, attention_score_distillation_loss: 0.000139
loss: 0.005200, lagrangian_loss: -0.000518, attention_score_distillation_loss: 0.000136
----------------------------------------------------------------------
time: 2023-07-19 15:42:29
Evaluating: f1: 0.871, eval_loss: 0.7727, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6318, expected_sparsity: 0.6279, expected_sequence_sparsity: 0.848, target_sparsity: 0.6253, step: 16100
lambda_1: -0.5283, lambda_2: 68.8386 lambda_3: 0.0000
train remain: [0.97 0.88 0.63 0.6  0.57 0.45 0.49 0.23 0.12 0.14]
infer remain: [1.0, 0.84, 0.62, 0.6, 0.58, 0.44, 0.48, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.52, 0.31, 0.18, 0.08, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111111101110110
11111111101011110111001111000110101110101010010000
11111111111011111101111101010011011010000000000100
11111111111111111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111010111111011001010110100010000000000100000
11101111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100010001000000010000000000000000000001
ETA: 0:33:39 | Epoch 139 finished. Took 35.42 seconds.
loss: 0.004265, lagrangian_loss: -0.000541, attention_score_distillation_loss: 0.000133
loss: 0.006619, lagrangian_loss: -0.000205, attention_score_distillation_loss: 0.000130
----------------------------------------------------------------------
time: 2023-07-19 15:42:44
Evaluating: f1: 0.8847, eval_loss: 0.6917, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6334, expected_sparsity: 0.6287, expected_sequence_sparsity: 0.8483, target_sparsity: 0.6272, step: 16150
lambda_1: -0.4700, lambda_2: 68.8875 lambda_3: 0.0000
train remain: [0.97 0.88 0.62 0.6  0.57 0.45 0.49 0.23 0.12 0.14]
infer remain: [1.0, 0.84, 0.62, 0.6, 0.56, 0.44, 0.48, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.52, 0.31, 0.17, 0.08, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111111101110110
11111111101011110111001111000110101110101010010000
11111111111011111101111101010011011010000000000100
11111111111101111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111010111111011001010110100010000000000100000
11101111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100010001000000010000000000000000000001
loss: 0.003459, lagrangian_loss: 0.000171, attention_score_distillation_loss: 0.000127
loss: 0.004751, lagrangian_loss: 0.000414, attention_score_distillation_loss: 0.000124
----------------------------------------------------------------------
time: 2023-07-19 15:42:58
Evaluating: f1: 0.866, eval_loss: 0.7306, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6381, expected_sparsity: 0.6317, expected_sequence_sparsity: 0.8495, target_sparsity: 0.6291, step: 16200
lambda_1: -0.6775, lambda_2: 68.9718 lambda_3: 0.0000
train remain: [0.97 0.88 0.62 0.6  0.57 0.45 0.49 0.23 0.12 0.14]
infer remain: [1.0, 0.84, 0.6, 0.6, 0.56, 0.44, 0.48, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.5, 0.3, 0.17, 0.07, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111111101110110
11111111101011110111001111000110101110100010010000
11111111111111111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111010111111011001010110100010000000000100000
11101111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100010001000000010000000000000000000001
loss: 0.127728, lagrangian_loss: 0.000942, attention_score_distillation_loss: 0.000121
ETA: 0:33:05 | Epoch 140 finished. Took 33.09 seconds.
loss: 0.002692, lagrangian_loss: 0.000891, attention_score_distillation_loss: 0.000118
----------------------------------------------------------------------
time: 2023-07-19 15:43:13
Evaluating: f1: 0.8542, eval_loss: 0.7302, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6381, expected_sparsity: 0.6317, expected_sequence_sparsity: 0.8495, target_sparsity: 0.6311, step: 16250
lambda_1: -1.0198, lambda_2: 69.1458 lambda_3: 0.0000
train remain: [0.97 0.87 0.62 0.6  0.57 0.45 0.49 0.23 0.12 0.14]
infer remain: [1.0, 0.84, 0.6, 0.6, 0.56, 0.44, 0.48, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.5, 0.3, 0.17, 0.07, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111111101110110
11111111101011110111001111000110101110100010010000
11111111111111111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111010111111011001010110100010000000000100000
11101111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100010001000000010000000000000000000001
loss: 0.002633, lagrangian_loss: 0.002175, attention_score_distillation_loss: 0.000116
loss: 0.004739, lagrangian_loss: 0.003634, attention_score_distillation_loss: 0.000113
----------------------------------------------------------------------
time: 2023-07-19 15:43:28
Evaluating: f1: 0.8609, eval_loss: 0.8503, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6381, expected_sparsity: 0.6317, expected_sequence_sparsity: 0.8495, target_sparsity: 0.633, step: 16300
lambda_1: -1.3202, lambda_2: 69.2947 lambda_3: 0.0000
train remain: [0.97 0.87 0.62 0.59 0.57 0.45 0.48 0.23 0.12 0.14]
infer remain: [1.0, 0.84, 0.6, 0.6, 0.56, 0.44, 0.48, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.5, 0.3, 0.17, 0.07, 0.04, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111111101110110
11111111101011110111001111000110101110100010010000
11111111111111111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111010111111011001010110100010000000000100000
11101111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100010001000000010000000000000000000001
loss: 0.002589, lagrangian_loss: 0.000621, attention_score_distillation_loss: 0.000110
loss: 0.134506, lagrangian_loss: 0.001687, attention_score_distillation_loss: 0.000107
ETA: 0:32:31 | Epoch 141 finished. Took 33.39 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:43:42
Evaluating: f1: 0.8581, eval_loss: 0.7425, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6397, expected_sparsity: 0.6333, expected_sequence_sparsity: 0.8502, target_sparsity: 0.635, step: 16350
lambda_1: -1.5862, lambda_2: 69.4057 lambda_3: 0.0000
train remain: [0.97 0.86 0.61 0.59 0.57 0.44 0.48 0.23 0.12 0.14]
infer remain: [1.0, 0.84, 0.6, 0.58, 0.56, 0.44, 0.48, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.5, 0.29, 0.16, 0.07, 0.03, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111111101110110
11111111101011110111001111000110101110100010010000
11111111111011111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111010111111011001010110100010000000000100000
11101111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100000001000000010000000000000000000011
loss: 0.008445, lagrangian_loss: 0.001599, attention_score_distillation_loss: 0.000104
loss: 0.003550, lagrangian_loss: 0.002469, attention_score_distillation_loss: 0.000101
----------------------------------------------------------------------
time: 2023-07-19 15:43:57
Evaluating: f1: 0.8774, eval_loss: 0.7076, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6428, expected_sparsity: 0.6372, expected_sequence_sparsity: 0.8518, target_sparsity: 0.6369, step: 16400
lambda_1: -1.7893, lambda_2: 69.4857 lambda_3: 0.0000
train remain: [0.97 0.85 0.61 0.59 0.56 0.44 0.47 0.22 0.12 0.14]
infer remain: [1.0, 0.82, 0.6, 0.58, 0.56, 0.44, 0.46, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.49, 0.29, 0.16, 0.07, 0.03, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111101101110110
11111111101011110111001111000110101110100010010000
11111111111011111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
11111111101111111100110110010000000000000000000000
11111111010111111011001010110100010000000000000000
11101111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100000001000000010000000000000000000011
loss: 0.001532, lagrangian_loss: 0.001442, attention_score_distillation_loss: 0.000098
loss: 0.002115, lagrangian_loss: 0.001714, attention_score_distillation_loss: 0.000095
ETA: 0:31:58 | Epoch 142 finished. Took 33.14 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:44:12
Evaluating: f1: 0.8645, eval_loss: 0.7318, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6475, expected_sparsity: 0.64, expected_sequence_sparsity: 0.8529, target_sparsity: 0.6388, step: 16450
lambda_1: -1.8637, lambda_2: 69.5243 lambda_3: 0.0000
train remain: [0.97 0.85 0.6  0.59 0.56 0.43 0.47 0.22 0.12 0.14]
infer remain: [1.0, 0.82, 0.58, 0.58, 0.56, 0.44, 0.46, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.48, 0.28, 0.15, 0.07, 0.03, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111101101110110
11111111101011110111001111000110100110100010010000
11111111111011111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
10111111111111111100110110010000000000000000000000
10111111010111111011001010110100010000000000100000
11101111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100000001000000010000000000000000000011
loss: 0.009147, lagrangian_loss: 0.001093, attention_score_distillation_loss: 0.000092
loss: 0.002512, lagrangian_loss: -0.000366, attention_score_distillation_loss: 0.000089
----------------------------------------------------------------------
time: 2023-07-19 15:44:27
Evaluating: f1: 0.8788, eval_loss: 0.7173, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6475, expected_sparsity: 0.6404, expected_sequence_sparsity: 0.8531, target_sparsity: 0.6408, step: 16500
lambda_1: -1.8103, lambda_2: 69.5543 lambda_3: 0.0000
train remain: [0.97 0.84 0.6  0.59 0.56 0.43 0.46 0.21 0.12 0.14]
infer remain: [1.0, 0.82, 0.58, 0.58, 0.56, 0.42, 0.46, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.48, 0.28, 0.15, 0.06, 0.03, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111101101110110
11111111101011110111001111000010101110100010010000
11111111111011111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
10111111101111111100110110010000000000000000000000
10111111010111111011001010110100010000000000100000
11101111110010000001000000000000000000000000000000
10000101000011000000000010000000000000000000000000
10000000010100000001000000010000000000000000000011
loss: 0.002435, lagrangian_loss: -0.001553, attention_score_distillation_loss: 0.000087
loss: 0.002631, lagrangian_loss: -0.002602, attention_score_distillation_loss: 0.000084
----------------------------------------------------------------------
time: 2023-07-19 15:44:41
Evaluating: f1: 0.8866, eval_loss: 0.6696, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6475, expected_sparsity: 0.6405, expected_sequence_sparsity: 0.8531, target_sparsity: 0.6427, step: 16550
lambda_1: -1.5954, lambda_2: 69.6327 lambda_3: 0.0000
train remain: [0.96 0.84 0.59 0.58 0.56 0.43 0.46 0.21 0.12 0.14]
infer remain: [1.0, 0.82, 0.58, 0.58, 0.56, 0.42, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.48, 0.28, 0.15, 0.06, 0.03, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111101101110110
11111111101011110111001111000010101110100010010000
11111111111011111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
10111111101111111100110110010000000000000000000000
10111111010111111011001010110100010000000000100000
11101011110010000001000000000000000000000000000000
10000101000011001000000000000000000000000000000000
10000000010100000001000000010000000000000000000011
loss: 0.002116, lagrangian_loss: -0.001460, attention_score_distillation_loss: 0.000081
ETA: 0:31:25 | Epoch 143 finished. Took 35.46 seconds.
loss: 0.145425, lagrangian_loss: -0.001636, attention_score_distillation_loss: 0.000078
----------------------------------------------------------------------
time: 2023-07-19 15:44:56
Evaluating: f1: 0.8773, eval_loss: 0.7223, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6475, expected_sparsity: 0.6405, expected_sequence_sparsity: 0.8531, target_sparsity: 0.6447, step: 16600
lambda_1: -1.3609, lambda_2: 69.7219 lambda_3: 0.0000
train remain: [0.97 0.83 0.59 0.58 0.56 0.43 0.46 0.21 0.12 0.14]
infer remain: [1.0, 0.82, 0.58, 0.58, 0.56, 0.42, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.48, 0.28, 0.15, 0.06, 0.03, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111101101110110
11111111101011110111001111000010101110100010010000
11111111111011111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
10111111101111111100110110010000000000000000000000
10111111010111111011001010110100010000000000100000
11101011110010000001000000000000000000000000000000
10000101000011001000000000000000000000000000000000
10000000010100000001000000010000000000000000000011
loss: 0.007107, lagrangian_loss: -0.000407, attention_score_distillation_loss: 0.000075
loss: 0.009919, lagrangian_loss: -0.000167, attention_score_distillation_loss: 0.000072
----------------------------------------------------------------------
time: 2023-07-19 15:45:11
Evaluating: f1: 0.8741, eval_loss: 0.7097, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6475, expected_sparsity: 0.6405, expected_sequence_sparsity: 0.8531, target_sparsity: 0.6466, step: 16650
lambda_1: -1.3337, lambda_2: 69.7619 lambda_3: 0.0000
train remain: [0.96 0.83 0.58 0.58 0.56 0.42 0.46 0.21 0.12 0.14]
infer remain: [1.0, 0.82, 0.58, 0.58, 0.56, 0.42, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.48, 0.28, 0.15, 0.06, 0.03, 0.01, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101111111111101111110111101101110110
11111111101011110111001111000010101110100010010000
11111111111011111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
10111111101111111100110110010000000000000000000000
10111111010111111011001010110100010000000000100000
11101011110010000001000000000000000000000000000000
10000101000011001000000000000000000000000000000000
10000000010100000001000000010000000000000000000011
loss: 0.002916, lagrangian_loss: 0.001448, attention_score_distillation_loss: 0.000069
ETA: 0:30:51 | Epoch 144 finished. Took 33.3 seconds.
loss: 0.002908, lagrangian_loss: 0.002041, attention_score_distillation_loss: 0.000066
----------------------------------------------------------------------
time: 2023-07-19 15:45:26
Evaluating: f1: 0.8605, eval_loss: 0.8001, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6665, expected_sparsity: 0.6616, expected_sequence_sparsity: 0.8618, target_sparsity: 0.6486, step: 16700
lambda_1: -1.6311, lambda_2: 69.9012 lambda_3: 0.0000
train remain: [0.96 0.83 0.58 0.58 0.55 0.42 0.46 0.21 0.12 0.14]
infer remain: [0.92, 0.82, 0.56, 0.58, 0.56, 0.42, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.75, 0.42, 0.25, 0.14, 0.06, 0.03, 0.01, 0.0, 0.0]
01111111111111111111111111111111110111111111111010
10111011111111101111111111101111110111101101110110
11111111101011110111001011000010101110100010010000
11111111111011111101110101010011011010000000000100
11111111111101111111111110101100100000000000000000
10111111101111111100110110010000000000000000000000
10111111010111111011001010110100010000000000100000
11101011110010000001000000000000000000000000000000
10000101000011001000000000000000000000000000000000
10000000010100000001000000010000000000000000000011
loss: 0.003135, lagrangian_loss: -0.000663, attention_score_distillation_loss: 0.000064
loss: 0.006193, lagrangian_loss: 0.002916, attention_score_distillation_loss: 0.000061
----------------------------------------------------------------------
time: 2023-07-19 15:45:41
Evaluating: f1: 0.8735, eval_loss: 0.7033, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6696, expected_sparsity: 0.663, expected_sequence_sparsity: 0.8623, target_sparsity: 0.6505, step: 16750
lambda_1: -2.0119, lambda_2: 70.0971 lambda_3: 0.0000
train remain: [0.96 0.83 0.58 0.57 0.55 0.42 0.46 0.2  0.12 0.14]
infer remain: [0.92, 0.82, 0.56, 0.56, 0.56, 0.42, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.75, 0.42, 0.24, 0.13, 0.06, 0.03, 0.01, 0.0, 0.0]
01111111111111111111111111111111110111111111111010
10111011111111101111111111101111110111101101110110
11111111101011110111001011000010101110100010010000
11111111111011111100110101010011011010000000000100
11111111111101111111111110101100100000000000000000
10111111111111111100110010010000000000000000000000
10111111010111111011001010110100010000000000100000
11101011110010000001000000000000000000000000000000
10000101000011001000000000000000000000000000000000
10000000010100000001000000010000000000000000000011
loss: 0.002854, lagrangian_loss: 0.005571, attention_score_distillation_loss: 0.000058
loss: 0.001962, lagrangian_loss: 0.001002, attention_score_distillation_loss: 0.000055
ETA: 0:30:17 | Epoch 145 finished. Took 33.34 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:45:55
Evaluating: f1: 0.8744, eval_loss: 0.7185, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6712, expected_sparsity: 0.6636, expected_sequence_sparsity: 0.8626, target_sparsity: 0.6524, step: 16800
lambda_1: -2.4114, lambda_2: 70.3093 lambda_3: 0.0000
train remain: [0.96 0.83 0.57 0.57 0.55 0.42 0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.82, 0.56, 0.56, 0.54, 0.42, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.75, 0.42, 0.24, 0.13, 0.05, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111111111111110111111111111010
10111011111111101111111111101111110111101101110110
11111111101011110111001011000010101110100010010000
11111111111011111100110101010011011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111111100110010010000000000000000000000
10111111010111111011001010110100010000000000100000
10101111110010000001000000000000000000000000000000
10000101000011001000000000000000000000000000000000
10000000010100000001000000010000000000000000000011
loss: 0.004987, lagrangian_loss: -0.000376, attention_score_distillation_loss: 0.000052
loss: 0.001900, lagrangian_loss: 0.001524, attention_score_distillation_loss: 0.000049
----------------------------------------------------------------------
time: 2023-07-19 15:46:10
Evaluating: f1: 0.8611, eval_loss: 0.7567, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6759, expected_sparsity: 0.6693, expected_sequence_sparsity: 0.8649, target_sparsity: 0.6544, step: 16850
lambda_1: -2.6790, lambda_2: 70.4488 lambda_3: 0.0000
train remain: [0.96 0.82 0.56 0.57 0.55 0.42 0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.8, 0.54, 0.56, 0.54, 0.42, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.74, 0.4, 0.22, 0.12, 0.05, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111111111111110111111111111010
10111011111111101101111111101111110111101101110110
11111111101011110111001011000010100110100010010000
11111111111011111100110101010011011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111111100110010010000000000000000000000
10111111010111111011001010110100010000000000100000
10101111110010000001000000000000000000000000000000
10000101000011000000000000000000000000000000000001
10000000010100000001000000010000000000000000000011
loss: 0.168615, lagrangian_loss: 0.002915, attention_score_distillation_loss: 0.000046
loss: 0.003455, lagrangian_loss: 0.006691, attention_score_distillation_loss: 0.000044
----------------------------------------------------------------------
time: 2023-07-19 15:46:25
Evaluating: f1: 0.8744, eval_loss: 0.6847, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6759, expected_sparsity: 0.6693, expected_sequence_sparsity: 0.8649, target_sparsity: 0.6563, step: 16900
lambda_1: -3.1740, lambda_2: 70.7482 lambda_3: 0.0000
train remain: [0.96 0.82 0.55 0.56 0.55 0.41 0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.8, 0.54, 0.56, 0.54, 0.42, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.74, 0.4, 0.22, 0.12, 0.05, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111111111111110111111111111010
10111011111111101101111111101111110111101101110110
11111111101011110111001011000010100110100010010000
11111111111011111100110101010011011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111111100110010010000000000000000000000
10111111010111111011001010110100010000000000100000
10101111110010000001000000000000000000000000000000
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
loss: 0.133191, lagrangian_loss: 0.006953, attention_score_distillation_loss: 0.000041
ETA: 0:29:44 | Epoch 146 finished. Took 35.35 seconds.
loss: 0.003346, lagrangian_loss: 0.006039, attention_score_distillation_loss: 0.000038
----------------------------------------------------------------------
time: 2023-07-19 15:46:39
Evaluating: f1: 0.867, eval_loss: 0.7245, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6759, expected_sparsity: 0.6696, expected_sequence_sparsity: 0.8651, target_sparsity: 0.6583, step: 16950
lambda_1: -3.6650, lambda_2: 71.0466 lambda_3: 0.0000
train remain: [0.96 0.82 0.55 0.56 0.54 0.41 0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.8, 0.54, 0.56, 0.54, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.74, 0.4, 0.22, 0.12, 0.05, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111111111111110111111111111010
10111011111111101101111111101111110111101101110110
11111111101011110111001011000010100110100010010000
11111111111011111100110101010011011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101111110010000001000000000000000000000000000000
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
loss: 0.127968, lagrangian_loss: 0.005726, attention_score_distillation_loss: 0.000035
loss: 0.001936, lagrangian_loss: 0.009225, attention_score_distillation_loss: 0.000032
Starting saving the best from epoch 147 and step 17000
----------------------------------------------------------------------
time: 2023-07-19 15:46:54
Evaluating: f1: 0.8714, eval_loss: 0.7322, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.679, expected_sparsity: 0.672, expected_sequence_sparsity: 0.8661, target_sparsity: 0.6602, step: 17000
lambda_1: -4.1869, lambda_2: 71.3869 lambda_3: 0.0000
train remain: [0.96 0.82 0.54 0.56 0.54 0.41 0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.8, 0.52, 0.56, 0.54, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.74, 0.38, 0.21, 0.12, 0.05, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101101111111101111110111101101110110
11111111101011110111001011000010100110100010000000
11111111111011111100110101010011011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101111110010000001000000000000000000000000000000
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
Saving the best model so far: [Epoch 147 | Step: 17000 | MACs sparsity: 0.679 | Score: 0.8714 | Loss: 0.7322]
loss: 0.133940, lagrangian_loss: 0.004829, attention_score_distillation_loss: 0.000029
ETA: 0:29:15 | Epoch 147 finished. Took 47.63 seconds.
loss: 0.002726, lagrangian_loss: 0.007066, attention_score_distillation_loss: 0.000026
----------------------------------------------------------------------
time: 2023-07-19 15:47:23
Evaluating: f1: 0.8627, eval_loss: 0.6787, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6821, expected_sparsity: 0.6763, expected_sequence_sparsity: 0.8678, target_sparsity: 0.6622, step: 17050
lambda_1: -4.5306, lambda_2: 71.5690 lambda_3: 0.0000
train remain: [0.95 0.81 0.53 0.55 0.54 0.4  0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.78, 0.52, 0.54, 0.54, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.72, 0.37, 0.2, 0.11, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101101111111101111110111001101110110
11111111101011110111001011000010100110100010000000
11111111111011111100110101010010011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101111110010000001000000000000000000000000000000
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8714 @ step 17000 epoch 147.83
loss: 0.014005, lagrangian_loss: -0.000084, attention_score_distillation_loss: 0.000023
loss: 0.006553, lagrangian_loss: 0.008311, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:47:38
Evaluating: f1: 0.8735, eval_loss: 0.6959, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6821, expected_sparsity: 0.6763, expected_sequence_sparsity: 0.8678, target_sparsity: 0.6641, step: 17100
lambda_1: -4.6151, lambda_2: 71.6265 lambda_3: 0.0000
train remain: [0.95 0.81 0.52 0.55 0.53 0.4  0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.78, 0.52, 0.54, 0.54, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.72, 0.37, 0.2, 0.11, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101101111111101111110111001101110110
11111111101011110111001011000010100110100010000000
11111111111011111100110101010010011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000000000000010000000000000000
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8714 @ step 17000 epoch 147.83
Saving the best model so far: [Epoch 148 | Step: 17100 | MACs sparsity: 0.6821 | Score: 0.8735 | Loss: 0.6959]
loss: 0.001852, lagrangian_loss: 0.001555, attention_score_distillation_loss: 0.000020
loss: 0.007361, lagrangian_loss: -0.006592, attention_score_distillation_loss: 0.000020
ETA: 0:28:47 | Epoch 148 finished. Took 50.88 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:48:10
Evaluating: f1: 0.8616, eval_loss: 0.7365, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6853, expected_sparsity: 0.6791, expected_sequence_sparsity: 0.869, target_sparsity: 0.666, step: 17150
lambda_1: -4.3411, lambda_2: 71.7736 lambda_3: 0.0000
train remain: [0.95 0.8  0.52 0.55 0.53 0.4  0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.78, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.72, 0.36, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101101111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000000000000010000000000000000
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
loss: 0.004291, lagrangian_loss: -0.008991, attention_score_distillation_loss: 0.000020
loss: 0.005514, lagrangian_loss: -0.004422, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:48:25
Evaluating: f1: 0.8646, eval_loss: 0.7136, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6853, expected_sparsity: 0.6791, expected_sequence_sparsity: 0.869, target_sparsity: 0.668, step: 17200
lambda_1: -3.8956, lambda_2: 72.0295 lambda_3: 0.0000
train remain: [0.95 0.8  0.51 0.54 0.53 0.4  0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.78, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.72, 0.36, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101101111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000000000000000000000000000001
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
loss: 0.006193, lagrangian_loss: -0.009504, attention_score_distillation_loss: 0.000020
loss: 0.008559, lagrangian_loss: -0.007060, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:48:40
Evaluating: f1: 0.8684, eval_loss: 0.7344, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6822, expected_sequence_sparsity: 0.8702, target_sparsity: 0.6699, step: 17250
lambda_1: -3.3490, lambda_2: 72.4049 lambda_3: 0.0000
train remain: [0.94 0.79 0.51 0.54 0.52 0.39 0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
ETA: 0:28:14 | Epoch 149 finished. Took 35.62 seconds.
loss: 0.002547, lagrangian_loss: -0.006992, attention_score_distillation_loss: 0.000020
loss: 0.015886, lagrangian_loss: -0.008724, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:48:55
Evaluating: f1: 0.8699, eval_loss: 0.7249, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6824, expected_sequence_sparsity: 0.8703, target_sparsity: 0.67, step: 17300
lambda_1: -2.6269, lambda_2: 73.0800 lambda_3: 0.0000
train remain: [0.94 0.78 0.5  0.54 0.52 0.39 0.46 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.38, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
loss: 0.002372, lagrangian_loss: -0.006692, attention_score_distillation_loss: 0.000020
loss: 0.007142, lagrangian_loss: -0.008612, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:49:10
Evaluating: f1: 0.864, eval_loss: 0.7346, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6824, expected_sequence_sparsity: 0.8703, target_sparsity: 0.67, step: 17350
lambda_1: -1.6432, lambda_2: 74.2868 lambda_3: 0.0000
train remain: [0.94 0.78 0.5  0.54 0.52 0.39 0.46 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.38, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
loss: 0.003114, lagrangian_loss: -0.006976, attention_score_distillation_loss: 0.000020
ETA: 0:27:40 | Epoch 150 finished. Took 33.54 seconds.
loss: 0.010309, lagrangian_loss: -0.002986, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:49:24
Evaluating: f1: 0.865, eval_loss: 0.7636, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6846, expected_sequence_sparsity: 0.8712, target_sparsity: 0.67, step: 17400
lambda_1: -0.5653, lambda_2: 75.6840 lambda_3: 0.0000
train remain: [0.94 0.78 0.5  0.54 0.52 0.39 0.46 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000000000000000000000000000011
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
loss: 0.003043, lagrangian_loss: -0.000798, attention_score_distillation_loss: 0.000020
loss: 0.003158, lagrangian_loss: 0.001665, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:49:39
Evaluating: f1: 0.8678, eval_loss: 0.7542, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6822, expected_sequence_sparsity: 0.8702, target_sparsity: 0.67, step: 17450
lambda_1: 0.3908, lambda_2: 76.8025 lambda_3: 0.0000
train remain: [0.94 0.78 0.5  0.54 0.52 0.39 0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
loss: 0.022671, lagrangian_loss: 0.002830, attention_score_distillation_loss: 0.000020
loss: 0.073721, lagrangian_loss: 0.003918, attention_score_distillation_loss: 0.000020
ETA: 0:27:06 | Epoch 151 finished. Took 33.32 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:49:54
Evaluating: f1: 0.8697, eval_loss: 0.7745, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6822, expected_sequence_sparsity: 0.8702, target_sparsity: 0.67, step: 17500
lambda_1: 1.1375, lambda_2: 77.5334 lambda_3: 0.0000
train remain: [0.94 0.79 0.5  0.54 0.53 0.4  0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
loss: 0.003659, lagrangian_loss: 0.003443, attention_score_distillation_loss: 0.000020
loss: 0.002455, lagrangian_loss: 0.002364, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:50:09
Evaluating: f1: 0.8616, eval_loss: 0.7418, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6816, expected_sequence_sparsity: 0.87, target_sparsity: 0.67, step: 17550
lambda_1: 1.5023, lambda_2: 77.7559 lambda_3: 0.0000
train remain: [0.95 0.79 0.51 0.55 0.53 0.4  0.47 0.21 0.13 0.14]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.54, 0.4, 0.46, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111010111111011001010110100010000000000000001
10001011110010000001000000000000000010000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000011
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
loss: 0.140542, lagrangian_loss: -0.000501, attention_score_distillation_loss: 0.000020
loss: 0.005515, lagrangian_loss: -0.003270, attention_score_distillation_loss: 0.000020
ETA: 0:26:32 | Epoch 152 finished. Took 33.34 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:50:24
Evaluating: f1: 0.8699, eval_loss: 0.7265, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6853, expected_sparsity: 0.6785, expected_sequence_sparsity: 0.8687, target_sparsity: 0.67, step: 17600
lambda_1: 1.3356, lambda_2: 77.8588 lambda_3: 0.0000
train remain: [0.95 0.8  0.52 0.55 0.54 0.41 0.48 0.23 0.14 0.15]
infer remain: [0.92, 0.78, 0.5, 0.54, 0.54, 0.4, 0.48, 0.22, 0.14, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.72, 0.36, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101101111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111010111111011001010110100110000000000000001
10001011110010000001000000000000000010000000010001
10000101000010000000000000000000100000000000000011
10000000010100000001000000010000000000000000000011
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
loss: 0.004088, lagrangian_loss: -0.002679, attention_score_distillation_loss: 0.000020
loss: 0.003141, lagrangian_loss: -0.002640, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:50:38
Evaluating: f1: 0.8714, eval_loss: 0.7308, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6821, expected_sparsity: 0.6761, expected_sequence_sparsity: 0.8677, target_sparsity: 0.67, step: 17650
lambda_1: 0.7073, lambda_2: 78.3977 lambda_3: 0.0000
train remain: [0.95 0.8  0.52 0.55 0.54 0.41 0.48 0.24 0.15 0.16]
infer remain: [0.92, 0.78, 0.52, 0.54, 0.54, 0.4, 0.48, 0.24, 0.14, 0.16]
layerwise remain: [1.0, 1.0, 0.92, 0.72, 0.37, 0.2, 0.11, 0.04, 0.02, 0.01, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101101111111101111110111001101110110
11111111101011110111001011000010100110100010000000
11111111111111111100110001010010011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111010111111011001010110100110000000000000001
10001011110010010001000000000000000010000000010001
10000101000010000000000000000000100000000000000011
10000000000100000001000000010000000000000000010111
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
loss: 0.002142, lagrangian_loss: -0.001575, attention_score_distillation_loss: 0.000020
loss: 0.003520, lagrangian_loss: 0.000979, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:50:53
Evaluating: f1: 0.8821, eval_loss: 0.6836, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6853, expected_sparsity: 0.6785, expected_sequence_sparsity: 0.8687, target_sparsity: 0.67, step: 17700
lambda_1: -0.2270, lambda_2: 79.5347 lambda_3: 0.0000
train remain: [0.95 0.8  0.52 0.55 0.54 0.41 0.48 0.23 0.14 0.15]
infer remain: [0.92, 0.78, 0.5, 0.54, 0.54, 0.4, 0.48, 0.22, 0.14, 0.16]
layerwise remain: [1.0, 1.0, 0.92, 0.72, 0.36, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101101111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111010111111011001010110100110000000000000001
10001011110010000001000000000000000010000000010001
10000101000010000000000000000000100000000000000011
10000000010100000001000000010000000000000000010011
Best eval score so far: 0.8735 @ step 17100 epoch 148.70
Saving the best model so far: [Epoch 153 | Step: 17700 | MACs sparsity: 0.6853 | Score: 0.8821 | Loss: 0.6836]
loss: 0.146984, lagrangian_loss: 0.002908, attention_score_distillation_loss: 0.000020
ETA: 0:26:02 | Epoch 153 finished. Took 47.88 seconds.
loss: 0.122997, lagrangian_loss: 0.002678, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:51:20
Evaluating: f1: 0.8744, eval_loss: 0.7094, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6853, expected_sparsity: 0.6786, expected_sequence_sparsity: 0.8688, target_sparsity: 0.67, step: 17750
lambda_1: -0.9625, lambda_2: 80.2731 lambda_3: 0.0000
train remain: [0.95 0.79 0.52 0.55 0.53 0.4  0.47 0.22 0.13 0.14]
infer remain: [0.92, 0.78, 0.5, 0.54, 0.54, 0.4, 0.46, 0.22, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.72, 0.36, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101101111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111111110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111010111111011001010110100010000000000000001
10001011110010000001000000000000000010000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000011
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.005594, lagrangian_loss: 0.005160, attention_score_distillation_loss: 0.000020
loss: 0.003260, lagrangian_loss: 0.004361, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:51:35
Evaluating: f1: 0.867, eval_loss: 0.7239, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6822, expected_sequence_sparsity: 0.8702, target_sparsity: 0.67, step: 17800
lambda_1: -1.3775, lambda_2: 80.5314 lambda_3: 0.0000
train remain: [0.95 0.79 0.51 0.55 0.53 0.4  0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.002171, lagrangian_loss: 0.002651, attention_score_distillation_loss: 0.000020
ETA: 0:25:28 | Epoch 154 finished. Took 33.14 seconds.
loss: 0.002710, lagrangian_loss: 0.001142, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:51:50
Evaluating: f1: 0.8632, eval_loss: 0.7154, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6822, expected_sequence_sparsity: 0.8702, target_sparsity: 0.67, step: 17850
lambda_1: -1.5787, lambda_2: 80.6145 lambda_3: 0.0000
train remain: [0.95 0.79 0.51 0.54 0.53 0.4  0.46 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.006560, lagrangian_loss: -0.000010, attention_score_distillation_loss: 0.000020
loss: 0.005817, lagrangian_loss: -0.000090, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:52:04
Evaluating: f1: 0.8621, eval_loss: 0.7683, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6822, expected_sequence_sparsity: 0.8702, target_sparsity: 0.67, step: 17900
lambda_1: -1.5033, lambda_2: 80.6592 lambda_3: 0.0000
train remain: [0.95 0.78 0.5  0.54 0.52 0.4  0.46 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.005282, lagrangian_loss: -0.002634, attention_score_distillation_loss: 0.000020
loss: 0.004886, lagrangian_loss: -0.001457, attention_score_distillation_loss: 0.000020
ETA: 0:24:54 | Epoch 155 finished. Took 33.15 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:52:19
Evaluating: f1: 0.87, eval_loss: 0.6913, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6844, expected_sequence_sparsity: 0.8711, target_sparsity: 0.67, step: 17950
lambda_1: -1.2109, lambda_2: 80.7915 lambda_3: 0.0000
train remain: [0.95 0.78 0.5  0.54 0.52 0.4  0.45 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.4, 0.46, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101001110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.003436, lagrangian_loss: -0.002359, attention_score_distillation_loss: 0.000020
loss: 0.004639, lagrangian_loss: -0.002327, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:52:34
Evaluating: f1: 0.869, eval_loss: 0.7136, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6844, expected_sequence_sparsity: 0.8711, target_sparsity: 0.67, step: 18000
lambda_1: -0.7878, lambda_2: 81.0432 lambda_3: 0.0000
train remain: [0.95 0.78 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.4, 0.46, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101001110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.001982, lagrangian_loss: -0.001456, attention_score_distillation_loss: 0.000020
loss: 0.003408, lagrangian_loss: -0.000841, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:52:49
Evaluating: f1: 0.8681, eval_loss: 0.7329, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6844, expected_sequence_sparsity: 0.8711, target_sparsity: 0.67, step: 18050
lambda_1: -0.2910, lambda_2: 81.3843 lambda_3: 0.0000
train remain: [0.95 0.78 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.4, 0.46, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101001110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000000000000010000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.002550, lagrangian_loss: -0.000235, attention_score_distillation_loss: 0.000020
ETA: 0:24:20 | Epoch 156 finished. Took 35.5 seconds.
loss: 0.003243, lagrangian_loss: 0.000125, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:53:03
Evaluating: f1: 0.8694, eval_loss: 0.7256, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6844, expected_sequence_sparsity: 0.8711, target_sparsity: 0.67, step: 18100
lambda_1: 0.1170, lambda_2: 81.6176 lambda_3: 0.0000
train remain: [0.95 0.78 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.4, 0.46, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101001110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000000000000000000000000000001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.002923, lagrangian_loss: 0.000694, attention_score_distillation_loss: 0.000020
loss: 0.003525, lagrangian_loss: 0.000997, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:53:18
Evaluating: f1: 0.8776, eval_loss: 0.7111, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6844, expected_sequence_sparsity: 0.8711, target_sparsity: 0.67, step: 18150
lambda_1: 0.4285, lambda_2: 81.7666 lambda_3: 0.0000
train remain: [0.95 0.78 0.5  0.54 0.52 0.4  0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101001110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000000000000000000000000000001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.003168, lagrangian_loss: -0.000075, attention_score_distillation_loss: 0.000020
ETA: 0:23:46 | Epoch 157 finished. Took 33.38 seconds.
loss: 0.004004, lagrangian_loss: 0.001173, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:53:33
Evaluating: f1: 0.8721, eval_loss: 0.7514, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6822, expected_sequence_sparsity: 0.8702, target_sparsity: 0.67, step: 18200
lambda_1: 0.5869, lambda_2: 81.8274 lambda_3: 0.0000
train remain: [0.95 0.78 0.5  0.54 0.53 0.4  0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.002964, lagrangian_loss: -0.000306, attention_score_distillation_loss: 0.000020
loss: 0.001471, lagrangian_loss: -0.000497, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:53:48
Evaluating: f1: 0.8473, eval_loss: 0.7448, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6821, expected_sequence_sparsity: 0.8702, target_sparsity: 0.67, step: 18250
lambda_1: 0.5351, lambda_2: 81.8591 lambda_3: 0.0000
train remain: [0.96 0.79 0.51 0.54 0.53 0.4  0.46 0.21 0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111010111111011001010110100010000000000000001
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000011
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.002930, lagrangian_loss: -0.000823, attention_score_distillation_loss: 0.000020
loss: 0.002938, lagrangian_loss: -0.000081, attention_score_distillation_loss: 0.000020
ETA: 0:23:12 | Epoch 158 finished. Took 33.31 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:54:02
Evaluating: f1: 0.8643, eval_loss: 0.7991, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6821, expected_sequence_sparsity: 0.8702, target_sparsity: 0.67, step: 18300
lambda_1: 0.3330, lambda_2: 81.9338 lambda_3: 0.0000
train remain: [0.96 0.79 0.51 0.54 0.53 0.4  0.46 0.21 0.12 0.14]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111010111111011001010110100010000000000000001
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000011
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.001908, lagrangian_loss: -0.000282, attention_score_distillation_loss: 0.000020
loss: 0.006030, lagrangian_loss: -0.000017, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:54:17
Evaluating: f1: 0.8586, eval_loss: 0.7456, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6821, expected_sequence_sparsity: 0.8702, target_sparsity: 0.67, step: 18350
lambda_1: 0.0030, lambda_2: 82.0949 lambda_3: 0.0000
train remain: [0.96 0.79 0.51 0.54 0.53 0.4  0.46 0.21 0.12 0.14]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111010111111011001010110100010000000000000001
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000011
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.002210, lagrangian_loss: 0.000419, attention_score_distillation_loss: 0.000020
loss: 0.001725, lagrangian_loss: 0.000949, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:54:32
Evaluating: f1: 0.8723, eval_loss: 0.7942, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6821, expected_sequence_sparsity: 0.8702, target_sparsity: 0.67, step: 18400
lambda_1: -0.4295, lambda_2: 82.3513 lambda_3: 0.0000
train remain: [0.96 0.79 0.51 0.54 0.53 0.4  0.46 0.21 0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.4, 0.46, 0.2, 0.12, 0.14]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101100110010010000000000000000000000
10111111010111111011001010110100010000000000000001
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000011
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
ETA: 0:22:38 | Epoch 159 finished. Took 35.6 seconds.
loss: 0.002749, lagrangian_loss: 0.000648, attention_score_distillation_loss: 0.000020
loss: 0.004844, lagrangian_loss: 0.000975, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:54:47
Evaluating: f1: 0.8601, eval_loss: 0.7596, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6824, expected_sequence_sparsity: 0.8703, target_sparsity: 0.67, step: 18450
lambda_1: -0.7523, lambda_2: 82.5068 lambda_3: 0.0000
train remain: [0.96 0.79 0.51 0.54 0.52 0.39 0.46 0.2  0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.38, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000000000000000000000000000001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.001489, lagrangian_loss: 0.000047, attention_score_distillation_loss: 0.000020
loss: 0.003326, lagrangian_loss: -0.000387, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:55:02
Evaluating: f1: 0.8596, eval_loss: 0.7291, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8712, target_sparsity: 0.67, step: 18500
lambda_1: -0.8178, lambda_2: 82.5507 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.46, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.009153, lagrangian_loss: -0.000661, attention_score_distillation_loss: 0.000020
ETA: 0:22:04 | Epoch 160 finished. Took 33.31 seconds.
loss: 0.002924, lagrangian_loss: -0.000367, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:55:17
Evaluating: f1: 0.8693, eval_loss: 0.7611, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 18550
lambda_1: -0.7382, lambda_2: 82.5837 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.002700, lagrangian_loss: -0.000655, attention_score_distillation_loss: 0.000020
loss: 0.051913, lagrangian_loss: -0.000631, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:55:31
Evaluating: f1: 0.8654, eval_loss: 0.7536, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 18600
lambda_1: -0.5083, lambda_2: 82.6806 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10101011110010000001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.002180, lagrangian_loss: -0.000433, attention_score_distillation_loss: 0.000020
loss: 0.006298, lagrangian_loss: -0.000154, attention_score_distillation_loss: 0.000020
ETA: 0:21:30 | Epoch 161 finished. Took 33.22 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:55:46
Evaluating: f1: 0.8714, eval_loss: 0.7381, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 18650
lambda_1: -0.2051, lambda_2: 82.8126 lambda_3: 0.0000
train remain: [0.95 0.78 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10101011110010000001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.001190, lagrangian_loss: -0.000127, attention_score_distillation_loss: 0.000020
loss: 0.001193, lagrangian_loss: 0.000084, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:56:00
Evaluating: f1: 0.8776, eval_loss: 0.724, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 18700
lambda_1: 0.0665, lambda_2: 82.9393 lambda_3: 0.0000
train remain: [0.95 0.78 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10101011110010000001000000000000000000000000000001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.001821, lagrangian_loss: 0.000202, attention_score_distillation_loss: 0.000020
loss: 0.002870, lagrangian_loss: 0.000822, attention_score_distillation_loss: 0.000020
ETA: 0:20:56 | Epoch 162 finished. Took 33.02 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:56:15
Evaluating: f1: 0.8661, eval_loss: 0.7248, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 18750
lambda_1: 0.2976, lambda_2: 83.0369 lambda_3: 0.0000
train remain: [0.95 0.79 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.002990, lagrangian_loss: -0.000052, attention_score_distillation_loss: 0.000020
loss: 0.005201, lagrangian_loss: -0.000114, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:56:30
Evaluating: f1: 0.8582, eval_loss: 0.7291, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6824, expected_sequence_sparsity: 0.8703, target_sparsity: 0.67, step: 18800
lambda_1: 0.2959, lambda_2: 83.0881 lambda_3: 0.0000
train remain: [0.96 0.79 0.51 0.54 0.53 0.39 0.45 0.2  0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.38, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
loss: 0.002691, lagrangian_loss: -0.000055, attention_score_distillation_loss: 0.000020
loss: 0.003214, lagrangian_loss: -0.000085, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:56:45
Evaluating: f1: 0.8908, eval_loss: 0.6478, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6824, expected_sequence_sparsity: 0.8703, target_sparsity: 0.67, step: 18850
lambda_1: 0.1293, lambda_2: 83.1625 lambda_3: 0.0000
train remain: [0.96 0.79 0.51 0.54 0.53 0.39 0.45 0.2  0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.38, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8821 @ step 17700 epoch 153.91
Saving the best model so far: [Epoch 163 | Step: 18850 | MACs sparsity: 0.6868 | Score: 0.8908 | Loss: 0.6478]
loss: 0.002464, lagrangian_loss: 0.000052, attention_score_distillation_loss: 0.000020
ETA: 0:20:24 | Epoch 163 finished. Took 43.92 seconds.
loss: 0.003624, lagrangian_loss: 0.000180, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:57:08
Evaluating: f1: 0.8606, eval_loss: 0.7187, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6868, expected_sparsity: 0.6824, expected_sequence_sparsity: 0.8703, target_sparsity: 0.67, step: 18900
lambda_1: -0.1517, lambda_2: 83.2908 lambda_3: 0.0000
train remain: [0.96 0.79 0.51 0.54 0.53 0.39 0.45 0.2  0.12 0.13]
infer remain: [0.92, 0.76, 0.5, 0.54, 0.52, 0.38, 0.46, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.35, 0.19, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001011000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.004227, lagrangian_loss: 0.000056, attention_score_distillation_loss: 0.000020
loss: 0.005288, lagrangian_loss: 0.000992, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:57:23
Evaluating: f1: 0.8816, eval_loss: 0.6763, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 18950
lambda_1: -0.3686, lambda_2: 83.3826 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10101011110010000001000000000000000000000000000001
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.004319, lagrangian_loss: 0.000213, attention_score_distillation_loss: 0.000020
ETA: 0:19:50 | Epoch 164 finished. Took 33.47 seconds.
loss: 0.003914, lagrangian_loss: -0.000035, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:57:38
Evaluating: f1: 0.8778, eval_loss: 0.7053, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19000
lambda_1: -0.4478, lambda_2: 83.4331 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111011111111111111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10101011110010000001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.003418, lagrangian_loss: -0.000358, attention_score_distillation_loss: 0.000020
loss: 0.049512, lagrangian_loss: 0.000256, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:57:52
Evaluating: f1: 0.8655, eval_loss: 0.7326, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19050
lambda_1: -0.4498, lambda_2: 83.4764 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.101981, lagrangian_loss: -0.000078, attention_score_distillation_loss: 0.000020
loss: 0.002131, lagrangian_loss: 0.000408, attention_score_distillation_loss: 0.000020
ETA: 0:19:16 | Epoch 165 finished. Took 33.38 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:58:07
Evaluating: f1: 0.8651, eval_loss: 0.7015, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19100
lambda_1: -0.4295, lambda_2: 83.5153 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10101011110010000001000010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001654, lagrangian_loss: -0.000328, attention_score_distillation_loss: 0.000020
loss: 0.003756, lagrangian_loss: -0.000298, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:58:22
Evaluating: f1: 0.8557, eval_loss: 0.7759, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19150
lambda_1: -0.2891, lambda_2: 83.5810 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000010000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.011739, lagrangian_loss: -0.000212, attention_score_distillation_loss: 0.000020
loss: 0.003640, lagrangian_loss: -0.000017, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:58:37
Evaluating: f1: 0.8654, eval_loss: 0.7522, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19200
lambda_1: -0.0734, lambda_2: 83.6706 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000010000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.118608, lagrangian_loss: 0.000037, attention_score_distillation_loss: 0.000020
ETA: 0:18:42 | Epoch 166 finished. Took 35.53 seconds.
loss: 0.001860, lagrangian_loss: 0.000055, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:58:51
Evaluating: f1: 0.8581, eval_loss: 0.7507, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19250
lambda_1: 0.0298, lambda_2: 83.7407 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000010000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.006087, lagrangian_loss: 0.000041, attention_score_distillation_loss: 0.000020
loss: 0.003031, lagrangian_loss: 0.000068, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:59:06
Evaluating: f1: 0.8656, eval_loss: 0.7219, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19300
lambda_1: 0.1749, lambda_2: 83.8183 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002189, lagrangian_loss: -0.000089, attention_score_distillation_loss: 0.000020
ETA: 0:18:08 | Epoch 167 finished. Took 33.31 seconds.
loss: 0.001886, lagrangian_loss: 0.000004, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:59:21
Evaluating: f1: 0.8611, eval_loss: 0.7705, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19350
lambda_1: 0.2345, lambda_2: 83.8586 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.004097, lagrangian_loss: -0.000028, attention_score_distillation_loss: 0.000020
loss: 0.003245, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 15:59:36
Evaluating: f1: 0.8451, eval_loss: 0.7944, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19400
lambda_1: 0.1340, lambda_2: 83.9120 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.003030, lagrangian_loss: -0.000049, attention_score_distillation_loss: 0.000020
loss: 0.002807, lagrangian_loss: 0.000085, attention_score_distillation_loss: 0.000020
ETA: 0:17:34 | Epoch 168 finished. Took 33.11 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:59:50
Evaluating: f1: 0.8357, eval_loss: 0.7928, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19450
lambda_1: -0.0842, lambda_2: 84.0012 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.45 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.12, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000100000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001248, lagrangian_loss: -0.000016, attention_score_distillation_loss: 0.000020
loss: 0.002005, lagrangian_loss: 0.000146, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:00:05
Evaluating: f1: 0.861, eval_loss: 0.7979, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19500
lambda_1: -0.2886, lambda_2: 84.0951 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002164, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.000020
loss: 0.002151, lagrangian_loss: 0.000178, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:00:20
Evaluating: f1: 0.8557, eval_loss: 0.7901, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19550
lambda_1: -0.3465, lambda_2: 84.1427 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
ETA: 0:17:00 | Epoch 169 finished. Took 35.25 seconds.
loss: 0.002931, lagrangian_loss: 0.000111, attention_score_distillation_loss: 0.000020
loss: 0.002651, lagrangian_loss: -0.000178, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:00:34
Evaluating: f1: 0.8601, eval_loss: 0.7372, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19600
lambda_1: -0.2344, lambda_2: 84.1923 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002128, lagrangian_loss: -0.000117, attention_score_distillation_loss: 0.000020
loss: 0.002778, lagrangian_loss: 0.000138, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:00:49
Evaluating: f1: 0.866, eval_loss: 0.7242, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19650
lambda_1: -0.0626, lambda_2: 84.2643 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.004632, lagrangian_loss: 0.000105, attention_score_distillation_loss: 0.000020
ETA: 0:16:26 | Epoch 170 finished. Took 33.1 seconds.
loss: 0.001214, lagrangian_loss: 0.000008, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:01:04
Evaluating: f1: 0.8678, eval_loss: 0.7561, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19700
lambda_1: 0.0857, lambda_2: 84.3253 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.003981, lagrangian_loss: 0.000091, attention_score_distillation_loss: 0.000020
loss: 0.002792, lagrangian_loss: -0.000035, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:01:18
Evaluating: f1: 0.8748, eval_loss: 0.735, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.69, expected_sparsity: 0.6847, expected_sequence_sparsity: 0.8713, target_sparsity: 0.67, step: 19750
lambda_1: 0.1134, lambda_2: 84.3761 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [0.92, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 0.92, 0.7, 0.34, 0.18, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
01111111111111111111111011111111111111111111111010
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010100001000000000000000000000000000001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.003369, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.000020
loss: 0.142612, lagrangian_loss: -0.000025, attention_score_distillation_loss: 0.000020
ETA: 0:15:52 | Epoch 171 finished. Took 33.19 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:01:33
Evaluating: f1: 0.8707, eval_loss: 0.762, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 19800
lambda_1: 0.0361, lambda_2: 84.4315 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001810, lagrangian_loss: 0.000016, attention_score_distillation_loss: 0.000020
loss: 0.001038, lagrangian_loss: 0.000190, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:01:48
Evaluating: f1: 0.8761, eval_loss: 0.7528, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 19850
lambda_1: -0.2607, lambda_2: 84.5545 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.009421, lagrangian_loss: -0.000022, attention_score_distillation_loss: 0.000020
loss: 0.002408, lagrangian_loss: 0.000031, attention_score_distillation_loss: 0.000020
ETA: 0:15:18 | Epoch 172 finished. Took 33.1 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:02:02
Evaluating: f1: 0.8767, eval_loss: 0.7228, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 19900
lambda_1: -0.3633, lambda_2: 84.6072 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.006374, lagrangian_loss: 0.001063, attention_score_distillation_loss: 0.000020
loss: 0.002648, lagrangian_loss: -0.000183, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:02:17
Evaluating: f1: 0.8626, eval_loss: 0.7558, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 19950
lambda_1: -0.3942, lambda_2: 84.6655 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.008101, lagrangian_loss: 0.000185, attention_score_distillation_loss: 0.000020
loss: 0.003919, lagrangian_loss: -0.000078, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:02:32
Evaluating: f1: 0.8832, eval_loss: 0.7133, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20000
lambda_1: -0.2228, lambda_2: 84.7516 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002097, lagrangian_loss: -0.000058, attention_score_distillation_loss: 0.000020
ETA: 0:14:44 | Epoch 173 finished. Took 35.46 seconds.
loss: 0.001695, lagrangian_loss: -0.000018, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:02:47
Evaluating: f1: 0.875, eval_loss: 0.7539, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20050
lambda_1: -0.0045, lambda_2: 84.8690 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.009993, lagrangian_loss: 0.000005, attention_score_distillation_loss: 0.000020
loss: 0.001037, lagrangian_loss: 0.000192, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:03:02
Evaluating: f1: 0.8566, eval_loss: 0.7743, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20100
lambda_1: 0.2145, lambda_2: 84.9749 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001210, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000020
ETA: 0:14:10 | Epoch 174 finished. Took 33.24 seconds.
loss: 0.002070, lagrangian_loss: -0.000129, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:03:16
Evaluating: f1: 0.8456, eval_loss: 0.8236, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20150
lambda_1: 0.1426, lambda_2: 85.0440 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001194, lagrangian_loss: 0.000153, attention_score_distillation_loss: 0.000020
loss: 0.001844, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:03:31
Evaluating: f1: 0.865, eval_loss: 0.7715, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20200
lambda_1: -0.0640, lambda_2: 85.1440 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002895, lagrangian_loss: 0.000336, attention_score_distillation_loss: 0.000020
loss: 0.001708, lagrangian_loss: 0.000037, attention_score_distillation_loss: 0.000020
ETA: 0:13:36 | Epoch 175 finished. Took 33.29 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:03:46
Evaluating: f1: 0.8678, eval_loss: 0.7617, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20250
lambda_1: -0.3265, lambda_2: 85.2593 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000010000000000000001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.004211, lagrangian_loss: 0.000088, attention_score_distillation_loss: 0.000020
loss: 0.001055, lagrangian_loss: 0.000311, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:04:01
Evaluating: f1: 0.8727, eval_loss: 0.7556, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20300
lambda_1: -0.3798, lambda_2: 85.3103 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002133, lagrangian_loss: -0.000364, attention_score_distillation_loss: 0.000020
loss: 0.195103, lagrangian_loss: 0.000328, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:04:15
Evaluating: f1: 0.8703, eval_loss: 0.7479, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20350
lambda_1: -0.2586, lambda_2: 85.3998 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001746, lagrangian_loss: -0.000025, attention_score_distillation_loss: 0.000020
ETA: 0:13:02 | Epoch 176 finished. Took 35.43 seconds.
loss: 0.006257, lagrangian_loss: -0.000036, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:04:30
Evaluating: f1: 0.8517, eval_loss: 0.796, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20400
lambda_1: 0.0371, lambda_2: 85.5357 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010001001000000000000010000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002954, lagrangian_loss: 0.000085, attention_score_distillation_loss: 0.000020
loss: 0.001349, lagrangian_loss: -0.000006, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:04:45
Evaluating: f1: 0.8546, eval_loss: 0.762, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20450
lambda_1: 0.2252, lambda_2: 85.6391 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001010000000000000000000000000001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002633, lagrangian_loss: -0.000113, attention_score_distillation_loss: 0.000020
ETA: 0:12:28 | Epoch 177 finished. Took 33.18 seconds.
loss: 0.002853, lagrangian_loss: -0.000058, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:05:00
Evaluating: f1: 0.872, eval_loss: 0.7189, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20500
lambda_1: 0.1634, lambda_2: 85.7141 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002780, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.000020
loss: 0.000718, lagrangian_loss: 0.000004, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:05:14
Evaluating: f1: 0.8651, eval_loss: 0.7357, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20550
lambda_1: -0.1070, lambda_2: 85.8416 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002766, lagrangian_loss: 0.000075, attention_score_distillation_loss: 0.000020
loss: 0.001001, lagrangian_loss: -0.000095, attention_score_distillation_loss: 0.000020
ETA: 0:11:54 | Epoch 178 finished. Took 33.17 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:05:29
Evaluating: f1: 0.8581, eval_loss: 0.7623, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20600
lambda_1: -0.2240, lambda_2: 85.9192 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001010000000000000000000000000001
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001774, lagrangian_loss: -0.000107, attention_score_distillation_loss: 0.000020
loss: 0.002291, lagrangian_loss: 0.000061, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:05:44
Evaluating: f1: 0.8616, eval_loss: 0.7231, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20650
lambda_1: -0.0489, lambda_2: 86.0052 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001654, lagrangian_loss: 0.000014, attention_score_distillation_loss: 0.000020
loss: 0.004153, lagrangian_loss: 0.000025, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:05:58
Evaluating: f1: 0.8576, eval_loss: 0.7454, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20700
lambda_1: 0.1441, lambda_2: 86.1123 lambda_3: 0.0000
train remain: [0.96 0.78 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010100001010000000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
ETA: 0:11:20 | Epoch 179 finished. Took 35.47 seconds.
loss: 0.002002, lagrangian_loss: -0.000043, attention_score_distillation_loss: 0.000020
loss: 0.001561, lagrangian_loss: -0.000045, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:06:13
Evaluating: f1: 0.8571, eval_loss: 0.7436, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20750
lambda_1: 0.1420, lambda_2: 86.1911 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010100001010000000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002854, lagrangian_loss: 0.000034, attention_score_distillation_loss: 0.000020
loss: 0.001173, lagrangian_loss: 0.000132, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:06:28
Evaluating: f1: 0.87, eval_loss: 0.7184, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20800
lambda_1: -0.1154, lambda_2: 86.3454 lambda_3: 0.0000
train remain: [0.96 0.79 0.5  0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010110100010000000000000000
10001011110010100001010000000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.091195, lagrangian_loss: 0.000006, attention_score_distillation_loss: 0.000020
ETA: 0:10:46 | Epoch 180 finished. Took 33.15 seconds.
loss: 0.002847, lagrangian_loss: -0.000003, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:06:42
Evaluating: f1: 0.8566, eval_loss: 0.7497, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20850
lambda_1: -0.3265, lambda_2: 86.4612 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.2  0.11 0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002411, lagrangian_loss: 0.000789, attention_score_distillation_loss: 0.000020
loss: 0.002621, lagrangian_loss: -0.000074, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:06:57
Evaluating: f1: 0.8503, eval_loss: 0.8098, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20900
lambda_1: -0.3795, lambda_2: 86.5651 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001899, lagrangian_loss: -0.000010, attention_score_distillation_loss: 0.000020
loss: 0.087168, lagrangian_loss: 0.000032, attention_score_distillation_loss: 0.000020
ETA: 0:10:12 | Epoch 181 finished. Took 33.22 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:07:12
Evaluating: f1: 0.8669, eval_loss: 0.7718, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 20950
lambda_1: -0.1762, lambda_2: 86.6673 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002364, lagrangian_loss: -0.000024, attention_score_distillation_loss: 0.000020
loss: 0.001172, lagrangian_loss: -0.000001, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:07:27
Evaluating: f1: 0.862, eval_loss: 0.7935, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21000
lambda_1: 0.0363, lambda_2: 86.7993 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000000000000010000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.006221, lagrangian_loss: 0.000035, attention_score_distillation_loss: 0.000020
loss: 0.001526, lagrangian_loss: -0.000000, attention_score_distillation_loss: 0.000020
ETA: 0:09:37 | Epoch 182 finished. Took 33.3 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:07:41
Evaluating: f1: 0.8486, eval_loss: 0.7599, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21050
lambda_1: -0.1781, lambda_2: 86.9663 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000000000000010000000000000000
10000000010100000001000000010000010000000000000000
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001772, lagrangian_loss: -0.000085, attention_score_distillation_loss: 0.000020
loss: 0.002683, lagrangian_loss: -0.000009, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:07:56
Evaluating: f1: 0.8606, eval_loss: 0.7133, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21100
lambda_1: -0.4115, lambda_2: 87.1168 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000000000000010000000000000000
10000000010100000001000000010000010000000000000000
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001809, lagrangian_loss: 0.000197, attention_score_distillation_loss: 0.000020
loss: 0.002033, lagrangian_loss: 0.001337, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:08:11
Evaluating: f1: 0.8532, eval_loss: 0.7781, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21150
lambda_1: -0.3903, lambda_2: 87.2219 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000010100000001000000010000010000000000000000
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001898, lagrangian_loss: -0.000286, attention_score_distillation_loss: 0.000020
ETA: 0:09:04 | Epoch 183 finished. Took 35.31 seconds.
loss: 0.001768, lagrangian_loss: 0.000057, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:08:25
Evaluating: f1: 0.8487, eval_loss: 0.7804, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6689, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 21200
lambda_1: -0.1513, lambda_2: 87.4056 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000010100000001000000010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001702, lagrangian_loss: -0.000065, attention_score_distillation_loss: 0.000020
loss: 0.003401, lagrangian_loss: 0.000128, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:08:40
Evaluating: f1: 0.8527, eval_loss: 0.768, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21250
lambda_1: 0.2079, lambda_2: 87.6079 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010100001010000000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001500, lagrangian_loss: 0.000536, attention_score_distillation_loss: 0.000020
ETA: 0:08:29 | Epoch 184 finished. Took 33.08 seconds.
loss: 0.002227, lagrangian_loss: 0.000407, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:08:55
Evaluating: f1: 0.8537, eval_loss: 0.7681, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21300
lambda_1: 0.3653, lambda_2: 87.7662 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.21 0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000001
10001011110010100001010000000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002762, lagrangian_loss: -0.000273, attention_score_distillation_loss: 0.000020
loss: 0.002053, lagrangian_loss: -0.000046, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:09:10
Evaluating: f1: 0.8566, eval_loss: 0.7426, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21350
lambda_1: -0.0805, lambda_2: 88.0463 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.21 0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.14]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000001
10001011110010100001000000000000000000000000000001
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000010011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.003112, lagrangian_loss: 0.000033, attention_score_distillation_loss: 0.000020
loss: 0.001553, lagrangian_loss: 0.000171, attention_score_distillation_loss: 0.000020
ETA: 0:07:55 | Epoch 185 finished. Took 33.12 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:09:24
Evaluating: f1: 0.8591, eval_loss: 0.7435, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21400
lambda_1: -0.4411, lambda_2: 88.2999 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010100001010000000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001744, lagrangian_loss: -0.000243, attention_score_distillation_loss: 0.000020
loss: 0.002581, lagrangian_loss: -0.000077, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:09:39
Evaluating: f1: 0.8531, eval_loss: 0.7636, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6689, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 21450
lambda_1: -0.1828, lambda_2: 88.4583 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.003321, lagrangian_loss: -0.000007, attention_score_distillation_loss: 0.000020
loss: 0.001784, lagrangian_loss: 0.000037, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:09:54
Evaluating: f1: 0.8537, eval_loss: 0.7538, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21500
lambda_1: 0.1555, lambda_2: 88.6680 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001687, lagrangian_loss: 0.000215, attention_score_distillation_loss: 0.000020
ETA: 0:07:22 | Epoch 186 finished. Took 35.46 seconds.
loss: 0.001474, lagrangian_loss: 0.000629, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:10:08
Evaluating: f1: 0.8625, eval_loss: 0.7456, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21550
lambda_1: 0.3285, lambda_2: 88.8723 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000001
10001011110010000001010000000000000000000000000001
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001634, lagrangian_loss: -0.000023, attention_score_distillation_loss: 0.000020
loss: 0.002179, lagrangian_loss: -0.000101, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:10:23
Evaluating: f1: 0.8556, eval_loss: 0.741, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21600
lambda_1: -0.0380, lambda_2: 89.1487 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.21 0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000001
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.000575, lagrangian_loss: 0.000018, attention_score_distillation_loss: 0.000020
ETA: 0:06:47 | Epoch 187 finished. Took 33.05 seconds.
loss: 0.001805, lagrangian_loss: 0.000483, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:10:38
Evaluating: f1: 0.8526, eval_loss: 0.7655, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21650
lambda_1: -0.4347, lambda_2: 89.4338 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010001000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001195, lagrangian_loss: -0.000423, attention_score_distillation_loss: 0.000020
loss: 0.001808, lagrangian_loss: 0.000390, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:10:52
Evaluating: f1: 0.8681, eval_loss: 0.7174, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6689, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 21700
lambda_1: -0.1589, lambda_2: 89.6395 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000000000000010000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001363, lagrangian_loss: 0.000064, attention_score_distillation_loss: 0.000020
loss: 0.003172, lagrangian_loss: 0.000028, attention_score_distillation_loss: 0.000020
ETA: 0:06:13 | Epoch 188 finished. Took 33.19 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:11:07
Evaluating: f1: 0.8694, eval_loss: 0.7135, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21750
lambda_1: 0.2661, lambda_2: 89.9294 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010000000000000000010000000000
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002358, lagrangian_loss: -0.000120, attention_score_distillation_loss: 0.000020
loss: 0.001429, lagrangian_loss: -0.000066, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:11:22
Evaluating: f1: 0.8596, eval_loss: 0.747, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21800
lambda_1: 0.2150, lambda_2: 90.0482 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.21 0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000001
10001011110010000001000000000000000000000000010001
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001851, lagrangian_loss: -0.000117, attention_score_distillation_loss: 0.000020
loss: 0.002425, lagrangian_loss: 0.000040, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:11:36
Evaluating: f1: 0.8679, eval_loss: 0.7305, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 21850
lambda_1: -0.2101, lambda_2: 90.3547 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.52 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000001
10001011110010000001010000000000000000000000000001
10000101000010000000000000000000000000000000000001
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
ETA: 0:05:39 | Epoch 189 finished. Took 35.33 seconds.
loss: 0.000704, lagrangian_loss: -0.000092, attention_score_distillation_loss: 0.000020
loss: 0.000859, lagrangian_loss: 0.000433, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:11:51
Evaluating: f1: 0.8709, eval_loss: 0.7177, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6689, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 21900
lambda_1: -0.2950, lambda_2: 90.5115 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000000000000010000000000000000
10000000000100001001000010010000000000000000000000
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001505, lagrangian_loss: -0.000078, attention_score_distillation_loss: 0.000020
loss: 0.001651, lagrangian_loss: -0.000106, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:12:06
Evaluating: f1: 0.8679, eval_loss: 0.7343, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6689, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 21950
lambda_1: -0.0568, lambda_2: 90.6472 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000000000000010000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001825, lagrangian_loss: 0.000020, attention_score_distillation_loss: 0.000020
ETA: 0:05:05 | Epoch 190 finished. Took 33.0 seconds.
loss: 0.001877, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:12:20
Evaluating: f1: 0.8709, eval_loss: 0.7104, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 22000
lambda_1: 0.0952, lambda_2: 90.8486 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010000000000010000000000000000
10000101000010000000000000000000010000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002109, lagrangian_loss: 0.000229, attention_score_distillation_loss: 0.000020
loss: 0.001293, lagrangian_loss: 0.000091, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:12:35
Evaluating: f1: 0.8664, eval_loss: 0.748, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 22050
lambda_1: 0.0097, lambda_2: 90.9873 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010000000000000000000000000001
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002194, lagrangian_loss: -0.000000, attention_score_distillation_loss: 0.000020
loss: 0.000904, lagrangian_loss: 0.000022, attention_score_distillation_loss: 0.000020
ETA: 0:04:31 | Epoch 191 finished. Took 33.17 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:12:50
Evaluating: f1: 0.8586, eval_loss: 0.7496, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 22100
lambda_1: -0.2219, lambda_2: 91.1646 lambda_3: 0.0000
train remain: [0.97 0.78 0.5  0.54 0.51 0.39 0.44 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010000000000010000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000010010000000000000000000001
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002082, lagrangian_loss: 0.000022, attention_score_distillation_loss: 0.000020
loss: 0.001520, lagrangian_loss: 0.000359, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:13:05
Evaluating: f1: 0.8532, eval_loss: 0.7638, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6689, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 22150
lambda_1: -0.3488, lambda_2: 91.2887 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.38 0.43 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001827, lagrangian_loss: -0.000175, attention_score_distillation_loss: 0.000020
loss: 0.001174, lagrangian_loss: -0.000187, attention_score_distillation_loss: 0.000020
ETA: 0:03:57 | Epoch 192 finished. Took 33.23 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:13:19
Evaluating: f1: 0.8566, eval_loss: 0.7657, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6757, expected_sparsity: 0.67, expected_sequence_sparsity: 0.8652, target_sparsity: 0.67, step: 22200
lambda_1: -0.1045, lambda_2: 91.4392 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.38 0.43 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.52, 0.5, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.19, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111011111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111110111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001657, lagrangian_loss: -0.000016, attention_score_distillation_loss: 0.000020
loss: 0.001459, lagrangian_loss: 0.000132, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:13:34
Evaluating: f1: 0.8551, eval_loss: 0.7698, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6689, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 22250
lambda_1: 0.2273, lambda_2: 91.6546 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.44 0.2  0.1  0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010001000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002587, lagrangian_loss: -0.000042, attention_score_distillation_loss: 0.000020
loss: 0.000916, lagrangian_loss: -0.000066, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:13:49
Evaluating: f1: 0.8611, eval_loss: 0.7683, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6684, expected_sequence_sparsity: 0.8646, target_sparsity: 0.67, step: 22300
lambda_1: 0.0348, lambda_2: 91.8501 lambda_3: 0.0000
train remain: [0.97 0.78 0.5  0.54 0.51 0.39 0.44 0.2  0.11 0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.52, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
11111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000001
10001011110010000001010000000000000000000000000001
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001516, lagrangian_loss: 0.000019, attention_score_distillation_loss: 0.000020
ETA: 0:03:23 | Epoch 193 finished. Took 35.33 seconds.
loss: 0.001377, lagrangian_loss: 0.000081, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:14:03
Evaluating: f1: 0.8601, eval_loss: 0.7673, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6689, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 22350
lambda_1: -0.2663, lambda_2: 92.1052 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.38 0.43 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010001000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001168, lagrangian_loss: 0.000611, attention_score_distillation_loss: 0.000020
loss: 0.001424, lagrangian_loss: -0.000017, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:14:18
Evaluating: f1: 0.8646, eval_loss: 0.7443, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.669, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 22400
lambda_1: -0.2004, lambda_2: 92.2354 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.38 0.43 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.000935, lagrangian_loss: -0.000097, attention_score_distillation_loss: 0.000020
ETA: 0:02:49 | Epoch 194 finished. Took 33.22 seconds.
loss: 0.004615, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:14:33
Evaluating: f1: 0.8551, eval_loss: 0.7674, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.669, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 22450
lambda_1: 0.0140, lambda_2: 92.3896 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.43 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010000001010010000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.003741, lagrangian_loss: 0.000018, attention_score_distillation_loss: 0.000020
loss: 0.001238, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:14:47
Evaluating: f1: 0.8546, eval_loss: 0.7638, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.669, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 22500
lambda_1: 0.0640, lambda_2: 92.5192 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.43 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.000877, lagrangian_loss: -0.000011, attention_score_distillation_loss: 0.000020
loss: 0.001550, lagrangian_loss: -0.000019, attention_score_distillation_loss: 0.000020
ETA: 0:02:15 | Epoch 195 finished. Took 32.91 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:15:02
Evaluating: f1: 0.8571, eval_loss: 0.7567, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6726, expected_sparsity: 0.6689, expected_sequence_sparsity: 0.8648, target_sparsity: 0.67, step: 22550
lambda_1: 0.0445, lambda_2: 92.6913 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.54 0.51 0.39 0.43 0.2  0.1  0.13]
infer remain: [1.0, 0.76, 0.48, 0.54, 0.5, 0.38, 0.44, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.2, 0.1, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111111111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010001000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.166347, lagrangian_loss: 0.000017, attention_score_distillation_loss: 0.000020
loss: 0.001544, lagrangian_loss: 0.000131, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:15:17
Evaluating: f1: 0.8646, eval_loss: 0.7374, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6757, expected_sparsity: 0.6701, expected_sequence_sparsity: 0.8652, target_sparsity: 0.67, step: 22600
lambda_1: -0.3134, lambda_2: 92.9561 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.38 0.43 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.52, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.19, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111011111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002077, lagrangian_loss: 0.000380, attention_score_distillation_loss: 0.000020
loss: 0.001404, lagrangian_loss: -0.000018, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:15:31
Evaluating: f1: 0.8675, eval_loss: 0.7283, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6789, expected_sparsity: 0.6732, expected_sequence_sparsity: 0.8665, target_sparsity: 0.67, step: 22650
lambda_1: -0.3634, lambda_2: 93.1178 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.38 0.42 0.2  0.1  0.12]
infer remain: [1.0, 0.74, 0.48, 0.52, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.36, 0.18, 0.09, 0.04, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101011110111001101110110
11111111101011110111001001000010100110100000000000
11111111111011111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002117, lagrangian_loss: -0.000319, attention_score_distillation_loss: 0.000020
ETA: 0:01:41 | Epoch 196 finished. Took 35.35 seconds.
loss: 0.001624, lagrangian_loss: -0.000043, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:15:46
Evaluating: f1: 0.869, eval_loss: 0.7248, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6789, expected_sparsity: 0.6732, expected_sequence_sparsity: 0.8665, target_sparsity: 0.67, step: 22700
lambda_1: 0.0395, lambda_2: 93.4047 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.38 0.42 0.2  0.1  0.12]
infer remain: [1.0, 0.74, 0.48, 0.52, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.36, 0.18, 0.09, 0.04, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101011110111001101110110
11111111101011110111001001000010100110100000000000
11111111111011111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001322, lagrangian_loss: 0.000101, attention_score_distillation_loss: 0.000020
loss: 0.001156, lagrangian_loss: 0.000128, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:16:01
Evaluating: f1: 0.8651, eval_loss: 0.7446, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6757, expected_sparsity: 0.6701, expected_sequence_sparsity: 0.8652, target_sparsity: 0.67, step: 22750
lambda_1: 0.2353, lambda_2: 93.7072 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.39 0.43 0.2  0.1  0.13]
infer remain: [1.0, 0.76, 0.48, 0.52, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.19, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111011111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010000001000000000000000000000000010001
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001924, lagrangian_loss: -0.000144, attention_score_distillation_loss: 0.000020
ETA: 0:01:07 | Epoch 197 finished. Took 33.01 seconds.
loss: 0.001262, lagrangian_loss: 0.000258, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:16:15
Evaluating: f1: 0.867, eval_loss: 0.7414, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6757, expected_sparsity: 0.6701, expected_sequence_sparsity: 0.8652, target_sparsity: 0.67, step: 22800
lambda_1: -0.2827, lambda_2: 94.1122 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.38 0.42 0.2  0.1  0.12]
infer remain: [1.0, 0.76, 0.48, 0.52, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.36, 0.19, 0.09, 0.04, 0.02, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101111110111001101110110
11111111101011110111001001000010100110100000000000
11111111111011111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.003389, lagrangian_loss: 0.000274, attention_score_distillation_loss: 0.000020
loss: 0.001050, lagrangian_loss: -0.000213, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:16:30
Evaluating: f1: 0.8596, eval_loss: 0.7526, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6789, expected_sparsity: 0.6732, expected_sequence_sparsity: 0.8665, target_sparsity: 0.67, step: 22850
lambda_1: -0.2575, lambda_2: 94.2527 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.5  0.38 0.42 0.2  0.09 0.12]
infer remain: [1.0, 0.74, 0.48, 0.52, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.36, 0.18, 0.09, 0.04, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101011110111001101110110
11111111101011110111001001000010100110100000000000
11111111111011111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002498, lagrangian_loss: -0.000067, attention_score_distillation_loss: 0.000020
loss: 0.001408, lagrangian_loss: 0.000325, attention_score_distillation_loss: 0.000020
ETA: 0:00:33 | Epoch 198 finished. Took 33.34 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:16:45
Evaluating: f1: 0.8596, eval_loss: 0.7494, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6789, expected_sparsity: 0.6732, expected_sequence_sparsity: 0.8665, target_sparsity: 0.67, step: 22900
lambda_1: 0.1379, lambda_2: 94.5999 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.38 0.42 0.2  0.1  0.12]
infer remain: [1.0, 0.74, 0.48, 0.52, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.36, 0.18, 0.09, 0.04, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101011110111001101110110
11111111101011110111001001000010100110100000000000
11111111111011111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.002369, lagrangian_loss: 0.000137, attention_score_distillation_loss: 0.000020
loss: 0.000617, lagrangian_loss: 0.000096, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:17:00
Evaluating: f1: 0.8611, eval_loss: 0.7508, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6789, expected_sparsity: 0.6732, expected_sequence_sparsity: 0.8665, target_sparsity: 0.67, step: 22950
lambda_1: 0.0115, lambda_2: 94.8216 lambda_3: 0.0000
train remain: [0.97 0.78 0.5  0.53 0.51 0.38 0.43 0.2  0.1  0.12]
infer remain: [1.0, 0.74, 0.48, 0.52, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.36, 0.18, 0.09, 0.04, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101011110111001101110110
11111111101011110111001001000010100110100000000000
11111111111011111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010001001000000000000000000010000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
loss: 0.001884, lagrangian_loss: 0.000071, attention_score_distillation_loss: 0.000020
loss: 0.001276, lagrangian_loss: -0.000035, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 16:17:14
Evaluating: f1: 0.8596, eval_loss: 0.7537, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6789, expected_sparsity: 0.6732, expected_sequence_sparsity: 0.8665, target_sparsity: 0.67, step: 23000
lambda_1: -0.2657, lambda_2: 95.0416 lambda_3: 0.0000
train remain: [0.97 0.78 0.49 0.53 0.51 0.38 0.42 0.2  0.09 0.12]
infer remain: [1.0, 0.74, 0.48, 0.52, 0.5, 0.38, 0.42, 0.2, 0.1, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.36, 0.18, 0.09, 0.04, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111111111111111111111111111111
10111011111111101100111111101011110111001101110110
11111111101011110111001001000010100110100000000000
11111111111011111100110001010010011010000000000100
10111111111101111111110110001100100000000000000000
10111111111111101000110010010000000000000000000000
10111111010111111011001010010100010000000000000000
10001011110010001001010000000000000000000000000000
10000101000010000000000010000000000000000000000000
10000000000100000001000000010000000000000000000011
Best eval score so far: 0.8908 @ step 18850 epoch 163.91
ETA: 0:00:00 | Epoch 199 finished. Took 35.49 seconds.