/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" 2023/07/19 14:39:49 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow. 2023/07/19 14:39:50 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn. 2023/07/19 14:39:50 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers. Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Downloading and preparing dataset glue/stsb to /home/aiscuser/.cache/huggingface/datasets/glue/stsb/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353... Downloading data: 0%| | 0.00/803k [00:00 Training Arguments TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, bf16=False, bf16_full_eval=False, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=100, evaluation_strategy=IntervalStrategy.STEPS, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=40, log_level_replica=-1, log_on_each_node=True, logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/STSB/reproduce1/s0.7_lr2e-05_reglr0.02_alpha0.004_warmup50_bin30/runs/Jul19_14-39-51_node-0, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=25, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=150.0, optim=OptimizerNames.ADAMW_HF, output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/STSB/reproduce1/s0.7_lr2e-05_reglr0.02_alpha0.004_warmup50_bin30, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=32, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, remove_unused_columns=True, report_to=['mlflow'], resume_from_checkpoint=None, run_name=/mnt/data/device-aware-bert/token_pruning/experiments/STSB/reproduce1/s0.7_lr2e-05_reglr0.02_alpha0.004_warmup50_bin30, save_on_each_node=False, save_steps=0, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=57, sharded_ddp=[], skip_memory_metrics=True, tf32=None, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) Additional Arguments AdditionalArguments(test=False, ex_name='s0.7_lr2e-05_reglr0.02_alpha0.004_warmup50_bin30', pruning_type='token+pruner', reg_learning_rate=0.02, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=50, target_sparsity=0.7, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/STSB', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.004, distill_temp=2.0, use_mac_l0=True, prune_location=[3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=30, topk=20) ---------------------------------------------------------------------- time: 2023-07-19 14:40:32 Evaluating: pearson: 0.8953, eval_loss: 0.4804, step: 0 lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000 Starting l0 regularization! using , temperature: 0.67, init drop rate: 0.01 token_loga shape: [9, 30] prune location: [3, 4, 5, 6, 7, 8, 9, 10, 11] NDCG TOPK= 20 loss: 0.272668, lagrangian_loss: 0.002260, attention_score_distillation_loss: 0.038668 ---------------------------------------------------------------------- time: 2023-07-19 14:41:08 Evaluating: pearson: 0.8876, eval_loss: 0.527, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5114, target_sparsity: 0.0076, step: 100 lambda_1: -2.1429, lambda_2: 2.5893 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.262334, lagrangian_loss: 0.007194, attention_score_distillation_loss: 0.037622 loss: 0.210680, lagrangian_loss: -0.032782, attention_score_distillation_loss: 0.037414 ETA: 2:22:42 | Epoch 0 finished. Took 57.47 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:41:43 Evaluating: pearson: 0.8895, eval_loss: 0.4978, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5114, target_sparsity: 0.0154, step: 200 lambda_1: 0.0455, lambda_2: 6.2759 lambda_3: 0.0000 train remain: [0.99 1. 0.98 0.99 0.99 0.99 1. 0.99 0.97] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.433873, lagrangian_loss: 0.004123, attention_score_distillation_loss: 0.037494 loss: 0.152333, lagrangian_loss: 0.009690, attention_score_distillation_loss: 0.037199 ---------------------------------------------------------------------- time: 2023-07-19 14:42:18 Evaluating: pearson: 0.8887, eval_loss: 0.486, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5114, target_sparsity: 0.0232, step: 300 lambda_1: 1.5517, lambda_2: 7.4843 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 1. 1. 1. 1. 0.99] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.407483, lagrangian_loss: -0.008975, attention_score_distillation_loss: 0.036165 loss: 0.226825, lagrangian_loss: -0.010172, attention_score_distillation_loss: 0.036331 ETA: 2:29:33 | Epoch 1 finished. Took 63.8 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:42:52 Evaluating: pearson: 0.8948, eval_loss: 0.5324, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5114, target_sparsity: 0.031, step: 400 lambda_1: -0.4904, lambda_2: 8.9173 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 1. 1. 1. 1. 0.99] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.157170, lagrangian_loss: 0.012922, attention_score_distillation_loss: 0.036653 loss: 0.434192, lagrangian_loss: 0.050047, attention_score_distillation_loss: 0.035997 ---------------------------------------------------------------------- time: 2023-07-19 14:43:27 Evaluating: pearson: 0.8925, eval_loss: 0.4677, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5114, target_sparsity: 0.0387, step: 500 lambda_1: -3.5211, lambda_2: 12.1669 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 0.99 1. 1. 1. 0.99] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.199316, lagrangian_loss: 0.077902, attention_score_distillation_loss: 0.036243 ETA: 2:31:23 | Epoch 2 finished. Took 64.11 seconds. loss: 0.208301, lagrangian_loss: -0.183304, attention_score_distillation_loss: 0.035986 ---------------------------------------------------------------------- time: 2023-07-19 14:44:02 Evaluating: pearson: 0.8909, eval_loss: 0.4796, token_prune_loc: [False, False, False, False, True, False, False, False, False], macs_sparsity: 0.1376, expected_sparsity: 0.1367, expected_sequence_sparsity: 0.5787, target_sparsity: 0.0465, step: 600 lambda_1: -0.8688, lambda_2: 17.2757 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 0.77 0.99 0.99 1. 0.92] infer remain: [1.0, 1.0, 1.0, 1.0, 0.7, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.7, 0.7, 0.7, 0.7] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111100010011100100 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.175524, lagrangian_loss: 0.040728, attention_score_distillation_loss: 0.034940 loss: 0.178263, lagrangian_loss: 0.003417, attention_score_distillation_loss: 0.035948 ---------------------------------------------------------------------- time: 2023-07-19 14:44:37 Evaluating: pearson: 0.8933, eval_loss: 0.4632, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5114, target_sparsity: 0.0543, step: 700 lambda_1: 0.5612, lambda_2: 18.5714 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 0.94 1. 1. 1. 0.98] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.095753, lagrangian_loss: -0.004199, attention_score_distillation_loss: 0.035325 ETA: 2:31:48 | Epoch 3 finished. Took 64.16 seconds. loss: 0.069233, lagrangian_loss: 0.005891, attention_score_distillation_loss: 0.035580 ---------------------------------------------------------------------- time: 2023-07-19 14:45:11 Evaluating: pearson: 0.8953, eval_loss: 0.4608, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.5114, target_sparsity: 0.0621, step: 800 lambda_1: -0.5128, lambda_2: 18.8157 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 0.92 1. 1. 1. 0.97] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.126529, lagrangian_loss: 0.007744, attention_score_distillation_loss: 0.034379 loss: 0.094579, lagrangian_loss: -0.002043, attention_score_distillation_loss: 0.033861 ---------------------------------------------------------------------- time: 2023-07-19 14:45:46 Evaluating: pearson: 0.8926, eval_loss: 0.4814, token_prune_loc: [False, False, False, False, True, False, False, False, False], macs_sparsity: 0.107, expected_sparsity: 0.1064, expected_sequence_sparsity: 0.5638, target_sparsity: 0.0698, step: 900 lambda_1: -0.4978, lambda_2: 18.8587 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 0.86 1. 0.99 1. 0.95] infer remain: [1.0, 1.0, 1.0, 1.0, 0.77, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.77, 0.77, 0.77, 0.77, 0.77] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011110100 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 ETA: 2:31:50 | Epoch 4 finished. Took 64.61 seconds. loss: 0.098086, lagrangian_loss: -0.002800, attention_score_distillation_loss: 0.034529 loss: 0.132752, lagrangian_loss: -0.000201, attention_score_distillation_loss: 0.034015 ---------------------------------------------------------------------- time: 2023-07-19 14:46:21 Evaluating: pearson: 0.8909, eval_loss: 0.472, token_prune_loc: [False, False, False, False, True, False, False, False, False], macs_sparsity: 0.107, expected_sparsity: 0.1064, expected_sequence_sparsity: 0.5638, target_sparsity: 0.0776, step: 1000 lambda_1: -0.3158, lambda_2: 18.8781 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 0.87 1. 0.99 1. 0.95] infer remain: [1.0, 1.0, 1.0, 1.0, 0.77, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.77, 0.77, 0.77, 0.77, 0.77] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011110100 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.097724, lagrangian_loss: 0.000727, attention_score_distillation_loss: 0.034215 loss: 0.273739, lagrangian_loss: -0.000436, attention_score_distillation_loss: 0.034057 ETA: 2:28:10 | Epoch 5 finished. Took 56.3 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:46:56 Evaluating: pearson: 0.8895, eval_loss: 0.4835, token_prune_loc: [False, False, False, False, True, False, False, False, False], macs_sparsity: 0.107, expected_sparsity: 0.1064, expected_sequence_sparsity: 0.5638, target_sparsity: 0.0854, step: 1100 lambda_1: -0.2460, lambda_2: 18.8822 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 0.84 1. 0.99 1. 0.94] infer remain: [1.0, 1.0, 1.0, 1.0, 0.77, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.77, 0.77, 0.77, 0.77, 0.77] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011110100 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.099501, lagrangian_loss: -0.000382, attention_score_distillation_loss: 0.033754 loss: 0.122943, lagrangian_loss: -0.000049, attention_score_distillation_loss: 0.033970 ---------------------------------------------------------------------- time: 2023-07-19 14:47:31 Evaluating: pearson: 0.892, eval_loss: 0.4828, token_prune_loc: [False, False, False, False, True, False, False, False, False], macs_sparsity: 0.107, expected_sparsity: 0.1064, expected_sequence_sparsity: 0.5638, target_sparsity: 0.0932, step: 1200 lambda_1: -0.2257, lambda_2: 18.8830 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.84 0.99 0.99 1. 0.93] infer remain: [1.0, 1.0, 1.0, 1.0, 0.77, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.77, 0.77, 0.77, 0.77, 0.77] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011110100 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.103437, lagrangian_loss: -0.000016, attention_score_distillation_loss: 0.032399 loss: 0.118148, lagrangian_loss: 0.000374, attention_score_distillation_loss: 0.032247 ETA: 2:27:58 | Epoch 6 finished. Took 64.15 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:48:05 Evaluating: pearson: 0.8936, eval_loss: 0.4597, token_prune_loc: [False, False, False, False, True, False, False, False, False], macs_sparsity: 0.1376, expected_sparsity: 0.1216, expected_sequence_sparsity: 0.5713, target_sparsity: 0.101, step: 1300 lambda_1: -0.2557, lambda_2: 18.8837 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 0.82 0.99 0.99 1. 0.91] infer remain: [1.0, 1.0, 1.0, 1.0, 0.73, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.73, 0.73, 0.73, 0.73, 0.73] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011110000 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 loss: 0.087991, lagrangian_loss: -0.000274, attention_score_distillation_loss: 0.032776 loss: 0.166293, lagrangian_loss: 0.000123, attention_score_distillation_loss: 0.032512 ---------------------------------------------------------------------- time: 2023-07-19 14:48:40 Evaluating: pearson: 0.8934, eval_loss: 0.4901, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1555, expected_sparsity: 0.1411, expected_sequence_sparsity: 0.5809, target_sparsity: 0.1087, step: 1400 lambda_1: -0.1991, lambda_2: 18.8899 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 0.81 0.99 0.99 1. 0.9 ] infer remain: [1.0, 1.0, 1.0, 1.0, 0.73, 1.0, 1.0, 1.0, 0.8] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.73, 0.73, 0.73, 0.73, 0.59] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011100100 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111111111011111110100 loss: 0.116252, lagrangian_loss: 0.000049, attention_score_distillation_loss: 0.032113 ETA: 2:27:37 | Epoch 7 finished. Took 64.42 seconds. loss: 0.091873, lagrangian_loss: 0.000516, attention_score_distillation_loss: 0.031367 ---------------------------------------------------------------------- time: 2023-07-19 14:49:15 Evaluating: pearson: 0.8915, eval_loss: 0.4657, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1599, expected_sparsity: 0.1444, expected_sequence_sparsity: 0.5825, target_sparsity: 0.1165, step: 1500 lambda_1: -0.2281, lambda_2: 18.8967 lambda_3: 0.0000 train remain: [1. 1. 0.99 1. 0.79 0.99 0.99 1. 0.85] infer remain: [1.0, 1.0, 1.0, 1.0, 0.73, 1.0, 1.0, 1.0, 0.77] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.73, 0.73, 0.73, 0.73, 0.56] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011100100 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111111111011110110100 loss: 0.043083, lagrangian_loss: -0.000576, attention_score_distillation_loss: 0.031940 loss: 0.083223, lagrangian_loss: 0.000136, attention_score_distillation_loss: 0.031341 ---------------------------------------------------------------------- time: 2023-07-19 14:49:50 Evaluating: pearson: 0.8917, eval_loss: 0.4674, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1599, expected_sparsity: 0.1444, expected_sequence_sparsity: 0.5825, target_sparsity: 0.1243, step: 1600 lambda_1: -0.2035, lambda_2: 18.9051 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.79 0.99 0.99 1. 0.86] infer remain: [1.0, 1.0, 1.0, 1.0, 0.73, 1.0, 1.0, 1.0, 0.77] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.73, 0.73, 0.73, 0.73, 0.56] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011100100 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111111111011110110100 loss: 0.058317, lagrangian_loss: 0.001115, attention_score_distillation_loss: 0.030712 ETA: 2:27:06 | Epoch 8 finished. Took 64.35 seconds. loss: 0.082088, lagrangian_loss: -0.000708, attention_score_distillation_loss: 0.031576 ---------------------------------------------------------------------- time: 2023-07-19 14:50:25 Evaluating: pearson: 0.8898, eval_loss: 0.4776, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1644, expected_sparsity: 0.1616, expected_sequence_sparsity: 0.5909, target_sparsity: 0.1321, step: 1700 lambda_1: -0.1715, lambda_2: 18.9168 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.77 0.99 0.99 1. 0.82] infer remain: [1.0, 1.0, 1.0, 1.0, 0.7, 1.0, 1.0, 1.0, 0.73] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.7, 0.7, 0.7, 0.51] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011100000 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111111111011100110100 loss: 0.098492, lagrangian_loss: -0.000240, attention_score_distillation_loss: 0.030097 loss: 0.098944, lagrangian_loss: 0.000114, attention_score_distillation_loss: 0.031157 ---------------------------------------------------------------------- time: 2023-07-19 14:51:00 Evaluating: pearson: 0.8922, eval_loss: 0.4967, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1644, expected_sparsity: 0.1616, expected_sequence_sparsity: 0.5909, target_sparsity: 0.1398, step: 1800 lambda_1: -0.2332, lambda_2: 18.9246 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.76 0.99 0.98 1. 0.8 ] infer remain: [1.0, 1.0, 1.0, 1.0, 0.7, 1.0, 1.0, 1.0, 0.73] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.7, 0.7, 0.7, 0.51] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011100000 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111111111011100110100 ETA: 2:26:29 | Epoch 9 finished. Took 64.43 seconds. loss: 0.094596, lagrangian_loss: 0.000243, attention_score_distillation_loss: 0.030738 loss: 0.111501, lagrangian_loss: -0.000185, attention_score_distillation_loss: 0.030145 ---------------------------------------------------------------------- time: 2023-07-19 14:51:35 Evaluating: pearson: 0.8925, eval_loss: 0.464, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1644, expected_sparsity: 0.1616, expected_sequence_sparsity: 0.5909, target_sparsity: 0.1476, step: 1900 lambda_1: -0.1392, lambda_2: 18.9280 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.75 0.99 0.98 1. 0.78] infer remain: [1.0, 1.0, 1.0, 1.0, 0.7, 1.0, 1.0, 1.0, 0.73] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.7, 0.7, 0.7, 0.51] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011100000 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111111111011100110100 loss: 0.030897, lagrangian_loss: -0.000145, attention_score_distillation_loss: 0.030697 loss: 0.066178, lagrangian_loss: 0.000133, attention_score_distillation_loss: 0.030171 ETA: 2:24:08 | Epoch 10 finished. Took 56.64 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:52:10 Evaluating: pearson: 0.893, eval_loss: 0.4571, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1689, expected_sparsity: 0.1647, expected_sequence_sparsity: 0.5925, target_sparsity: 0.1554, step: 2000 lambda_1: -0.2398, lambda_2: 18.9300 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.73 0.99 0.98 1. 0.76] infer remain: [1.0, 1.0, 1.0, 1.0, 0.7, 1.0, 1.0, 1.0, 0.7] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7, 0.7, 0.7, 0.7, 0.49] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010011100000 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111111110011100110100 loss: 0.075597, lagrangian_loss: 0.000161, attention_score_distillation_loss: 0.029997 loss: 0.058554, lagrangian_loss: -0.000284, attention_score_distillation_loss: 0.029598 ---------------------------------------------------------------------- time: 2023-07-19 14:52:45 Evaluating: pearson: 0.8889, eval_loss: 0.4748, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1841, expected_sparsity: 0.1815, expected_sequence_sparsity: 0.6007, target_sparsity: 0.1632, step: 2100 lambda_1: -0.2354, lambda_2: 18.9323 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.72 0.99 0.98 1. 0.74] infer remain: [1.0, 1.0, 1.0, 1.0, 0.67, 1.0, 1.0, 1.0, 0.67] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.67, 0.67, 0.67, 0.67, 0.44] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010001100000 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111101110011100110100 loss: 0.064319, lagrangian_loss: -0.000454, attention_score_distillation_loss: 0.029757 loss: 0.098023, lagrangian_loss: 0.000011, attention_score_distillation_loss: 0.029388 ETA: 2:23:31 | Epoch 11 finished. Took 64.42 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:53:20 Evaluating: pearson: 0.8935, eval_loss: 0.4553, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1841, expected_sparsity: 0.1815, expected_sequence_sparsity: 0.6007, target_sparsity: 0.171, step: 2200 lambda_1: -0.2287, lambda_2: 18.9332 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.71 0.99 0.98 1. 0.71] infer remain: [1.0, 1.0, 1.0, 1.0, 0.67, 1.0, 1.0, 1.0, 0.67] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.67, 0.67, 0.67, 0.67, 0.44] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010001100000 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111111110001100110100 loss: 0.043141, lagrangian_loss: -0.000179, attention_score_distillation_loss: 0.029063 loss: 0.065103, lagrangian_loss: -0.000069, attention_score_distillation_loss: 0.028202 ---------------------------------------------------------------------- time: 2023-07-19 14:53:55 Evaluating: pearson: 0.8886, eval_loss: 0.4759, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1886, expected_sparsity: 0.1845, expected_sequence_sparsity: 0.6022, target_sparsity: 0.1787, step: 2300 lambda_1: -0.1960, lambda_2: 18.9345 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.7 0.99 0.98 1. 0.67] infer remain: [1.0, 1.0, 1.0, 1.0, 0.67, 1.0, 1.0, 1.0, 0.63] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.67, 0.67, 0.67, 0.67, 0.42] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010001100000 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111111110001100110000 loss: 0.031083, lagrangian_loss: 0.000286, attention_score_distillation_loss: 0.028377 ETA: 2:22:49 | Epoch 12 finished. Took 64.34 seconds. loss: 0.055928, lagrangian_loss: 0.000181, attention_score_distillation_loss: 0.027413 ---------------------------------------------------------------------- time: 2023-07-19 14:54:29 Evaluating: pearson: 0.8918, eval_loss: 0.4631, token_prune_loc: [False, False, False, False, True, False, False, False, True], macs_sparsity: 0.1886, expected_sparsity: 0.1874, expected_sequence_sparsity: 0.6036, target_sparsity: 0.1865, step: 2400 lambda_1: -0.1891, lambda_2: 18.9366 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.69 0.99 0.98 1. 0.64] infer remain: [1.0, 1.0, 1.0, 1.0, 0.67, 1.0, 1.0, 1.0, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.67, 0.67, 0.67, 0.67, 0.4] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010001100000 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 011110111111101110001100110000 loss: 0.105506, lagrangian_loss: 0.000006, attention_score_distillation_loss: 0.028112 loss: 0.058316, lagrangian_loss: -0.000141, attention_score_distillation_loss: 0.028307 ---------------------------------------------------------------------- time: 2023-07-19 14:55:05 Evaluating: pearson: 0.89, eval_loss: 0.4712, token_prune_loc: [False, False, False, False, True, False, True, False, True], macs_sparsity: 0.2038, expected_sparsity: 0.1981, expected_sequence_sparsity: 0.6089, target_sparsity: 0.1943, step: 2500 lambda_1: -0.2474, lambda_2: 18.9382 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.67 0.99 0.97 1. 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.67, 1.0, 0.93, 1.0, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.67, 0.67, 0.62, 0.62, 0.37] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101010001100000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111101110001100110000 loss: 0.057751, lagrangian_loss: 0.000185, attention_score_distillation_loss: 0.027548 ETA: 2:22:05 | Epoch 13 finished. Took 64.49 seconds. loss: 0.089278, lagrangian_loss: 0.000473, attention_score_distillation_loss: 0.027058 ---------------------------------------------------------------------- time: 2023-07-19 14:55:39 Evaluating: pearson: 0.8896, eval_loss: 0.4698, token_prune_loc: [False, False, False, False, True, False, True, False, True], macs_sparsity: 0.2289, expected_sparsity: 0.2136, expected_sequence_sparsity: 0.6165, target_sparsity: 0.2021, step: 2600 lambda_1: -0.3399, lambda_2: 18.9398 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.66 0.99 0.97 0.99 0.59] infer remain: [1.0, 1.0, 1.0, 1.0, 0.63, 1.0, 0.93, 1.0, 0.57] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.63, 0.63, 0.59, 0.59, 0.33] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101000001100000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100110000 loss: 0.087580, lagrangian_loss: 0.000204, attention_score_distillation_loss: 0.027054 loss: 0.081310, lagrangian_loss: 0.000382, attention_score_distillation_loss: 0.027019 ---------------------------------------------------------------------- time: 2023-07-19 14:56:14 Evaluating: pearson: 0.8898, eval_loss: 0.4696, token_prune_loc: [False, False, False, False, True, False, True, False, True], macs_sparsity: 0.2289, expected_sparsity: 0.2136, expected_sequence_sparsity: 0.6165, target_sparsity: 0.2098, step: 2700 lambda_1: -0.4161, lambda_2: 18.9419 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.65 0.99 0.97 0.99 0.57] infer remain: [1.0, 1.0, 1.0, 1.0, 0.63, 1.0, 0.93, 1.0, 0.57] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.63, 0.63, 0.59, 0.59, 0.33] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101000001100000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111101110001100010000 ETA: 2:21:18 | Epoch 14 finished. Took 64.39 seconds. loss: 0.048841, lagrangian_loss: -0.000226, attention_score_distillation_loss: 0.026317 loss: 0.048041, lagrangian_loss: -0.000512, attention_score_distillation_loss: 0.026785 ---------------------------------------------------------------------- time: 2023-07-19 14:56:49 Evaluating: pearson: 0.8888, eval_loss: 0.4741, token_prune_loc: [False, False, False, False, True, False, True, False, True], macs_sparsity: 0.2289, expected_sparsity: 0.2162, expected_sequence_sparsity: 0.6178, target_sparsity: 0.2176, step: 2800 lambda_1: -0.2352, lambda_2: 18.9457 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.64 0.99 0.96 0.99 0.56] infer remain: [1.0, 1.0, 1.0, 1.0, 0.63, 1.0, 0.93, 1.0, 0.53] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.63, 0.63, 0.59, 0.59, 0.32] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101000001100000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100010000 loss: 0.061396, lagrangian_loss: 0.000215, attention_score_distillation_loss: 0.026574 loss: 0.038992, lagrangian_loss: 0.000188, attention_score_distillation_loss: 0.026458 ETA: 2:19:24 | Epoch 15 finished. Took 56.62 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:57:24 Evaluating: pearson: 0.8897, eval_loss: 0.4706, token_prune_loc: [False, False, False, False, True, False, True, False, True], macs_sparsity: 0.2289, expected_sparsity: 0.2162, expected_sequence_sparsity: 0.6178, target_sparsity: 0.2254, step: 2900 lambda_1: -0.5198, lambda_2: 18.9525 lambda_3: 0.0000 train remain: [1. 1. 1. 0.98 0.63 0.98 0.96 0.99 0.53] infer remain: [1.0, 1.0, 1.0, 1.0, 0.63, 1.0, 0.93, 1.0, 0.53] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.63, 0.63, 0.59, 0.59, 0.32] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111101000001100000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100010000 loss: 0.039197, lagrangian_loss: -0.000011, attention_score_distillation_loss: 0.026101 loss: 0.041652, lagrangian_loss: -0.000574, attention_score_distillation_loss: 0.025387 ---------------------------------------------------------------------- time: 2023-07-19 14:57:59 Evaluating: pearson: 0.8906, eval_loss: 0.4676, token_prune_loc: [False, False, False, False, True, False, True, False, True], macs_sparsity: 0.2343, expected_sparsity: 0.2288, expected_sequence_sparsity: 0.624, target_sparsity: 0.2332, step: 3000 lambda_1: -0.1752, lambda_2: 18.9618 lambda_3: 0.0000 train remain: [1. 1. 1. 0.97 0.62 0.98 0.95 0.99 0.53] infer remain: [1.0, 1.0, 1.0, 1.0, 0.6, 1.0, 0.93, 1.0, 0.53] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.6, 0.56, 0.56, 0.3] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111100000001100000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100010000 loss: 0.041724, lagrangian_loss: -0.000224, attention_score_distillation_loss: 0.025668 loss: 0.041907, lagrangian_loss: 0.000382, attention_score_distillation_loss: 0.025185 ETA: 2:18:35 | Epoch 16 finished. Took 64.23 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:58:34 Evaluating: pearson: 0.8897, eval_loss: 0.4717, token_prune_loc: [False, False, False, False, True, False, True, False, True], macs_sparsity: 0.2387, expected_sparsity: 0.2312, expected_sequence_sparsity: 0.6252, target_sparsity: 0.241, step: 3100 lambda_1: -0.3655, lambda_2: 18.9691 lambda_3: 0.0000 train remain: [1. 1. 1. 0.96 0.62 0.97 0.95 0.99 0.53] infer remain: [1.0, 1.0, 1.0, 1.0, 0.6, 1.0, 0.93, 1.0, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.6, 0.56, 0.56, 0.28] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111100000001100000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100000000 loss: 0.043067, lagrangian_loss: 0.000235, attention_score_distillation_loss: 0.024911 loss: 0.047409, lagrangian_loss: -0.000718, attention_score_distillation_loss: 0.024555 ---------------------------------------------------------------------- time: 2023-07-19 14:59:09 Evaluating: pearson: 0.8919, eval_loss: 0.4644, token_prune_loc: [False, False, False, False, True, False, True, False, True], macs_sparsity: 0.2387, expected_sparsity: 0.2312, expected_sequence_sparsity: 0.6252, target_sparsity: 0.2487, step: 3200 lambda_1: -0.0285, lambda_2: 18.9834 lambda_3: 0.0000 train remain: [1. 1. 1. 0.94 0.61 0.97 0.95 0.99 0.53] infer remain: [1.0, 1.0, 1.0, 1.0, 0.6, 1.0, 0.93, 1.0, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.6, 0.56, 0.56, 0.28] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111100000001100000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100000000 loss: 0.062500, lagrangian_loss: -0.000011, attention_score_distillation_loss: 0.024622 ETA: 2:17:48 | Epoch 17 finished. Took 64.6 seconds. loss: 0.043902, lagrangian_loss: 0.000214, attention_score_distillation_loss: 0.024234 ---------------------------------------------------------------------- time: 2023-07-19 14:59:44 Evaluating: pearson: 0.8914, eval_loss: 0.4644, token_prune_loc: [False, False, False, True, True, False, True, False, True], macs_sparsity: 0.2809, expected_sparsity: 0.2717, expected_sequence_sparsity: 0.6451, target_sparsity: 0.2565, step: 3300 lambda_1: -0.2990, lambda_2: 18.9937 lambda_3: 0.0000 train remain: [1. 1. 1. 0.93 0.59 0.97 0.95 0.99 0.53] infer remain: [1.0, 1.0, 1.0, 0.87, 0.6, 1.0, 0.93, 1.0, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.87, 0.52, 0.52, 0.49, 0.49, 0.24] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111100010 111111111111111100000001100000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100000000 loss: 0.057916, lagrangian_loss: -0.000212, attention_score_distillation_loss: 0.024072 loss: 0.056418, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.024042 ---------------------------------------------------------------------- time: 2023-07-19 15:00:19 Evaluating: pearson: 0.8908, eval_loss: 0.4674, token_prune_loc: [False, False, False, True, True, False, True, False, True], macs_sparsity: 0.2809, expected_sparsity: 0.2717, expected_sequence_sparsity: 0.6451, target_sparsity: 0.2643, step: 3400 lambda_1: -0.1758, lambda_2: 19.0113 lambda_3: 0.0000 train remain: [1. 1. 1. 0.92 0.59 0.97 0.95 0.99 0.52] infer remain: [1.0, 1.0, 1.0, 0.87, 0.6, 1.0, 0.93, 1.0, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.87, 0.52, 0.52, 0.49, 0.49, 0.24] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111100010 111111111111111100000001100000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100000000 loss: 0.050660, lagrangian_loss: 0.001058, attention_score_distillation_loss: 0.023293 ETA: 2:16:57 | Epoch 18 finished. Took 64.38 seconds. loss: 0.037479, lagrangian_loss: -0.000831, attention_score_distillation_loss: 0.023715 ---------------------------------------------------------------------- time: 2023-07-19 15:00:53 Evaluating: pearson: 0.891, eval_loss: 0.468, token_prune_loc: [False, False, False, True, True, False, True, False, True], macs_sparsity: 0.2961, expected_sparsity: 0.2825, expected_sequence_sparsity: 0.6504, target_sparsity: 0.2721, step: 3500 lambda_1: -0.1208, lambda_2: 19.0366 lambda_3: 0.0000 train remain: [1. 1. 1. 0.9 0.58 0.96 0.94 0.98 0.52] infer remain: [1.0, 1.0, 1.0, 0.87, 0.57, 1.0, 0.93, 1.0, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.87, 0.49, 0.49, 0.46, 0.46, 0.23] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111100010 111111111111111100000001000000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100000000 loss: 0.035374, lagrangian_loss: -0.000142, attention_score_distillation_loss: 0.023079 loss: 0.033941, lagrangian_loss: 0.000048, attention_score_distillation_loss: 0.023379 ---------------------------------------------------------------------- time: 2023-07-19 15:01:29 Evaluating: pearson: 0.8888, eval_loss: 0.4787, token_prune_loc: [False, False, False, True, True, False, True, False, True], macs_sparsity: 0.2961, expected_sparsity: 0.2825, expected_sequence_sparsity: 0.6504, target_sparsity: 0.2798, step: 3600 lambda_1: -0.5674, lambda_2: 19.0791 lambda_3: 0.0000 train remain: [1. 1. 1. 0.89 0.58 0.96 0.94 0.98 0.52] infer remain: [1.0, 1.0, 1.0, 0.87, 0.57, 1.0, 0.93, 1.0, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.87, 0.49, 0.49, 0.46, 0.46, 0.23] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111100010 111111111111111100000001000000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100000000 ETA: 2:16:06 | Epoch 19 finished. Took 64.49 seconds. loss: 0.080542, lagrangian_loss: 0.001079, attention_score_distillation_loss: 0.023240 loss: 0.053110, lagrangian_loss: -0.001927, attention_score_distillation_loss: 0.022204 ---------------------------------------------------------------------- time: 2023-07-19 15:02:04 Evaluating: pearson: 0.8882, eval_loss: 0.4764, token_prune_loc: [False, False, False, True, True, False, True, False, True], macs_sparsity: 0.3015, expected_sparsity: 0.2922, expected_sequence_sparsity: 0.6552, target_sparsity: 0.2876, step: 3700 lambda_1: 0.2122, lambda_2: 19.1497 lambda_3: 0.0000 train remain: [1. 1. 1. 0.87 0.57 0.96 0.94 0.97 0.52] infer remain: [1.0, 1.0, 1.0, 0.83, 0.57, 1.0, 0.93, 1.0, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.47, 0.47, 0.44, 0.44, 0.22] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111000010 111111111111111100000001000000 111111111111111111111111111111 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100000000 loss: 0.053636, lagrangian_loss: 0.000232, attention_score_distillation_loss: 0.022471 loss: 0.043717, lagrangian_loss: 0.001853, attention_score_distillation_loss: 0.022379 ETA: 2:14:24 | Epoch 20 finished. Took 56.45 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:02:39 Evaluating: pearson: 0.8854, eval_loss: 0.5067, token_prune_loc: [False, False, False, True, True, True, True, False, True], macs_sparsity: 0.3123, expected_sparsity: 0.306, expected_sequence_sparsity: 0.662, target_sparsity: 0.2954, step: 3800 lambda_1: -0.9229, lambda_2: 19.2555 lambda_3: 0.0000 train remain: [1. 1. 1. 0.86 0.56 0.95 0.93 0.96 0.51] infer remain: [1.0, 1.0, 1.0, 0.83, 0.57, 0.9, 0.93, 1.0, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.47, 0.42, 0.4, 0.4, 0.2] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111000010 111111111111111100000001000000 111111111111111110111101111110 111111111111111111111011111110 111111111111111111111111111111 011110111111001110001100000000 loss: 0.090108, lagrangian_loss: -0.001601, attention_score_distillation_loss: 0.022251 loss: 0.045245, lagrangian_loss: -0.000776, attention_score_distillation_loss: 0.022093 ---------------------------------------------------------------------- time: 2023-07-19 15:03:13 Evaluating: pearson: 0.887, eval_loss: 0.4814, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3248, expected_sparsity: 0.3129, expected_sequence_sparsity: 0.6654, target_sparsity: 0.3032, step: 3900 lambda_1: 0.4844, lambda_2: 19.3734 lambda_3: 0.0000 train remain: [1. 1. 1. 0.86 0.56 0.94 0.93 0.96 0.5 ] infer remain: [1.0, 1.0, 1.0, 0.83, 0.57, 0.9, 0.9, 0.93, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.47, 0.42, 0.38, 0.36, 0.18] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111000010 111111111111111100000001000000 111111111111111110111101111110 111111111111111111111011111100 111111111111111111111111110110 011110111111001110001100000000 loss: 0.035132, lagrangian_loss: -0.000674, attention_score_distillation_loss: 0.021684 loss: 0.042056, lagrangian_loss: 0.004960, attention_score_distillation_loss: 0.021617 ETA: 2:13:33 | Epoch 21 finished. Took 64.52 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:03:48 Evaluating: pearson: 0.8846, eval_loss: 0.5021, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3373, expected_sparsity: 0.3238, expected_sequence_sparsity: 0.6707, target_sparsity: 0.311, step: 4000 lambda_1: -1.3388, lambda_2: 19.5872 lambda_3: 0.0000 train remain: [1. 1. 1. 0.84 0.54 0.93 0.91 0.94 0.49] infer remain: [1.0, 1.0, 1.0, 0.83, 0.53, 0.9, 0.9, 0.9, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.44, 0.4, 0.36, 0.32, 0.16] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111000010 111111111111011100000001000000 111111111111111110111101111110 111111111111111111111011111100 101111111111111111111111110110 011110111111001110001100000000 loss: 0.039776, lagrangian_loss: -0.001281, attention_score_distillation_loss: 0.020963 loss: 0.027263, lagrangian_loss: -0.004392, attention_score_distillation_loss: 0.020882 ---------------------------------------------------------------------- time: 2023-07-19 15:04:24 Evaluating: pearson: 0.8832, eval_loss: 0.4956, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3373, expected_sparsity: 0.3238, expected_sequence_sparsity: 0.6707, target_sparsity: 0.3187, step: 4100 lambda_1: 0.4694, lambda_2: 19.7868 lambda_3: 0.0000 train remain: [1. 1. 1. 0.84 0.53 0.92 0.91 0.91 0.49] infer remain: [1.0, 1.0, 1.0, 0.83, 0.53, 0.9, 0.9, 0.9, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.44, 0.4, 0.36, 0.32, 0.16] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111000010 111111111111011100000001000000 111111111111111110111101111110 111111111111111111111011111100 101111111111111111111111110110 011110111111001010001100000001 loss: 0.053875, lagrangian_loss: 0.001513, attention_score_distillation_loss: 0.020713 ETA: 2:12:43 | Epoch 22 finished. Took 64.86 seconds. loss: 0.038053, lagrangian_loss: 0.000921, attention_score_distillation_loss: 0.020522 ---------------------------------------------------------------------- time: 2023-07-19 15:04:59 Evaluating: pearson: 0.8849, eval_loss: 0.4938, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3373, expected_sparsity: 0.3238, expected_sequence_sparsity: 0.6707, target_sparsity: 0.3265, step: 4200 lambda_1: -1.1854, lambda_2: 20.0367 lambda_3: 0.0000 train remain: [1. 1. 1. 0.84 0.52 0.91 0.91 0.9 0.49] infer remain: [1.0, 1.0, 1.0, 0.83, 0.53, 0.9, 0.9, 0.9, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.44, 0.4, 0.36, 0.32, 0.16] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111000010 111111111111011100000001000000 111111111111111110111101111110 111111111111111111111011111100 101111111111111111111111110110 011111111111001010001100000000 loss: 0.027434, lagrangian_loss: 0.005110, attention_score_distillation_loss: 0.019793 loss: 0.035636, lagrangian_loss: -0.008713, attention_score_distillation_loss: 0.020328 ---------------------------------------------------------------------- time: 2023-07-19 15:05:34 Evaluating: pearson: 0.8829, eval_loss: 0.4962, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3578, expected_sparsity: 0.349, expected_sequence_sparsity: 0.6831, target_sparsity: 0.3343, step: 4300 lambda_1: 0.6094, lambda_2: 20.4433 lambda_3: 0.0000 train remain: [1. 1. 1. 0.82 0.5 0.9 0.9 0.83 0.49] infer remain: [1.0, 1.0, 1.0, 0.8, 0.5, 0.87, 0.9, 0.8, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.4, 0.35, 0.31, 0.25, 0.12] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111000000 111111111111011000000001000000 111111111111111110111101111100 111111111111111111111011111100 101111111101111011111011110110 011110111111001010001100000001 loss: 0.033252, lagrangian_loss: 0.004554, attention_score_distillation_loss: 0.019891 ETA: 2:11:50 | Epoch 23 finished. Took 64.61 seconds. loss: 0.035325, lagrangian_loss: 0.000865, attention_score_distillation_loss: 0.019403 ---------------------------------------------------------------------- time: 2023-07-19 15:06:09 Evaluating: pearson: 0.8845, eval_loss: 0.4906, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3578, expected_sparsity: 0.3501, expected_sequence_sparsity: 0.6837, target_sparsity: 0.3421, step: 4400 lambda_1: -1.5153, lambda_2: 21.2849 lambda_3: 0.0000 train remain: [1. 1. 1. 0.82 0.5 0.9 0.9 0.82 0.48] infer remain: [1.0, 1.0, 1.0, 0.8, 0.5, 0.87, 0.9, 0.8, 0.47] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.4, 0.35, 0.31, 0.25, 0.12] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111000000 111111111111011000000001000000 111111111111111110111101111100 111111111111111111111011111100 101111111101111011111011110110 011110111111001010001100000000 loss: 0.042771, lagrangian_loss: -0.001151, attention_score_distillation_loss: 0.019567 loss: 0.026257, lagrangian_loss: -0.005064, attention_score_distillation_loss: 0.019067 ---------------------------------------------------------------------- time: 2023-07-19 15:06:44 Evaluating: pearson: 0.8848, eval_loss: 0.4902, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3605, expected_sparsity: 0.3538, expected_sequence_sparsity: 0.6855, target_sparsity: 0.3498, step: 4500 lambda_1: -0.2251, lambda_2: 21.4976 lambda_3: 0.0000 train remain: [1. 1. 1. 0.8 0.49 0.87 0.88 0.78 0.47] infer remain: [1.0, 1.0, 1.0, 0.8, 0.5, 0.87, 0.87, 0.77, 0.47] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.4, 0.35, 0.3, 0.23, 0.11] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111000000 111111111111011000000001000000 111111111111111110111101111100 111111111111111011111011111100 101111111101111011111011100110 011110111111001010001100000000 ETA: 2:10:58 | Epoch 24 finished. Took 64.78 seconds. loss: 0.064180, lagrangian_loss: -0.000522, attention_score_distillation_loss: 0.019287 loss: 0.022867, lagrangian_loss: -0.000607, attention_score_distillation_loss: 0.018857 ---------------------------------------------------------------------- time: 2023-07-19 15:07:19 Evaluating: pearson: 0.8841, eval_loss: 0.4994, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3578, expected_sparsity: 0.3516, expected_sequence_sparsity: 0.6844, target_sparsity: 0.3576, step: 4600 lambda_1: -0.1311, lambda_2: 21.6853 lambda_3: 0.0000 train remain: [1. 1. 1. 0.81 0.5 0.87 0.89 0.79 0.48] infer remain: [1.0, 1.0, 1.0, 0.8, 0.5, 0.87, 0.9, 0.77, 0.47] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.4, 0.35, 0.31, 0.24, 0.11] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111000000 111111111111011000000000000001 111111111111111110111101111100 111111111111111111111011111100 101111111101111011111011100110 011110111111001010001100000000 loss: 0.025074, lagrangian_loss: 0.001952, attention_score_distillation_loss: 0.018833 loss: 0.042156, lagrangian_loss: 0.006298, attention_score_distillation_loss: 0.017956 ETA: 2:09:27 | Epoch 25 finished. Took 56.94 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:07:54 Evaluating: pearson: 0.8826, eval_loss: 0.4989, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3838, expected_sparsity: 0.3729, expected_sequence_sparsity: 0.6948, target_sparsity: 0.3654, step: 4700 lambda_1: -1.5031, lambda_2: 21.9044 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.48 0.85 0.87 0.73 0.46] infer remain: [1.0, 1.0, 1.0, 0.77, 0.47, 0.83, 0.87, 0.73, 0.47] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.77, 0.36, 0.3, 0.26, 0.19, 0.09] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111011000000 111111111111011000000000000000 111111111111111010111101111100 111111111111111011111011111100 101111111101111011111011100100 011110111111001010001100000000 loss: 0.033408, lagrangian_loss: -0.003437, attention_score_distillation_loss: 0.018566 loss: 0.032524, lagrangian_loss: -0.003145, attention_score_distillation_loss: 0.018173 ---------------------------------------------------------------------- time: 2023-07-19 15:08:29 Evaluating: pearson: 0.8865, eval_loss: 0.483, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3838, expected_sparsity: 0.3761, expected_sequence_sparsity: 0.6964, target_sparsity: 0.3732, step: 4800 lambda_1: -0.8198, lambda_2: 21.9591 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.78 0.47 0.83 0.86 0.69 0.45] infer remain: [1.0, 1.0, 1.0, 0.77, 0.47, 0.83, 0.87, 0.67, 0.43] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.77, 0.36, 0.3, 0.26, 0.17, 0.07] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111011000000 111111111111011000000000000000 111111111111111010111101111100 111111111111111011111011111100 101111111101111011110011100000 011110111111001000001100000000 loss: 0.034116, lagrangian_loss: -0.002873, attention_score_distillation_loss: 0.018080 loss: 0.042883, lagrangian_loss: -0.000720, attention_score_distillation_loss: 0.017463 ETA: 2:08:33 | Epoch 26 finished. Took 64.6 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:09:04 Evaluating: pearson: 0.8855, eval_loss: 0.4901, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3838, expected_sparsity: 0.3761, expected_sequence_sparsity: 0.6964, target_sparsity: 0.381, step: 4900 lambda_1: -0.3075, lambda_2: 22.0031 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.77 0.46 0.83 0.86 0.67 0.43] infer remain: [1.0, 1.0, 1.0, 0.77, 0.47, 0.83, 0.87, 0.67, 0.43] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.77, 0.36, 0.3, 0.26, 0.17, 0.07] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111011000000 111111111111011000000000000000 111111111111111010111101111100 111111111111111011111011111100 101111111101111011110011100000 011110111111001000001100000000 loss: 0.072500, lagrangian_loss: 0.000887, attention_score_distillation_loss: 0.017141 loss: 0.029188, lagrangian_loss: 0.004306, attention_score_distillation_loss: 0.016807 ---------------------------------------------------------------------- time: 2023-07-19 15:09:40 Evaluating: pearson: 0.8854, eval_loss: 0.488, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.4017, expected_sparsity: 0.3888, expected_sequence_sparsity: 0.7027, target_sparsity: 0.3887, step: 5000 lambda_1: -1.0262, lambda_2: 22.0695 lambda_3: 0.0000 train remain: [1. 0.98 0.99 0.75 0.45 0.82 0.85 0.65 0.42] infer remain: [1.0, 1.0, 1.0, 0.73, 0.47, 0.8, 0.83, 0.63, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.73, 0.34, 0.27, 0.23, 0.14, 0.06] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111001000000 111111111111011000000000000000 111111111111111010111101111000 111111111111111011111011110100 101111111101011011110011100000 011110111111001000000100000000 loss: 0.034011, lagrangian_loss: -0.003784, attention_score_distillation_loss: 0.017256 ETA: 2:07:41 | Epoch 27 finished. Took 65.14 seconds. loss: 0.043225, lagrangian_loss: -0.001022, attention_score_distillation_loss: 0.016889 ---------------------------------------------------------------------- time: 2023-07-19 15:10:15 Evaluating: pearson: 0.8859, eval_loss: 0.4854, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.4097, expected_sparsity: 0.395, expected_sequence_sparsity: 0.7057, target_sparsity: 0.3965, step: 5100 lambda_1: 0.1164, lambda_2: 22.1968 lambda_3: 0.0000 train remain: [1. 0.99 0.98 0.74 0.45 0.82 0.85 0.65 0.41] infer remain: [1.0, 1.0, 1.0, 0.73, 0.43, 0.8, 0.83, 0.63, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.73, 0.32, 0.25, 0.21, 0.13, 0.05] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111001000000 111111111111001000000000000000 111111111111111010111101111000 111111111111111011111011110100 101111111101011011110011100000 011110111111001000000100000000 loss: 0.044787, lagrangian_loss: -0.000109, attention_score_distillation_loss: 0.016204 loss: 0.078612, lagrangian_loss: 0.004198, attention_score_distillation_loss: 0.016360 ---------------------------------------------------------------------- time: 2023-07-19 15:10:50 Evaluating: pearson: 0.8871, eval_loss: 0.482, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.4097, expected_sparsity: 0.395, expected_sequence_sparsity: 0.7057, target_sparsity: 0.4043, step: 5200 lambda_1: -1.4303, lambda_2: 22.4164 lambda_3: 0.0000 train remain: [1. 0.98 0.98 0.74 0.43 0.81 0.84 0.64 0.4 ] infer remain: [1.0, 1.0, 1.0, 0.73, 0.43, 0.8, 0.83, 0.63, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.73, 0.32, 0.25, 0.21, 0.13, 0.05] 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111111111111 111111111111111111111001000000 111111111111001000000000000000 111111111111111010111101111000 111111111111111011111011110100 101111111101011011110011100000 011110111111001000000100000000 loss: 0.032933, lagrangian_loss: 0.006532, attention_score_distillation_loss: 0.015942 ETA: 2:06:46 | Epoch 28 finished. Took 64.69 seconds. loss: 0.063280, lagrangian_loss: -0.006930, attention_score_distillation_loss: 0.015954 ---------------------------------------------------------------------- time: 2023-07-19 15:11:25 Evaluating: pearson: 0.8836, eval_loss: 0.4956, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.442, expected_sparsity: 0.4334, expected_sequence_sparsity: 0.7246, target_sparsity: 0.4121, step: 5300 lambda_1: -0.2468, lambda_2: 22.7796 lambda_3: 0.0000 train remain: [1. 0.97 0.91 0.74 0.43 0.8 0.82 0.63 0.4 ] infer remain: [1.0, 0.97, 0.87, 0.73, 0.43, 0.8, 0.83, 0.63, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.84, 0.61, 0.27, 0.21, 0.18, 0.11, 0.04] 111111111111111111111111111111 111111111111111111111111111110 111111111111111011111111101100 111111111111111111111001000000 111111111111000000000000000010 111111111111111010111101111000 111111111111111011111011110100 101111111101011011110011100000 011110111111001000000100000000 loss: 0.032152, lagrangian_loss: -0.000352, attention_score_distillation_loss: 0.015807 loss: 0.033457, lagrangian_loss: -0.002386, attention_score_distillation_loss: 0.015325 ---------------------------------------------------------------------- time: 2023-07-19 15:12:00 Evaluating: pearson: 0.8854, eval_loss: 0.4896, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4366, expected_sparsity: 0.4263, expected_sequence_sparsity: 0.7211, target_sparsity: 0.4198, step: 5400 lambda_1: -0.9580, lambda_2: 23.4250 lambda_3: 0.0000 train remain: [1. 0.98 0.94 0.74 0.43 0.8 0.82 0.63 0.4 ] infer remain: [1.0, 0.97, 0.9, 0.73, 0.43, 0.8, 0.83, 0.63, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.87, 0.64, 0.28, 0.22, 0.18, 0.12, 0.05] 111111111111111111111111111111 111111111111111111111111111110 111111111111111111111111101100 111111111111111111111001000000 111111111111000000000100000000 111111111111111010111101111000 111111111111111011111011110100 101111111101011011110011100000 011110111111001000000100000000 ETA: 2:05:51 | Epoch 29 finished. Took 64.88 seconds. loss: 0.043780, lagrangian_loss: 0.009604, attention_score_distillation_loss: 0.015192 loss: 0.036972, lagrangian_loss: -0.006099, attention_score_distillation_loss: 0.014970 ---------------------------------------------------------------------- time: 2023-07-19 15:12:35 Evaluating: pearson: 0.8835, eval_loss: 0.4947, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4528, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.729, target_sparsity: 0.4276, step: 5500 lambda_1: -0.8624, lambda_2: 23.8150 lambda_3: 0.0000 train remain: [1. 0.97 0.86 0.74 0.42 0.8 0.8 0.62 0.39] infer remain: [1.0, 0.97, 0.83, 0.73, 0.43, 0.8, 0.8, 0.6, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.81, 0.59, 0.26, 0.2, 0.16, 0.1, 0.04] 111111111111111111111111111111 111111111111111111111111111110 111111111111111011111111001100 111111111111111111111001000000 111111111111000100000000000000 111111111111111010111101111000 111111111111111011110011110100 101111111101011001110011100000 011110111111001000000100000000 loss: 0.029143, lagrangian_loss: -0.003281, attention_score_distillation_loss: 0.014426 loss: 0.064067, lagrangian_loss: 0.000906, attention_score_distillation_loss: 0.014830 ETA: 2:04:24 | Epoch 30 finished. Took 56.73 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:13:10 Evaluating: pearson: 0.875, eval_loss: 0.5317, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.442, expected_sparsity: 0.4346, expected_sequence_sparsity: 0.7252, target_sparsity: 0.4354, step: 5600 lambda_1: 0.0196, lambda_2: 24.1434 lambda_3: 0.0000 train remain: [1. 0.98 0.88 0.75 0.42 0.8 0.81 0.62 0.4 ] infer remain: [1.0, 0.97, 0.87, 0.73, 0.43, 0.8, 0.8, 0.63, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.84, 0.61, 0.27, 0.21, 0.17, 0.11, 0.04] 111111111111111111111111111111 111111111111111111111111111110 111111111111111011111111101100 111111111111111111111001000000 111111111111000001000000000000 111111111111111010111101111000 111111111111111011110011110100 101111111101011011110011100000 010110111111001000010100000000 loss: 0.074807, lagrangian_loss: 0.000877, attention_score_distillation_loss: 0.014705 loss: 0.024375, lagrangian_loss: 0.007127, attention_score_distillation_loss: 0.014275 ---------------------------------------------------------------------- time: 2023-07-19 15:13:45 Evaluating: pearson: 0.8815, eval_loss: 0.5042, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4626, expected_sparsity: 0.4492, expected_sequence_sparsity: 0.7324, target_sparsity: 0.4432, step: 5700 lambda_1: -2.0285, lambda_2: 24.6559 lambda_3: 0.0000 train remain: [1. 0.97 0.84 0.74 0.4 0.78 0.79 0.61 0.38] infer remain: [1.0, 0.97, 0.83, 0.73, 0.4, 0.77, 0.8, 0.6, 0.37] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.81, 0.59, 0.24, 0.18, 0.14, 0.09, 0.03] 111111111111111111111111111111 111111111111111111111111111110 111111111111111011111111001100 111111111111111111111001000000 111111111111000000000000000000 111111111111111010111101110000 111111111111111011110011110100 101111111101011001110011100000 010110111111001000000100000000 loss: 0.030542, lagrangian_loss: 0.005629, attention_score_distillation_loss: 0.013648 loss: 0.017180, lagrangian_loss: 0.002210, attention_score_distillation_loss: 0.013331 ETA: 2:03:28 | Epoch 31 finished. Took 64.53 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:14:20 Evaluating: pearson: 0.8851, eval_loss: 0.4902, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.476, expected_sparsity: 0.4666, expected_sequence_sparsity: 0.7409, target_sparsity: 0.451, step: 5800 lambda_1: -1.4935, lambda_2: 24.7928 lambda_3: 0.0000 train remain: [1. 0.97 0.82 0.7 0.38 0.76 0.74 0.61 0.34] infer remain: [1.0, 0.97, 0.8, 0.7, 0.37, 0.77, 0.73, 0.6, 0.33] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.77, 0.54, 0.2, 0.15, 0.11, 0.07, 0.02] 111111111111111111111111111111 111111111111111111111111111110 111111111111111011111011001100 111111111111111111110001000000 111111111110000000000000000000 111111111111111010111101110000 111111111101111011110010110100 101111111101011001110011100000 000110111111001000000100000000 loss: 0.048888, lagrangian_loss: -0.008006, attention_score_distillation_loss: 0.013155 loss: 0.088103, lagrangian_loss: 0.000602, attention_score_distillation_loss: 0.013206 ---------------------------------------------------------------------- time: 2023-07-19 15:14:55 Evaluating: pearson: 0.8842, eval_loss: 0.4924, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4706, expected_sparsity: 0.456, expected_sequence_sparsity: 0.7357, target_sparsity: 0.4587, step: 5900 lambda_1: 0.7277, lambda_2: 25.4812 lambda_3: 0.0000 train remain: [1. 0.97 0.82 0.7 0.39 0.78 0.74 0.61 0.34] infer remain: [1.0, 0.97, 0.83, 0.7, 0.4, 0.77, 0.73, 0.6, 0.33] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.81, 0.56, 0.23, 0.17, 0.13, 0.08, 0.03] 111111111111111111111111111111 111111111111111111111111111110 111111111111111011111111001100 111111111111111111110001000000 111111111110000000000000000001 111111111111111010111101110000 111111111101111011110010110100 101111111101011001110011100000 000110111111001000000100000000 loss: 0.034211, lagrangian_loss: -0.001736, attention_score_distillation_loss: 0.013145 ETA: 2:02:31 | Epoch 32 finished. Took 64.29 seconds. loss: 0.024530, lagrangian_loss: 0.007966, attention_score_distillation_loss: 0.012792 ---------------------------------------------------------------------- time: 2023-07-19 15:15:30 Evaluating: pearson: 0.8839, eval_loss: 0.5077, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.484, expected_sparsity: 0.4708, expected_sequence_sparsity: 0.743, target_sparsity: 0.4665, step: 6000 lambda_1: -1.7084, lambda_2: 26.3562 lambda_3: 0.0000 train remain: [1. 0.97 0.82 0.68 0.38 0.76 0.73 0.61 0.33] infer remain: [1.0, 0.97, 0.8, 0.67, 0.37, 0.77, 0.73, 0.6, 0.33] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.77, 0.52, 0.19, 0.14, 0.11, 0.06, 0.02] 111111111111111111111111111111 111111111111111111111111111110 111111111111111011111011001100 111111111111111111110000000000 111111111110000000000000000000 111111111111111010111101110000 111111111101111011110010110100 101111111101011001110011100000 000110111111001000000100000000 loss: 0.050172, lagrangian_loss: 0.011862, attention_score_distillation_loss: 0.012548 loss: 0.069147, lagrangian_loss: 0.007502, attention_score_distillation_loss: 0.012470 ---------------------------------------------------------------------- time: 2023-07-19 15:16:05 Evaluating: pearson: 0.8836, eval_loss: 0.4951, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4894, expected_sparsity: 0.4778, expected_sequence_sparsity: 0.7464, target_sparsity: 0.4743, step: 6100 lambda_1: -2.7888, lambda_2: 26.5564 lambda_3: 0.0000 train remain: [1. 0.97 0.8 0.68 0.34 0.72 0.69 0.6 0.31] infer remain: [1.0, 0.97, 0.8, 0.67, 0.33, 0.7, 0.7, 0.6, 0.3] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.77, 0.52, 0.17, 0.12, 0.08, 0.05, 0.02] 111111111111111111111111111111 111111111111111111111111111110 111111111111111011111011001100 111111111111111111110000000000 111111111010000000000000000000 111111111101101010111101110000 111111111101111011110010100100 101111111101011001110011100000 000110111101001000000100000000 loss: 0.032991, lagrangian_loss: -0.000811, attention_score_distillation_loss: 0.012388 ETA: 2:01:33 | Epoch 33 finished. Took 64.38 seconds. loss: 0.044950, lagrangian_loss: -0.012479, attention_score_distillation_loss: 0.012055 ---------------------------------------------------------------------- time: 2023-07-19 15:16:40 Evaluating: pearson: 0.8818, eval_loss: 0.5012, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5002, expected_sparsity: 0.4909, expected_sequence_sparsity: 0.7529, target_sparsity: 0.4821, step: 6200 lambda_1: -1.7110, lambda_2: 26.7955 lambda_3: 0.0000 train remain: [0.99 0.97 0.74 0.68 0.32 0.68 0.68 0.59 0.3 ] infer remain: [1.0, 0.97, 0.73, 0.67, 0.33, 0.67, 0.67, 0.6, 0.3] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.71, 0.47, 0.16, 0.11, 0.07, 0.04, 0.01] 111111111111111111111111111111 111111111111111111111111111110 111111111111111001111011001000 111111111111111111110000000000 111111111010000000000000000000 111111111101101010101101110000 111111111101111011110010100000 101111111101011001110011100000 000110111101001000000100000000 loss: 0.056334, lagrangian_loss: -0.005636, attention_score_distillation_loss: 0.011519 loss: 0.046380, lagrangian_loss: -0.001551, attention_score_distillation_loss: 0.011655 ---------------------------------------------------------------------- time: 2023-07-19 15:17:15 Evaluating: pearson: 0.8832, eval_loss: 0.4955, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5002, expected_sparsity: 0.4909, expected_sequence_sparsity: 0.7529, target_sparsity: 0.4898, step: 6300 lambda_1: 0.4510, lambda_2: 27.5148 lambda_3: 0.0000 train remain: [0.99 0.97 0.73 0.68 0.32 0.68 0.68 0.59 0.3 ] infer remain: [1.0, 0.97, 0.73, 0.67, 0.33, 0.67, 0.67, 0.6, 0.3] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.71, 0.47, 0.16, 0.11, 0.07, 0.04, 0.01] 111111111111111111111111111111 111111111111111111111111111110 111111111111111001111011001000 111111111111111111110000000000 111111111010000000000000000000 111111111101101010101101110000 111111111101111011110010100000 101111111101011001110011100000 000110111101001000000010000000 ETA: 2:00:37 | Epoch 34 finished. Took 64.89 seconds. loss: 0.042762, lagrangian_loss: 0.000536, attention_score_distillation_loss: 0.011637 loss: 0.047321, lagrangian_loss: 0.000077, attention_score_distillation_loss: 0.011241 ---------------------------------------------------------------------- time: 2023-07-19 15:17:50 Evaluating: pearson: 0.8831, eval_loss: 0.4953, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5002, expected_sparsity: 0.4909, expected_sequence_sparsity: 0.7529, target_sparsity: 0.4976, step: 6400 lambda_1: -0.9797, lambda_2: 27.9941 lambda_3: 0.0000 train remain: [0.99 0.97 0.73 0.68 0.32 0.67 0.68 0.59 0.3 ] infer remain: [1.0, 0.97, 0.73, 0.67, 0.33, 0.67, 0.67, 0.6, 0.3] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.71, 0.47, 0.16, 0.11, 0.07, 0.04, 0.01] 111111111111111111111111111111 111111111111111111111111111110 111111111111111001101011001100 111111111111111111110000000000 111111111010000000000000000000 111111111101101000111101110000 111111111101111011110010100000 101111111101011001110011100000 000110111101001000000010000000 loss: 0.020261, lagrangian_loss: 0.005008, attention_score_distillation_loss: 0.010912 loss: 0.066354, lagrangian_loss: 0.016164, attention_score_distillation_loss: 0.010887 ETA: 1:59:15 | Epoch 35 finished. Took 56.82 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:18:25 Evaluating: pearson: 0.8772, eval_loss: 0.5203, token_prune_loc: [False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5162, expected_sparsity: 0.5036, expected_sequence_sparsity: 0.7591, target_sparsity: 0.5054, step: 6500 lambda_1: -3.3110, lambda_2: 28.8410 lambda_3: 0.0000 train remain: [0.99 0.97 0.71 0.68 0.28 0.63 0.68 0.58 0.3 ] infer remain: [1.0, 0.97, 0.7, 0.67, 0.27, 0.63, 0.67, 0.57, 0.3] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.97, 0.68, 0.45, 0.12, 0.08, 0.05, 0.03, 0.01] 111111111111111111111111111111 111111111111111111111111111110 111111111111111001101011001000 111111111111111111110000000000 101111011010000000000000000000 111111111101101000101101110000 111111111101111011110010100000 101111111101011000110011100000 000110111101001000000000000001 loss: 0.084893, lagrangian_loss: 0.020234, attention_score_distillation_loss: 0.010603 loss: 0.247035, lagrangian_loss: -0.006342, attention_score_distillation_loss: 0.010467 ---------------------------------------------------------------------- time: 2023-07-19 15:19:00 Evaluating: pearson: 0.8717, eval_loss: 0.5459, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5432, expected_sparsity: 0.5262, expected_sequence_sparsity: 0.7702, target_sparsity: 0.5132, step: 6600 lambda_1: -3.4452, lambda_2: 29.1760 lambda_3: 0.0000 train remain: [0.95 0.96 0.7 0.68 0.24 0.53 0.67 0.57 0.3 ] infer remain: [0.93, 0.97, 0.7, 0.67, 0.23, 0.53, 0.67, 0.57, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.93, 0.9, 0.63, 0.42, 0.1, 0.05, 0.03, 0.02, 0.01] 110111111111111111111111111110 111111111111111111111111111110 111111111111111001101011001000 111111111111111111110000000000 101111101000000000000000000000 100111111101101000110001110000 111111111101111011110010100000 101111111101011000110011100000 000110111101001000000000000001 loss: 0.030295, lagrangian_loss: -0.012324, attention_score_distillation_loss: 0.010073 loss: 0.034462, lagrangian_loss: -0.020872, attention_score_distillation_loss: 0.010076 ETA: 1:58:18 | Epoch 36 finished. Took 64.83 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:19:35 Evaluating: pearson: 0.8683, eval_loss: 0.5573, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5539, expected_sparsity: 0.5374, expected_sequence_sparsity: 0.7758, target_sparsity: 0.521, step: 6700 lambda_1: -0.4821, lambda_2: 30.7339 lambda_3: 0.0000 train remain: [0.94 0.95 0.67 0.68 0.24 0.52 0.66 0.56 0.3 ] infer remain: [0.93, 0.93, 0.67, 0.67, 0.23, 0.5, 0.67, 0.57, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.93, 0.87, 0.58, 0.39, 0.09, 0.05, 0.03, 0.02, 0.01] 110111111111111111111111111110 111101111111111111111111111110 111111111111111001101011000000 111111111111111111110000000000 101111001000010000000000000000 100111111101001000110001110000 111111111101111011110010100000 101111111101011000110011100000 000110111101001000010000000000 loss: 0.051900, lagrangian_loss: -0.001814, attention_score_distillation_loss: 0.009658 loss: 0.030492, lagrangian_loss: 0.000102, attention_score_distillation_loss: 0.009725 ---------------------------------------------------------------------- time: 2023-07-19 15:20:10 Evaluating: pearson: 0.8771, eval_loss: 0.5278, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5458, expected_sparsity: 0.5316, expected_sequence_sparsity: 0.7729, target_sparsity: 0.5287, step: 6800 lambda_1: -0.0646, lambda_2: 31.0693 lambda_3: 0.0000 train remain: [0.95 0.96 0.68 0.68 0.24 0.52 0.66 0.56 0.3 ] infer remain: [0.93, 0.97, 0.67, 0.67, 0.23, 0.5, 0.67, 0.57, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.93, 0.9, 0.6, 0.4, 0.09, 0.05, 0.03, 0.02, 0.01] 110111111111111111111111111110 111111111111111111111111111110 111111111111111001101011000000 111111111111111111110000000000 101111001000000000000000100000 100111111101001000110001110000 111111111101111011110010100000 101111111101011000110011100000 000110111101001000000010000000 loss: 0.070156, lagrangian_loss: 0.002052, attention_score_distillation_loss: 0.009199 ETA: 1:57:21 | Epoch 37 finished. Took 64.67 seconds. loss: 0.046940, lagrangian_loss: 0.017508, attention_score_distillation_loss: 0.008959 ---------------------------------------------------------------------- time: 2023-07-19 15:20:46 Evaluating: pearson: 0.869, eval_loss: 0.5549, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5539, expected_sparsity: 0.5376, expected_sequence_sparsity: 0.7759, target_sparsity: 0.5365, step: 6900 lambda_1: -2.6809, lambda_2: 32.4505 lambda_3: 0.0000 train remain: [0.95 0.94 0.66 0.68 0.24 0.51 0.65 0.56 0.3 ] infer remain: [0.93, 0.93, 0.67, 0.67, 0.23, 0.5, 0.63, 0.57, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.93, 0.87, 0.58, 0.39, 0.09, 0.05, 0.03, 0.02, 0.0] 110111111111111111111111111110 111101111111111111111111111110 111111111111111001101011000000 111111111111111111110000000000 101111001010000000000000000000 100111111101001000110001110000 101111111101111011110010100000 101111111101011000110011100000 000110111101001000010000000000 loss: 0.070058, lagrangian_loss: 0.027112, attention_score_distillation_loss: 0.008852 loss: 0.074075, lagrangian_loss: 0.016675, attention_score_distillation_loss: 0.008657 ---------------------------------------------------------------------- time: 2023-07-19 15:21:20 Evaluating: pearson: 0.8742, eval_loss: 0.5337, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.562, expected_sparsity: 0.5474, expected_sequence_sparsity: 0.7807, target_sparsity: 0.5443, step: 7000 lambda_1: -3.9616, lambda_2: 32.9898 lambda_3: 0.0000 train remain: [0.93 0.94 0.59 0.68 0.23 0.48 0.64 0.55 0.3 ] infer remain: [0.93, 0.93, 0.6, 0.67, 0.23, 0.47, 0.63, 0.57, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.93, 0.87, 0.52, 0.35, 0.08, 0.04, 0.02, 0.01, 0.0] 110111111111111111111111111110 111101111111111111111111111110 111111111111111001101000000000 111111111111111111110000000000 101111001100000000000000000000 100011111101001000110001110000 101111111101111011110010100000 101111111101011000110011100000 000110111101001000000000000001 loss: 0.064370, lagrangian_loss: 0.002295, attention_score_distillation_loss: 0.008340 ETA: 1:56:23 | Epoch 38 finished. Took 64.63 seconds. loss: 0.044245, lagrangian_loss: -0.012566, attention_score_distillation_loss: 0.008139 ---------------------------------------------------------------------- time: 2023-07-19 15:21:56 Evaluating: pearson: 0.8741, eval_loss: 0.539, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5754, expected_sparsity: 0.5645, expected_sequence_sparsity: 0.7891, target_sparsity: 0.5521, step: 7100 lambda_1: -2.5033, lambda_2: 33.5751 lambda_3: 0.0000 train remain: [0.91 0.93 0.55 0.66 0.23 0.47 0.64 0.55 0.3 ] infer remain: [0.9, 0.93, 0.53, 0.67, 0.23, 0.47, 0.63, 0.53, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.84, 0.45, 0.3, 0.07, 0.03, 0.02, 0.01, 0.0] 110011111111111111111111111110 111101111111111111111111111110 111111111101111001100000000000 111111111111111111110000000000 101111001000000000000100000000 100011111101001000110001110000 101111111101111011110010100000 101111111101011000110010100000 000110111101001000010000000000 loss: 0.081477, lagrangian_loss: -0.013901, attention_score_distillation_loss: 0.007984 loss: 0.046079, lagrangian_loss: -0.007136, attention_score_distillation_loss: 0.007826 ---------------------------------------------------------------------- time: 2023-07-19 15:22:30 Evaluating: pearson: 0.8732, eval_loss: 0.5409, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5808, expected_sparsity: 0.5689, expected_sequence_sparsity: 0.7913, target_sparsity: 0.5598, step: 7200 lambda_1: -0.6499, lambda_2: 34.3618 lambda_3: 0.0000 train remain: [0.91 0.93 0.52 0.65 0.23 0.46 0.64 0.55 0.3 ] infer remain: [0.9, 0.93, 0.5, 0.67, 0.23, 0.47, 0.63, 0.53, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.84, 0.42, 0.28, 0.07, 0.03, 0.02, 0.01, 0.0] 110011111111111111111111111110 111101111111111111111111111110 111111111101111001000000000000 111111111111111111110000000000 101111001000000000000001000000 100011111101001000110001110000 101111111101111011110010100000 101111111101011000110010100000 000110111101001000010000000000 ETA: 1:55:24 | Epoch 39 finished. Took 64.53 seconds. loss: 0.052311, lagrangian_loss: -0.000795, attention_score_distillation_loss: 0.007474 loss: 0.047241, lagrangian_loss: -0.000049, attention_score_distillation_loss: 0.007434 ---------------------------------------------------------------------- time: 2023-07-19 15:23:06 Evaluating: pearson: 0.868, eval_loss: 0.5595, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5808, expected_sparsity: 0.5689, expected_sequence_sparsity: 0.7913, target_sparsity: 0.5676, step: 7300 lambda_1: -0.7083, lambda_2: 34.5765 lambda_3: 0.0000 train remain: [0.91 0.93 0.51 0.65 0.23 0.46 0.63 0.55 0.3 ] infer remain: [0.9, 0.93, 0.5, 0.67, 0.23, 0.47, 0.63, 0.53, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.84, 0.42, 0.28, 0.07, 0.03, 0.02, 0.01, 0.0] 110011111111111111111111111110 111101111111111111111111111110 111111111101111001000000000000 111111111111111111110000000000 101111001000000000000000010000 100011111101001000110001110000 101111111101111011110010100000 101111111101011000110010100000 000110111101001000010000000000 loss: 0.061821, lagrangian_loss: 0.006615, attention_score_distillation_loss: 0.007145 loss: 0.063714, lagrangian_loss: 0.014441, attention_score_distillation_loss: 0.006995 ETA: 1:54:05 | Epoch 40 finished. Took 56.74 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:23:41 Evaluating: pearson: 0.8698, eval_loss: 0.5498, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5835, expected_sparsity: 0.5723, expected_sequence_sparsity: 0.7929, target_sparsity: 0.5754, step: 7400 lambda_1: -3.0239, lambda_2: 35.8346 lambda_3: 0.0000 train remain: [0.91 0.93 0.5 0.65 0.21 0.45 0.62 0.55 0.3 ] infer remain: [0.9, 0.93, 0.5, 0.63, 0.2, 0.43, 0.63, 0.53, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.84, 0.42, 0.27, 0.05, 0.02, 0.01, 0.01, 0.0] 110011111111111111111111111110 111101111111111111111111111110 111111111101111001000000000000 111111111111111111100000000000 101111001000000000000000000000 100011111101001000100001110000 101111111101111011110010100000 101111111101011000110010100000 000110111101001000010000000000 loss: 0.077226, lagrangian_loss: 0.027676, attention_score_distillation_loss: 0.006803 loss: 0.096816, lagrangian_loss: 0.025061, attention_score_distillation_loss: 0.006575 ---------------------------------------------------------------------- time: 2023-07-19 15:24:16 Evaluating: pearson: 0.8714, eval_loss: 0.5433, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5888, expected_sparsity: 0.5773, expected_sequence_sparsity: 0.7954, target_sparsity: 0.5832, step: 7500 lambda_1: -5.6046, lambda_2: 37.4632 lambda_3: 0.0000 train remain: [0.91 0.9 0.49 0.65 0.19 0.4 0.61 0.55 0.29] infer remain: [0.9, 0.9, 0.5, 0.63, 0.2, 0.4, 0.6, 0.53, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.4, 0.26, 0.05, 0.02, 0.01, 0.01, 0.0] 110011111111111111111111111110 110101111111111111111111111110 111111111101111001000000000000 111111111111111111100000000000 101111001000000000000000000000 100011011101001000110001010000 101111111101111011010010100000 101111111101011000110010100000 100110111101001000000000000000 loss: 0.030107, lagrangian_loss: 0.064182, attention_score_distillation_loss: 0.006225 loss: 0.049201, lagrangian_loss: 0.078200, attention_score_distillation_loss: 0.006130 ETA: 1:53:08 | Epoch 41 finished. Took 65.02 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:24:51 Evaluating: pearson: 0.8419, eval_loss: 0.6896, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5969, expected_sparsity: 0.5867, expected_sequence_sparsity: 0.8, target_sparsity: 0.591, step: 7600 lambda_1: -8.4787, lambda_2: 39.6273 lambda_3: 0.0000 train remain: [0.9 0.9 0.44 0.65 0.18 0.37 0.61 0.52 0.28] infer remain: [0.9, 0.9, 0.43, 0.63, 0.17, 0.37, 0.6, 0.53, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.35, 0.22, 0.04, 0.01, 0.01, 0.0, 0.0] 110011111111111111111111111110 110101111111111111111111111110 011111111101111000000000000000 111111111111111111100000000000 101111000000000000000000000000 100011011101001000110001000000 101111111101111011010010100000 101111111101011000110010100000 000110111101001000000000000000 loss: 0.048075, lagrangian_loss: 0.082027, attention_score_distillation_loss: 0.005845 loss: 0.074841, lagrangian_loss: 0.061686, attention_score_distillation_loss: 0.005723 ---------------------------------------------------------------------- time: 2023-07-19 15:25:26 Evaluating: pearson: 0.8173, eval_loss: 0.7733, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.613, expected_sparsity: 0.6021, expected_sequence_sparsity: 0.8076, target_sparsity: 0.5987, step: 7700 lambda_1: -11.0192, lambda_2: 41.4908 lambda_3: 0.0000 train remain: [0.88 0.83 0.43 0.64 0.18 0.34 0.61 0.5 0.27] infer remain: [0.87, 0.83, 0.43, 0.63, 0.17, 0.33, 0.6, 0.5, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.87, 0.72, 0.31, 0.2, 0.03, 0.01, 0.01, 0.0, 0.0] 100011111111111111111111111110 110101111101111111111111111010 011111111101011000000000000001 111111111111111111100000000000 101111000000000000000000000000 100011011101001000100001000000 101111111101111011010010100000 100111111101011000110010100000 000110111101001000000000000000 loss: 0.088138, lagrangian_loss: -0.015873, attention_score_distillation_loss: 0.005610 ETA: 1:52:10 | Epoch 42 finished. Took 64.87 seconds. loss: 0.108121, lagrangian_loss: -0.106642, attention_score_distillation_loss: 0.005329 ---------------------------------------------------------------------- time: 2023-07-19 15:26:01 Evaluating: pearson: 0.8126, eval_loss: 0.8242, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6399, expected_sparsity: 0.6258, expected_sequence_sparsity: 0.8193, target_sparsity: 0.6065, step: 7800 lambda_1: -8.7474, lambda_2: 43.9252 lambda_3: 0.0000 train remain: [0.84 0.71 0.42 0.59 0.17 0.29 0.61 0.49 0.27] infer remain: [0.83, 0.7, 0.43, 0.6, 0.17, 0.27, 0.6, 0.5, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.58, 0.25, 0.15, 0.03, 0.01, 0.0, 0.0, 0.0] 100011111101111111111111111110 110001111101111101111101011010 011111111101011000000000000001 111111111101111111100000000000 101011000000000000000000000001 100001001101001000010001000000 101111111101111011010010100000 100111111101011000110010100000 000110111101001000000000000000 loss: 0.210216, lagrangian_loss: -0.125186, attention_score_distillation_loss: 0.005118 loss: 0.176400, lagrangian_loss: -0.083898, attention_score_distillation_loss: 0.004681 ---------------------------------------------------------------------- time: 2023-07-19 15:26:37 Evaluating: pearson: 0.8091, eval_loss: 0.8021, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.648, expected_sparsity: 0.6352, expected_sequence_sparsity: 0.8238, target_sparsity: 0.6143, step: 7900 lambda_1: -4.7908, lambda_2: 48.7272 lambda_3: 0.0000 train remain: [0.81 0.7 0.39 0.56 0.17 0.28 0.61 0.49 0.27] infer remain: [0.8, 0.7, 0.4, 0.57, 0.17, 0.27, 0.6, 0.5, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.56, 0.22, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0] 100011111101111111011111111110 110001111101111101111101011010 011111111101011000000000000000 111111111101111111000000000000 101011000000001000000000000000 100001001101001000010001000000 101111111101111011010010100000 100111111101011000110010100000 000110111101001000000000000000 loss: 0.091603, lagrangian_loss: -0.069692, attention_score_distillation_loss: 0.004733 ETA: 1:51:12 | Epoch 43 finished. Took 64.85 seconds. loss: 0.130693, lagrangian_loss: -0.030531, attention_score_distillation_loss: 0.004541 ---------------------------------------------------------------------- time: 2023-07-19 15:27:12 Evaluating: pearson: 0.8116, eval_loss: 0.7909, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.648, expected_sparsity: 0.6352, expected_sequence_sparsity: 0.8238, target_sparsity: 0.6221, step: 8000 lambda_1: -1.8013, lambda_2: 51.6872 lambda_3: 0.0000 train remain: [0.81 0.7 0.39 0.55 0.17 0.28 0.61 0.49 0.27] infer remain: [0.8, 0.7, 0.4, 0.57, 0.17, 0.27, 0.6, 0.5, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.56, 0.22, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0] 100011111101111111011111111110 110001111101111101111101011010 011111111101011000000000000000 111111111101111111000000000000 101011000000000000010000000000 100001001101001000010001000000 101111111101111011010010100000 100111111101011000110010100000 000110111101001000000000000000 loss: 0.127579, lagrangian_loss: -0.012635, attention_score_distillation_loss: 0.004232 loss: 0.122561, lagrangian_loss: -0.001506, attention_score_distillation_loss: 0.003994 ---------------------------------------------------------------------- time: 2023-07-19 15:27:47 Evaluating: pearson: 0.8138, eval_loss: 0.7753, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.648, expected_sparsity: 0.6352, expected_sequence_sparsity: 0.8238, target_sparsity: 0.6298, step: 8100 lambda_1: -0.7101, lambda_2: 52.2153 lambda_3: 0.0000 train remain: [0.81 0.71 0.4 0.55 0.17 0.28 0.61 0.49 0.27] infer remain: [0.8, 0.7, 0.4, 0.57, 0.17, 0.27, 0.6, 0.5, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.56, 0.22, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0] 100011111101111111011111111110 110001111101111101111101011010 011111111101011000000000000000 111111111101111111000000000000 101011000000001000000000000000 100001001101001000010001000000 101111111101111011010010100000 100111111101011000110010100000 000110111101001000000000000000 ETA: 1:50:13 | Epoch 44 finished. Took 64.71 seconds. loss: 0.078549, lagrangian_loss: -0.001009, attention_score_distillation_loss: 0.003841 loss: 0.088088, lagrangian_loss: 0.006807, attention_score_distillation_loss: 0.003553 ---------------------------------------------------------------------- time: 2023-07-19 15:28:22 Evaluating: pearson: 0.8084, eval_loss: 0.8147, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.648, expected_sparsity: 0.6359, expected_sequence_sparsity: 0.8242, target_sparsity: 0.6376, step: 8200 lambda_1: -1.4427, lambda_2: 52.5451 lambda_3: 0.0000 train remain: [0.81 0.71 0.4 0.55 0.17 0.28 0.6 0.49 0.27] infer remain: [0.8, 0.7, 0.4, 0.53, 0.17, 0.27, 0.6, 0.5, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.56, 0.22, 0.12, 0.02, 0.01, 0.0, 0.0, 0.0] 100011111101111111011111111110 110001111101111101111101011010 011111111101001000000001000000 111111111101111110000000000000 101011000000000000010000000000 100001001101001000010001000000 101111111101111011010010100000 100111111101011000110010100000 000110111101001000000000000000 loss: 0.142977, lagrangian_loss: 0.014383, attention_score_distillation_loss: 0.003434 loss: 0.074929, lagrangian_loss: 0.031015, attention_score_distillation_loss: 0.003244 ETA: 1:48:56 | Epoch 45 finished. Took 56.88 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:28:57 Evaluating: pearson: 0.8082, eval_loss: 0.8258, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.648, expected_sparsity: 0.6359, expected_sequence_sparsity: 0.8242, target_sparsity: 0.6454, step: 8300 lambda_1: -3.8843, lambda_2: 54.5718 lambda_3: 0.0000 train remain: [0.81 0.7 0.39 0.53 0.17 0.28 0.6 0.49 0.27] infer remain: [0.8, 0.7, 0.4, 0.53, 0.17, 0.27, 0.6, 0.5, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.56, 0.22, 0.12, 0.02, 0.01, 0.0, 0.0, 0.0] 100011111101111111011111111110 110001111101111101111101011010 011111111101001000000000100000 111111111101111110000000000000 101011000100000000000000000000 100001001101001000010001000000 101111111101111011010010100000 100111111101011000110010100000 000110111101001000000000000000 loss: 0.108575, lagrangian_loss: 0.064090, attention_score_distillation_loss: 0.003033 loss: 0.086275, lagrangian_loss: 0.084533, attention_score_distillation_loss: 0.002791 ---------------------------------------------------------------------- time: 2023-07-19 15:29:32 Evaluating: pearson: 0.8168, eval_loss: 0.7943, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6395, expected_sequence_sparsity: 0.826, target_sparsity: 0.6532, step: 8400 lambda_1: -7.3478, lambda_2: 58.6131 lambda_3: 0.0000 train remain: [0.81 0.67 0.39 0.52 0.16 0.28 0.6 0.49 0.27] infer remain: [0.8, 0.67, 0.4, 0.53, 0.17, 0.27, 0.6, 0.5, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.53, 0.21, 0.11, 0.02, 0.01, 0.0, 0.0, 0.0] 100011111101111111011111111110 110001111001111101111101011010 011111111101001000000000000001 111111111101011110000000000001 101011000000000000010000000000 100001001101001000010001000000 101111111101111011010010100000 100011111101011000110010100001 000110111101001000000000000000 loss: 0.104441, lagrangian_loss: 0.140984, attention_score_distillation_loss: 0.002569 loss: 0.111436, lagrangian_loss: 0.173203, attention_score_distillation_loss: 0.002360 ETA: 1:47:58 | Epoch 46 finished. Took 64.96 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:30:07 Evaluating: pearson: 0.8092, eval_loss: 0.8039, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6587, expected_sparsity: 0.6497, expected_sequence_sparsity: 0.831, target_sparsity: 0.661, step: 8500 lambda_1: -10.6127, lambda_2: 62.3280 lambda_3: 0.0000 train remain: [0.81 0.61 0.36 0.49 0.14 0.27 0.59 0.48 0.27] infer remain: [0.8, 0.6, 0.37, 0.5, 0.13, 0.27, 0.6, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.48, 0.18, 0.09, 0.01, 0.0, 0.0, 0.0, 0.0] 100011111101111111011111111110 110001010001111101111101011010 011111111101001000000000000000 111111111101011110000000000000 101011000000000000000000000000 100001001101001000010001000000 101111111101111011010010100000 100011111101011000110010100000 000110111101001000000000000000 loss: 0.123818, lagrangian_loss: 0.231153, attention_score_distillation_loss: 0.002120 loss: 0.129713, lagrangian_loss: 0.233761, attention_score_distillation_loss: 0.001929 ---------------------------------------------------------------------- time: 2023-07-19 15:30:42 Evaluating: pearson: 0.8058, eval_loss: 0.8128, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6641, expected_sparsity: 0.655, expected_sequence_sparsity: 0.8336, target_sparsity: 0.6687, step: 8600 lambda_1: -13.7100, lambda_2: 65.7965 lambda_3: 0.0000 train remain: [0.81 0.55 0.33 0.49 0.14 0.27 0.57 0.46 0.27] infer remain: [0.8, 0.57, 0.33, 0.5, 0.13, 0.27, 0.57, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.45, 0.15, 0.08, 0.01, 0.0, 0.0, 0.0, 0.0] 100011111101111111011111111110 110001010001111101111001011010 011111111101000000000000000000 111111111101011010000000000001 101011000000000000000000000000 100001001101001000010001000000 101011111101111011010010100000 100011111101011000110010100000 000110111101001000000000000000 loss: 0.081704, lagrangian_loss: 0.251546, attention_score_distillation_loss: 0.001699 ETA: 1:46:59 | Epoch 47 finished. Took 64.69 seconds. loss: 0.106977, lagrangian_loss: 0.246855, attention_score_distillation_loss: 0.001506 ---------------------------------------------------------------------- time: 2023-07-19 15:31:17 Evaluating: pearson: 0.8101, eval_loss: 0.7946, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6775, expected_sparsity: 0.6667, expected_sequence_sparsity: 0.8393, target_sparsity: 0.6765, step: 8700 lambda_1: -16.5600, lambda_2: 68.8417 lambda_3: 0.0000 train remain: [0.78 0.52 0.33 0.47 0.13 0.26 0.54 0.46 0.27] infer remain: [0.77, 0.5, 0.33, 0.47, 0.13, 0.27, 0.53, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.77, 0.38, 0.13, 0.06, 0.01, 0.0, 0.0, 0.0, 0.0] 100011111101111111011111110110 110001010001011001111001011010 011111111100100000000000000000 111111111101011010000000000000 101011000000000000000000000000 100001001101001000010001000000 101011111101011011010010100000 100011111101011000110010100000 000110111101001000000000000000 loss: 0.100143, lagrangian_loss: 0.271340, attention_score_distillation_loss: 0.001302 loss: 0.158481, lagrangian_loss: 0.187113, attention_score_distillation_loss: 0.001068 ---------------------------------------------------------------------- time: 2023-07-19 15:31:53 Evaluating: pearson: 0.7883, eval_loss: 0.8694, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6882, expected_sparsity: 0.6788, expected_sequence_sparsity: 0.8453, target_sparsity: 0.6843, step: 8800 lambda_1: -18.7171, lambda_2: 70.6955 lambda_3: 0.0000 train remain: [0.74 0.44 0.31 0.45 0.11 0.26 0.52 0.46 0.27] infer remain: [0.73, 0.43, 0.3, 0.47, 0.1, 0.27, 0.53, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.73, 0.32, 0.1, 0.04, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111111011111100110 100001010001011001011001011010 011111111100000000000000000000 111111111101011000000000000001 101010000000000000000000000000 100001001101001000010001000000 101011111101011011010010100000 100011111101011000110010100000 000110111101001000000000000000 loss: 0.182240, lagrangian_loss: 0.140845, attention_score_distillation_loss: 0.000876 ETA: 1:46:00 | Epoch 48 finished. Took 65.05 seconds. loss: 0.241357, lagrangian_loss: 0.167169, attention_score_distillation_loss: 0.000645 ---------------------------------------------------------------------- time: 2023-07-19 15:32:28 Evaluating: pearson: 0.7612, eval_loss: 0.9532, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7043, expected_sparsity: 0.6927, expected_sequence_sparsity: 0.8521, target_sparsity: 0.6921, step: 8900 lambda_1: -20.2795, lambda_2: 71.7212 lambda_3: 0.0000 train remain: [0.71 0.34 0.27 0.39 0.1 0.25 0.51 0.46 0.27] infer remain: [0.7, 0.33, 0.27, 0.4, 0.1, 0.23, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.7, 0.23, 0.06, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111111011011100110 100000010001011001010001001010 011111111000000000000000000000 111111111101001000000000000000 101010000000000000000000000000 100001001101001000000001000000 100011111101011011010010100000 100011111101011000110010100000 100110111100001000000000000000 loss: 0.189149, lagrangian_loss: 0.102262, attention_score_distillation_loss: 0.000434 loss: 0.305672, lagrangian_loss: 0.172312, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 15:33:03 Evaluating: pearson: 0.7748, eval_loss: 0.9126, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.707, expected_sparsity: 0.6968, expected_sequence_sparsity: 0.8541, target_sparsity: 0.6998, step: 9000 lambda_1: -21.3536, lambda_2: 72.2096 lambda_3: 0.0000 train remain: [0.67 0.33 0.27 0.36 0.1 0.24 0.51 0.46 0.27] infer remain: [0.67, 0.33, 0.27, 0.37, 0.1, 0.23, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.22, 0.06, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111111011011100010 100000010001011001010001001001 011111011100000000000000000000 101111111101001000000000000000 101010000000000000000000000000 100001001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 ETA: 1:45:01 | Epoch 49 finished. Took 64.67 seconds. loss: 0.122828, lagrangian_loss: 0.134442, attention_score_distillation_loss: 0.000382 loss: 0.170258, lagrangian_loss: 0.027213, attention_score_distillation_loss: 0.000384 Starting saving the best from epoch 50 and step 9100 ---------------------------------------------------------------------- time: 2023-07-19 15:33:38 Evaluating: pearson: 0.7752, eval_loss: 0.9167, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7151, expected_sparsity: 0.7068, expected_sequence_sparsity: 0.8591, target_sparsity: 0.7, step: 9100 lambda_1: -21.6260, lambda_2: 72.3652 lambda_3: 0.0000 train remain: [0.61 0.29 0.26 0.36 0.1 0.23 0.5 0.46 0.27] infer remain: [0.6, 0.3, 0.27, 0.37, 0.1, 0.23, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.6, 0.18, 0.05, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111111011001000010 100000010001011001010001001000 011111011000000000000000000001 101111011101001000000000000001 101010000000000000000000000000 100001001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Saving the best model so far: [Epoch 50 | Step: 9100 | MACs sparsity: 0.7151 | Score: 0.7752 | Loss: 0.9167] loss: 0.124482, lagrangian_loss: -0.058486, attention_score_distillation_loss: 0.000386 loss: 0.159412, lagrangian_loss: -0.086318, attention_score_distillation_loss: 0.000380 ETA: 1:44:31 | Epoch 50 finished. Took 79.97 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:34:36 Evaluating: pearson: 0.763, eval_loss: 0.9662, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7177, expected_sparsity: 0.7096, expected_sequence_sparsity: 0.8604, target_sparsity: 0.7, step: 9200 lambda_1: -20.5938, lambda_2: 72.8310 lambda_3: 0.0000 train remain: [0.59 0.25 0.23 0.36 0.1 0.23 0.5 0.46 0.27] infer remain: [0.6, 0.27, 0.23, 0.37, 0.1, 0.23, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.6, 0.16, 0.04, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111111011001000010 100000000001001001010001001001 011111011000000000000000000000 101111011101001000000000000001 101000000000000000000000000001 100001001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 loss: 0.151929, lagrangian_loss: -0.159871, attention_score_distillation_loss: 0.000383 loss: 0.216680, lagrangian_loss: -0.194308, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 15:35:11 Evaluating: pearson: 0.759, eval_loss: 0.9599, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7258, expected_sparsity: 0.7152, expected_sequence_sparsity: 0.8632, target_sparsity: 0.7, step: 9300 lambda_1: -18.8374, lambda_2: 73.9772 lambda_3: 0.0000 train remain: [0.58 0.24 0.22 0.36 0.1 0.22 0.5 0.46 0.27] infer remain: [0.57, 0.23, 0.23, 0.37, 0.1, 0.23, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.57, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111111011001000000 100000000001001001010001001000 011111011000000000000000000000 101111011101001000000000000001 101000000000000000000000000001 100001001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 loss: 0.145970, lagrangian_loss: -0.163008, attention_score_distillation_loss: 0.000380 loss: 0.163700, lagrangian_loss: -0.168505, attention_score_distillation_loss: 0.000382 ETA: 1:43:30 | Epoch 51 finished. Took 64.7 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:35:46 Evaluating: pearson: 0.7544, eval_loss: 1.0067, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7258, expected_sparsity: 0.7157, expected_sequence_sparsity: 0.8635, target_sparsity: 0.7, step: 9400 lambda_1: -16.8582, lambda_2: 75.4303 lambda_3: 0.0000 train remain: [0.58 0.23 0.21 0.35 0.1 0.22 0.5 0.46 0.27] infer remain: [0.57, 0.23, 0.2, 0.37, 0.1, 0.23, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.57, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111111011001000000 100000000001001001010001000001 011111010000000000000000000000 101111011101001000000000000001 101000000000001000000000000000 100001001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 loss: 0.112049, lagrangian_loss: -0.159810, attention_score_distillation_loss: 0.000379 loss: 0.134811, lagrangian_loss: -0.202310, attention_score_distillation_loss: 0.000380 ---------------------------------------------------------------------- time: 2023-07-19 15:36:21 Evaluating: pearson: 0.7485, eval_loss: 0.991, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7258, expected_sparsity: 0.7177, expected_sequence_sparsity: 0.8644, target_sparsity: 0.7, step: 9500 lambda_1: -14.5817, lambda_2: 77.4188 lambda_3: 0.0000 train remain: [0.56 0.21 0.21 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.57, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.57, 0.11, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111111011001000000 100000000001001001010001000000 011111010000000000000000000000 101111011101001000000000000000 101000000000000000000000000001 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 loss: 0.113434, lagrangian_loss: -0.191044, attention_score_distillation_loss: 0.000379 ETA: 1:42:28 | Epoch 52 finished. Took 64.45 seconds. loss: 0.148300, lagrangian_loss: -0.179528, attention_score_distillation_loss: 0.000384 ---------------------------------------------------------------------- time: 2023-07-19 15:36:56 Evaluating: pearson: 0.7631, eval_loss: 0.9564, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7311, expected_sparsity: 0.7211, expected_sequence_sparsity: 0.8661, target_sparsity: 0.7, step: 9600 lambda_1: -11.8536, lambda_2: 80.3045 lambda_3: 0.0000 train remain: [0.55 0.21 0.2 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.53, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.53, 0.11, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111101011001000000 000000000001001001010001000001 011111010000000000000000000000 101111011101001000000000000000 101000000000000000000000000001 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 loss: 0.160034, lagrangian_loss: -0.158100, attention_score_distillation_loss: 0.000379 loss: 0.190926, lagrangian_loss: -0.120930, attention_score_distillation_loss: 0.000377 ---------------------------------------------------------------------- time: 2023-07-19 15:37:31 Evaluating: pearson: 0.7525, eval_loss: 0.9912, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7311, expected_sparsity: 0.7211, expected_sequence_sparsity: 0.8661, target_sparsity: 0.7, step: 9700 lambda_1: -9.1556, lambda_2: 83.1951 lambda_3: 0.0000 train remain: [0.55 0.21 0.2 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.53, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.53, 0.11, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111101011001000000 001000000001001001010001000000 011111010000000000000000000000 101111011101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 loss: 0.119560, lagrangian_loss: -0.150248, attention_score_distillation_loss: 0.000385 ETA: 1:41:28 | Epoch 53 finished. Took 64.9 seconds. loss: 0.188583, lagrangian_loss: -0.091349, attention_score_distillation_loss: 0.000375 ---------------------------------------------------------------------- time: 2023-07-19 15:38:07 Evaluating: pearson: 0.7527, eval_loss: 0.9902, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7311, expected_sparsity: 0.7211, expected_sequence_sparsity: 0.8661, target_sparsity: 0.7, step: 9800 lambda_1: -6.4202, lambda_2: 86.2255 lambda_3: 0.0000 train remain: [0.54 0.21 0.2 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.53, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.53, 0.11, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111101011001000000 000000000001001001010011000000 011111010000000000000000000000 101111011101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 loss: 0.152142, lagrangian_loss: -0.076991, attention_score_distillation_loss: 0.000384 loss: 0.215770, lagrangian_loss: -0.058972, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 15:38:42 Evaluating: pearson: 0.7693, eval_loss: 0.9355, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7311, expected_sparsity: 0.7211, expected_sequence_sparsity: 0.8661, target_sparsity: 0.7, step: 9900 lambda_1: -3.7362, lambda_2: 89.1860 lambda_3: 0.0000 train remain: [0.53 0.21 0.2 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.53, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.53, 0.11, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111101011001000000 000000000001001001010001001000 111111000000000000000000000000 101111011101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 ETA: 1:40:27 | Epoch 54 finished. Took 64.71 seconds. loss: 0.121074, lagrangian_loss: -0.038500, attention_score_distillation_loss: 0.000375 loss: 0.171117, lagrangian_loss: -0.014021, attention_score_distillation_loss: 0.000389 ---------------------------------------------------------------------- time: 2023-07-19 15:39:17 Evaluating: pearson: 0.7594, eval_loss: 0.9695, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7311, expected_sparsity: 0.7211, expected_sequence_sparsity: 0.8661, target_sparsity: 0.7, step: 10000 lambda_1: -1.1512, lambda_2: 91.9865 lambda_3: 0.0000 train remain: [0.54 0.21 0.2 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.53, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.53, 0.11, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111101011001000000 000100000001001001010001000000 111111000000000000000000000000 101111011101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 loss: 0.116908, lagrangian_loss: 0.005401, attention_score_distillation_loss: 0.000386 loss: 0.249222, lagrangian_loss: 0.008288, attention_score_distillation_loss: 0.000367 ETA: 1:39:12 | Epoch 55 finished. Took 56.8 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:39:52 Evaluating: pearson: 0.7508, eval_loss: 1.0008, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7258, expected_sparsity: 0.7158, expected_sequence_sparsity: 0.8635, target_sparsity: 0.7, step: 10100 lambda_1: 1.0896, lambda_2: 94.1495 lambda_3: 0.0000 train remain: [0.56 0.22 0.21 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.57, 0.23, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.57, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 100011111101111101011011000000 000100000001001001010001100000 111111000000000000000000000000 101111011101001000000000000000 101000000000000000000000000001 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 loss: 0.113397, lagrangian_loss: 0.037434, attention_score_distillation_loss: 0.000387 loss: 0.050302, lagrangian_loss: 0.026131, attention_score_distillation_loss: 0.000374 ---------------------------------------------------------------------- time: 2023-07-19 15:40:27 Evaluating: pearson: 0.7647, eval_loss: 0.9376, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7177, expected_sparsity: 0.7097, expected_sequence_sparsity: 0.8605, target_sparsity: 0.7, step: 10200 lambda_1: 2.7380, lambda_2: 95.3785 lambda_3: 0.0000 train remain: [0.6 0.28 0.22 0.35 0.11 0.22 0.49 0.46 0.27] infer remain: [0.6, 0.27, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.6, 0.16, 0.04, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101011111101111101011001000001 000100000001001001010001100001 111111010000000000000000000000 101111011101001000000000000000 101000000000000000000000000001 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 loss: 0.325193, lagrangian_loss: 0.025423, attention_score_distillation_loss: 0.000385 loss: 0.199189, lagrangian_loss: -0.013582, attention_score_distillation_loss: 0.000373 ETA: 1:38:11 | Epoch 56 finished. Took 65.01 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:41:02 Evaluating: pearson: 0.7902, eval_loss: 0.8569, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7016, expected_sparsity: 0.6929, expected_sequence_sparsity: 0.8522, target_sparsity: 0.7, step: 10300 lambda_1: 2.3571, lambda_2: 95.8732 lambda_3: 0.0000 train remain: [0.69 0.39 0.25 0.36 0.11 0.22 0.49 0.46 0.27] infer remain: [0.67, 0.4, 0.23, 0.37, 0.1, 0.23, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.27, 0.06, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000001 000110000011001001010011100101 111111010000000000000000000000 101111011101001000000010000000 101000000000000000000000000001 100001001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7752 @ step 9100 epoch 50.56 Saving the best model so far: [Epoch 57 | Step: 10300 | MACs sparsity: 0.7016 | Score: 0.7902 | Loss: 0.8569] loss: 0.116222, lagrangian_loss: -0.014457, attention_score_distillation_loss: 0.000386 loss: 0.046088, lagrangian_loss: -0.003608, attention_score_distillation_loss: 0.000384 ---------------------------------------------------------------------- time: 2023-07-19 15:41:51 Evaluating: pearson: 0.7897, eval_loss: 0.8997, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.707, expected_sparsity: 0.6976, expected_sequence_sparsity: 0.8546, target_sparsity: 0.7, step: 10400 lambda_1: 0.5415, lambda_2: 97.3043 lambda_3: 0.0000 train remain: [0.67 0.35 0.24 0.35 0.11 0.22 0.49 0.46 0.27] infer remain: [0.67, 0.33, 0.23, 0.37, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.22, 0.05, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000001 000110000011001001010001100001 111111010000000000000000000000 101111011101001000000010000000 101000000000000000000000000001 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7902 @ step 10300 epoch 57.22 loss: 0.089915, lagrangian_loss: 0.005422, attention_score_distillation_loss: 0.000380 ETA: 1:37:32 | Epoch 57 finished. Took 78.77 seconds. loss: 0.115280, lagrangian_loss: 0.005510, attention_score_distillation_loss: 0.000376 ---------------------------------------------------------------------- time: 2023-07-19 15:42:26 Evaluating: pearson: 0.796, eval_loss: 0.8564, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7151, expected_sparsity: 0.7039, expected_sequence_sparsity: 0.8576, target_sparsity: 0.7, step: 10500 lambda_1: -0.3236, lambda_2: 97.6590 lambda_3: 0.0000 train remain: [0.66 0.3 0.24 0.35 0.11 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.3, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.19, 0.04, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000001 000110000001001001010001100001 111111010000000000000000000000 101111011101001000000000000000 101000000000000000000000000001 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7902 @ step 10300 epoch 57.22 Saving the best model so far: [Epoch 58 | Step: 10500 | MACs sparsity: 0.7151 | Score: 0.796 | Loss: 0.8564] loss: 0.103753, lagrangian_loss: 0.000209, attention_score_distillation_loss: 0.000379 loss: 0.071886, lagrangian_loss: 0.000327, attention_score_distillation_loss: 0.000383 ---------------------------------------------------------------------- time: 2023-07-19 15:43:19 Evaluating: pearson: 0.7773, eval_loss: 0.9195, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7151, expected_sparsity: 0.7061, expected_sequence_sparsity: 0.8587, target_sparsity: 0.7, step: 10600 lambda_1: -0.5914, lambda_2: 97.7389 lambda_3: 0.0000 train remain: [0.66 0.28 0.23 0.35 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.27, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.17, 0.04, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000110000001001001010001100000 111111010000000000000000000000 101111011101001000000000000000 101000000000000000000000000001 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.057625, lagrangian_loss: 0.000081, attention_score_distillation_loss: 0.000378 ETA: 1:36:57 | Epoch 58 finished. Took 81.74 seconds. loss: 0.045191, lagrangian_loss: -0.000237, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 15:43:54 Evaluating: pearson: 0.7902, eval_loss: 0.8904, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7151, expected_sparsity: 0.7061, expected_sequence_sparsity: 0.8587, target_sparsity: 0.7, step: 10700 lambda_1: -0.7159, lambda_2: 97.7950 lambda_3: 0.0000 train remain: [0.65 0.27 0.23 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.27, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.17, 0.04, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000110000001001001010001100000 111111010000000000000000000000 101111011101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.128166, lagrangian_loss: -0.000544, attention_score_distillation_loss: 0.000383 loss: 0.117183, lagrangian_loss: -0.000219, attention_score_distillation_loss: 0.000384 ---------------------------------------------------------------------- time: 2023-07-19 15:44:28 Evaluating: pearson: 0.7843, eval_loss: 0.9077, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7204, expected_sparsity: 0.7083, expected_sequence_sparsity: 0.8598, target_sparsity: 0.7, step: 10800 lambda_1: -0.7266, lambda_2: 97.8372 lambda_3: 0.0000 train remain: [0.67 0.27 0.23 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.23, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.15, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000100000001001001010001100000 111111010000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 ETA: 1:35:54 | Epoch 59 finished. Took 64.44 seconds. loss: 0.061777, lagrangian_loss: 0.003083, attention_score_distillation_loss: 0.000384 loss: 0.053912, lagrangian_loss: 0.000964, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 15:45:04 Evaluating: pearson: 0.7903, eval_loss: 0.874, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7204, expected_sparsity: 0.7083, expected_sequence_sparsity: 0.8598, target_sparsity: 0.7, step: 10900 lambda_1: -0.9246, lambda_2: 97.9020 lambda_3: 0.0000 train remain: [0.67 0.26 0.23 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.23, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.15, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000100000001001001010001100000 111111010000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.054002, lagrangian_loss: 0.001087, attention_score_distillation_loss: 0.000388 loss: 0.097379, lagrangian_loss: -0.001467, attention_score_distillation_loss: 0.000385 ETA: 1:34:39 | Epoch 60 finished. Took 56.75 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:45:39 Evaluating: pearson: 0.7815, eval_loss: 0.9107, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7204, expected_sparsity: 0.7083, expected_sequence_sparsity: 0.8598, target_sparsity: 0.7, step: 11000 lambda_1: -0.9325, lambda_2: 97.9509 lambda_3: 0.0000 train remain: [0.66 0.25 0.23 0.34 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.23, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.15, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000100000001011001010001000000 111111010000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.089965, lagrangian_loss: -0.001495, attention_score_distillation_loss: 0.000387 loss: 0.094231, lagrangian_loss: -0.000274, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 15:46:13 Evaluating: pearson: 0.7755, eval_loss: 0.9429, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7106, expected_sequence_sparsity: 0.8609, target_sparsity: 0.7, step: 11100 lambda_1: -0.8248, lambda_2: 98.0037 lambda_3: 0.0000 train remain: [0.66 0.25 0.22 0.33 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000100000001001001010001000000 111111010000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.067006, lagrangian_loss: -0.001698, attention_score_distillation_loss: 0.000390 loss: 0.085092, lagrangian_loss: -0.000708, attention_score_distillation_loss: 0.000389 ETA: 1:33:37 | Epoch 61 finished. Took 64.93 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:46:49 Evaluating: pearson: 0.7661, eval_loss: 0.9668, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7106, expected_sequence_sparsity: 0.8609, target_sparsity: 0.7, step: 11200 lambda_1: -0.6635, lambda_2: 98.0617 lambda_3: 0.0000 train remain: [0.66 0.27 0.22 0.33 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000100000001001001010001000000 111111000001000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.095318, lagrangian_loss: -0.000623, attention_score_distillation_loss: 0.000379 loss: 0.095582, lagrangian_loss: 0.002319, attention_score_distillation_loss: 0.000383 ---------------------------------------------------------------------- time: 2023-07-19 15:47:24 Evaluating: pearson: 0.7726, eval_loss: 0.9569, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7106, expected_sequence_sparsity: 0.8609, target_sparsity: 0.7, step: 11300 lambda_1: -0.7268, lambda_2: 98.1172 lambda_3: 0.0000 train remain: [0.66 0.27 0.22 0.33 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000100000001001001010001000000 111111000001000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.063519, lagrangian_loss: -0.000523, attention_score_distillation_loss: 0.000378 ETA: 1:32:35 | Epoch 62 finished. Took 65.02 seconds. loss: 0.101295, lagrangian_loss: -0.001086, attention_score_distillation_loss: 0.000387 ---------------------------------------------------------------------- time: 2023-07-19 15:47:59 Evaluating: pearson: 0.788, eval_loss: 0.9034, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7106, expected_sequence_sparsity: 0.8609, target_sparsity: 0.7, step: 11400 lambda_1: -0.7271, lambda_2: 98.1711 lambda_3: 0.0000 train remain: [0.65 0.27 0.22 0.33 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000100000001001001010001000000 111111010000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.046692, lagrangian_loss: -0.001004, attention_score_distillation_loss: 0.000386 loss: 0.033076, lagrangian_loss: -0.000707, attention_score_distillation_loss: 0.000388 ---------------------------------------------------------------------- time: 2023-07-19 15:48:34 Evaluating: pearson: 0.7908, eval_loss: 0.886, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7106, expected_sequence_sparsity: 0.8609, target_sparsity: 0.7, step: 11500 lambda_1: -0.4726, lambda_2: 98.2364 lambda_3: 0.0000 train remain: [0.65 0.27 0.22 0.33 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000100000001001001010001000000 111111000001000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.064750, lagrangian_loss: -0.000150, attention_score_distillation_loss: 0.000381 ETA: 1:31:32 | Epoch 63 finished. Took 64.88 seconds. loss: 0.044639, lagrangian_loss: 0.001230, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 15:49:10 Evaluating: pearson: 0.7935, eval_loss: 0.8754, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7106, expected_sequence_sparsity: 0.8609, target_sparsity: 0.7, step: 11600 lambda_1: -0.5412, lambda_2: 98.3227 lambda_3: 0.0000 train remain: [0.66 0.27 0.22 0.33 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000000001011001010001000000 111111000000001000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.084236, lagrangian_loss: -0.000310, attention_score_distillation_loss: 0.000382 loss: 0.060878, lagrangian_loss: 0.001177, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 15:49:45 Evaluating: pearson: 0.7905, eval_loss: 0.8992, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7106, expected_sequence_sparsity: 0.8609, target_sparsity: 0.7, step: 11700 lambda_1: -0.4652, lambda_2: 98.3940 lambda_3: 0.0000 train remain: [0.66 0.27 0.22 0.33 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000000001011001010001000000 111111010000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 ETA: 1:30:30 | Epoch 64 finished. Took 65.01 seconds. loss: 0.045025, lagrangian_loss: -0.000477, attention_score_distillation_loss: 0.000388 loss: 0.050351, lagrangian_loss: 0.000964, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 15:50:20 Evaluating: pearson: 0.7866, eval_loss: 0.9432, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7106, expected_sequence_sparsity: 0.8609, target_sparsity: 0.7, step: 11800 lambda_1: -0.5377, lambda_2: 98.4921 lambda_3: 0.0000 train remain: [0.66 0.26 0.22 0.33 0.1 0.22 0.49 0.45 0.27] infer remain: [0.63, 0.2, 0.23, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000000001011001010001000000 111111000000001000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.060496, lagrangian_loss: -0.000379, attention_score_distillation_loss: 0.000392 loss: 0.045180, lagrangian_loss: 0.001874, attention_score_distillation_loss: 0.000378 ETA: 1:29:17 | Epoch 65 finished. Took 56.88 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:50:55 Evaluating: pearson: 0.7864, eval_loss: 0.93, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.711, expected_sequence_sparsity: 0.8611, target_sparsity: 0.7, step: 11900 lambda_1: -0.5056, lambda_2: 98.5966 lambda_3: 0.0000 train remain: [0.66 0.27 0.22 0.33 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000000011001001010001000000 111111000000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.052759, lagrangian_loss: -0.000066, attention_score_distillation_loss: 0.000384 loss: 0.098247, lagrangian_loss: -0.000599, attention_score_distillation_loss: 0.000388 ---------------------------------------------------------------------- time: 2023-07-19 15:51:30 Evaluating: pearson: 0.7876, eval_loss: 0.93, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.711, expected_sequence_sparsity: 0.8611, target_sparsity: 0.7, step: 12000 lambda_1: -0.5030, lambda_2: 98.6715 lambda_3: 0.0000 train remain: [0.66 0.27 0.22 0.33 0.1 0.22 0.49 0.45 0.27] infer remain: [0.63, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000000001001011010001000000 111111000000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.049438, lagrangian_loss: -0.000640, attention_score_distillation_loss: 0.000386 loss: 0.064165, lagrangian_loss: 0.000559, attention_score_distillation_loss: 0.000377 ETA: 1:28:15 | Epoch 66 finished. Took 64.84 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:52:05 Evaluating: pearson: 0.7826, eval_loss: 0.9155, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.711, expected_sequence_sparsity: 0.8611, target_sparsity: 0.7, step: 12100 lambda_1: -0.4486, lambda_2: 98.7822 lambda_3: 0.0000 train remain: [0.66 0.27 0.22 0.33 0.1 0.22 0.49 0.45 0.27] infer remain: [0.63, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000001000001001001010001000000 111111000000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.105475, lagrangian_loss: 0.001390, attention_score_distillation_loss: 0.000379 loss: 0.037636, lagrangian_loss: 0.001246, attention_score_distillation_loss: 0.000379 ---------------------------------------------------------------------- time: 2023-07-19 15:52:40 Evaluating: pearson: 0.7757, eval_loss: 0.9643, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.711, expected_sequence_sparsity: 0.8611, target_sparsity: 0.7, step: 12200 lambda_1: -0.5215, lambda_2: 98.8777 lambda_3: 0.0000 train remain: [0.66 0.27 0.22 0.33 0.1 0.22 0.49 0.45 0.27] infer remain: [0.63, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000001000001001001010001000000 111111000000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.078813, lagrangian_loss: 0.002593, attention_score_distillation_loss: 0.000376 ETA: 1:27:11 | Epoch 67 finished. Took 64.22 seconds. loss: 0.050498, lagrangian_loss: -0.000060, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 15:53:15 Evaluating: pearson: 0.7791, eval_loss: 0.9619, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.711, expected_sequence_sparsity: 0.8611, target_sparsity: 0.7, step: 12300 lambda_1: -0.5019, lambda_2: 98.9959 lambda_3: 0.0000 train remain: [0.65 0.28 0.21 0.33 0.1 0.22 0.49 0.45 0.27] infer remain: [0.63, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000001000001001001010001000000 111111000000000000000000000000 101111011101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.073610, lagrangian_loss: 0.000803, attention_score_distillation_loss: 0.000381 loss: 0.023274, lagrangian_loss: -0.000240, attention_score_distillation_loss: 0.000379 ---------------------------------------------------------------------- time: 2023-07-19 15:53:50 Evaluating: pearson: 0.7668, eval_loss: 1.0094, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.711, expected_sequence_sparsity: 0.8611, target_sparsity: 0.7, step: 12400 lambda_1: -0.2759, lambda_2: 99.1276 lambda_3: 0.0000 train remain: [0.65 0.28 0.21 0.33 0.1 0.22 0.49 0.45 0.27] infer remain: [0.63, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000001000001001001010001000000 111111000000000000000000000000 101111011101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.047830, lagrangian_loss: 0.000425, attention_score_distillation_loss: 0.000380 ETA: 1:26:08 | Epoch 68 finished. Took 64.57 seconds. loss: 0.037425, lagrangian_loss: 0.000685, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 15:54:25 Evaluating: pearson: 0.7841, eval_loss: 0.9146, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.711, expected_sequence_sparsity: 0.8611, target_sparsity: 0.7, step: 12500 lambda_1: -0.3843, lambda_2: 99.2584 lambda_3: 0.0000 train remain: [0.65 0.28 0.21 0.32 0.1 0.22 0.49 0.45 0.27] infer remain: [0.63, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001100000 000001000001001001010001000000 111111000000000000000000000000 101111011101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.094387, lagrangian_loss: -0.000277, attention_score_distillation_loss: 0.000386 loss: 0.082036, lagrangian_loss: -0.000516, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 15:55:00 Evaluating: pearson: 0.7816, eval_loss: 0.9496, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.711, expected_sequence_sparsity: 0.8611, target_sparsity: 0.7, step: 12600 lambda_1: -0.4478, lambda_2: 99.3945 lambda_3: 0.0000 train remain: [0.65 0.28 0.21 0.32 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.33, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 000000000001011001010001000000 111111000000000000000000000000 101111011101001000000000000000 101000010000000000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 ETA: 1:25:06 | Epoch 69 finished. Took 64.68 seconds. loss: 0.072990, lagrangian_loss: -0.000382, attention_score_distillation_loss: 0.000388 loss: 0.096248, lagrangian_loss: 0.000571, attention_score_distillation_loss: 0.000384 ---------------------------------------------------------------------- time: 2023-07-19 15:55:35 Evaluating: pearson: 0.787, eval_loss: 0.9201, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 12700 lambda_1: -0.5127, lambda_2: 99.5486 lambda_3: 0.0000 train remain: [0.66 0.27 0.22 0.31 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 000000000001011001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.068056, lagrangian_loss: 0.000138, attention_score_distillation_loss: 0.000380 loss: 0.041288, lagrangian_loss: -0.000247, attention_score_distillation_loss: 0.000385 ETA: 1:23:54 | Epoch 70 finished. Took 56.94 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:56:10 Evaluating: pearson: 0.7841, eval_loss: 0.924, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 12800 lambda_1: -0.3476, lambda_2: 99.6908 lambda_3: 0.0000 train remain: [0.66 0.28 0.22 0.31 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 000000000001011001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 loss: 0.046482, lagrangian_loss: 0.000006, attention_score_distillation_loss: 0.000383 loss: 0.032675, lagrangian_loss: -0.000325, attention_score_distillation_loss: 0.000377 ---------------------------------------------------------------------- time: 2023-07-19 15:56:45 Evaluating: pearson: 0.8662, eval_loss: 0.5844, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 12900 lambda_1: -0.2944, lambda_2: 99.8334 lambda_3: 0.0000 train remain: [0.66 0.27 0.21 0.31 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 100000000001001001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.7960 @ step 10500 epoch 58.33 Saving the best model so far: [Epoch 71 | Step: 12900 | MACs sparsity: 0.7231 | Score: 0.8662 | Loss: 0.5844] loss: 0.053614, lagrangian_loss: -0.000131, attention_score_distillation_loss: 0.000383 loss: 0.034662, lagrangian_loss: -0.000195, attention_score_distillation_loss: 0.000383 ETA: 1:23:10 | Epoch 71 finished. Took 81.41 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:57:37 Evaluating: pearson: 0.8613, eval_loss: 0.5941, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 13000 lambda_1: -0.2608, lambda_2: 100.0117 lambda_3: 0.0000 train remain: [0.66 0.28 0.21 0.31 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 100000000001001001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000000000000000010000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.025899, lagrangian_loss: -0.000094, attention_score_distillation_loss: 0.000383 loss: 0.055102, lagrangian_loss: 0.000370, attention_score_distillation_loss: 0.000374 ---------------------------------------------------------------------- time: 2023-07-19 15:58:12 Evaluating: pearson: 0.8608, eval_loss: 0.6083, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 13100 lambda_1: -0.2767, lambda_2: 100.1852 lambda_3: 0.0000 train remain: [0.66 0.28 0.21 0.31 0.1 0.22 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 100000000001001001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000000100000000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.063222, lagrangian_loss: 0.000743, attention_score_distillation_loss: 0.000379 ETA: 1:22:06 | Epoch 72 finished. Took 64.76 seconds. loss: 0.035551, lagrangian_loss: 0.004377, attention_score_distillation_loss: 0.000370 ---------------------------------------------------------------------- time: 2023-07-19 15:58:47 Evaluating: pearson: 0.8652, eval_loss: 0.5793, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 13200 lambda_1: -0.4047, lambda_2: 100.3540 lambda_3: 0.0000 train remain: [0.65 0.27 0.21 0.3 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 100000000001001001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000010000000000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.071439, lagrangian_loss: -0.000109, attention_score_distillation_loss: 0.000385 loss: 0.040129, lagrangian_loss: -0.000078, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 15:59:23 Evaluating: pearson: 0.8633, eval_loss: 0.5785, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 13300 lambda_1: -0.1982, lambda_2: 100.5196 lambda_3: 0.0000 train remain: [0.66 0.28 0.21 0.3 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 100000000001001001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000010000000000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.046749, lagrangian_loss: 0.000234, attention_score_distillation_loss: 0.000382 ETA: 1:21:03 | Epoch 73 finished. Took 64.79 seconds. loss: 0.065887, lagrangian_loss: 0.001726, attention_score_distillation_loss: 0.000376 ---------------------------------------------------------------------- time: 2023-07-19 15:59:58 Evaluating: pearson: 0.8602, eval_loss: 0.6158, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 13400 lambda_1: -0.3351, lambda_2: 100.6745 lambda_3: 0.0000 train remain: [0.66 0.28 0.21 0.3 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 100000000001001001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.075421, lagrangian_loss: 0.000495, attention_score_distillation_loss: 0.000384 loss: 0.048791, lagrangian_loss: 0.000603, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 16:00:33 Evaluating: pearson: 0.7773, eval_loss: 0.97, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 13500 lambda_1: -0.3978, lambda_2: 100.8506 lambda_3: 0.0000 train remain: [0.66 0.27 0.2 0.3 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000101001001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 1:20:00 | Epoch 74 finished. Took 64.63 seconds. loss: 0.050293, lagrangian_loss: -0.000281, attention_score_distillation_loss: 0.000388 loss: 0.041803, lagrangian_loss: 0.000169, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 16:01:08 Evaluating: pearson: 0.7952, eval_loss: 0.8934, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 13600 lambda_1: -0.2937, lambda_2: 101.0221 lambda_3: 0.0000 train remain: [0.66 0.28 0.2 0.3 0.09 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 001000000001001001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.051227, lagrangian_loss: -0.000193, attention_score_distillation_loss: 0.000379 loss: 0.066852, lagrangian_loss: 0.001294, attention_score_distillation_loss: 0.000367 ETA: 1:18:49 | Epoch 75 finished. Took 56.57 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:01:43 Evaluating: pearson: 0.7826, eval_loss: 0.9529, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 13700 lambda_1: -0.2063, lambda_2: 101.2517 lambda_3: 0.0000 train remain: [0.65 0.28 0.2 0.3 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 000000000001001001010001100000 111111000000000000000000000000 101111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.049452, lagrangian_loss: -0.000054, attention_score_distillation_loss: 0.000385 loss: 0.065097, lagrangian_loss: -0.000037, attention_score_distillation_loss: 0.000383 ---------------------------------------------------------------------- time: 2023-07-19 16:02:17 Evaluating: pearson: 0.7826, eval_loss: 0.9378, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 13800 lambda_1: -0.2581, lambda_2: 101.4746 lambda_3: 0.0000 train remain: [0.66 0.29 0.2 0.3 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000001001001001010001000000 111111000000000000000000000000 101111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.042845, lagrangian_loss: 0.000086, attention_score_distillation_loss: 0.000382 loss: 0.062977, lagrangian_loss: -0.000475, attention_score_distillation_loss: 0.000383 ETA: 1:17:45 | Epoch 76 finished. Took 64.59 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:02:53 Evaluating: pearson: 0.7561, eval_loss: 1.0559, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 13900 lambda_1: -0.3661, lambda_2: 101.7510 lambda_3: 0.0000 train remain: [0.65 0.27 0.19 0.3 0.09 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001001011010001000000 111111000000000000000000000000 101111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.031006, lagrangian_loss: -0.000239, attention_score_distillation_loss: 0.000372 loss: 0.041257, lagrangian_loss: 0.000237, attention_score_distillation_loss: 0.000386 ---------------------------------------------------------------------- time: 2023-07-19 16:03:28 Evaluating: pearson: 0.7702, eval_loss: 0.9913, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 14000 lambda_1: -0.3033, lambda_2: 101.9779 lambda_3: 0.0000 train remain: [0.66 0.28 0.2 0.3 0.09 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001001011010001000000 111111000000000000000000000000 100111001101001000010000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.078714, lagrangian_loss: 0.000253, attention_score_distillation_loss: 0.000384 ETA: 1:16:42 | Epoch 77 finished. Took 64.84 seconds. loss: 0.045859, lagrangian_loss: 0.000803, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 16:04:03 Evaluating: pearson: 0.7712, eval_loss: 0.9966, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 14100 lambda_1: -0.3285, lambda_2: 102.2479 lambda_3: 0.0000 train remain: [0.66 0.27 0.19 0.3 0.09 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000010 000000000001001011010001000000 111111000000000000000000000000 100111001101001000000010000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.044995, lagrangian_loss: -0.000199, attention_score_distillation_loss: 0.000377 loss: 0.063254, lagrangian_loss: 0.000342, attention_score_distillation_loss: 0.000388 ---------------------------------------------------------------------- time: 2023-07-19 16:04:37 Evaluating: pearson: 0.7838, eval_loss: 0.9678, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 14200 lambda_1: -0.2491, lambda_2: 102.5826 lambda_3: 0.0000 train remain: [0.66 0.28 0.19 0.3 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000001000001001001010001000000 111111000000000000000000000000 100111001101001000010000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.073411, lagrangian_loss: 0.000403, attention_score_distillation_loss: 0.000379 ETA: 1:15:39 | Epoch 78 finished. Took 64.26 seconds. loss: 0.047638, lagrangian_loss: -0.000232, attention_score_distillation_loss: 0.000388 ---------------------------------------------------------------------- time: 2023-07-19 16:05:13 Evaluating: pearson: 0.7677, eval_loss: 1.0043, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 14300 lambda_1: -0.1188, lambda_2: 102.8764 lambda_3: 0.0000 train remain: [0.66 0.28 0.19 0.3 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000000001001101010001000000 111111000000000000000000000000 100111001101001000000000000001 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.051270, lagrangian_loss: -0.000034, attention_score_distillation_loss: 0.000383 loss: 0.062033, lagrangian_loss: 0.003197, attention_score_distillation_loss: 0.000370 ---------------------------------------------------------------------- time: 2023-07-19 16:05:48 Evaluating: pearson: 0.808, eval_loss: 0.8371, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 14400 lambda_1: -0.4586, lambda_2: 103.2618 lambda_3: 0.0000 train remain: [0.65 0.27 0.19 0.29 0.09 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 001000000001001001010001000000 101111000000000000000000000000 100111001101001000010000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 1:14:36 | Epoch 79 finished. Took 64.81 seconds. loss: 0.038721, lagrangian_loss: -0.000220, attention_score_distillation_loss: 0.000382 loss: 0.040901, lagrangian_loss: -0.000039, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 16:06:23 Evaluating: pearson: 0.7971, eval_loss: 0.868, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7111, expected_sequence_sparsity: 0.8612, target_sparsity: 0.7, step: 14500 lambda_1: -0.1493, lambda_2: 103.6309 lambda_3: 0.0000 train remain: [0.66 0.3 0.19 0.29 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.2, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 001000000001001001010001000000 111111000000000000000000000000 100111001101001000010000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.053083, lagrangian_loss: 0.000527, attention_score_distillation_loss: 0.000378 loss: 0.026494, lagrangian_loss: 0.000553, attention_score_distillation_loss: 0.000377 ETA: 1:13:25 | Epoch 80 finished. Took 56.6 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:06:58 Evaluating: pearson: 0.7769, eval_loss: 0.9858, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 14600 lambda_1: -0.2162, lambda_2: 104.0256 lambda_3: 0.0000 train remain: [0.66 0.28 0.19 0.29 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 000000000101001001010001000000 101111000000000000000000000000 100111001101001000010000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.061099, lagrangian_loss: 0.001176, attention_score_distillation_loss: 0.000374 loss: 0.039841, lagrangian_loss: 0.000318, attention_score_distillation_loss: 0.000383 ---------------------------------------------------------------------- time: 2023-07-19 16:07:32 Evaluating: pearson: 0.7843, eval_loss: 0.9417, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 14700 lambda_1: -0.1247, lambda_2: 104.4652 lambda_3: 0.0000 train remain: [0.67 0.27 0.18 0.29 0.1 0.21 0.49 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.3, 0.1, 0.2, 0.5, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000101001001010001000000 101111000000000000000000000000 100111001101001000010000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010100000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.027057, lagrangian_loss: 0.000052, attention_score_distillation_loss: 0.000380 loss: 0.067292, lagrangian_loss: 0.002007, attention_score_distillation_loss: 0.000371 ETA: 1:12:22 | Epoch 81 finished. Took 64.62 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:08:08 Evaluating: pearson: 0.7763, eval_loss: 0.9568, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 14800 lambda_1: -0.4095, lambda_2: 104.7730 lambda_3: 0.0000 train remain: [0.67 0.27 0.18 0.28 0.1 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000001001001010001000100 101111000000000000000000000000 100111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.028118, lagrangian_loss: -0.000185, attention_score_distillation_loss: 0.000386 loss: 0.050125, lagrangian_loss: 0.000103, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 16:08:43 Evaluating: pearson: 0.7709, eval_loss: 0.9715, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 14900 lambda_1: -0.2043, lambda_2: 105.0446 lambda_3: 0.0000 train remain: [0.66 0.27 0.18 0.28 0.1 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000001001001010001000100 101111000000000000000000000000 100111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.040388, lagrangian_loss: -0.000028, attention_score_distillation_loss: 0.000381 ETA: 1:11:19 | Epoch 82 finished. Took 64.54 seconds. loss: 0.043093, lagrangian_loss: -0.000080, attention_score_distillation_loss: 0.000384 ---------------------------------------------------------------------- time: 2023-07-19 16:09:18 Evaluating: pearson: 0.7601, eval_loss: 1.035, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 15000 lambda_1: -0.3245, lambda_2: 105.3407 lambda_3: 0.0000 train remain: [0.66 0.28 0.18 0.28 0.1 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000001001001010001000100 101111000000000000000000000000 100111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.040221, lagrangian_loss: -0.000227, attention_score_distillation_loss: 0.000380 loss: 0.041196, lagrangian_loss: -0.000149, attention_score_distillation_loss: 0.000384 ---------------------------------------------------------------------- time: 2023-07-19 16:09:53 Evaluating: pearson: 0.7604, eval_loss: 1.0307, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 15100 lambda_1: -0.2977, lambda_2: 105.7161 lambda_3: 0.0000 train remain: [0.66 0.27 0.18 0.27 0.1 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001001001110001000000 101111000000000000000000000000 100111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.028135, lagrangian_loss: 0.000260, attention_score_distillation_loss: 0.000380 ETA: 1:10:16 | Epoch 83 finished. Took 64.97 seconds. loss: 0.038570, lagrangian_loss: -0.000054, attention_score_distillation_loss: 0.000390 ---------------------------------------------------------------------- time: 2023-07-19 16:10:28 Evaluating: pearson: 0.7629, eval_loss: 1.0186, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 15200 lambda_1: -0.1537, lambda_2: 106.1380 lambda_3: 0.0000 train remain: [0.67 0.28 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001001001011001000000 101111000000000000000000000000 100111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.050219, lagrangian_loss: 0.000617, attention_score_distillation_loss: 0.000383 loss: 0.055448, lagrangian_loss: 0.000647, attention_score_distillation_loss: 0.000375 ---------------------------------------------------------------------- time: 2023-07-19 16:11:03 Evaluating: pearson: 0.7785, eval_loss: 0.9886, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 15300 lambda_1: -0.1899, lambda_2: 106.6810 lambda_3: 0.0000 train remain: [0.66 0.29 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000001001001001010001000000 101111000000000000000000000000 100111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 1:09:13 | Epoch 84 finished. Took 64.89 seconds. loss: 0.031221, lagrangian_loss: -0.000081, attention_score_distillation_loss: 0.000390 loss: 0.042285, lagrangian_loss: 0.000051, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 16:11:38 Evaluating: pearson: 0.7683, eval_loss: 1.0159, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 15400 lambda_1: -0.3384, lambda_2: 107.1661 lambda_3: 0.0000 train remain: [0.66 0.29 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001001011010001000000 101111000000000000000000000000 100111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.033055, lagrangian_loss: -0.000264, attention_score_distillation_loss: 0.000382 loss: 0.039543, lagrangian_loss: 0.000038, attention_score_distillation_loss: 0.000377 ETA: 1:08:04 | Epoch 85 finished. Took 56.9 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:12:14 Evaluating: pearson: 0.7751, eval_loss: 0.9958, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 15500 lambda_1: -0.2988, lambda_2: 107.5536 lambda_3: 0.0000 train remain: [0.66 0.28 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000101001001010001000000 101111000000000000000000000000 100111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.054903, lagrangian_loss: -0.000206, attention_score_distillation_loss: 0.000385 loss: 0.066603, lagrangian_loss: 0.000103, attention_score_distillation_loss: 0.000384 ---------------------------------------------------------------------- time: 2023-07-19 16:12:49 Evaluating: pearson: 0.774, eval_loss: 0.9929, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 15600 lambda_1: -0.2889, lambda_2: 108.0514 lambda_3: 0.0000 train remain: [0.66 0.28 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000010 000000000011001001010001000000 101111000000000000000000000000 100111001101001000000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.042211, lagrangian_loss: -0.000119, attention_score_distillation_loss: 0.000385 loss: 0.023560, lagrangian_loss: -0.000057, attention_score_distillation_loss: 0.000387 ETA: 1:07:00 | Epoch 86 finished. Took 64.7 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:13:24 Evaluating: pearson: 0.7775, eval_loss: 1.0039, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 15700 lambda_1: -0.0613, lambda_2: 108.6293 lambda_3: 0.0000 train remain: [0.67 0.3 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001100000 000000000001001001011001000000 101111000000000000000000000000 100011001101001000000001000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.034088, lagrangian_loss: 0.000975, attention_score_distillation_loss: 0.000377 loss: 0.044617, lagrangian_loss: -0.000202, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 16:13:58 Evaluating: pearson: 0.7572, eval_loss: 1.0526, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 15800 lambda_1: -0.1538, lambda_2: 109.3043 lambda_3: 0.0000 train remain: [0.66 0.28 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000001001011010001000000 101111000000000000000000000000 100011001101001001000000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.038073, lagrangian_loss: 0.000286, attention_score_distillation_loss: 0.000387 ETA: 1:05:57 | Epoch 87 finished. Took 64.45 seconds. loss: 0.056705, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.000383 ---------------------------------------------------------------------- time: 2023-07-19 16:14:34 Evaluating: pearson: 0.7947, eval_loss: 0.873, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 15900 lambda_1: -0.2719, lambda_2: 109.9419 lambda_3: 0.0000 train remain: [0.66 0.28 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000010 010000000001001001010001000000 101111000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.048416, lagrangian_loss: -0.000164, attention_score_distillation_loss: 0.000380 loss: 0.027297, lagrangian_loss: -0.000163, attention_score_distillation_loss: 0.000388 ---------------------------------------------------------------------- time: 2023-07-19 16:15:09 Evaluating: pearson: 0.7701, eval_loss: 1.0343, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 16000 lambda_1: -0.2272, lambda_2: 110.5427 lambda_3: 0.0000 train remain: [0.66 0.29 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000000001101001010001000000 101111000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.033269, lagrangian_loss: -0.000056, attention_score_distillation_loss: 0.000385 ETA: 1:04:54 | Epoch 88 finished. Took 64.87 seconds. loss: 0.023883, lagrangian_loss: 0.001760, attention_score_distillation_loss: 0.000376 ---------------------------------------------------------------------- time: 2023-07-19 16:15:44 Evaluating: pearson: 0.7734, eval_loss: 1.0289, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 16100 lambda_1: -0.2795, lambda_2: 111.0826 lambda_3: 0.0000 train remain: [0.66 0.29 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001101001010001000000 101111000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.043581, lagrangian_loss: 0.000957, attention_score_distillation_loss: 0.000376 loss: 0.020190, lagrangian_loss: 0.001916, attention_score_distillation_loss: 0.000368 ---------------------------------------------------------------------- time: 2023-07-19 16:16:19 Evaluating: pearson: 0.7774, eval_loss: 1.011, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 16200 lambda_1: -0.3266, lambda_2: 111.6429 lambda_3: 0.0000 train remain: [0.66 0.29 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000101001001010001000000 101111000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 1:03:51 | Epoch 89 finished. Took 64.6 seconds. loss: 0.024648, lagrangian_loss: 0.000329, attention_score_distillation_loss: 0.000382 loss: 0.022842, lagrangian_loss: -0.000122, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 16:16:54 Evaluating: pearson: 0.7989, eval_loss: 0.8395, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 16300 lambda_1: -0.2067, lambda_2: 112.2242 lambda_3: 0.0000 train remain: [0.66 0.29 0.17 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 010000000001001001010001000000 101111000000000000000000000000 100011001101001000010000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.033363, lagrangian_loss: 0.000143, attention_score_distillation_loss: 0.000387 loss: 0.050695, lagrangian_loss: -0.000042, attention_score_distillation_loss: 0.000380 ETA: 1:02:42 | Epoch 90 finished. Took 56.53 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:17:29 Evaluating: pearson: 0.7772, eval_loss: 1.0165, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 16400 lambda_1: -0.0293, lambda_2: 112.9877 lambda_3: 0.0000 train remain: [0.66 0.28 0.18 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000010000001001001010001000000 101111000000000000000000000000 100011001101001000010000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.042821, lagrangian_loss: 0.000101, attention_score_distillation_loss: 0.000377 loss: 0.022374, lagrangian_loss: 0.000263, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 16:18:04 Evaluating: pearson: 0.7697, eval_loss: 1.0476, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 16500 lambda_1: -0.3794, lambda_2: 113.5938 lambda_3: 0.0000 train remain: [0.65 0.28 0.17 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 001000000001001001010001000000 101111000000000000000000000000 100011001101001000010000000000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.040125, lagrangian_loss: 0.000178, attention_score_distillation_loss: 0.000371 loss: 0.022919, lagrangian_loss: -0.000021, attention_score_distillation_loss: 0.000379 ETA: 1:01:39 | Epoch 91 finished. Took 64.62 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:18:39 Evaluating: pearson: 0.7758, eval_loss: 1.0089, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 16600 lambda_1: -0.2481, lambda_2: 114.2574 lambda_3: 0.0000 train remain: [0.66 0.28 0.17 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011011000000 001000000001001001010001000000 101111000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.020817, lagrangian_loss: -0.000122, attention_score_distillation_loss: 0.000387 loss: 0.045296, lagrangian_loss: 0.000139, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 16:19:14 Evaluating: pearson: 0.7569, eval_loss: 1.0972, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 16700 lambda_1: -0.2610, lambda_2: 114.9493 lambda_3: 0.0000 train remain: [0.66 0.28 0.17 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000001001001010001100000 101111000000000000000000000000 100011001101001000000000010000 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.044508, lagrangian_loss: 0.001247, attention_score_distillation_loss: 0.000382 ETA: 1:00:35 | Epoch 92 finished. Took 64.61 seconds. loss: 0.019837, lagrangian_loss: 0.001162, attention_score_distillation_loss: 0.000368 ---------------------------------------------------------------------- time: 2023-07-19 16:19:49 Evaluating: pearson: 0.7647, eval_loss: 1.0758, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 16800 lambda_1: -0.4221, lambda_2: 115.6649 lambda_3: 0.0000 train remain: [0.66 0.27 0.16 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000001001001010001100000 101111000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.022881, lagrangian_loss: -0.000346, attention_score_distillation_loss: 0.000377 loss: 0.030107, lagrangian_loss: 0.000172, attention_score_distillation_loss: 0.000384 ---------------------------------------------------------------------- time: 2023-07-19 16:20:24 Evaluating: pearson: 0.7553, eval_loss: 1.0948, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 16900 lambda_1: -0.2135, lambda_2: 116.7186 lambda_3: 0.0000 train remain: [0.67 0.28 0.16 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000010001001001010001000000 101111000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.041738, lagrangian_loss: 0.000787, attention_score_distillation_loss: 0.000381 ETA: 0:59:32 | Epoch 93 finished. Took 64.7 seconds. loss: 0.041096, lagrangian_loss: -0.000019, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 16:20:59 Evaluating: pearson: 0.7713, eval_loss: 1.0028, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 17000 lambda_1: -0.1418, lambda_2: 117.6479 lambda_3: 0.0000 train remain: [0.66 0.29 0.16 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000010 000000000001001001010001100000 101111000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.034198, lagrangian_loss: 0.000131, attention_score_distillation_loss: 0.000382 loss: 0.032873, lagrangian_loss: -0.000043, attention_score_distillation_loss: 0.000376 ---------------------------------------------------------------------- time: 2023-07-19 16:21:34 Evaluating: pearson: 0.8603, eval_loss: 0.5876, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 17100 lambda_1: -0.2169, lambda_2: 118.3491 lambda_3: 0.0000 train remain: [0.67 0.28 0.16 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 101111000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000000001000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:58:29 | Epoch 94 finished. Took 64.54 seconds. loss: 0.035105, lagrangian_loss: -0.000049, attention_score_distillation_loss: 0.000382 loss: 0.039268, lagrangian_loss: -0.000056, attention_score_distillation_loss: 0.000388 ---------------------------------------------------------------------- time: 2023-07-19 16:22:09 Evaluating: pearson: 0.7682, eval_loss: 1.0252, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7116, expected_sequence_sparsity: 0.8614, target_sparsity: 0.7, step: 17200 lambda_1: -0.1236, lambda_2: 119.2859 lambda_3: 0.0000 train remain: [0.66 0.29 0.16 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.17, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000010 000000000001001011010001000000 101111000000000000000000000000 100011001101001000000000100000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.026729, lagrangian_loss: 0.000710, attention_score_distillation_loss: 0.000385 loss: 0.029167, lagrangian_loss: 0.000831, attention_score_distillation_loss: 0.000380 ETA: 0:57:21 | Epoch 95 finished. Took 56.25 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:22:44 Evaluating: pearson: 0.7626, eval_loss: 1.0873, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.712, expected_sequence_sparsity: 0.8616, target_sparsity: 0.7, step: 17300 lambda_1: -0.1327, lambda_2: 119.9846 lambda_3: 0.0000 train remain: [0.66 0.29 0.15 0.27 0.1 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000010001001001010001000000 101110000000000000000000000000 100011001101001000010000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.021142, lagrangian_loss: 0.000488, attention_score_distillation_loss: 0.000376 loss: 0.030649, lagrangian_loss: 0.001606, attention_score_distillation_loss: 0.000375 ---------------------------------------------------------------------- time: 2023-07-19 16:23:19 Evaluating: pearson: 0.7484, eval_loss: 1.156, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.712, expected_sequence_sparsity: 0.8616, target_sparsity: 0.7, step: 17400 lambda_1: -0.3186, lambda_2: 120.8985 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000010001001001010001000000 101110000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.030678, lagrangian_loss: 0.000829, attention_score_distillation_loss: 0.000380 loss: 0.036721, lagrangian_loss: -0.000003, attention_score_distillation_loss: 0.000379 ETA: 0:56:18 | Epoch 96 finished. Took 64.54 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:23:53 Evaluating: pearson: 0.7698, eval_loss: 1.0538, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.712, expected_sequence_sparsity: 0.8616, target_sparsity: 0.7, step: 17500 lambda_1: -0.2705, lambda_2: 121.8839 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011011000000 000000010001001001010001000000 101110000000000000000000000000 100011001101001000000100000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.023943, lagrangian_loss: -0.000145, attention_score_distillation_loss: 0.000389 loss: 0.033800, lagrangian_loss: 0.000595, attention_score_distillation_loss: 0.000375 ---------------------------------------------------------------------- time: 2023-07-19 16:24:28 Evaluating: pearson: 0.7803, eval_loss: 1.0564, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.712, expected_sequence_sparsity: 0.8616, target_sparsity: 0.7, step: 17600 lambda_1: -0.1839, lambda_2: 122.8442 lambda_3: 0.0000 train remain: [0.66 0.29 0.15 0.27 0.1 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 000000000101001001010001000000 101110000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.017465, lagrangian_loss: 0.000035, attention_score_distillation_loss: 0.000383 ETA: 0:55:14 | Epoch 97 finished. Took 64.34 seconds. loss: 0.037121, lagrangian_loss: 0.000003, attention_score_distillation_loss: 0.000388 ---------------------------------------------------------------------- time: 2023-07-19 16:25:03 Evaluating: pearson: 0.7659, eval_loss: 1.1051, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.712, expected_sequence_sparsity: 0.8616, target_sparsity: 0.7, step: 17700 lambda_1: -0.2579, lambda_2: 123.7246 lambda_3: 0.0000 train remain: [0.66 0.29 0.15 0.27 0.09 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011011000000 000000000001001001010001100000 101110000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.048187, lagrangian_loss: 0.001287, attention_score_distillation_loss: 0.000380 loss: 0.032916, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.000387 ---------------------------------------------------------------------- time: 2023-07-19 16:25:38 Evaluating: pearson: 0.7567, eval_loss: 1.1019, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.712, expected_sequence_sparsity: 0.8616, target_sparsity: 0.7, step: 17800 lambda_1: -0.2351, lambda_2: 124.5340 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.27 0.1 0.21 0.48 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011011000000 000000000001001101010001000000 100111000000000000000000000000 100011001101001000000000100000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.017284, lagrangian_loss: -0.000052, attention_score_distillation_loss: 0.000381 ETA: 0:54:11 | Epoch 98 finished. Took 64.45 seconds. loss: 0.033069, lagrangian_loss: 0.000011, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 16:26:13 Evaluating: pearson: 0.7757, eval_loss: 1.0532, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.712, expected_sequence_sparsity: 0.8616, target_sparsity: 0.7, step: 17900 lambda_1: -0.1976, lambda_2: 125.3866 lambda_3: 0.0000 train remain: [0.66 0.29 0.15 0.26 0.1 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001011001010001000000 100111000000000000000000000000 100011001101001000000000100000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.030489, lagrangian_loss: -0.000006, attention_score_distillation_loss: 0.000388 loss: 0.021992, lagrangian_loss: -0.000087, attention_score_distillation_loss: 0.000391 ---------------------------------------------------------------------- time: 2023-07-19 16:26:48 Evaluating: pearson: 0.8605, eval_loss: 0.5862, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.712, expected_sequence_sparsity: 0.8616, target_sparsity: 0.7, step: 18000 lambda_1: -0.2871, lambda_2: 126.5477 lambda_3: 0.0000 train remain: [0.65 0.29 0.14 0.26 0.1 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.27, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100111000000000000000000000000 100011001101001000000000000001 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:53:07 | Epoch 99 finished. Took 64.46 seconds. loss: 0.015744, lagrangian_loss: 0.001224, attention_score_distillation_loss: 0.000376 loss: 0.022839, lagrangian_loss: 0.002841, attention_score_distillation_loss: 0.000372 ---------------------------------------------------------------------- time: 2023-07-19 16:27:23 Evaluating: pearson: 0.8576, eval_loss: 0.5954, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 18100 lambda_1: -0.2914, lambda_2: 127.7914 lambda_3: 0.0000 train remain: [0.66 0.3 0.15 0.25 0.1 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100111000000000000000000000000 100011001101001000000000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.025052, lagrangian_loss: 0.000439, attention_score_distillation_loss: 0.000383 loss: 0.035525, lagrangian_loss: 0.000813, attention_score_distillation_loss: 0.000390 ETA: 0:52:00 | Epoch 100 finished. Took 56.69 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:27:58 Evaluating: pearson: 0.8607, eval_loss: 0.5921, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 18200 lambda_1: -0.1125, lambda_2: 128.8302 lambda_3: 0.0000 train remain: [0.66 0.3 0.15 0.25 0.1 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 100000000001001001010001000000 100111000000000000000000000000 100011001101001000000000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.032449, lagrangian_loss: 0.000232, attention_score_distillation_loss: 0.000383 loss: 0.033555, lagrangian_loss: 0.000172, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 16:28:33 Evaluating: pearson: 0.8591, eval_loss: 0.5931, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 18300 lambda_1: -0.2458, lambda_2: 129.8555 lambda_3: 0.0000 train remain: [0.66 0.29 0.14 0.24 0.1 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100111000000000000000000000000 100011001101001000000000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.024299, lagrangian_loss: -0.000115, attention_score_distillation_loss: 0.000381 loss: 0.026072, lagrangian_loss: 0.000160, attention_score_distillation_loss: 0.000378 ETA: 0:50:57 | Epoch 101 finished. Took 64.48 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:29:08 Evaluating: pearson: 0.8467, eval_loss: 0.6669, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 18400 lambda_1: -0.3015, lambda_2: 130.9600 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.24 0.1 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 100000000001001001010001000000 100111000000000000000000000000 100011001101001000000000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.024402, lagrangian_loss: 0.000310, attention_score_distillation_loss: 0.000386 loss: 0.039002, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 16:29:42 Evaluating: pearson: 0.7645, eval_loss: 1.1141, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 18500 lambda_1: -0.2008, lambda_2: 132.0929 lambda_3: 0.0000 train remain: [0.67 0.27 0.15 0.24 0.1 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000001101001010001000000 100111000000000000000000000000 100011001101001000000000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.048687, lagrangian_loss: 0.000625, attention_score_distillation_loss: 0.000390 ETA: 0:49:53 | Epoch 102 finished. Took 64.29 seconds. loss: 0.031546, lagrangian_loss: 0.000019, attention_score_distillation_loss: 0.000389 ---------------------------------------------------------------------- time: 2023-07-19 16:30:18 Evaluating: pearson: 0.7735, eval_loss: 1.0669, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 18600 lambda_1: -0.1965, lambda_2: 133.0919 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.24 0.09 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001010000 000000000001001001010001100000 100111000000000000000000000000 100011001101001000000000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.020659, lagrangian_loss: 0.003899, attention_score_distillation_loss: 0.000369 loss: 0.025185, lagrangian_loss: 0.000046, attention_score_distillation_loss: 0.000384 ---------------------------------------------------------------------- time: 2023-07-19 16:30:53 Evaluating: pearson: 0.8503, eval_loss: 0.6534, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 18700 lambda_1: -0.3418, lambda_2: 134.1493 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.24 0.1 0.21 0.47 0.46 0.29] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 100000000001001001010001000000 100111000000000000000000000000 100011001101001000000000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.024355, lagrangian_loss: -0.000128, attention_score_distillation_loss: 0.000386 ETA: 0:48:50 | Epoch 103 finished. Took 64.69 seconds. loss: 0.020686, lagrangian_loss: -0.000040, attention_score_distillation_loss: 0.000386 ---------------------------------------------------------------------- time: 2023-07-19 16:31:28 Evaluating: pearson: 0.7738, eval_loss: 1.0493, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 18800 lambda_1: -0.1814, lambda_2: 135.1980 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.24 0.1 0.21 0.47 0.46 0.28] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000101001001010001000000 100111000000000000000000000000 100011001101001000000000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.029667, lagrangian_loss: 0.000004, attention_score_distillation_loss: 0.000382 loss: 0.038241, lagrangian_loss: 0.000376, attention_score_distillation_loss: 0.000379 ---------------------------------------------------------------------- time: 2023-07-19 16:32:03 Evaluating: pearson: 0.775, eval_loss: 1.0661, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 18900 lambda_1: -0.2206, lambda_2: 136.5613 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.23 0.1 0.21 0.47 0.46 0.28] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001001001010001100000 100111000000000000000000000000 100001001101001000000000000001 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:47:47 | Epoch 104 finished. Took 64.76 seconds. loss: 0.023255, lagrangian_loss: 0.000541, attention_score_distillation_loss: 0.000375 loss: 0.026722, lagrangian_loss: 0.000890, attention_score_distillation_loss: 0.000377 ---------------------------------------------------------------------- time: 2023-07-19 16:32:38 Evaluating: pearson: 0.7623, eval_loss: 1.0569, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 19000 lambda_1: -0.1829, lambda_2: 137.6093 lambda_3: 0.0000 train remain: [0.67 0.27 0.16 0.23 0.1 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001001001010001100000 100111000000000000000000000000 100001001101001000000000000001 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.031396, lagrangian_loss: 0.000542, attention_score_distillation_loss: 0.000388 loss: 0.019400, lagrangian_loss: 0.000111, attention_score_distillation_loss: 0.000381 ETA: 0:46:40 | Epoch 105 finished. Took 56.61 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:33:13 Evaluating: pearson: 0.7808, eval_loss: 1.0271, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 19100 lambda_1: -0.2156, lambda_2: 138.3517 lambda_3: 0.0000 train remain: [0.67 0.28 0.16 0.22 0.1 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.23, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000001001001001010001000000 100111000000000000000000000000 100001001101001000010000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.047426, lagrangian_loss: 0.000350, attention_score_distillation_loss: 0.000387 loss: 0.044959, lagrangian_loss: 0.000551, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 16:33:48 Evaluating: pearson: 0.7728, eval_loss: 1.0637, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 19200 lambda_1: -0.2131, lambda_2: 139.2832 lambda_3: 0.0000 train remain: [0.66 0.29 0.16 0.21 0.1 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 000000000001001001010101000000 100110000000000000000010000000 100001001101001000000000000000 101000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.051215, lagrangian_loss: 0.000087, attention_score_distillation_loss: 0.000389 loss: 0.020950, lagrangian_loss: 0.000048, attention_score_distillation_loss: 0.000386 ETA: 0:45:37 | Epoch 106 finished. Took 64.63 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:34:23 Evaluating: pearson: 0.7492, eval_loss: 1.1582, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7121, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 19300 lambda_1: -0.0879, lambda_2: 140.2887 lambda_3: 0.0000 train remain: [0.66 0.29 0.16 0.21 0.09 0.21 0.47 0.46 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.1, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000010 000000000001001001010101000000 100111000000000000000000000000 100001001101001000000000000000 001000000000001000000010000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 000110111100001000000000000001 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.036905, lagrangian_loss: 0.001157, attention_score_distillation_loss: 0.000373 loss: 0.034986, lagrangian_loss: 0.000202, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 16:34:58 Evaluating: pearson: 0.8089, eval_loss: 0.9326, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 19400 lambda_1: -0.2001, lambda_2: 141.4460 lambda_3: 0.0000 train remain: [0.66 0.3 0.15 0.21 0.09 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000001 100000000001001001010001000000 100111000000000000000000000000 100001001101001000000000000000 001000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.029122, lagrangian_loss: -0.000057, attention_score_distillation_loss: 0.000384 ETA: 0:44:34 | Epoch 107 finished. Took 64.66 seconds. loss: 0.016859, lagrangian_loss: 0.002211, attention_score_distillation_loss: 0.000375 ---------------------------------------------------------------------- time: 2023-07-19 16:35:33 Evaluating: pearson: 0.8243, eval_loss: 0.8476, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 19500 lambda_1: -0.3574, lambda_2: 142.5341 lambda_3: 0.0000 train remain: [0.65 0.3 0.15 0.21 0.08 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100110000000000010000000000000 100001000101001000010000000000 001000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.021369, lagrangian_loss: -0.000054, attention_score_distillation_loss: 0.000387 loss: 0.023157, lagrangian_loss: 0.000214, attention_score_distillation_loss: 0.000377 ---------------------------------------------------------------------- time: 2023-07-19 16:36:08 Evaluating: pearson: 0.8186, eval_loss: 0.8628, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 19600 lambda_1: -0.1135, lambda_2: 143.6197 lambda_3: 0.0000 train remain: [0.66 0.29 0.15 0.21 0.08 0.21 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 100000000001001001010001000000 100110000000000010000000000000 100001000101001000010000000000 001000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.034663, lagrangian_loss: -0.000022, attention_score_distillation_loss: 0.000379 ETA: 0:43:30 | Epoch 108 finished. Took 64.43 seconds. loss: 0.021586, lagrangian_loss: 0.000069, attention_score_distillation_loss: 0.000387 ---------------------------------------------------------------------- time: 2023-07-19 16:36:43 Evaluating: pearson: 0.7491, eval_loss: 1.1543, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 19700 lambda_1: -0.1204, lambda_2: 144.8133 lambda_3: 0.0000 train remain: [0.66 0.3 0.15 0.21 0.08 0.2 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 000000000101001001010001000000 100110000010000000000000000000 100001000101001000010000000000 001000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.023764, lagrangian_loss: 0.000868, attention_score_distillation_loss: 0.000383 loss: 0.020211, lagrangian_loss: 0.000143, attention_score_distillation_loss: 0.000388 ---------------------------------------------------------------------- time: 2023-07-19 16:37:17 Evaluating: pearson: 0.7424, eval_loss: 1.1603, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 19800 lambda_1: -0.3892, lambda_2: 145.8665 lambda_3: 0.0000 train remain: [0.66 0.27 0.15 0.21 0.07 0.2 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000101001001010001000000 100110000000001000000000000000 100001000101001000010000000000 001000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:42:27 | Epoch 109 finished. Took 64.27 seconds. loss: 0.025719, lagrangian_loss: -0.000227, attention_score_distillation_loss: 0.000382 loss: 0.030329, lagrangian_loss: -0.000006, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 16:37:52 Evaluating: pearson: 0.7639, eval_loss: 1.0872, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 19900 lambda_1: -0.2899, lambda_2: 147.2587 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.21 0.07 0.2 0.47 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001011001010001000000 100110000000001000000000000000 100001000101001000000000100000 001000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.026358, lagrangian_loss: -0.000142, attention_score_distillation_loss: 0.000379 loss: 0.032045, lagrangian_loss: -0.000141, attention_score_distillation_loss: 0.000383 ETA: 0:41:21 | Epoch 110 finished. Took 56.49 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:38:27 Evaluating: pearson: 0.762, eval_loss: 1.0961, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 20000 lambda_1: -0.0752, lambda_2: 148.7459 lambda_3: 0.0000 train remain: [0.66 0.29 0.15 0.21 0.07 0.2 0.46 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001100000 001000000001001001010001000000 100110010000000000000000000000 100001000101001000010000000000 001000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.027654, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.000383 loss: 0.024883, lagrangian_loss: 0.000591, attention_score_distillation_loss: 0.000380 ---------------------------------------------------------------------- time: 2023-07-19 16:39:02 Evaluating: pearson: 0.7971, eval_loss: 0.8353, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 20100 lambda_1: -0.1476, lambda_2: 149.9942 lambda_3: 0.0000 train remain: [0.66 0.29 0.15 0.21 0.07 0.2 0.46 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 001000000001001001010001000000 100110010000000000000000000000 100001000101001000000000000001 100000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.022788, lagrangian_loss: 0.000203, attention_score_distillation_loss: 0.000387 loss: 0.027471, lagrangian_loss: 0.001228, attention_score_distillation_loss: 0.000387 ETA: 0:40:17 | Epoch 111 finished. Took 64.45 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:39:37 Evaluating: pearson: 0.7826, eval_loss: 0.989, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 20200 lambda_1: -0.0936, lambda_2: 150.8673 lambda_3: 0.0000 train remain: [0.67 0.29 0.15 0.21 0.07 0.2 0.46 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 001000000001001001010001000000 100110010000000000000000000000 100001000101101000000000000000 100000000000001000000000000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.019439, lagrangian_loss: 0.000090, attention_score_distillation_loss: 0.000384 loss: 0.035056, lagrangian_loss: -0.000034, attention_score_distillation_loss: 0.000389 ---------------------------------------------------------------------- time: 2023-07-19 16:40:12 Evaluating: pearson: 0.773, eval_loss: 1.0459, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 20300 lambda_1: -0.1958, lambda_2: 152.0601 lambda_3: 0.0000 train remain: [0.67 0.27 0.15 0.21 0.07 0.2 0.46 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 001000000001001001010001000000 100110010000000000000000000000 100001010101001000000000000000 000000000000001000000100000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.015361, lagrangian_loss: 0.000202, attention_score_distillation_loss: 0.000384 ETA: 0:39:14 | Epoch 112 finished. Took 64.5 seconds. loss: 0.029153, lagrangian_loss: 0.000106, attention_score_distillation_loss: 0.000386 ---------------------------------------------------------------------- time: 2023-07-19 16:40:47 Evaluating: pearson: 0.8172, eval_loss: 0.8585, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 20400 lambda_1: -0.3328, lambda_2: 153.1857 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.21 0.07 0.19 0.46 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000001 100000000001001001010001000000 100110010000000000000000000000 100001010101001000000000000000 000000000000001000000010000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.040584, lagrangian_loss: -0.000151, attention_score_distillation_loss: 0.000383 loss: 0.020045, lagrangian_loss: 0.000017, attention_score_distillation_loss: 0.000384 ---------------------------------------------------------------------- time: 2023-07-19 16:41:22 Evaluating: pearson: 0.8178, eval_loss: 0.8471, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 20500 lambda_1: -0.1674, lambda_2: 154.2656 lambda_3: 0.0000 train remain: [0.67 0.29 0.15 0.21 0.07 0.19 0.46 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.2, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 100000000001001001010001000000 100110010000000000000000000000 100001010101001000000000000000 000000000000001000000010000000 100000001101001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.020250, lagrangian_loss: 0.000046, attention_score_distillation_loss: 0.000383 ETA: 0:38:11 | Epoch 113 finished. Took 64.82 seconds. loss: 0.024443, lagrangian_loss: 0.000293, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 16:41:57 Evaluating: pearson: 0.819, eval_loss: 0.8509, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 20600 lambda_1: -0.1549, lambda_2: 155.3536 lambda_3: 0.0000 train remain: [0.67 0.28 0.15 0.21 0.07 0.18 0.46 0.46 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 100000000001001001010001000000 100110000000010000000000000000 100001000101001000000000000001 000000000000001000000010000000 100000001101001000000000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.012542, lagrangian_loss: -0.000038, attention_score_distillation_loss: 0.000386 loss: 0.026489, lagrangian_loss: 0.000024, attention_score_distillation_loss: 0.000379 ---------------------------------------------------------------------- time: 2023-07-19 16:42:32 Evaluating: pearson: 0.7211, eval_loss: 1.2085, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 20700 lambda_1: -0.0453, lambda_2: 156.3800 lambda_3: 0.0000 train remain: [0.68 0.28 0.15 0.21 0.07 0.18 0.46 0.45 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000100000001001001010001000000 100110000000010000000000000000 100001000101001000010000000000 000000000000001000000010000000 100000001101001000000000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:37:07 | Epoch 114 finished. Took 64.42 seconds. loss: 0.037745, lagrangian_loss: 0.000357, attention_score_distillation_loss: 0.000383 loss: 0.038984, lagrangian_loss: 0.000241, attention_score_distillation_loss: 0.000380 ---------------------------------------------------------------------- time: 2023-07-19 16:43:07 Evaluating: pearson: 0.7434, eval_loss: 1.1628, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 20800 lambda_1: -0.2510, lambda_2: 157.2964 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.21 0.07 0.18 0.46 0.45 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000001001001001010001000000 100110000000000000000010000000 100001000101001000010000000000 000000000000001000000010000000 100000001101001000000000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.026042, lagrangian_loss: 0.000158, attention_score_distillation_loss: 0.000385 loss: 0.023143, lagrangian_loss: -0.000031, attention_score_distillation_loss: 0.000384 ETA: 0:36:02 | Epoch 115 finished. Took 56.79 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:43:42 Evaluating: pearson: 0.7287, eval_loss: 1.2061, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 20900 lambda_1: -0.2046, lambda_2: 158.1209 lambda_3: 0.0000 train remain: [0.66 0.28 0.15 0.21 0.07 0.18 0.46 0.45 0.27] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001001000 000000000001001001010001000010 100110000001000000000000000000 100001010101001000000000000000 000000000000001000000010000000 100000001101001000000000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.027918, lagrangian_loss: -0.000048, attention_score_distillation_loss: 0.000389 loss: 0.027788, lagrangian_loss: 0.000004, attention_score_distillation_loss: 0.000383 ---------------------------------------------------------------------- time: 2023-07-19 16:44:17 Evaluating: pearson: 0.7409, eval_loss: 1.1797, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 21000 lambda_1: -0.1466, lambda_2: 159.0558 lambda_3: 0.0000 train remain: [0.67 0.28 0.15 0.21 0.07 0.18 0.46 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000001001001010101000000 100110000001000000000000000000 100001010101001000000000000000 000000000000001000000010000000 100000001101001000000000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.027181, lagrangian_loss: 0.000952, attention_score_distillation_loss: 0.000385 loss: 0.026109, lagrangian_loss: 0.000131, attention_score_distillation_loss: 0.000384 ETA: 0:34:58 | Epoch 116 finished. Took 64.66 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:44:52 Evaluating: pearson: 0.7916, eval_loss: 1.0084, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 21100 lambda_1: -0.1868, lambda_2: 160.2301 lambda_3: 0.0000 train remain: [0.66 0.29 0.14 0.21 0.07 0.18 0.46 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 001000000001001001010001000000 100110000001000000000000000000 100001010101001000000000000000 000000000000001000000010000000 100000001101001000000000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.026227, lagrangian_loss: -0.000054, attention_score_distillation_loss: 0.000378 loss: 0.024603, lagrangian_loss: 0.001418, attention_score_distillation_loss: 0.000369 ---------------------------------------------------------------------- time: 2023-07-19 16:45:27 Evaluating: pearson: 0.8609, eval_loss: 0.5845, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 21200 lambda_1: -0.2624, lambda_2: 161.3966 lambda_3: 0.0000 train remain: [0.66 0.29 0.14 0.21 0.07 0.18 0.46 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100110000000000000000010000000 100001000101001000000000000001 100000000000001000000000000000 100000001101001000000000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.014716, lagrangian_loss: 0.000307, attention_score_distillation_loss: 0.000376 ETA: 0:33:55 | Epoch 117 finished. Took 64.36 seconds. loss: 0.018226, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 16:46:02 Evaluating: pearson: 0.7531, eval_loss: 1.1731, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 21300 lambda_1: -0.1484, lambda_2: 162.5245 lambda_3: 0.0000 train remain: [0.66 0.29 0.14 0.21 0.07 0.18 0.46 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000101001001010001000000 100110000000000000000010000000 100001000101001000000000000001 000000010000001000000000000000 100000001101001000000000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.027145, lagrangian_loss: 0.000354, attention_score_distillation_loss: 0.000379 loss: 0.020961, lagrangian_loss: 0.000282, attention_score_distillation_loss: 0.000379 ---------------------------------------------------------------------- time: 2023-07-19 16:46:37 Evaluating: pearson: 0.7756, eval_loss: 1.0521, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 21400 lambda_1: -0.1777, lambda_2: 163.7797 lambda_3: 0.0000 train remain: [0.66 0.29 0.14 0.21 0.07 0.18 0.46 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000100000001001001010001000000 110110000000000000000000000000 100001000101001000000000100000 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.023550, lagrangian_loss: 0.001809, attention_score_distillation_loss: 0.000388 ETA: 0:32:52 | Epoch 118 finished. Took 64.65 seconds. loss: 0.025421, lagrangian_loss: 0.001124, attention_score_distillation_loss: 0.000386 ---------------------------------------------------------------------- time: 2023-07-19 16:47:12 Evaluating: pearson: 0.7716, eval_loss: 1.0473, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 21500 lambda_1: -0.1036, lambda_2: 165.2305 lambda_3: 0.0000 train remain: [0.67 0.29 0.14 0.21 0.07 0.18 0.46 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.47, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000100000001001001010001000000 100110000001000000000000000000 100001010101001000000000000000 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010010000000 100011111101001000110010100001 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.021308, lagrangian_loss: 0.001654, attention_score_distillation_loss: 0.000374 loss: 0.022790, lagrangian_loss: 0.003845, attention_score_distillation_loss: 0.000371 ---------------------------------------------------------------------- time: 2023-07-19 16:47:47 Evaluating: pearson: 0.7737, eval_loss: 1.0869, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 21600 lambda_1: -0.1509, lambda_2: 166.2010 lambda_3: 0.0000 train remain: [0.66 0.29 0.15 0.21 0.07 0.18 0.46 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000001001001001010001000000 100110000000100000000000000000 100001000101001000000000000001 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010010000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:31:48 | Epoch 119 finished. Took 64.11 seconds. loss: 0.020970, lagrangian_loss: -0.000028, attention_score_distillation_loss: 0.000381 loss: 0.016239, lagrangian_loss: 0.000185, attention_score_distillation_loss: 0.000380 ---------------------------------------------------------------------- time: 2023-07-19 16:48:22 Evaluating: pearson: 0.7763, eval_loss: 1.071, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 21700 lambda_1: -0.0998, lambda_2: 167.1832 lambda_3: 0.0000 train remain: [0.67 0.3 0.15 0.21 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000010001001001010001000000 100110000001000000000000000000 100001000101001000000000000001 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010010000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.034697, lagrangian_loss: 0.002301, attention_score_distillation_loss: 0.000384 loss: 0.033953, lagrangian_loss: 0.000110, attention_score_distillation_loss: 0.000385 ETA: 0:30:43 | Epoch 120 finished. Took 56.53 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:48:57 Evaluating: pearson: 0.7725, eval_loss: 1.0965, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 21800 lambda_1: -0.0995, lambda_2: 168.3412 lambda_3: 0.0000 train remain: [0.66 0.29 0.14 0.21 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011011000000 000000000001001001010001100000 100110000001000000000000000000 100001000101101000000000000000 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010010000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.040705, lagrangian_loss: 0.000162, attention_score_distillation_loss: 0.000379 loss: 0.020507, lagrangian_loss: 0.000258, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 16:49:32 Evaluating: pearson: 0.7693, eval_loss: 1.084, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 21900 lambda_1: -0.1307, lambda_2: 169.4348 lambda_3: 0.0000 train remain: [0.67 0.29 0.13 0.21 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000010 000000000001001011010001000000 100110000000000000000000010000 100001000101001000000100000000 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010010000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.017552, lagrangian_loss: -0.000025, attention_score_distillation_loss: 0.000386 loss: 0.023387, lagrangian_loss: -0.000010, attention_score_distillation_loss: 0.000387 ETA: 0:29:39 | Epoch 121 finished. Took 64.46 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:50:07 Evaluating: pearson: 0.7752, eval_loss: 1.0622, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 22000 lambda_1: -0.1376, lambda_2: 170.6308 lambda_3: 0.0000 train remain: [0.66 0.3 0.13 0.21 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001010000 001000000001001001010001000000 100110000000000000000000010000 100001000101001000000000000001 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010010000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.024147, lagrangian_loss: -0.000014, attention_score_distillation_loss: 0.000382 loss: 0.019544, lagrangian_loss: 0.000573, attention_score_distillation_loss: 0.000372 ---------------------------------------------------------------------- time: 2023-07-19 16:50:41 Evaluating: pearson: 0.7838, eval_loss: 1.0464, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7122, expected_sequence_sparsity: 0.8617, target_sparsity: 0.7, step: 22100 lambda_1: -0.0540, lambda_2: 171.7231 lambda_3: 0.0000 train remain: [0.67 0.3 0.13 0.2 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.13, 0.2, 0.07, 0.17, 0.47, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 001000000001001001010001000000 100110001000000000000000000000 100001010101001000000000000000 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010010000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.016547, lagrangian_loss: 0.000023, attention_score_distillation_loss: 0.000386 ETA: 0:28:36 | Epoch 122 finished. Took 64.61 seconds. loss: 0.033873, lagrangian_loss: 0.000191, attention_score_distillation_loss: 0.000376 ---------------------------------------------------------------------- time: 2023-07-19 16:51:16 Evaluating: pearson: 0.785, eval_loss: 1.0308, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 22200 lambda_1: -0.2093, lambda_2: 172.8760 lambda_3: 0.0000 train remain: [0.66 0.29 0.12 0.21 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 001000000001001001010001000000 100110000000000000000000000000 100001000101101000000000000000 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.028192, lagrangian_loss: 0.004723, attention_score_distillation_loss: 0.000392 loss: 0.032357, lagrangian_loss: 0.003168, attention_score_distillation_loss: 0.000369 ---------------------------------------------------------------------- time: 2023-07-19 16:51:52 Evaluating: pearson: 0.8614, eval_loss: 0.5812, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 22300 lambda_1: -0.1606, lambda_2: 173.9992 lambda_3: 0.0000 train remain: [0.66 0.29 0.12 0.2 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 100000000001001001010001000000 100110000000000000000000000000 100001000101101000000000000000 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.024516, lagrangian_loss: 0.000426, attention_score_distillation_loss: 0.000380 ETA: 0:27:33 | Epoch 123 finished. Took 64.72 seconds. loss: 0.022692, lagrangian_loss: 0.002883, attention_score_distillation_loss: 0.000375 ---------------------------------------------------------------------- time: 2023-07-19 16:52:27 Evaluating: pearson: 0.7657, eval_loss: 1.1119, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 22400 lambda_1: -0.0599, lambda_2: 175.4763 lambda_3: 0.0000 train remain: [0.67 0.28 0.12 0.21 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000001000001001001010001000000 100110000000000000000000000000 100001000101101000000000000000 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.026675, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.000383 loss: 0.015929, lagrangian_loss: -0.000017, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 16:53:02 Evaluating: pearson: 0.7789, eval_loss: 1.0661, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 22500 lambda_1: -0.1191, lambda_2: 176.5367 lambda_3: 0.0000 train remain: [0.67 0.27 0.12 0.21 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 000001000001001001010001000000 100110000000000000000000000000 100001000101001000010000000000 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:26:29 | Epoch 124 finished. Took 64.54 seconds. loss: 0.030787, lagrangian_loss: 0.001667, attention_score_distillation_loss: 0.000375 loss: 0.019097, lagrangian_loss: 0.000195, attention_score_distillation_loss: 0.000386 ---------------------------------------------------------------------- time: 2023-07-19 16:53:36 Evaluating: pearson: 0.777, eval_loss: 1.0671, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 22600 lambda_1: -0.1055, lambda_2: 177.7395 lambda_3: 0.0000 train remain: [0.68 0.28 0.12 0.2 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 000001000001001001010001000000 100110000000000000000000000000 100001000101001000010000000000 100000000000001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.012551, lagrangian_loss: -0.000016, attention_score_distillation_loss: 0.000382 loss: 0.015420, lagrangian_loss: 0.001586, attention_score_distillation_loss: 0.000378 ETA: 0:25:24 | Epoch 125 finished. Took 56.36 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:54:11 Evaluating: pearson: 0.7342, eval_loss: 1.2094, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 22700 lambda_1: -0.2817, lambda_2: 178.6557 lambda_3: 0.0000 train remain: [0.66 0.28 0.12 0.2 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000001000001001001010001000000 100110000000000000000000000000 100001000101001000010000000000 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.025619, lagrangian_loss: -0.000067, attention_score_distillation_loss: 0.000384 loss: 0.018709, lagrangian_loss: 0.003269, attention_score_distillation_loss: 0.000371 ---------------------------------------------------------------------- time: 2023-07-19 16:54:46 Evaluating: pearson: 0.7264, eval_loss: 1.2194, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 22800 lambda_1: -0.1021, lambda_2: 179.9960 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.2 0.07 0.18 0.45 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000001000001001001010001000000 100110000000000000000000000000 100001000101001000000000000001 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.032218, lagrangian_loss: 0.000011, attention_score_distillation_loss: 0.000385 loss: 0.030864, lagrangian_loss: -0.000015, attention_score_distillation_loss: 0.000387 ETA: 0:24:21 | Epoch 126 finished. Took 64.59 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:55:21 Evaluating: pearson: 0.7365, eval_loss: 1.2123, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 22900 lambda_1: -0.1842, lambda_2: 180.9451 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.2 0.07 0.18 0.44 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000001000001001001010001000000 100110000000000000000000000000 100001000101001000000001000000 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.021893, lagrangian_loss: 0.000137, attention_score_distillation_loss: 0.000379 loss: 0.025122, lagrangian_loss: 0.000538, attention_score_distillation_loss: 0.000389 ---------------------------------------------------------------------- time: 2023-07-19 16:55:56 Evaluating: pearson: 0.7534, eval_loss: 1.1605, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 23000 lambda_1: -0.2257, lambda_2: 182.1329 lambda_3: 0.0000 train remain: [0.66 0.29 0.12 0.2 0.07 0.18 0.44 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000001000001001001010001000000 100010000000001000000000000000 100001000101001000000001000000 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.022173, lagrangian_loss: 0.003681, attention_score_distillation_loss: 0.000367 ETA: 0:23:18 | Epoch 127 finished. Took 64.55 seconds. loss: 0.020976, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 16:56:31 Evaluating: pearson: 0.7392, eval_loss: 1.1868, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 23100 lambda_1: -0.1979, lambda_2: 183.3698 lambda_3: 0.0000 train remain: [0.66 0.29 0.12 0.2 0.07 0.17 0.44 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000000001001001010001100000 100010000000001000000000000000 100001000101001000000000000001 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.025619, lagrangian_loss: 0.000615, attention_score_distillation_loss: 0.000386 loss: 0.018397, lagrangian_loss: 0.000417, attention_score_distillation_loss: 0.000383 ---------------------------------------------------------------------- time: 2023-07-19 16:57:06 Evaluating: pearson: 0.7498, eval_loss: 1.1738, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 23200 lambda_1: -0.0761, lambda_2: 184.4762 lambda_3: 0.0000 train remain: [0.66 0.3 0.12 0.2 0.07 0.17 0.44 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 000000000001001001010001100000 100010000000001000000000000000 100001000101001000000000000001 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.021030, lagrangian_loss: 0.000011, attention_score_distillation_loss: 0.000377 ETA: 0:22:14 | Epoch 128 finished. Took 64.36 seconds. loss: 0.026556, lagrangian_loss: -0.000018, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 16:57:41 Evaluating: pearson: 0.7415, eval_loss: 1.1684, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 23300 lambda_1: -0.1444, lambda_2: 185.3625 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.2 0.07 0.17 0.43 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000010 000000000001011001010001000000 100010000000001000000000000000 100001000101001000010000000000 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.022643, lagrangian_loss: 0.000565, attention_score_distillation_loss: 0.000384 loss: 0.017441, lagrangian_loss: -0.000009, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 16:58:16 Evaluating: pearson: 0.7548, eval_loss: 1.1531, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 23400 lambda_1: -0.2032, lambda_2: 186.1922 lambda_3: 0.0000 train remain: [0.66 0.29 0.12 0.19 0.07 0.17 0.43 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001011001010001000000 100010000000001000000000000000 100001000101001000010000000000 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:21:11 | Epoch 129 finished. Took 64.52 seconds. loss: 0.013727, lagrangian_loss: 0.000109, attention_score_distillation_loss: 0.000384 loss: 0.020591, lagrangian_loss: 0.000014, attention_score_distillation_loss: 0.000380 ---------------------------------------------------------------------- time: 2023-07-19 16:58:51 Evaluating: pearson: 0.7468, eval_loss: 1.1821, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 23500 lambda_1: -0.1945, lambda_2: 187.2113 lambda_3: 0.0000 train remain: [0.66 0.3 0.12 0.19 0.07 0.17 0.43 0.45 0.28] infer remain: [0.63, 0.2, 0.1, 0.2, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 000000000101001001010001000000 100010000000001000000000000000 100001000101001000000010000000 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.014694, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.000378 loss: 0.016260, lagrangian_loss: 0.002113, attention_score_distillation_loss: 0.000377 ETA: 0:20:06 | Epoch 130 finished. Took 56.49 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:59:25 Evaluating: pearson: 0.7369, eval_loss: 1.1945, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 23600 lambda_1: -0.1718, lambda_2: 188.6839 lambda_3: 0.0000 train remain: [0.66 0.29 0.12 0.18 0.07 0.17 0.42 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000101001001010001000000 100010000000001000000000000000 100001000101001000000000000000 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.037262, lagrangian_loss: 0.001906, attention_score_distillation_loss: 0.000381 loss: 0.017440, lagrangian_loss: 0.000090, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 17:00:01 Evaluating: pearson: 0.7451, eval_loss: 1.1899, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 23700 lambda_1: -0.1375, lambda_2: 190.2640 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.18 0.07 0.17 0.42 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.43, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000010 000000001001001001010001000000 100010000000000000000010000000 100001000101001000000000000000 000000000100001000000000000000 100000001100001000010000000000 100011111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.030498, lagrangian_loss: 0.000947, attention_score_distillation_loss: 0.000377 loss: 0.025913, lagrangian_loss: 0.000050, attention_score_distillation_loss: 0.000380 ETA: 0:19:03 | Epoch 131 finished. Took 64.74 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:00:36 Evaluating: pearson: 0.7322, eval_loss: 1.2222, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 23800 lambda_1: -0.1157, lambda_2: 191.6068 lambda_3: 0.0000 train remain: [0.67 0.3 0.12 0.18 0.07 0.17 0.42 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.4, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001100000 000000000001001001010011000000 100010000000001000000000000000 100001000101001000000000000000 000000000100001000000000000000 100000001100001000010000000000 100010111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.016031, lagrangian_loss: 0.001744, attention_score_distillation_loss: 0.000375 loss: 0.011870, lagrangian_loss: -0.000000, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 17:01:11 Evaluating: pearson: 0.7234, eval_loss: 1.2277, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 23900 lambda_1: -0.1600, lambda_2: 192.7725 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.18 0.07 0.17 0.42 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.4, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000001001001010001001000 100010000000001000000000000000 100001000101001000000000000000 000000000100001000000000000000 100000001100001000010000000000 100010111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.005114, lagrangian_loss: 0.001650, attention_score_distillation_loss: 0.000375 ETA: 0:17:59 | Epoch 132 finished. Took 64.46 seconds. loss: 0.009118, lagrangian_loss: 0.000133, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 17:01:46 Evaluating: pearson: 0.7451, eval_loss: 1.1769, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 24000 lambda_1: -0.1143, lambda_2: 193.7267 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.18 0.07 0.17 0.41 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.4, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000001001001001010001000000 100010000000001000000000000000 100001000101001000000000000000 000000000100001000000000000000 100000001100001000010000000000 100010111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.018415, lagrangian_loss: 0.001658, attention_score_distillation_loss: 0.000378 loss: 0.020419, lagrangian_loss: 0.001710, attention_score_distillation_loss: 0.000380 ---------------------------------------------------------------------- time: 2023-07-19 17:02:21 Evaluating: pearson: 0.7509, eval_loss: 1.1862, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 24100 lambda_1: -0.0518, lambda_2: 195.0269 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.18 0.07 0.17 0.41 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.4, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000001001001001010001000000 100010000000001000000000000000 100001000101001000000000000000 000000000100001000000000000000 100000001100001000010000000000 100010111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.012876, lagrangian_loss: 0.000950, attention_score_distillation_loss: 0.000372 ETA: 0:16:56 | Epoch 133 finished. Took 64.61 seconds. loss: 0.018349, lagrangian_loss: 0.000184, attention_score_distillation_loss: 0.000387 ---------------------------------------------------------------------- time: 2023-07-19 17:02:56 Evaluating: pearson: 0.7695, eval_loss: 1.1116, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 24200 lambda_1: -0.1255, lambda_2: 196.1547 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.17 0.07 0.17 0.41 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.4, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001010000 000000000001001001010001001000 100010000000000000000010000000 100001000101001000000000000000 100000000000001000000000000000 100000001100001000010000000000 100010111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.019833, lagrangian_loss: 0.000050, attention_score_distillation_loss: 0.000376 loss: 0.018418, lagrangian_loss: 0.000462, attention_score_distillation_loss: 0.000376 ---------------------------------------------------------------------- time: 2023-07-19 17:03:31 Evaluating: pearson: 0.7621, eval_loss: 1.1297, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 24300 lambda_1: -0.0624, lambda_2: 197.2902 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.17 0.07 0.17 0.4 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.4, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001010000 000000000101001001010001000000 100010000000000000000010000000 100001000101001000000000000000 100000000000001000000000000000 100000001100001000010000000000 100010111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:15:53 | Epoch 134 finished. Took 64.72 seconds. loss: 0.022577, lagrangian_loss: 0.000128, attention_score_distillation_loss: 0.000385 loss: 0.023329, lagrangian_loss: 0.000038, attention_score_distillation_loss: 0.000388 ---------------------------------------------------------------------- time: 2023-07-19 17:04:06 Evaluating: pearson: 0.7489, eval_loss: 1.1882, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 24400 lambda_1: -0.1808, lambda_2: 198.1860 lambda_3: 0.0000 train remain: [0.67 0.29 0.11 0.17 0.07 0.17 0.4 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.4, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001010000 000001000001001001010001000000 100010000001000000000000000000 100001000101001000000000000000 000000000100001000000000000000 100000001100001000010000000000 100010111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.020567, lagrangian_loss: -0.000039, attention_score_distillation_loss: 0.000378 loss: 0.017191, lagrangian_loss: 0.000090, attention_score_distillation_loss: 0.000385 ETA: 0:14:48 | Epoch 135 finished. Took 56.55 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:04:41 Evaluating: pearson: 0.8596, eval_loss: 0.5908, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 24500 lambda_1: -0.0789, lambda_2: 199.6727 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.17 0.07 0.17 0.4 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.4, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 100000000001001001010001000000 100010000100000000000000000000 100001000101001000000000000000 100000000000001000000000000000 100000001100001000000000000001 100010111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.023978, lagrangian_loss: 0.000333, attention_score_distillation_loss: 0.000387 loss: 0.025588, lagrangian_loss: 0.001878, attention_score_distillation_loss: 0.000379 ---------------------------------------------------------------------- time: 2023-07-19 17:05:16 Evaluating: pearson: 0.7582, eval_loss: 1.1354, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 24600 lambda_1: -0.1657, lambda_2: 200.8086 lambda_3: 0.0000 train remain: [0.66 0.29 0.12 0.17 0.07 0.17 0.39 0.43 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.4, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001010000 000000001001001001010001000000 100010000000000000000000100000 100001000101001000000000000000 100000000000001000000000000000 100000001100001000000000000001 100010111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.020807, lagrangian_loss: -0.000017, attention_score_distillation_loss: 0.000388 loss: 0.022447, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.000380 ETA: 0:13:45 | Epoch 136 finished. Took 64.39 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:05:51 Evaluating: pearson: 0.7518, eval_loss: 1.1627, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 24700 lambda_1: -0.0732, lambda_2: 201.9591 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.17 0.07 0.17 0.39 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.4, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000000001001101010001000000 100010000000000001000000000000 100001000101001000000000000000 100000000000001000000000000000 100000001100001000000000000001 100010111101011011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.031077, lagrangian_loss: 0.000040, attention_score_distillation_loss: 0.000382 loss: 0.022321, lagrangian_loss: 0.001731, attention_score_distillation_loss: 0.000388 ---------------------------------------------------------------------- time: 2023-07-19 17:06:26 Evaluating: pearson: 0.7368, eval_loss: 1.224, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 24800 lambda_1: -0.1356, lambda_2: 203.0933 lambda_3: 0.0000 train remain: [0.67 0.3 0.12 0.16 0.07 0.17 0.38 0.44 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.37, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001100000 000000000001001011010001000000 100010000000000000000000100000 100001000101001000000000000000 000000010000001000000000000000 100000001100001000000000000001 100010111101001011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.020094, lagrangian_loss: 0.000126, attention_score_distillation_loss: 0.000383 ETA: 0:12:42 | Epoch 137 finished. Took 64.24 seconds. loss: 0.014792, lagrangian_loss: -0.000010, attention_score_distillation_loss: 0.000379 ---------------------------------------------------------------------- time: 2023-07-19 17:07:00 Evaluating: pearson: 0.7417, eval_loss: 1.2115, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 24900 lambda_1: -0.1584, lambda_2: 204.2858 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.16 0.07 0.17 0.38 0.43 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.37, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001001011010001000000 100010000000000001000000000000 100001000101001000000000000000 000000010000001000000000000000 100000001100001000000000000001 100010111101001011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.018069, lagrangian_loss: 0.000203, attention_score_distillation_loss: 0.000385 loss: 0.024392, lagrangian_loss: 0.004545, attention_score_distillation_loss: 0.000367 ---------------------------------------------------------------------- time: 2023-07-19 17:07:35 Evaluating: pearson: 0.7401, eval_loss: 1.2177, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 25000 lambda_1: -0.1494, lambda_2: 205.4537 lambda_3: 0.0000 train remain: [0.67 0.29 0.11 0.16 0.07 0.16 0.38 0.43 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.37, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000000001011001010001000000 100010000000000001000000000000 100001000101001000000000000000 000000010000001000000000000000 100000001100001000000000000001 100010111101001011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.017772, lagrangian_loss: 0.000032, attention_score_distillation_loss: 0.000384 ETA: 0:11:38 | Epoch 138 finished. Took 64.8 seconds. loss: 0.015803, lagrangian_loss: 0.000810, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 17:08:11 Evaluating: pearson: 0.7362, eval_loss: 1.2227, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 25100 lambda_1: -0.0583, lambda_2: 206.3173 lambda_3: 0.0000 train remain: [0.67 0.3 0.12 0.15 0.07 0.16 0.37 0.43 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.37, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 001000000001001001010001000000 100010000000000001000000000000 100001000101001000000000000000 000000010000001000000000000000 100000001100001000000000000001 100010111101001011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.019647, lagrangian_loss: 0.000026, attention_score_distillation_loss: 0.000383 loss: 0.020977, lagrangian_loss: 0.000011, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 17:08:46 Evaluating: pearson: 0.732, eval_loss: 1.2307, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 25200 lambda_1: -0.1452, lambda_2: 207.3363 lambda_3: 0.0000 train remain: [0.67 0.29 0.11 0.15 0.07 0.16 0.37 0.43 0.28] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.17, 0.37, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011101000000 000000100001001001010001000000 100010000000000001000000000000 100000000101001000000000000000 000000010000001000000000000000 100000001100001000000000000001 100010111101001011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:10:35 | Epoch 139 finished. Took 64.62 seconds. loss: 0.026760, lagrangian_loss: 0.000133, attention_score_distillation_loss: 0.000387 loss: 0.019172, lagrangian_loss: 0.000020, attention_score_distillation_loss: 0.000385 ---------------------------------------------------------------------- time: 2023-07-19 17:09:20 Evaluating: pearson: 0.7502, eval_loss: 1.1964, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7126, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 25300 lambda_1: -0.0306, lambda_2: 208.4178 lambda_3: 0.0000 train remain: [0.67 0.3 0.11 0.15 0.07 0.16 0.37 0.43 0.28] infer remain: [0.63, 0.2, 0.1, 0.17, 0.07, 0.17, 0.37, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001100000 010000000001001001010001000000 100010000001000000000000000000 100001000101001000000000000000 000000010000001000000000000000 100000001100001000000000000001 100010111101001011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.025408, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.000386 loss: 0.017705, lagrangian_loss: 0.001178, attention_score_distillation_loss: 0.000372 ETA: 0:09:31 | Epoch 140 finished. Took 56.45 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:09:55 Evaluating: pearson: 0.7821, eval_loss: 0.9637, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 25400 lambda_1: -0.1264, lambda_2: 209.6138 lambda_3: 0.0000 train remain: [0.67 0.29 0.11 0.15 0.07 0.15 0.37 0.43 0.29] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.17, 0.37, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 010000000001001001010001000000 100010000000000000000001000000 100000000101001000000000000000 100000000000001000000000000000 100000001100001000000000000001 100010111101001011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.014810, lagrangian_loss: 0.000333, attention_score_distillation_loss: 0.000373 loss: 0.017383, lagrangian_loss: 0.002628, attention_score_distillation_loss: 0.000374 ---------------------------------------------------------------------- time: 2023-07-19 17:10:30 Evaluating: pearson: 0.7611, eval_loss: 1.146, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 25500 lambda_1: -0.1007, lambda_2: 210.6246 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.15 0.07 0.15 0.37 0.42 0.29] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 010000000001001001010001000000 100010000000001000000000000000 100000000101001000000000000000 000000000000001000010000000000 100000001100001000000000000000 100010111101001011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.022986, lagrangian_loss: 0.000093, attention_score_distillation_loss: 0.000384 loss: 0.012094, lagrangian_loss: -0.000029, attention_score_distillation_loss: 0.000384 ETA: 0:08:27 | Epoch 141 finished. Took 64.44 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:11:05 Evaluating: pearson: 0.7776, eval_loss: 1.0727, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 25600 lambda_1: -0.2271, lambda_2: 211.8452 lambda_3: 0.0000 train remain: [0.66 0.28 0.11 0.15 0.07 0.15 0.36 0.42 0.29] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101111001000000 001000000001001001010001000000 100010000000001000000000000000 100000000101001000000000000000 100000000000001000000000000000 100000001100001000000000000000 100010111101001011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.015185, lagrangian_loss: -0.000060, attention_score_distillation_loss: 0.000380 loss: 0.021058, lagrangian_loss: -0.000046, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 17:11:40 Evaluating: pearson: 0.7733, eval_loss: 1.0712, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 25700 lambda_1: -0.0340, lambda_2: 213.2352 lambda_3: 0.0000 train remain: [0.67 0.29 0.12 0.15 0.07 0.15 0.36 0.42 0.3 ] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.43, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111111011001000000 001000000001001001010001000000 100010010000000000000000000000 100000000101001000000000000000 100000000000001000000000000000 100000001100001000000000000000 100010111101001011010000000000 100011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.017281, lagrangian_loss: 0.001371, attention_score_distillation_loss: 0.000383 ETA: 0:07:24 | Epoch 142 finished. Took 64.52 seconds. loss: 0.035524, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.000378 ---------------------------------------------------------------------- time: 2023-07-19 17:12:15 Evaluating: pearson: 0.7774, eval_loss: 1.0843, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 25800 lambda_1: -0.1584, lambda_2: 214.5378 lambda_3: 0.0000 train remain: [0.67 0.29 0.11 0.15 0.06 0.15 0.36 0.42 0.3 ] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011011000000 000001000001001001010001000000 100010000000001000000000000000 100000000101001000000000000000 100000000000001000000000000000 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.021670, lagrangian_loss: 0.000413, attention_score_distillation_loss: 0.000384 loss: 0.030725, lagrangian_loss: 0.000704, attention_score_distillation_loss: 0.000386 ---------------------------------------------------------------------- time: 2023-07-19 17:12:50 Evaluating: pearson: 0.7598, eval_loss: 1.1281, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 25900 lambda_1: -0.1382, lambda_2: 215.5801 lambda_3: 0.0000 train remain: [0.66 0.29 0.11 0.14 0.06 0.15 0.36 0.42 0.3 ] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001100000 000001000001001001010001000000 100010000000001000000000000000 100000000101001000000000000000 100000000000001000000000000000 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.026484, lagrangian_loss: 0.000169, attention_score_distillation_loss: 0.000379 ETA: 0:06:20 | Epoch 143 finished. Took 64.73 seconds. loss: 0.014245, lagrangian_loss: 0.000326, attention_score_distillation_loss: 0.000379 ---------------------------------------------------------------------- time: 2023-07-19 17:13:25 Evaluating: pearson: 0.7566, eval_loss: 1.1385, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 26000 lambda_1: -0.1773, lambda_2: 216.7518 lambda_3: 0.0000 train remain: [0.66 0.29 0.11 0.14 0.06 0.15 0.36 0.41 0.29] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001100000 000000001001001001010001000000 100010000100000000000000000000 100000000101001000000000000000 100000000000001000000000000000 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.021942, lagrangian_loss: 0.001795, attention_score_distillation_loss: 0.000387 loss: 0.022532, lagrangian_loss: 0.000704, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 17:14:00 Evaluating: pearson: 0.7284, eval_loss: 1.2374, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 26100 lambda_1: -0.2321, lambda_2: 217.8993 lambda_3: 0.0000 train remain: [0.66 0.29 0.11 0.14 0.06 0.15 0.36 0.41 0.29] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111101111101011001000001 000000001001001001010001000000 100010000000000010000000000000 100000000101001000000000000000 000000000000001000000000000001 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:05:17 | Epoch 144 finished. Took 64.54 seconds. loss: 0.016943, lagrangian_loss: 0.000349, attention_score_distillation_loss: 0.000381 loss: 0.023894, lagrangian_loss: 0.000139, attention_score_distillation_loss: 0.000376 ---------------------------------------------------------------------- time: 2023-07-19 17:14:35 Evaluating: pearson: 0.7326, eval_loss: 1.2339, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 26200 lambda_1: -0.0016, lambda_2: 219.0161 lambda_3: 0.0000 train remain: [0.67 0.3 0.11 0.15 0.06 0.15 0.36 0.41 0.3 ] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000000001001001001010001000000 100010000000000010000000000000 100000000101001000000000000000 000000000000001000000000000001 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.023461, lagrangian_loss: 0.000011, attention_score_distillation_loss: 0.000388 loss: 0.016057, lagrangian_loss: 0.000024, attention_score_distillation_loss: 0.000385 ETA: 0:04:13 | Epoch 145 finished. Took 56.77 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:15:10 Evaluating: pearson: 0.7521, eval_loss: 1.1559, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 26300 lambda_1: -0.0650, lambda_2: 220.0106 lambda_3: 0.0000 train remain: [0.67 0.29 0.11 0.14 0.06 0.15 0.36 0.41 0.3 ] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 101111111111111101011001000000 000001000001001001010001000000 100010000100000000000000000000 100000000101001000000000000000 100000000000001000000000000000 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.016935, lagrangian_loss: 0.000136, attention_score_distillation_loss: 0.000382 loss: 0.021187, lagrangian_loss: 0.000067, attention_score_distillation_loss: 0.000382 ---------------------------------------------------------------------- time: 2023-07-19 17:15:45 Evaluating: pearson: 0.7572, eval_loss: 1.1445, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 26400 lambda_1: -0.1837, lambda_2: 221.2224 lambda_3: 0.0000 train remain: [0.67 0.29 0.11 0.14 0.06 0.15 0.36 0.41 0.3 ] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 000000001001001001010001000000 100010000000000000010000000000 100000000101001000000000000000 100000000000001000000000000000 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.013065, lagrangian_loss: 0.002685, attention_score_distillation_loss: 0.000383 loss: 0.013406, lagrangian_loss: 0.003519, attention_score_distillation_loss: 0.000385 ETA: 0:03:10 | Epoch 146 finished. Took 64.51 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:16:20 Evaluating: pearson: 0.8251, eval_loss: 0.9071, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 26500 lambda_1: -0.1272, lambda_2: 222.6243 lambda_3: 0.0000 train remain: [0.67 0.3 0.11 0.14 0.06 0.15 0.36 0.41 0.3 ] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100010000000000000010000000000 100000000101001000000000000000 000000000000001000000000000001 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.030218, lagrangian_loss: -0.000011, attention_score_distillation_loss: 0.000375 loss: 0.015533, lagrangian_loss: 0.001361, attention_score_distillation_loss: 0.000386 ---------------------------------------------------------------------- time: 2023-07-19 17:16:55 Evaluating: pearson: 0.8422, eval_loss: 0.7559, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 26600 lambda_1: -0.1149, lambda_2: 223.5965 lambda_3: 0.0000 train remain: [0.67 0.3 0.11 0.14 0.06 0.14 0.36 0.41 0.3 ] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100010000000000000010000000000 100000000101001000000000000000 100000000000001000000000000000 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.015316, lagrangian_loss: 0.001467, attention_score_distillation_loss: 0.000385 ETA: 0:02:06 | Epoch 147 finished. Took 64.63 seconds. loss: 0.018985, lagrangian_loss: 0.000297, attention_score_distillation_loss: 0.000377 ---------------------------------------------------------------------- time: 2023-07-19 17:17:30 Evaluating: pearson: 0.8443, eval_loss: 0.7525, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 26700 lambda_1: -0.0555, lambda_2: 224.6607 lambda_3: 0.0000 train remain: [0.67 0.29 0.11 0.14 0.06 0.15 0.36 0.41 0.3 ] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100010000000000000010000000000 100000000100001000010000000000 100000000000001000000000000000 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.030508, lagrangian_loss: 0.001552, attention_score_distillation_loss: 0.000378 loss: 0.010468, lagrangian_loss: 0.000634, attention_score_distillation_loss: 0.000381 ---------------------------------------------------------------------- time: 2023-07-19 17:18:05 Evaluating: pearson: 0.8181, eval_loss: 0.8974, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 26800 lambda_1: -0.1691, lambda_2: 225.6419 lambda_3: 0.0000 train remain: [0.67 0.29 0.11 0.14 0.06 0.14 0.36 0.41 0.29] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100010000000000000010000000000 101000000100001000000000000000 000000000000001000000000000001 100000001100001000000000000000 100010111101001011010000000000 100011111101001000110010000000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.020318, lagrangian_loss: -0.000031, attention_score_distillation_loss: 0.000384 ETA: 0:01:03 | Epoch 148 finished. Took 64.47 seconds. loss: 0.011219, lagrangian_loss: 0.000923, attention_score_distillation_loss: 0.000383 ---------------------------------------------------------------------- time: 2023-07-19 17:18:40 Evaluating: pearson: 0.8267, eval_loss: 0.8997, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 26900 lambda_1: -0.1710, lambda_2: 226.6552 lambda_3: 0.0000 train remain: [0.67 0.29 0.11 0.14 0.06 0.14 0.36 0.4 0.28] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100010000000000000010000000000 101000000100001000000000000000 000000000100001000000000000000 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 loss: 0.011498, lagrangian_loss: 0.000441, attention_score_distillation_loss: 0.000380 loss: 0.016741, lagrangian_loss: 0.001766, attention_score_distillation_loss: 0.000386 ---------------------------------------------------------------------- time: 2023-07-19 17:19:15 Evaluating: pearson: 0.8422, eval_loss: 0.7528, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.7231, expected_sparsity: 0.7127, expected_sequence_sparsity: 0.8619, target_sparsity: 0.7, step: 27000 lambda_1: -0.0480, lambda_2: 227.8838 lambda_3: 0.0000 train remain: [0.67 0.3 0.11 0.15 0.07 0.15 0.36 0.4 0.28] infer remain: [0.63, 0.2, 0.1, 0.13, 0.07, 0.13, 0.37, 0.4, 0.27] layerwise remain: [1.0, 1.0, 1.0, 0.63, 0.13, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 111111111101111101011001000000 100000000001001001010001000000 100010000000000000010000000000 100000000100001000000000000001 100000000000001000000000000000 100000001100001000000000000000 100010111101001011010000000000 000011111101001000110010100000 100110111100001000000000000000 Best eval score so far: 0.8662 @ step 12900 epoch 71.67 ETA: 0:00:00 | Epoch 149 finished. Took 64.06 seconds.