/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" 2023/07/19 14:47:58 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow. 2023/07/19 14:47:59 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn. 2023/07/19 14:47:59 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers. Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Downloading and preparing dataset glue/sst2 to /home/aiscuser/.cache/huggingface/datasets/glue/sst2/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353... Downloading data: 0%| | 0.00/7.44M [00:00 Training Arguments TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, bf16=False, bf16_full_eval=False, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=500, evaluation_strategy=IntervalStrategy.STEPS, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=1e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=40, log_level_replica=-1, log_on_each_node=True, logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/SST2/reproduce1/s0.4_lr1e-05_reglr0.04_alpha0.001_warmup10_bin25/runs/Jul19_14-48-00_node-0, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=100, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=40.0, optim=OptimizerNames.ADAMW_HF, output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/SST2/reproduce1/s0.4_lr1e-05_reglr0.04_alpha0.001_warmup10_bin25, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=32, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, remove_unused_columns=True, report_to=['mlflow'], resume_from_checkpoint=None, run_name=/mnt/data/device-aware-bert/token_pruning/experiments/SST2/reproduce1/s0.4_lr1e-05_reglr0.04_alpha0.001_warmup10_bin25, save_on_each_node=False, save_steps=0, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=57, sharded_ddp=[], skip_memory_metrics=True, tf32=None, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) Additional Arguments AdditionalArguments(test=False, ex_name='s0.4_lr1e-05_reglr0.04_alpha0.001_warmup10_bin25', pruning_type='token+pruner', reg_learning_rate=0.04, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=10, target_sparsity=0.4, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/SST2', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.001, distill_temp=2.0, use_mac_l0=True, prune_location=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=25, topk=20) ---------------------------------------------------------------------- time: 2023-07-19 14:48:57 Evaluating: accuracy: 0.9323, eval_loss: 0.2955, step: 0 lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000 Starting l0 regularization! using , temperature: 0.67, init drop rate: 0.01 token_loga shape: [10, 25] prune location: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] NDCG TOPK= 20 loss: 0.155357, lagrangian_loss: 0.004811, attention_score_distillation_loss: 0.005180 ---------------------------------------------------------------------- time: 2023-07-19 14:50:27 Evaluating: accuracy: 0.9289, eval_loss: 0.3444, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.6119, target_sparsity: 0.0095, step: 500 lambda_1: 1.3281, lambda_2: 9.5986 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 loss: 0.006962, lagrangian_loss: -0.008403, attention_score_distillation_loss: 0.004522 loss: 0.040866, lagrangian_loss: -0.021373, attention_score_distillation_loss: 0.006085 ---------------------------------------------------------------------- time: 2023-07-19 14:51:56 Evaluating: accuracy: 0.9289, eval_loss: 0.3043, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.6119, target_sparsity: 0.019, step: 1000 lambda_1: -2.0315, lambda_2: 19.8813 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 0.99 1. ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 loss: 0.028482, lagrangian_loss: 0.036377, attention_score_distillation_loss: 0.004161 loss: 0.183112, lagrangian_loss: 0.003172, attention_score_distillation_loss: 0.004896 ---------------------------------------------------------------------- time: 2023-07-19 14:53:27 Evaluating: accuracy: 0.9278, eval_loss: 0.3021, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0178, expected_sparsity: 0.0169, expected_sequence_sparsity: 0.6185, target_sparsity: 0.0285, step: 1500 lambda_1: 1.0045, lambda_2: 25.0687 lambda_3: 0.0000 train remain: [1. 0.99 1. 1. 1. 1. 1. 0.99 0.91 0.99] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110111110111 1111111111111111111111111 loss: 0.021260, lagrangian_loss: -0.000303, attention_score_distillation_loss: 0.005491 loss: 0.176446, lagrangian_loss: 0.002407, attention_score_distillation_loss: 0.004426 ---------------------------------------------------------------------- time: 2023-07-19 14:54:57 Evaluating: accuracy: 0.9255, eval_loss: 0.3446, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0323, expected_sparsity: 0.0301, expected_sequence_sparsity: 0.6236, target_sparsity: 0.038, step: 2000 lambda_1: 0.1692, lambda_2: 25.6928 lambda_3: 0.0000 train remain: [1. 0.99 1. 1. 1. 1. 1. 0.99 0.88 0.95] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.96] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.84] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110111110011 1111111111111011111111111 loss: 0.008688, lagrangian_loss: 0.000906, attention_score_distillation_loss: 0.005019 ETA: 4:05:57 | Epoch 0 finished. Took 378.39 seconds. loss: 0.029153, lagrangian_loss: 0.000659, attention_score_distillation_loss: 0.005779 ---------------------------------------------------------------------- time: 2023-07-19 14:56:27 Evaluating: accuracy: 0.9243, eval_loss: 0.321, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0523, expected_sparsity: 0.0472, expected_sequence_sparsity: 0.6303, target_sparsity: 0.0475, step: 2500 lambda_1: 0.1178, lambda_2: 26.1422 lambda_3: 0.0000 train remain: [1. 0.99 1. 1. 1. 1. 1. 0.98 0.84 0.9 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.88] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.74] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111011111110111110011 1111111111111001101111111 loss: 0.225283, lagrangian_loss: 0.000981, attention_score_distillation_loss: 0.004351 loss: 0.014326, lagrangian_loss: 0.000840, attention_score_distillation_loss: 0.004253 ---------------------------------------------------------------------- time: 2023-07-19 14:57:57 Evaluating: accuracy: 0.9255, eval_loss: 0.3366, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0612, expected_sparsity: 0.0592, expected_sequence_sparsity: 0.6351, target_sparsity: 0.057, step: 3000 lambda_1: 0.4644, lambda_2: 27.2006 lambda_3: 0.0000 train remain: [1. 0.99 1. 1. 1. 1. 1. 0.98 0.82 0.87] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.84] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111011111110111110001 1111111111111001101111011 loss: 0.108720, lagrangian_loss: 0.000790, attention_score_distillation_loss: 0.003481 loss: 0.080188, lagrangian_loss: 0.007772, attention_score_distillation_loss: 0.004560 ---------------------------------------------------------------------- time: 2023-07-19 14:59:27 Evaluating: accuracy: 0.9266, eval_loss: 0.3362, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0612, expected_sparsity: 0.0592, expected_sequence_sparsity: 0.6351, target_sparsity: 0.0665, step: 3500 lambda_1: -1.0033, lambda_2: 30.6546 lambda_3: 0.0000 train remain: [1. 0.99 1. 1. 1. 1. 1. 0.99 0.81 0.86] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.84] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111011111110111110001 1111111111111001101111011 loss: 0.087554, lagrangian_loss: 0.006383, attention_score_distillation_loss: 0.004199 loss: 0.010215, lagrangian_loss: 0.002603, attention_score_distillation_loss: 0.003189 ---------------------------------------------------------------------- time: 2023-07-19 15:00:57 Evaluating: accuracy: 0.93, eval_loss: 0.3314, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0667, expected_sparsity: 0.0635, expected_sequence_sparsity: 0.6367, target_sparsity: 0.076, step: 4000 lambda_1: -0.5176, lambda_2: 33.8813 lambda_3: 0.0000 train remain: [1. 0.99 1. 1. 1. 1. 1. 0.99 0.81 0.78] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.8] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111011111110111110001 1111111111111001101101011 loss: 0.009859, lagrangian_loss: 0.006298, attention_score_distillation_loss: 0.004333 ETA: 3:59:36 | Epoch 1 finished. Took 378.25 seconds. loss: 0.051151, lagrangian_loss: 0.001607, attention_score_distillation_loss: 0.004309 ---------------------------------------------------------------------- time: 2023-07-19 15:02:27 Evaluating: accuracy: 0.9266, eval_loss: 0.3515, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0834, expected_sparsity: 0.0761, expected_sequence_sparsity: 0.6417, target_sparsity: 0.0855, step: 4500 lambda_1: -0.0091, lambda_2: 36.8318 lambda_3: 0.0000 train remain: [1. 0.99 1. 1. 1. 1. 1. 0.99 0.79 0.67] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.68] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.54] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111011111110111110001 1111111101111000001101011 loss: 0.008536, lagrangian_loss: 0.000317, attention_score_distillation_loss: 0.004327 loss: 0.036679, lagrangian_loss: 0.001312, attention_score_distillation_loss: 0.004933 ---------------------------------------------------------------------- time: 2023-07-19 15:03:57 Evaluating: accuracy: 0.9266, eval_loss: 0.3407, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0923, expected_sparsity: 0.0909, expected_sequence_sparsity: 0.6475, target_sparsity: 0.095, step: 5000 lambda_1: 0.4635, lambda_2: 37.8125 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.98 0.76 0.6 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.46] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111011111110111110000 1011111101111000001001011 loss: 0.003988, lagrangian_loss: -0.000048, attention_score_distillation_loss: 0.004565 loss: 0.121249, lagrangian_loss: 0.002231, attention_score_distillation_loss: 0.003795 ---------------------------------------------------------------------- time: 2023-07-19 15:05:27 Evaluating: accuracy: 0.9312, eval_loss: 0.3027, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1122, expected_sparsity: 0.1048, expected_sequence_sparsity: 0.6529, target_sparsity: 0.1045, step: 5500 lambda_1: -1.4023, lambda_2: 40.6654 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.96 0.73 0.52] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.52] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.37] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111011111110111100000 1011110001111000001001011 loss: 0.098499, lagrangian_loss: -0.004970, attention_score_distillation_loss: 0.003548 loss: 0.051845, lagrangian_loss: 0.002930, attention_score_distillation_loss: 0.003697 ---------------------------------------------------------------------- time: 2023-07-19 15:06:57 Evaluating: accuracy: 0.9243, eval_loss: 0.2995, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1122, expected_sparsity: 0.1086, expected_sequence_sparsity: 0.6544, target_sparsity: 0.114, step: 6000 lambda_1: 0.6548, lambda_2: 46.3613 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.98 0.73 0.49] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.48] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.35] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111011111110111100000 1011100001111000001001011 loss: 0.007393, lagrangian_loss: -0.001965, attention_score_distillation_loss: 0.002874 loss: 0.061582, lagrangian_loss: -0.001227, attention_score_distillation_loss: 0.003754 ETA: 3:53:00 | Epoch 2 finished. Took 376.91 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:08:26 Evaluating: accuracy: 0.9312, eval_loss: 0.3044, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1211, expected_sparsity: 0.1179, expected_sequence_sparsity: 0.6581, target_sparsity: 0.1235, step: 6500 lambda_1: 0.3385, lambda_2: 50.3146 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.95 0.68 0.42] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.44] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.3] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111101011111110111100000 1011110001101000001000011 loss: 0.014733, lagrangian_loss: 0.000137, attention_score_distillation_loss: 0.003701 loss: 0.155054, lagrangian_loss: -0.000038, attention_score_distillation_loss: 0.003613 ---------------------------------------------------------------------- time: 2023-07-19 15:09:57 Evaluating: accuracy: 0.9278, eval_loss: 0.3334, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1556, expected_sparsity: 0.1486, expected_sequence_sparsity: 0.6701, target_sparsity: 0.133, step: 7000 lambda_1: 0.0394, lambda_2: 52.7942 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.89 0.67 0.38] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.68, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.57, 0.23] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111011111110100 1111101011111110111100000 1011100001101000001000011 loss: 0.179367, lagrangian_loss: 0.003221, attention_score_distillation_loss: 0.004066 loss: 0.031130, lagrangian_loss: 0.001172, attention_score_distillation_loss: 0.003428 ---------------------------------------------------------------------- time: 2023-07-19 15:11:26 Evaluating: accuracy: 0.9335, eval_loss: 0.3093, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1611, expected_sparsity: 0.1546, expected_sequence_sparsity: 0.6724, target_sparsity: 0.1425, step: 7500 lambda_1: -0.5608, lambda_2: 56.8670 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.88 0.66 0.34] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.68, 0.32] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.57, 0.18] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111011111110100 1111101011111110111100000 1011100001101000000000010 loss: 0.009305, lagrangian_loss: -0.001350, attention_score_distillation_loss: 0.004126 loss: 0.004649, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.002670 ---------------------------------------------------------------------- time: 2023-07-19 15:12:56 Evaluating: accuracy: 0.9323, eval_loss: 0.3185, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1733, expected_sparsity: 0.1649, expected_sequence_sparsity: 0.6765, target_sparsity: 0.152, step: 8000 lambda_1: 0.5161, lambda_2: 62.2754 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.84 0.65 0.34] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.32] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.51, 0.16] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111011110110100 1111100011111110111100000 1011100001101000000000010 loss: 0.172671, lagrangian_loss: -0.001068, attention_score_distillation_loss: 0.002684 loss: 0.014831, lagrangian_loss: 0.000837, attention_score_distillation_loss: 0.003171 ETA: 3:46:39 | Epoch 3 finished. Took 377.48 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:14:26 Evaluating: accuracy: 0.9289, eval_loss: 0.306, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.18, expected_sparsity: 0.1749, expected_sequence_sparsity: 0.6804, target_sparsity: 0.1615, step: 8500 lambda_1: -0.3289, lambda_2: 66.5334 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 0.99 0.79 0.62 0.31] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.6, 0.32] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.46, 0.15] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111011110100100 1111100011111110111000000 1011100001101000000000010 loss: 0.024292, lagrangian_loss: 0.000257, attention_score_distillation_loss: 0.002809 loss: 0.003360, lagrangian_loss: 0.000544, attention_score_distillation_loss: 0.003390 ---------------------------------------------------------------------- time: 2023-07-19 15:15:56 Evaluating: accuracy: 0.9323, eval_loss: 0.3061, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1922, expected_sparsity: 0.1833, expected_sequence_sparsity: 0.6837, target_sparsity: 0.171, step: 9000 lambda_1: -0.0414, lambda_2: 70.2247 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 0.99 0.76 0.6 0.27] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.6, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.43, 0.12] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111011110100000 1111100011111110111000000 1011100001101000000000000 loss: 0.018423, lagrangian_loss: 0.000380, attention_score_distillation_loss: 0.003018 loss: 0.006017, lagrangian_loss: 0.000093, attention_score_distillation_loss: 0.003399 ---------------------------------------------------------------------- time: 2023-07-19 15:17:26 Evaluating: accuracy: 0.9266, eval_loss: 0.3326, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1955, expected_sparsity: 0.1887, expected_sequence_sparsity: 0.6858, target_sparsity: 0.1805, step: 9500 lambda_1: -0.3810, lambda_2: 74.8933 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 1. 0.99 1. 0.74 0.58 0.24] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.56, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.4, 0.1] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111011110100000 1111100011011110111000000 1001100001101000000000000 loss: 0.014049, lagrangian_loss: -0.000359, attention_score_distillation_loss: 0.002617 loss: 0.008354, lagrangian_loss: -0.003637, attention_score_distillation_loss: 0.003165 ---------------------------------------------------------------------- time: 2023-07-19 15:18:56 Evaluating: accuracy: 0.9266, eval_loss: 0.3274, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1922, expected_sparsity: 0.1855, expected_sequence_sparsity: 0.6846, target_sparsity: 0.19, step: 10000 lambda_1: 0.2728, lambda_2: 84.7973 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 0.99 0.99 0.99 0.74 0.6 0.26] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.6, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.43, 0.1] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111011110100000 1111100011011110111000010 1001100000101010000000000 loss: 0.035517, lagrangian_loss: -0.000120, attention_score_distillation_loss: 0.003169 loss: 0.009681, lagrangian_loss: 0.000505, attention_score_distillation_loss: 0.002792 ---------------------------------------------------------------------- time: 2023-07-19 15:20:26 Evaluating: accuracy: 0.93, eval_loss: 0.3142, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.2256, expected_sparsity: 0.216, expected_sequence_sparsity: 0.6965, target_sparsity: 0.1995, step: 10500 lambda_1: -0.3492, lambda_2: 97.0002 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 0.96 1. 0.99 0.73 0.57 0.24] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 1.0, 0.72, 0.56, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.66, 0.37, 0.09] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111110110 1111111111111111111111111 1111111111111111111111111 1111111111111011110100000 1111100011011110111000000 1001100000101010000000000 loss: 0.080032, lagrangian_loss: 0.001015, attention_score_distillation_loss: 0.002486 ETA: 3:40:50 | Epoch 4 finished. Took 381.93 seconds. loss: 0.021222, lagrangian_loss: 0.000672, attention_score_distillation_loss: 0.002489 ---------------------------------------------------------------------- time: 2023-07-19 15:21:55 Evaluating: accuracy: 0.9255, eval_loss: 0.3376, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2144, expected_sparsity: 0.2068, expected_sequence_sparsity: 0.6929, target_sparsity: 0.209, step: 11000 lambda_1: -0.2302, lambda_2: 104.3252 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 0.97 1. 0.98 0.73 0.46 0.2 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.96, 0.72, 0.44, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.96, 0.69, 0.3, 0.06] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111110 1111111111111011110100000 1111100010010010111000000 1001100000101000000000000 loss: 0.221109, lagrangian_loss: 0.006300, attention_score_distillation_loss: 0.002147 loss: 0.023150, lagrangian_loss: 0.001558, attention_score_distillation_loss: 0.001901 ---------------------------------------------------------------------- time: 2023-07-19 15:23:25 Evaluating: accuracy: 0.9255, eval_loss: 0.33, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2233, expected_sparsity: 0.2111, expected_sequence_sparsity: 0.6946, target_sparsity: 0.2185, step: 11500 lambda_1: -0.5328, lambda_2: 114.2968 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.98 0.97 1. 0.97 0.73 0.39 0.16] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.96, 0.72, 0.4, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.96, 0.69, 0.28, 0.04] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111110 1111111111111011110100000 1111100010010010110000000 1000100000100000000000001 loss: 0.009923, lagrangian_loss: -0.000323, attention_score_distillation_loss: 0.002611 loss: 0.070220, lagrangian_loss: 0.000519, attention_score_distillation_loss: 0.002231 ---------------------------------------------------------------------- time: 2023-07-19 15:24:55 Evaluating: accuracy: 0.9243, eval_loss: 0.3453, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.2166, expected_sparsity: 0.2074, expected_sequence_sparsity: 0.6931, target_sparsity: 0.228, step: 12000 lambda_1: -0.5192, lambda_2: 127.7936 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.98 0.96 0.99 0.98 0.72 0.38 0.16] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.36, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.26, 0.04] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111011110100000 0111100010010010110000000 1000100000100000000000001 loss: 0.064717, lagrangian_loss: -0.000171, attention_score_distillation_loss: 0.002370 loss: 0.008864, lagrangian_loss: 0.001751, attention_score_distillation_loss: 0.002193 ---------------------------------------------------------------------- time: 2023-07-19 15:26:25 Evaluating: accuracy: 0.9243, eval_loss: 0.3443, token_prune_loc: [False, False, False, False, True, False, True, True, True, True], macs_sparsity: 0.26, expected_sparsity: 0.249, expected_sequence_sparsity: 0.7095, target_sparsity: 0.2375, step: 12500 lambda_1: -0.3350, lambda_2: 138.8287 lambda_3: 0.0000 train remain: [0.99 0.99 0.98 0.98 0.96 0.99 0.96 0.69 0.35 0.15] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 0.92, 0.68, 0.36, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.85, 0.58, 0.21, 0.03] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111110110 1111111111111111111111111 1111111111111111111011110 1111111111011011110100000 0111100010010010110000000 1000100000100000000001000 loss: 0.232064, lagrangian_loss: -0.000202, attention_score_distillation_loss: 0.001972 ETA: 3:34:27 | Epoch 5 finished. Took 377.78 seconds. loss: 0.008599, lagrangian_loss: -0.000286, attention_score_distillation_loss: 0.002237 ---------------------------------------------------------------------- time: 2023-07-19 15:27:55 Evaluating: accuracy: 0.9209, eval_loss: 0.3468, token_prune_loc: [False, False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2667, expected_sparsity: 0.2559, expected_sequence_sparsity: 0.7121, target_sparsity: 0.247, step: 13000 lambda_1: -0.2239, lambda_2: 151.3447 lambda_3: 0.0000 train remain: [1. 1. 1. 0.97 0.95 0.97 0.91 0.67 0.35 0.13] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 0.88, 0.68, 0.36, 0.12] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.81, 0.55, 0.2, 0.02] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111110110 1111111111111111111111111 1111111111111111111011100 1111111111011011110100000 0111100010000010110001000 1000100000000000001000000 loss: 0.051538, lagrangian_loss: 0.003540, attention_score_distillation_loss: 0.002079 loss: 0.009448, lagrangian_loss: 0.000019, attention_score_distillation_loss: 0.001888 ---------------------------------------------------------------------- time: 2023-07-19 15:29:25 Evaluating: accuracy: 0.9323, eval_loss: 0.3031, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2834, expected_sparsity: 0.274, expected_sequence_sparsity: 0.7192, target_sparsity: 0.2565, step: 13500 lambda_1: 0.0737, lambda_2: 163.3103 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 0.97 0.92 0.88 0.66 0.34 0.13] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.84, 0.68, 0.32, 0.12] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.71, 0.48, 0.15, 0.02] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111110000 1111111111111110111011100 1111111111011011110100000 0111100010000010110000000 1000100000000000000000001 loss: 0.114277, lagrangian_loss: 0.001444, attention_score_distillation_loss: 0.001766 loss: 0.016524, lagrangian_loss: 0.001059, attention_score_distillation_loss: 0.001796 ---------------------------------------------------------------------- time: 2023-07-19 15:30:55 Evaluating: accuracy: 0.9266, eval_loss: 0.3426, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3068, expected_sparsity: 0.2972, expected_sequence_sparsity: 0.7284, target_sparsity: 0.266, step: 14000 lambda_1: -0.3037, lambda_2: 174.0812 lambda_3: 0.0000 train remain: [1. 1. 1. 0.97 0.95 0.91 0.87 0.66 0.32 0.12] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.84, 0.84, 0.64, 0.32, 0.12] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.77, 0.65, 0.42, 0.13, 0.02] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111110110 1111111111111111111110000 1111111111111110111011100 0111111111011011110100000 0111100010000010110000000 1000100000000000000000001 loss: 0.045118, lagrangian_loss: 0.002866, attention_score_distillation_loss: 0.001582 loss: 0.037236, lagrangian_loss: 0.008492, attention_score_distillation_loss: 0.001601 ---------------------------------------------------------------------- time: 2023-07-19 15:32:25 Evaluating: accuracy: 0.9255, eval_loss: 0.3484, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.26, expected_sparsity: 0.2521, expected_sequence_sparsity: 0.7107, target_sparsity: 0.2755, step: 14500 lambda_1: -0.2446, lambda_2: 188.3178 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.94 0.97 0.93 0.82 0.65 0.28 0.13] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.28, 0.12] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.51, 0.14, 0.02] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110111011000 0111111111011011110100000 0111100010000010100000000 1000101000000000000000000 loss: 0.045942, lagrangian_loss: 0.001020, attention_score_distillation_loss: 0.001363 ETA: 3:28:05 | Epoch 6 finished. Took 377.63 seconds. loss: 0.311537, lagrangian_loss: 0.003741, attention_score_distillation_loss: 0.001749 ---------------------------------------------------------------------- time: 2023-07-19 15:33:55 Evaluating: accuracy: 0.9209, eval_loss: 0.3631, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.3034, expected_sparsity: 0.2916, expected_sequence_sparsity: 0.7261, target_sparsity: 0.285, step: 15000 lambda_1: -0.0262, lambda_2: 200.2944 lambda_3: 0.0000 train remain: [1. 1. 1. 0.94 0.97 0.88 0.84 0.66 0.28 0.13] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 0.64, 0.28, 0.12] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.41, 0.11, 0.01] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111100000 1111111111111110111011000 0111111111011011110100000 0110100010000010100000100 1000100000000000000000001 loss: 0.070817, lagrangian_loss: 0.010246, attention_score_distillation_loss: 0.001247 loss: 0.007067, lagrangian_loss: 0.000312, attention_score_distillation_loss: 0.001414 ---------------------------------------------------------------------- time: 2023-07-19 15:35:25 Evaluating: accuracy: 0.9243, eval_loss: 0.3757, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3301, expected_sparsity: 0.3178, expected_sequence_sparsity: 0.7364, target_sparsity: 0.2945, step: 15500 lambda_1: -0.2126, lambda_2: 214.0570 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.95 0.95 0.83 0.82 0.65 0.28 0.12] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.76, 0.8, 0.64, 0.28, 0.12] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.7, 0.56, 0.36, 0.1, 0.01] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111110110 1111111111111110111100000 1111111111111110111011000 0111111111011011110100000 1110100010000010100000000 1000100000100000000000000 loss: 0.050685, lagrangian_loss: 0.007757, attention_score_distillation_loss: 0.001605 loss: 0.030875, lagrangian_loss: -0.000290, attention_score_distillation_loss: 0.001264 ---------------------------------------------------------------------- time: 2023-07-19 15:36:56 Evaluating: accuracy: 0.9278, eval_loss: 0.3263, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3301, expected_sparsity: 0.3184, expected_sequence_sparsity: 0.7366, target_sparsity: 0.304, step: 16000 lambda_1: -0.5446, lambda_2: 225.3073 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.94 0.93 0.81 0.82 0.64 0.28 0.09] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.76, 0.8, 0.64, 0.28, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.7, 0.56, 0.36, 0.1, 0.01] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111110110 1111111111111110111100000 1111111111111110111011000 0111111111011011110100000 1110100010000010100000000 1000100000000000000000000 loss: 0.040715, lagrangian_loss: 0.002310, attention_score_distillation_loss: 0.001337 loss: 0.041333, lagrangian_loss: 0.011079, attention_score_distillation_loss: 0.001058 ---------------------------------------------------------------------- time: 2023-07-19 15:38:25 Evaluating: accuracy: 0.922, eval_loss: 0.3797, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.3101, expected_sparsity: 0.3014, expected_sequence_sparsity: 0.73, target_sparsity: 0.3135, step: 16500 lambda_1: -0.2081, lambda_2: 237.4823 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.92 0.94 0.78 0.82 0.62 0.26 0.09] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.8, 0.64, 0.24, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.61, 0.39, 0.09, 0.01] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110111100000 1111111111111110111011000 0111111111011011110100000 0110100010000010100000000 1000000000001000000000000 loss: 0.013925, lagrangian_loss: -0.000036, attention_score_distillation_loss: 0.001068 loss: 0.062891, lagrangian_loss: 0.000060, attention_score_distillation_loss: 0.001046 ETA: 3:21:45 | Epoch 7 finished. Took 377.88 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:39:55 Evaluating: accuracy: 0.9174, eval_loss: 0.3769, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3902, expected_sparsity: 0.3752, expected_sequence_sparsity: 0.7589, target_sparsity: 0.323, step: 17000 lambda_1: -0.3345, lambda_2: 250.9405 lambda_3: 0.0000 train remain: [1. 0.99 1. 0.91 0.93 0.77 0.82 0.62 0.24 0.09] infer remain: [1.0, 1.0, 1.0, 0.84, 0.88, 0.76, 0.8, 0.6, 0.24, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.74, 0.56, 0.45, 0.27, 0.06, 0.01] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101111100 1111111111111111111110100 1111111111111110111100000 1111111111111110111011000 0111111111011010110100000 0110100010000010100000000 1000000010000000000000000 loss: 0.045489, lagrangian_loss: 0.000514, attention_score_distillation_loss: 0.001010 loss: 0.020274, lagrangian_loss: 0.002810, attention_score_distillation_loss: 0.000836 ---------------------------------------------------------------------- time: 2023-07-19 15:41:25 Evaluating: accuracy: 0.9186, eval_loss: 0.3662, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3902, expected_sparsity: 0.3752, expected_sequence_sparsity: 0.7589, target_sparsity: 0.3325, step: 17500 lambda_1: -0.3834, lambda_2: 262.7237 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.88 0.91 0.77 0.83 0.63 0.25 0.09] infer remain: [1.0, 1.0, 1.0, 0.84, 0.88, 0.76, 0.8, 0.6, 0.24, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.74, 0.56, 0.45, 0.27, 0.06, 0.01] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111111101101100 1111111111111111111110100 1111111111111110111100000 1111111111111110111011000 0111111111011010110100000 1110100010000010000000000 1000000010000000000000000 loss: 0.008387, lagrangian_loss: 0.000690, attention_score_distillation_loss: 0.000897 loss: 0.025521, lagrangian_loss: 0.001207, attention_score_distillation_loss: 0.000878 ---------------------------------------------------------------------- time: 2023-07-19 15:42:55 Evaluating: accuracy: 0.9163, eval_loss: 0.4031, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4036, expected_sparsity: 0.3935, expected_sequence_sparsity: 0.7661, target_sparsity: 0.342, step: 18000 lambda_1: -0.2471, lambda_2: 276.1984 lambda_3: 0.0000 train remain: [1. 0.99 1. 0.88 0.87 0.77 0.82 0.62 0.24 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.84, 0.76, 0.8, 0.6, 0.24, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.51, 0.41, 0.25, 0.06, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111111111110000 1111111111111110101110000 1111111111111110111011000 0111111111011010110100000 1110100010000010000000000 1000000000001000000000000 loss: 0.013237, lagrangian_loss: 0.013013, attention_score_distillation_loss: 0.000661 loss: 0.703443, lagrangian_loss: 0.062650, attention_score_distillation_loss: 0.000531 ---------------------------------------------------------------------- time: 2023-07-19 15:44:24 Evaluating: accuracy: 0.9186, eval_loss: 0.3741, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4102, expected_sparsity: 0.3995, expected_sequence_sparsity: 0.7684, target_sparsity: 0.3515, step: 18500 lambda_1: -0.7234, lambda_2: 288.0484 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.85 0.87 0.77 0.82 0.62 0.22 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.84, 0.72, 0.8, 0.6, 0.2, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.48, 0.39, 0.23, 0.05, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111111111110000 1111111111111110101100000 1111111111111110111011000 0111111111011010110100000 0110100010000010000000000 1000000000000000000000001 loss: 0.010800, lagrangian_loss: 0.000563, attention_score_distillation_loss: 0.000612 loss: 0.077505, lagrangian_loss: 0.009510, attention_score_distillation_loss: 0.000497 ETA: 3:15:22 | Epoch 8 finished. Took 377.04 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:45:54 Evaluating: accuracy: 0.914, eval_loss: 0.4015, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4102, expected_sparsity: 0.3995, expected_sequence_sparsity: 0.7684, target_sparsity: 0.361, step: 19000 lambda_1: -0.7331, lambda_2: 300.5385 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.83 0.86 0.74 0.82 0.62 0.22 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.84, 0.72, 0.8, 0.6, 0.2, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.48, 0.39, 0.23, 0.05, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111111111110000 1111111111111110101100000 1111111111111110111011000 0111111111011010110100000 0110100010000000001000000 1000000000000000000000001 loss: 0.007223, lagrangian_loss: 0.016090, attention_score_distillation_loss: 0.000555 loss: 0.011439, lagrangian_loss: 0.000215, attention_score_distillation_loss: 0.000448 Starting saving the best from epoch 9 and step 19500 ---------------------------------------------------------------------- time: 2023-07-19 15:47:24 Evaluating: accuracy: 0.9117, eval_loss: 0.3985, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4202, expected_sparsity: 0.4072, expected_sequence_sparsity: 0.7714, target_sparsity: 0.3705, step: 19500 lambda_1: -0.6042, lambda_2: 313.5609 lambda_3: 0.0000 train remain: [1. 0.99 1. 0.82 0.82 0.74 0.81 0.62 0.17 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.8, 0.72, 0.8, 0.6, 0.16, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.46, 0.37, 0.22, 0.04, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110111110000 1111111111111110101100000 1111111111111110111011000 0111111111011010110100000 0010100010000000001000000 1000000000001000000000000 Saving the best model so far: [Epoch 9 | Step: 19500 | MACs sparsity: 0.4202 | Score: 0.9117 | Loss: 0.3985] loss: 0.021915, lagrangian_loss: 0.071607, attention_score_distillation_loss: 0.000287 loss: 0.019239, lagrangian_loss: 0.002265, attention_score_distillation_loss: 0.000318 ---------------------------------------------------------------------- time: 2023-07-19 15:49:51 Evaluating: accuracy: 0.9128, eval_loss: 0.396, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4269, expected_sparsity: 0.414, expected_sequence_sparsity: 0.7741, target_sparsity: 0.38, step: 20000 lambda_1: -0.7261, lambda_2: 326.3415 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.82 0.78 0.74 0.81 0.62 0.17 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.76, 0.72, 0.8, 0.6, 0.16, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.61, 0.44, 0.35, 0.21, 0.03, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101110000 1111111111111110101100000 1111111111111110111011000 0111111111011010110100000 0010100010000000001000000 1000000000000000000000001 Best eval score so far: 0.9117 @ step 19500 epoch 9.26 Saving the best model so far: [Epoch 9 | Step: 20000 | MACs sparsity: 0.4269 | Score: 0.9128 | Loss: 0.396] loss: 0.008656, lagrangian_loss: 0.004315, attention_score_distillation_loss: 0.000304 loss: 0.145286, lagrangian_loss: 0.003923, attention_score_distillation_loss: 0.000208 ---------------------------------------------------------------------- time: 2023-07-19 15:51:52 Evaluating: accuracy: 0.9071, eval_loss: 0.4292, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4269, expected_sparsity: 0.414, expected_sequence_sparsity: 0.7741, target_sparsity: 0.3895, step: 20500 lambda_1: -1.2951, lambda_2: 338.3936 lambda_3: 0.0000 train remain: [0.99 0.98 1. 0.81 0.77 0.73 0.81 0.61 0.16 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.76, 0.72, 0.8, 0.6, 0.16, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.61, 0.44, 0.35, 0.21, 0.03, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101110000 1111111111111110101100000 1111111111111110111011000 0111111111011010110100000 0010100010000000001000000 1000000000000000000000001 Best eval score so far: 0.9128 @ step 20000 epoch 9.50 loss: 0.225192, lagrangian_loss: -0.000559, attention_score_distillation_loss: 0.000140 loss: 0.015227, lagrangian_loss: 0.002478, attention_score_distillation_loss: 0.000064 ---------------------------------------------------------------------- time: 2023-07-19 15:53:21 Evaluating: accuracy: 0.9197, eval_loss: 0.3732, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4269, expected_sparsity: 0.4148, expected_sequence_sparsity: 0.7744, target_sparsity: 0.399, step: 21000 lambda_1: -2.0383, lambda_2: 350.4059 lambda_3: 0.0000 train remain: [0.99 0.97 0.99 0.81 0.77 0.73 0.8 0.62 0.13 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.76, 0.72, 0.8, 0.6, 0.12, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.61, 0.44, 0.35, 0.21, 0.03, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101110000 1111111111111110101100000 1111111111111110111011000 0111111111011010110100000 0010100010000000000000000 1000000000000000000000001 Best eval score so far: 0.9128 @ step 20000 epoch 9.50 Saving the best model so far: [Epoch 9 | Step: 21000 | MACs sparsity: 0.4269 | Score: 0.9197 | Loss: 0.3732] loss: 0.007518, lagrangian_loss: 0.061964, attention_score_distillation_loss: 0.000040 ETA: 3:15:03 | Epoch 9 finished. Took 497.94 seconds. loss: 0.339028, lagrangian_loss: 0.008784, attention_score_distillation_loss: 0.000060 ---------------------------------------------------------------------- time: 2023-07-19 15:55:18 Evaluating: accuracy: 0.9071, eval_loss: 0.4127, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4335, expected_sparsity: 0.4171, expected_sequence_sparsity: 0.7753, target_sparsity: 0.4, step: 21500 lambda_1: -0.6656, lambda_2: 363.0825 lambda_3: 0.0000 train remain: [0.99 0.96 0.99 0.81 0.76 0.73 0.77 0.62 0.13 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.76, 0.72, 0.76, 0.6, 0.12, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.61, 0.44, 0.33, 0.2, 0.02, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101110000 1111111111111110101100000 1111101111111110111011000 0111111111011010110100000 0010100001000000000000000 1000000010000000000000000 Best eval score so far: 0.9197 @ step 21000 epoch 9.98 loss: 0.070387, lagrangian_loss: 0.002247, attention_score_distillation_loss: 0.000052 loss: 0.008773, lagrangian_loss: 0.002919, attention_score_distillation_loss: 0.000054 ---------------------------------------------------------------------- time: 2023-07-19 15:56:48 Evaluating: accuracy: 0.9174, eval_loss: 0.4149, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4226, expected_sequence_sparsity: 0.7775, target_sparsity: 0.4, step: 22000 lambda_1: -0.4025, lambda_2: 374.1867 lambda_3: 0.0000 train remain: [0.98 0.97 1. 0.81 0.74 0.73 0.77 0.62 0.13 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.64, 0.12, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.2, 0.02, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0111111111011010110110000 0010101000000000000000000 1000000000001000000000000 Best eval score so far: 0.9197 @ step 21000 epoch 9.98 loss: 0.009230, lagrangian_loss: 0.012265, attention_score_distillation_loss: 0.000046 loss: 0.021243, lagrangian_loss: 0.014777, attention_score_distillation_loss: 0.000057 ---------------------------------------------------------------------- time: 2023-07-19 15:58:18 Evaluating: accuracy: 0.9209, eval_loss: 0.387, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4238, expected_sequence_sparsity: 0.778, target_sparsity: 0.4, step: 22500 lambda_1: -0.4009, lambda_2: 385.9117 lambda_3: 0.0000 train remain: [0.99 0.97 0.99 0.82 0.74 0.74 0.77 0.62 0.13 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.12, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0111111111011010110100000 0010101000000000000000000 1000000000000000000000010 Best eval score so far: 0.9197 @ step 21000 epoch 9.98 Saving the best model so far: [Epoch 10 | Step: 22500 | MACs sparsity: 0.4402 | Score: 0.9209 | Loss: 0.387] loss: 0.007688, lagrangian_loss: -0.000085, attention_score_distillation_loss: 0.000051 loss: 0.334175, lagrangian_loss: 0.024917, attention_score_distillation_loss: 0.000045 ---------------------------------------------------------------------- time: 2023-07-19 16:00:22 Evaluating: accuracy: 0.9186, eval_loss: 0.3972, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4238, expected_sequence_sparsity: 0.778, target_sparsity: 0.4, step: 23000 lambda_1: -0.4820, lambda_2: 396.7532 lambda_3: 0.0000 train remain: [0.97 0.99 1. 0.82 0.74 0.74 0.77 0.61 0.13 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.12, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0111111111011010110100000 0010101000000000000000000 1000000000000000100000000 Best eval score so far: 0.9209 @ step 22500 epoch 10.69 loss: 0.019337, lagrangian_loss: 0.029455, attention_score_distillation_loss: 0.000044 ETA: 3:09:27 | Epoch 10 finished. Took 410.65 seconds. loss: 0.028104, lagrangian_loss: -0.004824, attention_score_distillation_loss: 0.000049 ---------------------------------------------------------------------- time: 2023-07-19 16:01:52 Evaluating: accuracy: 0.922, eval_loss: 0.3689, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7782, target_sparsity: 0.4, step: 23500 lambda_1: -0.5830, lambda_2: 409.2549 lambda_3: 0.0000 train remain: [0.98 0.99 0.99 0.81 0.74 0.73 0.77 0.6 0.09 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0111111111011010110100000 0010100000000000000000000 1000000000001000000000000 Best eval score so far: 0.9209 @ step 22500 epoch 10.69 Saving the best model so far: [Epoch 11 | Step: 23500 | MACs sparsity: 0.4402 | Score: 0.922 | Loss: 0.3689] loss: 0.009476, lagrangian_loss: 0.016693, attention_score_distillation_loss: 0.000052 loss: 0.005693, lagrangian_loss: 0.006703, attention_score_distillation_loss: 0.000047 ---------------------------------------------------------------------- time: 2023-07-19 16:03:57 Evaluating: accuracy: 0.9151, eval_loss: 0.3836, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7782, target_sparsity: 0.4, step: 24000 lambda_1: -0.2180, lambda_2: 420.8145 lambda_3: 0.0000 train remain: [0.98 0.99 0.98 0.81 0.74 0.74 0.77 0.61 0.1 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0111111111011010110100000 0000100000000000001000000 1000000000001000000000000 Best eval score so far: 0.9220 @ step 23500 epoch 11.16 loss: 0.062661, lagrangian_loss: 0.017847, attention_score_distillation_loss: 0.000057 loss: 0.241612, lagrangian_loss: 0.015180, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 16:05:27 Evaluating: accuracy: 0.9186, eval_loss: 0.3978, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7782, target_sparsity: 0.4, step: 24500 lambda_1: -1.9868, lambda_2: 432.3417 lambda_3: 0.0000 train remain: [0.99 0.99 0.97 0.82 0.73 0.73 0.77 0.58 0.09 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0111111111011010110100000 0000100000000000001000000 1000000100000000000000000 Best eval score so far: 0.9220 @ step 23500 epoch 11.16 loss: 0.012182, lagrangian_loss: 0.006213, attention_score_distillation_loss: 0.000048 loss: 0.014083, lagrangian_loss: 0.008984, attention_score_distillation_loss: 0.000051 ---------------------------------------------------------------------- time: 2023-07-19 16:06:57 Evaluating: accuracy: 0.9163, eval_loss: 0.4165, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7782, target_sparsity: 0.4, step: 25000 lambda_1: -0.0343, lambda_2: 443.8658 lambda_3: 0.0000 train remain: [0.99 0.99 0.98 0.82 0.74 0.74 0.77 0.61 0.1 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0111111111011010110100000 0000100000000000000000100 1000000000000000000100000 Best eval score so far: 0.9220 @ step 23500 epoch 11.16 loss: 0.039797, lagrangian_loss: 0.001821, attention_score_distillation_loss: 0.000053 loss: 0.005244, lagrangian_loss: 0.019639, attention_score_distillation_loss: 0.000046 ETA: 3:03:43 | Epoch 11 finished. Took 412.35 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:08:27 Evaluating: accuracy: 0.9197, eval_loss: 0.3906, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7782, target_sparsity: 0.4, step: 25500 lambda_1: -0.2721, lambda_2: 455.2599 lambda_3: 0.0000 train remain: [0.98 0.98 0.99 0.82 0.74 0.76 0.77 0.6 0.1 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0111111111011010110100000 0000100000000010000000000 1000000010000000000000000 Best eval score so far: 0.9220 @ step 23500 epoch 11.16 loss: 0.006331, lagrangian_loss: 0.031499, attention_score_distillation_loss: 0.000042 loss: 0.245801, lagrangian_loss: 0.017978, attention_score_distillation_loss: 0.000054 ---------------------------------------------------------------------- time: 2023-07-19 16:09:56 Evaluating: accuracy: 0.9186, eval_loss: 0.4125, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4255, expected_sequence_sparsity: 0.7786, target_sparsity: 0.4, step: 26000 lambda_1: -0.3610, lambda_2: 466.5440 lambda_3: 0.0000 train remain: [0.99 0.97 0.99 0.81 0.76 0.74 0.77 0.58 0.09 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.18, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0101111111011010110100000 0000100000001000000000000 1000000000000000000000001 Best eval score so far: 0.9220 @ step 23500 epoch 11.16 loss: 0.020716, lagrangian_loss: 0.012093, attention_score_distillation_loss: 0.000058 loss: 0.037960, lagrangian_loss: 0.028970, attention_score_distillation_loss: 0.000062 ---------------------------------------------------------------------- time: 2023-07-19 16:11:26 Evaluating: accuracy: 0.9243, eval_loss: 0.365, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4255, expected_sequence_sparsity: 0.7786, target_sparsity: 0.4, step: 26500 lambda_1: -0.6528, lambda_2: 477.6496 lambda_3: 0.0000 train remain: [0.99 0.96 0.99 0.81 0.75 0.74 0.76 0.58 0.09 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.18, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0101111111011010110100000 0000101000000000000000000 1000000000000000000000001 Best eval score so far: 0.9220 @ step 23500 epoch 11.16 Saving the best model so far: [Epoch 12 | Step: 26500 | MACs sparsity: 0.4402 | Score: 0.9243 | Loss: 0.365] loss: 0.028442, lagrangian_loss: 0.006984, attention_score_distillation_loss: 0.000051 loss: 0.005528, lagrangian_loss: 0.036418, attention_score_distillation_loss: 0.000037 ---------------------------------------------------------------------- time: 2023-07-19 16:13:24 Evaluating: accuracy: 0.9197, eval_loss: 0.3721, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4249, expected_sequence_sparsity: 0.7784, target_sparsity: 0.4, step: 27000 lambda_1: -0.2567, lambda_2: 488.4018 lambda_3: 0.0000 train remain: [0.99 0.97 0.99 0.81 0.76 0.74 0.75 0.58 0.11 0.1 ] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.56, 0.12, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.18, 0.02, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0101111111011010110100000 0000101000001000000000000 1000000100000000000000000 Best eval score so far: 0.9243 @ step 26500 epoch 12.59 loss: 0.005848, lagrangian_loss: 0.004468, attention_score_distillation_loss: 0.000057 loss: 0.008393, lagrangian_loss: 0.000322, attention_score_distillation_loss: 0.000051 ETA: 2:57:33 | Epoch 12 finished. Took 405.04 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:14:54 Evaluating: accuracy: 0.922, eval_loss: 0.3892, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4255, expected_sequence_sparsity: 0.7786, target_sparsity: 0.4, step: 27500 lambda_1: -1.2530, lambda_2: 500.5627 lambda_3: 0.0000 train remain: [0.99 0.97 0.99 0.81 0.76 0.72 0.75 0.57 0.09 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.18, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111110111011000 0101111111011010110100000 0000100000001000000000000 1000000000000000000000001 Best eval score so far: 0.9243 @ step 26500 epoch 12.59 loss: 0.002942, lagrangian_loss: 0.000287, attention_score_distillation_loss: 0.000047 loss: 0.022960, lagrangian_loss: 0.082235, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 16:16:23 Evaluating: accuracy: 0.9255, eval_loss: 0.382, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 28000 lambda_1: -0.5438, lambda_2: 511.3359 lambda_3: 0.0000 train remain: [0.99 0.98 1. 0.81 0.76 0.73 0.73 0.57 0.1 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101111111011010110100000 0000100000001000000000000 1000000000000000000000001 Best eval score so far: 0.9243 @ step 26500 epoch 12.59 Saving the best model so far: [Epoch 13 | Step: 28000 | MACs sparsity: 0.4402 | Score: 0.9255 | Loss: 0.382] loss: 0.001618, lagrangian_loss: 0.024870, attention_score_distillation_loss: 0.000058 loss: 0.020451, lagrangian_loss: 0.000290, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 16:18:26 Evaluating: accuracy: 0.9289, eval_loss: 0.3622, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 28500 lambda_1: -0.1599, lambda_2: 521.5013 lambda_3: 0.0000 train remain: [0.99 0.97 1. 0.81 0.76 0.73 0.73 0.57 0.09 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101111111011010110100000 0000100000001000000000000 1000000010000000000000000 Best eval score so far: 0.9255 @ step 28000 epoch 13.30 Saving the best model so far: [Epoch 13 | Step: 28500 | MACs sparsity: 0.4402 | Score: 0.9289 | Loss: 0.3622] loss: 0.004615, lagrangian_loss: 0.019029, attention_score_distillation_loss: 0.000057 loss: 0.012880, lagrangian_loss: 0.015729, attention_score_distillation_loss: 0.000048 ---------------------------------------------------------------------- time: 2023-07-19 16:20:24 Evaluating: accuracy: 0.922, eval_loss: 0.3981, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 29000 lambda_1: -0.4209, lambda_2: 532.8297 lambda_3: 0.0000 train remain: [0.99 0.97 0.99 0.82 0.75 0.74 0.74 0.56 0.09 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101111111011010110100000 0000100000000000001000000 1000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.039566, lagrangian_loss: 0.010481, attention_score_distillation_loss: 0.000053 loss: 0.004200, lagrangian_loss: 0.004085, attention_score_distillation_loss: 0.000056 ETA: 2:52:18 | Epoch 13 finished. Took 437.47 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:21:53 Evaluating: accuracy: 0.9186, eval_loss: 0.4071, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 29500 lambda_1: -0.8445, lambda_2: 544.2328 lambda_3: 0.0000 train remain: [0.99 0.98 0.99 0.81 0.74 0.74 0.74 0.55 0.09 0.09] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101111111011010110100000 0000100000001000000000000 1000000010000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.065120, lagrangian_loss: 0.002955, attention_score_distillation_loss: 0.000054 loss: 0.006705, lagrangian_loss: 0.060361, attention_score_distillation_loss: 0.000042 ---------------------------------------------------------------------- time: 2023-07-19 16:23:23 Evaluating: accuracy: 0.9232, eval_loss: 0.3768, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 30000 lambda_1: -0.4613, lambda_2: 555.9514 lambda_3: 0.0000 train remain: [0.99 0.97 0.99 0.83 0.74 0.74 0.73 0.54 0.09 0.08] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101111111011010110100000 0000100000000000001000000 1000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.004798, lagrangian_loss: 0.016233, attention_score_distillation_loss: 0.000042 loss: 0.017283, lagrangian_loss: 0.006262, attention_score_distillation_loss: 0.000055 ---------------------------------------------------------------------- time: 2023-07-19 16:24:52 Evaluating: accuracy: 0.9278, eval_loss: 0.3585, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 30500 lambda_1: -0.5198, lambda_2: 567.5977 lambda_3: 0.0000 train remain: [0.99 0.97 0.99 0.82 0.75 0.74 0.74 0.55 0.09 0.08] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101111111011010110100000 0000101000000000000000000 1000000010000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.010632, lagrangian_loss: 0.006477, attention_score_distillation_loss: 0.000051 loss: 0.014309, lagrangian_loss: 0.002179, attention_score_distillation_loss: 0.000059 ---------------------------------------------------------------------- time: 2023-07-19 16:26:22 Evaluating: accuracy: 0.9243, eval_loss: 0.3599, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 31000 lambda_1: -0.4821, lambda_2: 579.8088 lambda_3: 0.0000 train remain: [0.99 0.98 0.99 0.82 0.74 0.74 0.74 0.54 0.1 0.08] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101111111011010110100000 0000100010000000000000000 1000000000000000010000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.015730, lagrangian_loss: 0.107058, attention_score_distillation_loss: 0.000042 loss: 0.002579, lagrangian_loss: 0.005024, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 16:27:52 Evaluating: accuracy: 0.9186, eval_loss: 0.3836, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4435, expected_sparsity: 0.4287, expected_sequence_sparsity: 0.7799, target_sparsity: 0.4, step: 31500 lambda_1: -1.0087, lambda_2: 592.1707 lambda_3: 0.0000 train remain: [0.99 0.98 0.99 0.82 0.74 0.73 0.73 0.54 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.16, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101101111011010110100000 0000100010000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.005703, lagrangian_loss: 0.003289, attention_score_distillation_loss: 0.000048 ETA: 2:45:13 | Epoch 14 finished. Took 381.35 seconds. loss: 0.007816, lagrangian_loss: 0.074306, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 16:29:22 Evaluating: accuracy: 0.9128, eval_loss: 0.4007, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4435, expected_sparsity: 0.4287, expected_sequence_sparsity: 0.7799, target_sparsity: 0.4, step: 32000 lambda_1: -0.5498, lambda_2: 603.1637 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.81 0.74 0.73 0.73 0.54 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.16, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101101111011010110100000 0000100000000010000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.007923, lagrangian_loss: 0.008814, attention_score_distillation_loss: 0.000053 loss: 0.019591, lagrangian_loss: 0.008428, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 16:30:52 Evaluating: accuracy: 0.9151, eval_loss: 0.3968, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 32500 lambda_1: -0.2055, lambda_2: 616.0857 lambda_3: 0.0000 train remain: [0.98 0.99 1. 0.82 0.74 0.73 0.73 0.54 0.1 0.07] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101101111011010110100010 0000100100000000000000000 1000000000000000100000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.015917, lagrangian_loss: 0.148786, attention_score_distillation_loss: 0.000040 loss: 0.012037, lagrangian_loss: -0.000045, attention_score_distillation_loss: 0.000047 ---------------------------------------------------------------------- time: 2023-07-19 16:32:22 Evaluating: accuracy: 0.9232, eval_loss: 0.3708, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4435, expected_sparsity: 0.4287, expected_sequence_sparsity: 0.7799, target_sparsity: 0.4, step: 33000 lambda_1: -0.6351, lambda_2: 627.1347 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.81 0.74 0.73 0.72 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.16, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111011000 0101101111011010110100000 1000100000000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.009967, lagrangian_loss: 0.004021, attention_score_distillation_loss: 0.000046 loss: 0.010388, lagrangian_loss: 0.000199, attention_score_distillation_loss: 0.000049 ---------------------------------------------------------------------- time: 2023-07-19 16:33:51 Evaluating: accuracy: 0.9186, eval_loss: 0.3989, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4308, expected_sequence_sparsity: 0.7807, target_sparsity: 0.4, step: 33500 lambda_1: -0.8868, lambda_2: 638.0287 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.81 0.74 0.71 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.28, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111010000 0101101111011010110100000 0010100000000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.018070, lagrangian_loss: 0.002462, attention_score_distillation_loss: 0.000051 ETA: 2:38:07 | Epoch 15 finished. Took 376.89 seconds. loss: 0.111439, lagrangian_loss: 0.000197, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 16:35:21 Evaluating: accuracy: 0.9186, eval_loss: 0.406, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 34000 lambda_1: -0.6912, lambda_2: 648.1470 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.81 0.74 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000000000001000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.008511, lagrangian_loss: 0.021982, attention_score_distillation_loss: 0.000044 loss: 0.251608, lagrangian_loss: 0.000294, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 16:36:51 Evaluating: accuracy: 0.922, eval_loss: 0.3852, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4308, expected_sequence_sparsity: 0.7807, target_sparsity: 0.4, step: 34500 lambda_1: -0.3695, lambda_2: 659.3114 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.82 0.75 0.71 0.68 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.28, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111111110101100000 1111101111111010111010000 0101101111011010110100000 0000101000000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.007828, lagrangian_loss: 0.000249, attention_score_distillation_loss: 0.000050 loss: 0.008113, lagrangian_loss: 0.021393, attention_score_distillation_loss: 0.000057 ---------------------------------------------------------------------- time: 2023-07-19 16:38:20 Evaluating: accuracy: 0.9232, eval_loss: 0.3906, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 35000 lambda_1: -0.4315, lambda_2: 671.6459 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.82 0.74 0.7 0.68 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000000000000000001 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.041162, lagrangian_loss: -0.000061, attention_score_distillation_loss: 0.000049 loss: 0.007525, lagrangian_loss: 0.020272, attention_score_distillation_loss: 0.000047 ---------------------------------------------------------------------- time: 2023-07-19 16:39:51 Evaluating: accuracy: 0.9232, eval_loss: 0.3764, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 35500 lambda_1: -0.1752, lambda_2: 681.8187 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.82 0.75 0.7 0.7 0.54 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000100000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.005210, lagrangian_loss: 0.039003, attention_score_distillation_loss: 0.000046 loss: 0.004421, lagrangian_loss: 0.109581, attention_score_distillation_loss: 0.000043 ETA: 2:31:08 | Epoch 16 finished. Took 377.56 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:41:21 Evaluating: accuracy: 0.9163, eval_loss: 0.3986, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 36000 lambda_1: -0.4046, lambda_2: 694.4706 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.82 0.75 0.7 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100001000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.002975, lagrangian_loss: 0.000761, attention_score_distillation_loss: 0.000053 loss: 0.004696, lagrangian_loss: 0.020834, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 16:42:50 Evaluating: accuracy: 0.9197, eval_loss: 0.3901, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 36500 lambda_1: -0.2002, lambda_2: 705.3948 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.81 0.75 0.71 0.69 0.53 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000001000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.107541, lagrangian_loss: 0.002262, attention_score_distillation_loss: 0.000057 loss: 0.057680, lagrangian_loss: 0.000653, attention_score_distillation_loss: 0.000045 ---------------------------------------------------------------------- time: 2023-07-19 16:44:20 Evaluating: accuracy: 0.9197, eval_loss: 0.3788, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 37000 lambda_1: -0.2664, lambda_2: 717.4370 lambda_3: 0.0000 train remain: [1. 0.99 1. 0.8 0.74 0.71 0.69 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000001000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.007292, lagrangian_loss: 0.048426, attention_score_distillation_loss: 0.000054 loss: 0.012846, lagrangian_loss: 0.000129, attention_score_distillation_loss: 0.000050 ---------------------------------------------------------------------- time: 2023-07-19 16:45:50 Evaluating: accuracy: 0.9243, eval_loss: 0.3744, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 37500 lambda_1: -0.3495, lambda_2: 729.8696 lambda_3: 0.0000 train remain: [1. 0.99 1. 0.8 0.75 0.7 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100001000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.003568, lagrangian_loss: 0.027161, attention_score_distillation_loss: 0.000041 loss: 0.006349, lagrangian_loss: 0.010191, attention_score_distillation_loss: 0.000048 ETA: 2:24:13 | Epoch 17 finished. Took 377.46 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:47:20 Evaluating: accuracy: 0.9186, eval_loss: 0.3869, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 38000 lambda_1: -0.2518, lambda_2: 740.6155 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.8 0.75 0.7 0.68 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000000001000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.006214, lagrangian_loss: 0.073724, attention_score_distillation_loss: 0.000041 loss: 0.010631, lagrangian_loss: 0.011686, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 16:48:50 Evaluating: accuracy: 0.9197, eval_loss: 0.3929, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 38500 lambda_1: -0.0965, lambda_2: 751.6179 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.81 0.77 0.7 0.69 0.54 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.009295, lagrangian_loss: 0.029498, attention_score_distillation_loss: 0.000062 loss: 0.015984, lagrangian_loss: 0.034771, attention_score_distillation_loss: 0.000054 ---------------------------------------------------------------------- time: 2023-07-19 16:50:20 Evaluating: accuracy: 0.922, eval_loss: 0.4008, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4435, expected_sparsity: 0.4285, expected_sequence_sparsity: 0.7798, target_sparsity: 0.4, step: 39000 lambda_1: -0.1759, lambda_2: 763.9257 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.81 0.77 0.7 0.7 0.54 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.8, 0.76, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.61, 0.41, 0.28, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111111101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000101000000000000000000 0000000000000000100000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.238435, lagrangian_loss: 0.000540, attention_score_distillation_loss: 0.000055 loss: 0.009482, lagrangian_loss: 0.359404, attention_score_distillation_loss: 0.000033 ---------------------------------------------------------------------- time: 2023-07-19 16:51:49 Evaluating: accuracy: 0.9197, eval_loss: 0.3931, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 39500 lambda_1: -0.1383, lambda_2: 775.9684 lambda_3: 0.0000 train remain: [1. 0.99 1. 0.8 0.77 0.7 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101100 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000001000000000000 0000000000000000100000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.006326, lagrangian_loss: 0.005719, attention_score_distillation_loss: 0.000049 loss: 0.005775, lagrangian_loss: 0.011350, attention_score_distillation_loss: 0.000054 ETA: 2:17:22 | Epoch 18 finished. Took 377.73 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:53:20 Evaluating: accuracy: 0.9209, eval_loss: 0.3899, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4502, expected_sparsity: 0.4374, expected_sequence_sparsity: 0.7833, target_sparsity: 0.4, step: 40000 lambda_1: -0.3434, lambda_2: 787.0940 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.78 0.79 0.7 0.69 0.53 0.1 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.76, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101110000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000000001000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.011044, lagrangian_loss: 0.042601, attention_score_distillation_loss: 0.000048 loss: 0.014113, lagrangian_loss: 0.003574, attention_score_distillation_loss: 0.000048 ---------------------------------------------------------------------- time: 2023-07-19 16:54:50 Evaluating: accuracy: 0.9174, eval_loss: 0.3861, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4502, expected_sparsity: 0.4374, expected_sequence_sparsity: 0.7833, target_sparsity: 0.4, step: 40500 lambda_1: -0.2989, lambda_2: 798.2635 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.79 0.78 0.7 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.76, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101110000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000000000000100000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.004105, lagrangian_loss: 0.060908, attention_score_distillation_loss: 0.000042 loss: 0.007662, lagrangian_loss: -0.000011, attention_score_distillation_loss: 0.000049 ---------------------------------------------------------------------- time: 2023-07-19 16:56:19 Evaluating: accuracy: 0.9255, eval_loss: 0.3706, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4502, expected_sparsity: 0.4374, expected_sequence_sparsity: 0.7833, target_sparsity: 0.4, step: 41000 lambda_1: -0.5233, lambda_2: 809.8208 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.78 0.78 0.7 0.69 0.52 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.76, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101110000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.005376, lagrangian_loss: 0.003592, attention_score_distillation_loss: 0.000048 loss: 0.004263, lagrangian_loss: 0.010528, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 16:57:49 Evaluating: accuracy: 0.9209, eval_loss: 0.3861, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4502, expected_sparsity: 0.4374, expected_sequence_sparsity: 0.7833, target_sparsity: 0.4, step: 41500 lambda_1: -0.2841, lambda_2: 821.4862 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.78 0.82 0.7 0.69 0.53 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.76, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101110000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000000010000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.017162, lagrangian_loss: 0.084159, attention_score_distillation_loss: 0.000045 loss: 0.009848, lagrangian_loss: 0.042307, attention_score_distillation_loss: 0.000058 ---------------------------------------------------------------------- time: 2023-07-19 16:59:19 Evaluating: accuracy: 0.9266, eval_loss: 0.3469, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 42000 lambda_1: -0.2932, lambda_2: 831.1076 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.78 0.81 0.7 0.68 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.008410, lagrangian_loss: 0.160790, attention_score_distillation_loss: 0.000042 ETA: 2:10:39 | Epoch 19 finished. Took 381.39 seconds. loss: 0.004791, lagrangian_loss: 0.001433, attention_score_distillation_loss: 0.000046 ---------------------------------------------------------------------- time: 2023-07-19 17:00:48 Evaluating: accuracy: 0.9255, eval_loss: 0.3558, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 42500 lambda_1: -0.3749, lambda_2: 842.8365 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.78 0.8 0.7 0.68 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.009855, lagrangian_loss: 0.007496, attention_score_distillation_loss: 0.000052 loss: 0.006313, lagrangian_loss: 0.001644, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 17:02:18 Evaluating: accuracy: 0.9243, eval_loss: 0.3661, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 43000 lambda_1: -0.2701, lambda_2: 854.6217 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.79 0.79 0.69 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100100000000000000000 0001000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.005993, lagrangian_loss: -0.000014, attention_score_distillation_loss: 0.000053 loss: 0.006569, lagrangian_loss: 0.026503, attention_score_distillation_loss: 0.000058 ---------------------------------------------------------------------- time: 2023-07-19 17:03:48 Evaluating: accuracy: 0.9232, eval_loss: 0.367, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 43500 lambda_1: -0.2675, lambda_2: 866.8499 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.79 0.78 0.69 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100100000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.008160, lagrangian_loss: 0.026365, attention_score_distillation_loss: 0.000056 loss: 0.008577, lagrangian_loss: 0.075638, attention_score_distillation_loss: 0.000040 ---------------------------------------------------------------------- time: 2023-07-19 17:05:18 Evaluating: accuracy: 0.9243, eval_loss: 0.376, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 44000 lambda_1: -0.4296, lambda_2: 878.5640 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.81 0.76 0.69 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100100000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.007913, lagrangian_loss: 0.072113, attention_score_distillation_loss: 0.000063 ETA: 2:03:53 | Epoch 20 finished. Took 376.69 seconds. loss: 0.004072, lagrangian_loss: 0.004772, attention_score_distillation_loss: 0.000045 ---------------------------------------------------------------------- time: 2023-07-19 17:06:47 Evaluating: accuracy: 0.9243, eval_loss: 0.3733, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 44500 lambda_1: -0.4339, lambda_2: 888.7227 lambda_3: 0.0000 train remain: [0.99 0.99 1. 0.8 0.76 0.69 0.69 0.52 0.08 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100100000000000000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.127613, lagrangian_loss: 0.167968, attention_score_distillation_loss: 0.000063 loss: 0.010145, lagrangian_loss: 0.010796, attention_score_distillation_loss: 0.000060 ---------------------------------------------------------------------- time: 2023-07-19 17:08:17 Evaluating: accuracy: 0.9278, eval_loss: 0.3555, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 45000 lambda_1: -0.3059, lambda_2: 900.1195 lambda_3: 0.0000 train remain: [1. 0.99 1. 0.8 0.77 0.69 0.69 0.52 0.08 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.008137, lagrangian_loss: 0.080227, attention_score_distillation_loss: 0.000060 loss: 0.008451, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.000050 ---------------------------------------------------------------------- time: 2023-07-19 17:09:47 Evaluating: accuracy: 0.9278, eval_loss: 0.3421, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 45500 lambda_1: -0.3026, lambda_2: 912.6192 lambda_3: 0.0000 train remain: [1. 0.99 1. 0.8 0.76 0.69 0.69 0.52 0.08 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000000000010000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.003658, lagrangian_loss: 0.010049, attention_score_distillation_loss: 0.000056 loss: 0.005571, lagrangian_loss: 0.008614, attention_score_distillation_loss: 0.000048 ---------------------------------------------------------------------- time: 2023-07-19 17:11:16 Evaluating: accuracy: 0.9266, eval_loss: 0.3548, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 46000 lambda_1: -0.4178, lambda_2: 923.5681 lambda_3: 0.0000 train remain: [1. 0.99 1. 0.8 0.77 0.69 0.69 0.52 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.005839, lagrangian_loss: 0.023733, attention_score_distillation_loss: 0.000045 loss: 0.006106, lagrangian_loss: 0.036313, attention_score_distillation_loss: 0.000045 ETA: 1:57:09 | Epoch 21 finished. Took 376.18 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:12:46 Evaluating: accuracy: 0.9255, eval_loss: 0.3507, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 46500 lambda_1: -0.1736, lambda_2: 936.0222 lambda_3: 0.0000 train remain: [1. 1. 1. 0.8 0.78 0.7 0.7 0.52 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100001000000000000000 0000000000000000100000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.009574, lagrangian_loss: 0.010538, attention_score_distillation_loss: 0.000060 loss: 0.005597, lagrangian_loss: 0.038347, attention_score_distillation_loss: 0.000047 ---------------------------------------------------------------------- time: 2023-07-19 17:14:15 Evaluating: accuracy: 0.9289, eval_loss: 0.3452, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 47000 lambda_1: -0.0433, lambda_2: 948.1556 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.78 0.7 0.7 0.52 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000001000000000000 0000000000000000001000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.024184, lagrangian_loss: 0.042090, attention_score_distillation_loss: 0.000050 loss: 0.004610, lagrangian_loss: 0.057158, attention_score_distillation_loss: 0.000056 ---------------------------------------------------------------------- time: 2023-07-19 17:15:45 Evaluating: accuracy: 0.9289, eval_loss: 0.3383, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 47500 lambda_1: -0.2667, lambda_2: 960.0967 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.77 0.7 0.69 0.52 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000010000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.017927, lagrangian_loss: 0.063212, attention_score_distillation_loss: 0.000043 loss: 0.008120, lagrangian_loss: 0.003447, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 17:17:15 Evaluating: accuracy: 0.9278, eval_loss: 0.3528, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 48000 lambda_1: -0.2235, lambda_2: 971.6016 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.7 0.7 0.52 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000101000000000000000000 0000000000000000010000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.004714, lagrangian_loss: 0.037656, attention_score_distillation_loss: 0.000055 loss: 0.005076, lagrangian_loss: 0.190249, attention_score_distillation_loss: 0.000066 ETA: 1:50:29 | Epoch 22 finished. Took 376.96 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:18:45 Evaluating: accuracy: 0.9289, eval_loss: 0.3311, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 48500 lambda_1: -0.1407, lambda_2: 982.6774 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.78 0.7 0.7 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000001000000000000 0000000000000000100000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.003278, lagrangian_loss: 0.023118, attention_score_distillation_loss: 0.000048 loss: 0.010283, lagrangian_loss: 0.086443, attention_score_distillation_loss: 0.000060 ---------------------------------------------------------------------- time: 2023-07-19 17:20:15 Evaluating: accuracy: 0.9232, eval_loss: 0.3612, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 49000 lambda_1: -0.3675, lambda_2: 995.7005 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.77 0.7 0.7 0.52 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000000000100000000 0000000000000000100000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.009656, lagrangian_loss: 0.009961, attention_score_distillation_loss: 0.000056 loss: 0.005070, lagrangian_loss: 0.001369, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 17:21:44 Evaluating: accuracy: 0.9232, eval_loss: 0.3452, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 49500 lambda_1: -0.4518, lambda_2: 1006.0292 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.69 0.69 0.52 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.023249, lagrangian_loss: 0.011005, attention_score_distillation_loss: 0.000053 loss: 0.009364, lagrangian_loss: 0.001575, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 17:23:13 Evaluating: accuracy: 0.9232, eval_loss: 0.3698, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 50000 lambda_1: -0.2274, lambda_2: 1017.4321 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.77 0.69 0.69 0.52 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000101000000000000000000 0000000000001000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.126595, lagrangian_loss: 0.004380, attention_score_distillation_loss: 0.000053 loss: 0.235262, lagrangian_loss: 0.000166, attention_score_distillation_loss: 0.000047 ---------------------------------------------------------------------- time: 2023-07-19 17:24:41 Evaluating: accuracy: 0.9186, eval_loss: 0.3889, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 50500 lambda_1: -0.1758, lambda_2: 1028.5825 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.79 0.76 0.7 0.7 0.52 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0001100000000000000000000 0000000000000000100000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.221742, lagrangian_loss: 0.034486, attention_score_distillation_loss: 0.000043 ETA: 1:43:51 | Epoch 23 finished. Took 378.66 seconds. loss: 0.008768, lagrangian_loss: 0.084534, attention_score_distillation_loss: 0.000046 ---------------------------------------------------------------------- time: 2023-07-19 17:26:10 Evaluating: accuracy: 0.9232, eval_loss: 0.3731, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 51000 lambda_1: -0.1421, lambda_2: 1039.8231 lambda_3: 0.0000 train remain: [1. 1. 1. 0.8 0.75 0.69 0.7 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000000000001000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.010449, lagrangian_loss: 0.033116, attention_score_distillation_loss: 0.000046 loss: 0.004404, lagrangian_loss: 0.003113, attention_score_distillation_loss: 0.000057 ---------------------------------------------------------------------- time: 2023-07-19 17:27:38 Evaluating: accuracy: 0.9278, eval_loss: 0.344, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 51500 lambda_1: -0.0429, lambda_2: 1051.0416 lambda_3: 0.0000 train remain: [1. 1. 1. 0.8 0.75 0.7 0.7 0.53 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.007905, lagrangian_loss: 0.015633, attention_score_distillation_loss: 0.000057 loss: 0.007022, lagrangian_loss: 0.001843, attention_score_distillation_loss: 0.000055 ---------------------------------------------------------------------- time: 2023-07-19 17:29:07 Evaluating: accuracy: 0.922, eval_loss: 0.3722, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 52000 lambda_1: -0.2290, lambda_2: 1062.3442 lambda_3: 0.0000 train remain: [1. 1. 1. 0.81 0.76 0.7 0.7 0.53 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000000000001000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.004420, lagrangian_loss: 0.000912, attention_score_distillation_loss: 0.000051 loss: 0.006146, lagrangian_loss: 0.110975, attention_score_distillation_loss: 0.000039 ---------------------------------------------------------------------- time: 2023-07-19 17:30:34 Evaluating: accuracy: 0.9243, eval_loss: 0.3804, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 52500 lambda_1: -0.2784, lambda_2: 1073.5000 lambda_3: 0.0000 train remain: [1. 1. 1. 0.8 0.75 0.69 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100010000000000000000 0000000000001000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.297986, lagrangian_loss: 0.010362, attention_score_distillation_loss: 0.000058 ETA: 1:37:10 | Epoch 24 finished. Took 370.32 seconds. loss: 0.014081, lagrangian_loss: 0.103171, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 17:32:01 Evaluating: accuracy: 0.9243, eval_loss: 0.361, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 53000 lambda_1: 0.0155, lambda_2: 1085.3218 lambda_3: 0.0000 train remain: [1. 1. 1. 0.8 0.76 0.7 0.7 0.54 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000010000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.005242, lagrangian_loss: 0.003217, attention_score_distillation_loss: 0.000059 loss: 0.047305, lagrangian_loss: 0.103842, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 17:33:27 Evaluating: accuracy: 0.9186, eval_loss: 0.3761, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 53500 lambda_1: -0.0985, lambda_2: 1097.8932 lambda_3: 0.0000 train remain: [1. 1. 1. 0.8 0.76 0.7 0.7 0.53 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000000000000001000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.005138, lagrangian_loss: 0.311181, attention_score_distillation_loss: 0.000039 loss: 0.004399, lagrangian_loss: 0.045660, attention_score_distillation_loss: 0.000056 ---------------------------------------------------------------------- time: 2023-07-19 17:34:54 Evaluating: accuracy: 0.9163, eval_loss: 0.4068, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 54000 lambda_1: -0.0808, lambda_2: 1108.8696 lambda_3: 0.0000 train remain: [1. 1. 1. 0.8 0.76 0.7 0.7 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000000010000000000 0000000000001000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.005563, lagrangian_loss: 0.000490, attention_score_distillation_loss: 0.000052 loss: 0.002701, lagrangian_loss: 0.000310, attention_score_distillation_loss: 0.000057 ---------------------------------------------------------------------- time: 2023-07-19 17:36:21 Evaluating: accuracy: 0.922, eval_loss: 0.3859, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 54500 lambda_1: -0.4706, lambda_2: 1120.7057 lambda_3: 0.0000 train remain: [1. 0.99 1. 0.79 0.77 0.69 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000000000001000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.005046, lagrangian_loss: 0.000858, attention_score_distillation_loss: 0.000052 ETA: 1:30:28 | Epoch 25 finished. Took 364.16 seconds. loss: 0.008167, lagrangian_loss: 0.006928, attention_score_distillation_loss: 0.000048 ---------------------------------------------------------------------- time: 2023-07-19 17:37:48 Evaluating: accuracy: 0.9243, eval_loss: 0.3639, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 55000 lambda_1: -0.1750, lambda_2: 1131.4738 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.77 0.7 0.7 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100001000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.007312, lagrangian_loss: 0.030305, attention_score_distillation_loss: 0.000057 loss: 0.003924, lagrangian_loss: 0.001943, attention_score_distillation_loss: 0.000051 ---------------------------------------------------------------------- time: 2023-07-19 17:39:14 Evaluating: accuracy: 0.9232, eval_loss: 0.3751, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 55500 lambda_1: -0.3222, lambda_2: 1143.6475 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.77 0.69 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.044854, lagrangian_loss: 0.166533, attention_score_distillation_loss: 0.000061 loss: 0.004780, lagrangian_loss: 0.001529, attention_score_distillation_loss: 0.000055 ---------------------------------------------------------------------- time: 2023-07-19 17:40:41 Evaluating: accuracy: 0.9255, eval_loss: 0.3674, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 56000 lambda_1: -0.2564, lambda_2: 1155.0938 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.77 0.69 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000001000000000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.006782, lagrangian_loss: 0.052921, attention_score_distillation_loss: 0.000044 loss: 0.004538, lagrangian_loss: 0.004323, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 17:42:08 Evaluating: accuracy: 0.9266, eval_loss: 0.3523, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 56500 lambda_1: -0.2556, lambda_2: 1167.1124 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.69 0.69 0.53 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 1000000000000000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.002842, lagrangian_loss: 0.047573, attention_score_distillation_loss: 0.000046 loss: 0.011549, lagrangian_loss: 0.000071, attention_score_distillation_loss: 0.000054 ETA: 1:23:49 | Epoch 26 finished. Took 363.92 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:43:34 Evaluating: accuracy: 0.9232, eval_loss: 0.3675, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 57000 lambda_1: -0.1855, lambda_2: 1179.0197 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.7 0.7 0.54 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000000000001000000 0000000000001000000000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.003182, lagrangian_loss: 0.004353, attention_score_distillation_loss: 0.000052 loss: 0.005471, lagrangian_loss: 0.041603, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 17:45:00 Evaluating: accuracy: 0.9266, eval_loss: 0.341, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 57500 lambda_1: -0.3045, lambda_2: 1190.6985 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.69 0.69 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000000000001000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.009677, lagrangian_loss: 0.003945, attention_score_distillation_loss: 0.000049 loss: 0.009485, lagrangian_loss: 0.020115, attention_score_distillation_loss: 0.000051 ---------------------------------------------------------------------- time: 2023-07-19 17:46:27 Evaluating: accuracy: 0.9243, eval_loss: 0.3598, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 58000 lambda_1: -0.2953, lambda_2: 1201.7389 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.7 0.7 0.54 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111111010110100000 1000100000000000000000000 0000000000000000000000001 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 loss: 0.003310, lagrangian_loss: 0.022545, attention_score_distillation_loss: 0.000050 loss: 0.004362, lagrangian_loss: 0.040250, attention_score_distillation_loss: 0.000054 ---------------------------------------------------------------------- time: 2023-07-19 17:47:53 Evaluating: accuracy: 0.9335, eval_loss: 0.3285, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 58500 lambda_1: -0.2262, lambda_2: 1213.2111 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000001000000000000 0000000000000000100000000 Best eval score so far: 0.9289 @ step 28500 epoch 13.54 Saving the best model so far: [Epoch 27 | Step: 58500 | MACs sparsity: 0.4535 | Score: 0.9335 | Loss: 0.3285] loss: 0.005294, lagrangian_loss: 0.079756, attention_score_distillation_loss: 0.000044 loss: 0.003799, lagrangian_loss: 0.011152, attention_score_distillation_loss: 0.000046 ETA: 1:17:20 | Epoch 27 finished. Took 382.14 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:49:39 Evaluating: accuracy: 0.9163, eval_loss: 0.3892, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 59000 lambda_1: -0.1040, lambda_2: 1225.4785 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 1000100000000000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.030617, lagrangian_loss: 0.000033, attention_score_distillation_loss: 0.000052 loss: 0.003273, lagrangian_loss: 0.011096, attention_score_distillation_loss: 0.000051 ---------------------------------------------------------------------- time: 2023-07-19 17:51:06 Evaluating: accuracy: 0.9232, eval_loss: 0.357, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 59500 lambda_1: -0.1813, lambda_2: 1237.4011 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000001000000000000 0000000000001000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.003005, lagrangian_loss: 0.053603, attention_score_distillation_loss: 0.000040 loss: 0.009924, lagrangian_loss: 0.013898, attention_score_distillation_loss: 0.000054 ---------------------------------------------------------------------- time: 2023-07-19 17:52:33 Evaluating: accuracy: 0.9197, eval_loss: 0.3741, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 60000 lambda_1: -0.1195, lambda_2: 1248.5291 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000001000000000000 0000000000001000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.010397, lagrangian_loss: 0.010281, attention_score_distillation_loss: 0.000062 loss: 0.004336, lagrangian_loss: 0.000046, attention_score_distillation_loss: 0.000056 ---------------------------------------------------------------------- time: 2023-07-19 17:53:59 Evaluating: accuracy: 0.9197, eval_loss: 0.3875, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 60500 lambda_1: -0.0746, lambda_2: 1259.7573 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010111100000 0000100010000000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.008605, lagrangian_loss: 0.146015, attention_score_distillation_loss: 0.000067 loss: 0.014067, lagrangian_loss: 0.003472, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 17:55:25 Evaluating: accuracy: 0.9232, eval_loss: 0.3866, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 61000 lambda_1: 0.0643, lambda_2: 1271.1039 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.7 0.71 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010111100000 0000100000000000010000000 0000000000001000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.004650, lagrangian_loss: 0.044687, attention_score_distillation_loss: 0.000059 ETA: 1:10:46 | Epoch 28 finished. Took 367.57 seconds. loss: 0.004408, lagrangian_loss: 0.054539, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 17:56:53 Evaluating: accuracy: 0.9232, eval_loss: 0.3698, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 61500 lambda_1: -0.3601, lambda_2: 1283.4346 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100000 0000100000001000000000000 0000000000000000100000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.006690, lagrangian_loss: 0.037820, attention_score_distillation_loss: 0.000054 loss: 0.006226, lagrangian_loss: 0.014205, attention_score_distillation_loss: 0.000046 ---------------------------------------------------------------------- time: 2023-07-19 17:58:19 Evaluating: accuracy: 0.9209, eval_loss: 0.3866, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 62000 lambda_1: -0.0337, lambda_2: 1294.3456 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111011010110101000 0000100000000010000000000 0000000000000000100000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.003804, lagrangian_loss: 0.040909, attention_score_distillation_loss: 0.000059 loss: 0.005209, lagrangian_loss: 0.072344, attention_score_distillation_loss: 0.000061 ---------------------------------------------------------------------- time: 2023-07-19 17:59:45 Evaluating: accuracy: 0.9232, eval_loss: 0.3961, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 62500 lambda_1: -0.2674, lambda_2: 1305.4608 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111011010110110000 0000100010000000000000000 0000000000000000000000001 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.006811, lagrangian_loss: 0.008639, attention_score_distillation_loss: 0.000052 loss: 0.003622, lagrangian_loss: 0.017137, attention_score_distillation_loss: 0.000049 ---------------------------------------------------------------------- time: 2023-07-19 18:01:12 Evaluating: accuracy: 0.9209, eval_loss: 0.3866, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 63000 lambda_1: -0.0440, lambda_2: 1317.6906 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.7 0.72 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 1101101111011010110100000 0000101000000000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.002448, lagrangian_loss: 0.346444, attention_score_distillation_loss: 0.000041 ETA: 1:04:13 | Epoch 29 finished. Took 363.78 seconds. loss: 0.004151, lagrangian_loss: 0.030700, attention_score_distillation_loss: 0.000045 ---------------------------------------------------------------------- time: 2023-07-19 18:02:38 Evaluating: accuracy: 0.9232, eval_loss: 0.3837, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 63500 lambda_1: -0.1709, lambda_2: 1329.0950 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.71 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111111010110100000 0000100100000000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.005317, lagrangian_loss: 0.055183, attention_score_distillation_loss: 0.000058 loss: 0.004955, lagrangian_loss: 0.108678, attention_score_distillation_loss: 0.000061 ---------------------------------------------------------------------- time: 2023-07-19 18:04:04 Evaluating: accuracy: 0.914, eval_loss: 0.4174, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 64000 lambda_1: -0.1744, lambda_2: 1340.0244 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111011010110110000 0000100000000000010000000 0000000000000000000001000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.010350, lagrangian_loss: 0.182796, attention_score_distillation_loss: 0.000065 loss: 0.002113, lagrangian_loss: 0.012199, attention_score_distillation_loss: 0.000054 ---------------------------------------------------------------------- time: 2023-07-19 18:05:30 Evaluating: accuracy: 0.9232, eval_loss: 0.3823, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 64500 lambda_1: -0.2126, lambda_2: 1352.1821 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.74 0.7 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 1101101111011010110100000 0000100000000000001000000 0000000010000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.002971, lagrangian_loss: 0.083132, attention_score_distillation_loss: 0.000058 loss: 0.006463, lagrangian_loss: 0.386724, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 18:06:57 Evaluating: accuracy: 0.9186, eval_loss: 0.3901, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 65000 lambda_1: -0.2470, lambda_2: 1364.3120 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111011110110100000 0010100000000000000000000 0000000000000000100000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.007009, lagrangian_loss: 0.021756, attention_score_distillation_loss: 0.000048 loss: 0.005223, lagrangian_loss: 0.001395, attention_score_distillation_loss: 0.000049 ETA: 0:57:41 | Epoch 30 finished. Took 362.16 seconds. ---------------------------------------------------------------------- time: 2023-07-19 18:08:23 Evaluating: accuracy: 0.9186, eval_loss: 0.3888, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 65500 lambda_1: -0.2038, lambda_2: 1376.3102 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.71 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111011010110100100 0000100000100000000000000 0000000000000000100000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.010341, lagrangian_loss: 0.001311, attention_score_distillation_loss: 0.000050 loss: 0.001989, lagrangian_loss: 0.209403, attention_score_distillation_loss: 0.000042 ---------------------------------------------------------------------- time: 2023-07-19 18:09:50 Evaluating: accuracy: 0.9186, eval_loss: 0.4016, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 66000 lambda_1: -0.2289, lambda_2: 1387.0728 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.74 0.69 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111011010110100100 0000100001000000000000000 0000000000000000100000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.002681, lagrangian_loss: 0.101971, attention_score_distillation_loss: 0.000060 loss: 0.004634, lagrangian_loss: 0.004062, attention_score_distillation_loss: 0.000055 ---------------------------------------------------------------------- time: 2023-07-19 18:11:16 Evaluating: accuracy: 0.9266, eval_loss: 0.3666, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 66500 lambda_1: -0.0507, lambda_2: 1399.0841 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.71 0.55 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111011010111100000 0000100000000000100000000 0000000000000000100000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.002087, lagrangian_loss: 0.129615, attention_score_distillation_loss: 0.000044 loss: 0.004971, lagrangian_loss: 0.000881, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 18:12:43 Evaluating: accuracy: 0.9278, eval_loss: 0.356, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 67000 lambda_1: 0.0116, lambda_2: 1411.0002 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.71 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111011010111100000 1000100000000000000000000 0000000000000000100000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.008002, lagrangian_loss: 0.002378, attention_score_distillation_loss: 0.000054 loss: 0.010319, lagrangian_loss: 0.042074, attention_score_distillation_loss: 0.000045 ETA: 0:51:11 | Epoch 31 finished. Took 363.44 seconds. ---------------------------------------------------------------------- time: 2023-07-19 18:14:09 Evaluating: accuracy: 0.922, eval_loss: 0.3839, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 67500 lambda_1: -0.1236, lambda_2: 1422.6659 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.71 0.55 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111011010111100000 0000100000001000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.006457, lagrangian_loss: 0.036834, attention_score_distillation_loss: 0.000053 loss: 0.013327, lagrangian_loss: 0.061022, attention_score_distillation_loss: 0.000056 ---------------------------------------------------------------------- time: 2023-07-19 18:15:36 Evaluating: accuracy: 0.9197, eval_loss: 0.3936, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4404, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 68000 lambda_1: 0.0118, lambda_2: 1433.4221 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.72 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.08] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111011010110101000 0000100000000000001000000 0000000000000000100000001 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.010778, lagrangian_loss: 0.009595, attention_score_distillation_loss: 0.000049 loss: 0.004655, lagrangian_loss: 0.064071, attention_score_distillation_loss: 0.000043 ---------------------------------------------------------------------- time: 2023-07-19 18:17:02 Evaluating: accuracy: 0.9186, eval_loss: 0.3921, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 68500 lambda_1: 0.0033, lambda_2: 1443.8096 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.72 0.55 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111111111111010111010000 0101101111111010110100000 0001100000000000000000000 0000000000000000000000001 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.006267, lagrangian_loss: 0.000252, attention_score_distillation_loss: 0.000056 loss: 0.002737, lagrangian_loss: 0.000134, attention_score_distillation_loss: 0.000049 ---------------------------------------------------------------------- time: 2023-07-19 18:18:29 Evaluating: accuracy: 0.922, eval_loss: 0.3809, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 69000 lambda_1: -0.1860, lambda_2: 1456.8696 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111111010110100000 1000100000000000000000000 0000000000000000000000001 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.003123, lagrangian_loss: 0.007215, attention_score_distillation_loss: 0.000048 loss: 0.004314, lagrangian_loss: 0.018547, attention_score_distillation_loss: 0.000047 ETA: 0:44:42 | Epoch 32 finished. Took 362.69 seconds. ---------------------------------------------------------------------- time: 2023-07-19 18:19:55 Evaluating: accuracy: 0.9278, eval_loss: 0.3581, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 69500 lambda_1: -0.1446, lambda_2: 1467.4972 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 1101101111011010110100000 0000100000001000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.009108, lagrangian_loss: 0.016024, attention_score_distillation_loss: 0.000049 loss: 0.003129, lagrangian_loss: 0.013746, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 18:21:21 Evaluating: accuracy: 0.9278, eval_loss: 0.3482, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 70000 lambda_1: -0.2399, lambda_2: 1479.5732 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010111100000 0000100000001000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.004541, lagrangian_loss: 0.023339, attention_score_distillation_loss: 0.000053 loss: 0.005145, lagrangian_loss: 0.000003, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 18:22:48 Evaluating: accuracy: 0.9255, eval_loss: 0.3602, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 70500 lambda_1: -0.2053, lambda_2: 1491.0387 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.71 0.55 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010111100000 0000100100000000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.007048, lagrangian_loss: 0.233375, attention_score_distillation_loss: 0.000066 loss: 0.008428, lagrangian_loss: 0.093717, attention_score_distillation_loss: 0.000045 ---------------------------------------------------------------------- time: 2023-07-19 18:24:13 Evaluating: accuracy: 0.9174, eval_loss: 0.4037, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 71000 lambda_1: -0.2005, lambda_2: 1503.3418 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.71 0.55 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010111100000 0000100000000000100000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.008609, lagrangian_loss: 0.084994, attention_score_distillation_loss: 0.000053 loss: 0.004410, lagrangian_loss: 0.015736, attention_score_distillation_loss: 0.000055 ---------------------------------------------------------------------- time: 2023-07-19 18:25:40 Evaluating: accuracy: 0.9163, eval_loss: 0.4026, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 71500 lambda_1: -0.1630, lambda_2: 1515.7715 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010111100000 0000100100000000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.003061, lagrangian_loss: 0.190803, attention_score_distillation_loss: 0.000062 ETA: 0:38:16 | Epoch 33 finished. Took 366.76 seconds. loss: 0.004931, lagrangian_loss: 0.000027, attention_score_distillation_loss: 0.000046 ---------------------------------------------------------------------- time: 2023-07-19 18:27:06 Evaluating: accuracy: 0.9232, eval_loss: 0.3828, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 72000 lambda_1: -0.0772, lambda_2: 1527.7817 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.71 0.55 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000100000001000000000000 0000000010000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.005254, lagrangian_loss: 0.091260, attention_score_distillation_loss: 0.000059 loss: 0.003216, lagrangian_loss: 0.035976, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 18:28:33 Evaluating: accuracy: 0.9174, eval_loss: 0.3958, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 72500 lambda_1: 0.0546, lambda_2: 1539.7094 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.76 0.7 0.71 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000100000001000000000000 0000000000000000000000001 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.001345, lagrangian_loss: 0.017621, attention_score_distillation_loss: 0.000053 loss: 0.001312, lagrangian_loss: 0.016323, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 18:29:59 Evaluating: accuracy: 0.9186, eval_loss: 0.391, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 73000 lambda_1: -0.3442, lambda_2: 1551.1321 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.74 0.69 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000100000001000000000000 0000000000000000100000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.004728, lagrangian_loss: 0.036811, attention_score_distillation_loss: 0.000044 loss: 0.010639, lagrangian_loss: 0.007627, attention_score_distillation_loss: 0.000047 ---------------------------------------------------------------------- time: 2023-07-19 18:31:25 Evaluating: accuracy: 0.9174, eval_loss: 0.4025, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 73500 lambda_1: -0.2420, lambda_2: 1562.0221 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.74 0.69 0.71 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000100000001000000000000 0000000000000000000100000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.001686, lagrangian_loss: 0.021263, attention_score_distillation_loss: 0.000052 ETA: 0:31:51 | Epoch 34 finished. Took 362.62 seconds. loss: 0.003135, lagrangian_loss: 0.139665, attention_score_distillation_loss: 0.000042 ---------------------------------------------------------------------- time: 2023-07-19 18:32:52 Evaluating: accuracy: 0.9186, eval_loss: 0.3941, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 74000 lambda_1: -0.1809, lambda_2: 1573.5197 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.74 0.7 0.71 0.55 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000100000001000000000000 0000000000001000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.002117, lagrangian_loss: 0.237573, attention_score_distillation_loss: 0.000040 loss: 0.238044, lagrangian_loss: 0.063505, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 18:34:19 Evaluating: accuracy: 0.9209, eval_loss: 0.3868, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 74500 lambda_1: -0.1354, lambda_2: 1584.7434 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.71 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000100010000000000000000 0000000000000000000100000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.004703, lagrangian_loss: 0.035853, attention_score_distillation_loss: 0.000054 loss: 0.003526, lagrangian_loss: 0.061310, attention_score_distillation_loss: 0.000043 ---------------------------------------------------------------------- time: 2023-07-19 18:35:45 Evaluating: accuracy: 0.9174, eval_loss: 0.4025, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 75000 lambda_1: -0.0771, lambda_2: 1595.6375 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.71 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000100000000000010000000 0000000000001000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.003222, lagrangian_loss: 0.010867, attention_score_distillation_loss: 0.000056 loss: 0.004039, lagrangian_loss: 0.151809, attention_score_distillation_loss: 0.000037 ---------------------------------------------------------------------- time: 2023-07-19 18:37:11 Evaluating: accuracy: 0.9117, eval_loss: 0.4162, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 75500 lambda_1: -0.3404, lambda_2: 1607.3636 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.74 0.69 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000100000001000000000000 0000000000000000000000001 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.008898, lagrangian_loss: 0.002641, attention_score_distillation_loss: 0.000056 loss: 0.002288, lagrangian_loss: 0.005810, attention_score_distillation_loss: 0.000047 ETA: 0:25:26 | Epoch 35 finished. Took 362.8 seconds. ---------------------------------------------------------------------- time: 2023-07-19 18:38:37 Evaluating: accuracy: 0.9163, eval_loss: 0.3967, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 76000 lambda_1: -0.0507, lambda_2: 1617.9211 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.71 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0111101111011010110100000 0000100000001000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.004749, lagrangian_loss: 0.004356, attention_score_distillation_loss: 0.000054 loss: 0.005166, lagrangian_loss: 0.071034, attention_score_distillation_loss: 0.000059 ---------------------------------------------------------------------- time: 2023-07-19 18:40:04 Evaluating: accuracy: 0.914, eval_loss: 0.4179, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 76500 lambda_1: -0.0346, lambda_2: 1630.5061 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.7 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110101000 0000101000000000000000000 0000000000000000000100000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.107366, lagrangian_loss: 0.038062, attention_score_distillation_loss: 0.000047 loss: 0.002034, lagrangian_loss: 0.000957, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 18:41:30 Evaluating: accuracy: 0.914, eval_loss: 0.4237, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 77000 lambda_1: -0.2779, lambda_2: 1642.2699 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 1101101111011010110100000 0000100000000000001000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.001993, lagrangian_loss: 0.011436, attention_score_distillation_loss: 0.000049 loss: 0.019683, lagrangian_loss: 0.092736, attention_score_distillation_loss: 0.000063 ---------------------------------------------------------------------- time: 2023-07-19 18:42:57 Evaluating: accuracy: 0.9197, eval_loss: 0.3978, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 77500 lambda_1: -0.0523, lambda_2: 1653.0314 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.7 0.55 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0111101111011010110100000 1000100000000000000000000 0000000000000000000100000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.003899, lagrangian_loss: 0.073292, attention_score_distillation_loss: 0.000058 loss: 0.002660, lagrangian_loss: 0.000018, attention_score_distillation_loss: 0.000047 ETA: 0:19:03 | Epoch 36 finished. Took 362.99 seconds. ---------------------------------------------------------------------- time: 2023-07-19 18:44:23 Evaluating: accuracy: 0.922, eval_loss: 0.3961, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 78000 lambda_1: -0.1274, lambda_2: 1664.6094 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.71 0.55 0.09 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000101000000000000000000 0000000000000000000000001 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.003181, lagrangian_loss: 0.000082, attention_score_distillation_loss: 0.000048 loss: 0.002963, lagrangian_loss: 0.032109, attention_score_distillation_loss: 0.000057 ---------------------------------------------------------------------- time: 2023-07-19 18:45:50 Evaluating: accuracy: 0.9209, eval_loss: 0.387, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 78500 lambda_1: -0.3639, lambda_2: 1676.4302 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.74 0.69 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000100000001000000000000 0000000000000000100000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.005896, lagrangian_loss: 0.111578, attention_score_distillation_loss: 0.000059 loss: 0.012731, lagrangian_loss: 0.001073, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 18:47:16 Evaluating: accuracy: 0.9163, eval_loss: 0.4137, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 79000 lambda_1: -0.0953, lambda_2: 1688.5875 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.7 0.55 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 1000100000000000000000000 0000000000001000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.003938, lagrangian_loss: 0.001211, attention_score_distillation_loss: 0.000054 loss: 0.004430, lagrangian_loss: 0.008583, attention_score_distillation_loss: 0.000048 ---------------------------------------------------------------------- time: 2023-07-19 18:48:42 Evaluating: accuracy: 0.9174, eval_loss: 0.4159, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 79500 lambda_1: -0.1537, lambda_2: 1699.3953 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110110000 0000100000000000100000000 0000000000000000001000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.002152, lagrangian_loss: 0.010100, attention_score_distillation_loss: 0.000048 loss: 0.003274, lagrangian_loss: 0.229384, attention_score_distillation_loss: 0.000044 ETA: 0:12:41 | Epoch 37 finished. Took 362.55 seconds. ---------------------------------------------------------------------- time: 2023-07-19 18:50:09 Evaluating: accuracy: 0.9174, eval_loss: 0.416, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 80000 lambda_1: 0.0127, lambda_2: 1710.7234 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.7 0.55 0.1 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 1101101111011010110100000 1000100000000000000000000 0000000000001000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.001956, lagrangian_loss: 0.049769, attention_score_distillation_loss: 0.000057 loss: 0.007824, lagrangian_loss: 0.023596, attention_score_distillation_loss: 0.000051 ---------------------------------------------------------------------- time: 2023-07-19 18:51:36 Evaluating: accuracy: 0.9186, eval_loss: 0.411, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 80500 lambda_1: -0.0600, lambda_2: 1724.6722 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111111010110100000 0000100000000000100000000 0000000000001000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.013019, lagrangian_loss: 0.202098, attention_score_distillation_loss: 0.000059 loss: 0.002397, lagrangian_loss: 0.062876, attention_score_distillation_loss: 0.000057 ---------------------------------------------------------------------- time: 2023-07-19 18:53:03 Evaluating: accuracy: 0.9163, eval_loss: 0.4202, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 81000 lambda_1: -0.0943, lambda_2: 1735.4951 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 1101101111011010110100000 1000100000000000000000000 0010000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.004799, lagrangian_loss: 0.003932, attention_score_distillation_loss: 0.000053 loss: 0.002140, lagrangian_loss: 0.219619, attention_score_distillation_loss: 0.000058 ---------------------------------------------------------------------- time: 2023-07-19 18:54:31 Evaluating: accuracy: 0.9186, eval_loss: 0.3913, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 81500 lambda_1: 0.0196, lambda_2: 1746.4094 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.75 0.7 0.7 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010111100000 0000100000001000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.004118, lagrangian_loss: 0.097584, attention_score_distillation_loss: 0.000045 loss: 0.004009, lagrangian_loss: 0.012719, attention_score_distillation_loss: 0.000052 ---------------------------------------------------------------------- time: 2023-07-19 18:55:59 Evaluating: accuracy: 0.9186, eval_loss: 0.4033, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 82000 lambda_1: -0.0256, lambda_2: 1756.4104 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.75 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010111100000 0000100000001000000000000 1000000000000000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.004811, lagrangian_loss: 0.020769, attention_score_distillation_loss: 0.000056 ETA: 0:06:20 | Epoch 38 finished. Took 372.34 seconds. loss: 0.006787, lagrangian_loss: 0.000367, attention_score_distillation_loss: 0.000057 ---------------------------------------------------------------------- time: 2023-07-19 18:57:27 Evaluating: accuracy: 0.9174, eval_loss: 0.4066, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 82500 lambda_1: -0.1395, lambda_2: 1767.7959 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.74 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010110100010 0000100000001000000000000 0000000000001000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.002873, lagrangian_loss: 0.066111, attention_score_distillation_loss: 0.000048 loss: 0.001274, lagrangian_loss: 0.008245, attention_score_distillation_loss: 0.000055 ---------------------------------------------------------------------- time: 2023-07-19 18:58:55 Evaluating: accuracy: 0.9174, eval_loss: 0.4043, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 83000 lambda_1: -0.0655, lambda_2: 1778.9573 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.74 0.7 0.7 0.55 0.1 0.06] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 1101101111011010110100000 0001100000000000000000000 0000000000000000100000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.006256, lagrangian_loss: 0.026077, attention_score_distillation_loss: 0.000057 loss: 0.006882, lagrangian_loss: 0.016961, attention_score_distillation_loss: 0.000053 ---------------------------------------------------------------------- time: 2023-07-19 19:00:24 Evaluating: accuracy: 0.9174, eval_loss: 0.4029, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 83500 lambda_1: -0.1752, lambda_2: 1789.7844 lambda_3: 0.0000 train remain: [1. 1. 1. 0.78 0.74 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 0101101111011010111100000 1000100000000000000000000 0000000000000000000000001 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.003101, lagrangian_loss: 0.155395, attention_score_distillation_loss: 0.000056 loss: 0.003904, lagrangian_loss: 0.006937, attention_score_distillation_loss: 0.000055 ---------------------------------------------------------------------- time: 2023-07-19 19:01:53 Evaluating: accuracy: 0.9174, eval_loss: 0.4077, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 84000 lambda_1: -0.0343, lambda_2: 1800.2780 lambda_3: 0.0000 train remain: [1. 1. 1. 0.79 0.74 0.7 0.7 0.54 0.09 0.05] infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0] 1111111111111111111111111 1111111111111111111111111 1111111111111111111111111 1111111111111110101101000 1111111111111110101100000 1111111111011110101100000 1111101111111010111010000 1101101111011010110100000 0000100000000010000000000 0000000000001000000000000 Best eval score so far: 0.9335 @ step 58500 epoch 27.79 loss: 0.001714, lagrangian_loss: 0.187577, attention_score_distillation_loss: 0.000062 ETA: 0:00:00 | Epoch 39 finished. Took 371.54 seconds. 07/19/2023 19:04:32 - WARNING - urllib3.connectionpool - Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='southcentralus.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourceGroups/gcr-singularity-octo/providers/Microsoft.MachineLearningServices/workspaces/msroctows/api/2.0/mlflow/runs/get?run_uuid=abaf7266-b685-4ed2-977e-c3790b442fc2&run_id=abaf7266-b685-4ed2-977e-c3790b442fc2