Table-GPT by Microsoft: Empower LLMs To Understand Tables

#Table-GPT#Table-tuning#Microsoft AI#ai papers#llm papers

2.5K Lượt xem｜1 đã tóm tắt｜2 năm trước

💫 Tóm tắt

微软研究的Table-GPT通过表格微调增强大语言模型理解表格的能力，显著提升任务准确性，并在未见任务中表现出良好泛化能力。

✨ Điểm nổi bật📊 Bản chép lời

Sao chép

Trò chuyện với video

✦Table-GPT模型旨在解决大型语言模型对表格数据理解不足的问题。
00:06
现有的大型语言模型在处理表格数据时常常给出不准确的响应。
研究表明，表格数据的二维特性使得模型在垂直读取时表现较差。
在测试中，ChatGPT在识别缺失值时，能正确识别行号的准确率为92.3%，而列号仅为42.2%。
该研究提出了表格调优的GPT模型，以提高对表格的理解和响应准确性。

✦研究人员通过表格调优提高了模型在处理表格任务的表现。
02:24
ChatGPT在处理第二年级艺术分数时的准确率仅为51.2%。
数据填充任务中，ChatGPT的零-shot结果也只有52.4%。
表格调优灵感来自于指令调优，通过在表格指令数据集上微调模型来提升性能。
表格调优的数据样本包含指令、表格和响应的三元组。

✦研究者通过合成和增强方法创建了用于表格调优的数据集。
04:44
使用来自维基百科的290万张真实表格和188k个数据库表格作为基础。
首先进行合成，生成带标签的表格指令数据集。
示例包括数据填充任务和列查找任务，通过随机替换单元格生成新的样本。
还合成了错误检测任务，通过在随机单元格注入错别字来创建数据。

✦为了创建更具多样性的表格指令数据集，采用了三种增强技术。
07:08
第一种是指令级增强，通过对手动编写的指令进行改写来避免过拟合。
第二种是表格级增强，通过重新排列列或行来生成更多样本，但不改变其语义。
第三种是标签级增强，通过提供正确答案并要求生成推理来创建额外样本。

00:06Thank you for joining this video about Table-GPT, a new research paper from Microsoft titled

00:11Table-GPT: Table-tuned GPT for Diverse Table Tasks.

00:14We have seen tremendous progress with large language models such as ChatGPT, Llama and

00:19more, where we can feed a LLM with a text instruction or question, and get an accurate

00:24response from the model.

00:25However, if we'll try to feed the model with a table data, in some kind of text format,

00:30and a question on that table, the LLM is more likely to yield inaccurate response, and we'll

00:35explain that in more details in a minute.

00:37In the paper, the researchers have created Table-GPT model which targets that problem

00:42and can better understand tables in the input and yield accurate response.

00:46In this video we'll explain the paper to understand how Table-GPT was created, and how it performs

00:51comparing to other large language models.

00:53Let's start with the question whether current large language models can understand tables,

00:57which the researchers have investigated.

00:59An important observation is that large language models are mostly pre-trained on natural language

01:04text from the web or from books, and code.

01:07Tabular data is different than natural language text and code, and so LLMs may not be able

01:12to reliably read tables.

01:14One main difference is that text and code are one dimensional and tables are two dimensional.

01:19With tables, it is important to be able to read vertically in order to be able to answer

01:24some type of questions.

01:25In the following example from the paper, we can see an instruction to find the row and

01:30column where a value is missing from the table.

01:32We can see the value is missing in row 2 for the column art, yet the tested language model

01:37is able to get the row right, but is wrong with the column.

01:40Such example implies that the model is better at reasoning horizontally rather than vertically.

01:45And really, when evaluated ChatGPT on 1000 samples, ChatGPT provided the correct row

01:51number 92.3% of the times and the correct column only 42.2% of the times.

01:56The researchers refer to this task as missing value identification.

02:01Another example for a different task is column finding, where here, the instruction is to

02:05find which column has a certain value, 93 in this example.

02:09Here again the response "art" is inaccurate as it should be "music".

02:13ChatGPT was able to get the correct column for this task 69.9% of the times.

02:20Another more complex task is table question answering, where we ask a question which is

02:24based on the table.

02:25In this example how many second-graders scored over 90 in art, and we can see that the model

02:30replied with 2, while Jennifer scored 94 and James's score is missing, so the answer should

02:36be 1.

02:37ChatGPT provided the correct results for this task only in 51.2% of the times.

02:43Last example for a task is data imputation, where we ask the model to fill a cell where

02:48there is a placeholder [TO-FILL] text.

02:50In this example the model was able to correctly return the matching continent for China.

02:54However, in this case as well, the zero-shot results of ChatGPT is only 52.4%

02:59So how the researchers were able to create a model that is doing better on such table

03:05tasks?

03:06The answer is a new approach they refer to as table-tuning.

03:08This approach is inspired from instruction-tuning which proved to be successful for large language

03:13models.

03:14As a short reminder for what is instruction-tuning, large language models are first pre-trained

03:19on huge amount of text to learn general purpose knowledge.

03:22This step helps the LLM to be good at predicting the next token in a sequence, so for example

03:27given an input such as "write a bed-time _", the LLM would be able to complete it with

03:32a reasonable word, such as "story".

03:35However, after the pre-training stage the model is still not good at following human

03:40instructions.

03:41For this reason, we have the instruction-tuning step, where we fine-tune the model on an instructions

03:46dataset, where each sample from that dataset is a tuple of instruction and a response which

03:51we use as the label.

03:52After this step, the model is good at following instructions, and we ignore reinforcement

03:56learning here for simplicity.

03:58So, table-tuning is another step that can run either on the pre-trained LLM or on the

04:03instruction-tuned LLM, where we fine-tune the model on a tables instructions dataset.

04:08Here, each sample in the dataset is a triplet with an instruction, a table, and a response,

04:14similar to the examples that we saw earlier.

04:16We'll talk in a minute about the creation of the tables dataset.

04:20After this step, the model is tuned to understand tables properly.

04:23We can take another look at this with the following figure from the paper.

04:27On the left side we can see instruction-tuning, where base large language models are trained

04:32over tuples of instruction and response, called completion here, in order to create chat expert

04:37language models such as ChatGPT.

04:38And on the right, we can see table-tuning, where either base large language model as

04:44GPT, or an instruction-tuned model such as ChatGPT, are further trained using triplets

04:49of instruction, table and response, in order to create a table-tuned version of the models.

04:54Before moving on, if you enjoy this content then please subscribe to the channel and hit

04:58the like button to help this channel grow.

05:01Let's now dive deeper into how the dataset used for table-tuning was created.

05:06The researchers refer to their method of creating that dataset as synthesis-then-augment.

05:10We first note that there is a limited diversity of existing labeled data available.

05:15So, the goal is to create a diverse large enough set of labeled data, but without expensive

05:20human labeling.

05:21We start with a large set of real tables, without instructions or labels, where 2.9

05:27million tables are taken from Wikipedia and 188k more are database tables.

05:33The first step is synthesis, which results in a dataset of labeled tables instructions.

05:38In each synthesis step, we sample a real table, and a task from a set of supported tasks,

05:43and create a new sample of (instruction, table, response).

05:46The table in the generated sample is not necessarily identical to the input table.

05:51For example, here we sample the data imputation task, where the model needs to fill a missing

05:56value.

05:57We sample a table and randomly replace one cell with the [TO-FILL] token, and we use

06:01the original cell value as the label.

06:04Regarding the instruction, they can be manually crafted and reused for other samples of the

06:09same task, with different tables.

06:11Another example that we saw earlier is column finding, where we ask to determine which column

06:16contains a certain value.

06:18For a sampled table, it is possible to detect a value which just appears once in the table

06:22and automatically generate the instruction to look for that value, in this case "93".

06:28And we use the value's column as the label, in this case "music".

06:31And there are various more tasks that the researchers have synthesized data for, such

06:35as error detection, where a typo was injected automatically to a random cell, and the original

06:41cell value is used as the label.

06:43Another one is table summarization, where a title from a Wikipedia table is used as

06:48the label.

06:49And various more.

06:50We can see a summary of the different tasks in the following table from the paper.

06:54We won't go over each one, but we can see that some tasks are used for training only,

06:59some are used for training and test and some are used only in test time.

07:03Additionally, we can see that the data for most tasks is synthesized, while for a few

07:08more complex tasks they use previous research data.

07:11So, after the synthesis step, we already have a diverse tables instructions dataset, but

07:17to create even more diverse dataset, we have the second step, which is augmentation.

07:21There are three types of augmentations used here.

07:24First is instruction-level augmentation.

07:26We mentioned earlier that the instructions are shared between different instances of

07:30the same task.

07:31For example, if we ask the model to summarize a table, the instruction can stay the same

07:36for different tables.

07:37To avoid overfitting and to create more diverse samples, they use LLMs to paraphrase the manually

07:43crafted instruction.

07:45Second is table-level augmentation.

07:47Here we create more samples by changing the table itself, but without changing its semantic

07:52meaning.

07:53We do that by re-ordering columns or rows, which should mainly not impact the table semantics.

07:58The third type is label-level or ground-truth-level augmentation, where we create additional samples

08:04by providing a LLM with the correct answer and ask it to add reasoning for the answer.

08:09Let's move on to see some of the results presented in the paper.

08:13We start with a comparison of ChatGPT with a table-tuned version of ChatGPT.

08:18In the following figure from the paper, we see the results for 8 tasks types, where for

08:22each task we have 4 bars.

08:24The left two bars in each task are zero-shot results, whereas prompt they just use an instruction

08:29and a table, and the right two bars are few-shot, where few examples are added to the prompt

08:35in addition to the target instruction and table.

08:37The green bars are ChatGPT and the orange bars are the table-tuned version.

08:42We can clearly see improvement for most of the tasks with table-tuning.

08:46Noticeably for example is error detection, where the table-tuned version zero-shot performance

08:51is improved dramatically.

08:53Interestingly, the 4 charts at the bottom are for tasks which the table-tuned model

08:57did not train on, yet it is still able to improve the performance on top of ChatGPT.

09:02What if we compare GPT 3.5, rather than ChatGPT which is instruction-tuned?

09:08In the following figure from the paper, we can see again a similar trend, this time GPT

09:133.5 in blue and the table-tuned version in red.

09:16Similarly, it achieves better performance, and also able to generalize well for unseen

09:21tasks as we can see in the 4 charts at the bottom.

09:24Thank you for watching and I hope to see you again in the next video.

Xem video gốc