Table-GPT by Microsoft: Empower LLMs To Understand Tables
#Table-GPT#Table-tuning#Microsoft AI#ai papers#llm papers
2.5K Lượt xem|1 đã tóm tắt|2 năm trước
💫 Tóm tắt
微软研究的Table-GPT通过表格微调增强大语言模型理解表格的能力,显著提升任务准确性,并在未见任务中表现出良好泛化能力。
✦
Table-GPT模型旨在解决大型语言模型对表格数据理解不足的问题。
00:06现有的大型语言模型在处理表格数据时常常给出不准确的响应。
研究表明,表格数据的二维特性使得模型在垂直读取时表现较差。
在测试中,ChatGPT在识别缺失值时,能正确识别行号的准确率为92.3%,而列号仅为42.2%。
该研究提出了表格调优的GPT模型,以提高对表格的理解和响应准确性。
✦
研究人员通过表格调优提高了模型在处理表格任务的表现。
02:24ChatGPT在处理第二年级艺术分数时的准确率仅为51.2%。
数据填充任务中,ChatGPT的零-shot结果也只有52.4%。
表格调优灵感来自于指令调优,通过在表格指令数据集上微调模型来提升性能。
表格调优的数据样本包含指令、表格和响应的三元组。
✦
研究者通过合成和增强方法创建了用于表格调优的数据集。
04:44使用来自维基百科的290万张真实表格和188k个数据库表格作为基础。
首先进行合成,生成带标签的表格指令数据集。
示例包括数据填充任务和列查找任务,通过随机替换单元格生成新的样本。
还合成了错误检测任务,通过在随机单元格注入错别字来创建数据。
✦
为了创建更具多样性的表格指令数据集,采用了三种增强技术。
07:08第一种是指令级增强,通过对手动编写的指令进行改写来避免过拟合。
第二种是表格级增强,通过重新排列列或行来生成更多样本,但不改变其语义。
第三种是标签级增强,通过提供正确答案并要求生成推理来创建额外样本。
00:06Thank you for joining this video about Table-GPT,
a new research paper from Microsoft titled
00:11Table-GPT: Table-tuned GPT for Diverse Table
Tasks.
00:14We have seen tremendous progress with large
language models such as ChatGPT, Llama and
00:19more, where we can feed a LLM with a text
instruction or question, and get an accurate
00:24response from the model.
00:25However, if we'll try to feed the model with
a table data, in some kind of text format,
00:30and a question on that table, the LLM is more
likely to yield inaccurate response, and we'll
00:35explain that in more details in a minute.
00:37In the paper, the researchers have created
Table-GPT model which targets that problem
00:42and can better understand tables in the input
and yield accurate response.
00:46In this video we'll explain the paper to understand
how Table-GPT was created, and how it performs
00:51comparing to other large language models.
00:53Let's start with the question whether current
large language models can understand tables,
00:57which the researchers have investigated.
00:59An important observation is that large language
models are mostly pre-trained on natural language
01:04text from the web or from books, and code.
01:07Tabular data is different than natural language
text and code, and so LLMs may not be able
01:12to reliably read tables.
01:14One main difference is that text and code
are one dimensional and tables are two dimensional.
01:19With tables, it is important to be able to
read vertically in order to be able to answer
01:24some type of questions.
01:25In the following example from the paper, we
can see an instruction to find the row and
01:30column where a value is missing from the table.
01:32We can see the value is missing in row 2 for
the column art, yet the tested language model
01:37is able to get the row right, but is wrong
with the column.
01:40Such example implies that the model is better
at reasoning horizontally rather than vertically.
01:45And really, when evaluated ChatGPT on 1000
samples, ChatGPT provided the correct row
01:51number 92.3% of the times and the correct
column only 42.2% of the times.
01:56The researchers refer to this task as missing
value identification.
02:01Another example for a different task is column
finding, where here, the instruction is to
02:05find which column has a certain value, 93
in this example.
02:09Here again the response "art" is inaccurate
as it should be "music".
02:13ChatGPT was able to get the correct column
for this task 69.9% of the times.
02:20Another more complex task is table question
answering, where we ask a question which is
02:24based on the table.
02:25In this example how many second-graders scored
over 90 in art, and we can see that the model
02:30replied with 2, while Jennifer scored 94 and
James's score is missing, so the answer should
02:36be 1.
02:37ChatGPT provided the correct results for this
task only in 51.2% of the times.
02:43Last example for a task is data imputation,
where we ask the model to fill a cell where
02:48there is a placeholder [TO-FILL] text.
02:50In this example the model was able to correctly
return the matching continent for China.
02:54However, in this case as well, the zero-shot
results of ChatGPT is only 52.4%
02:59So how the researchers were able to create
a model that is doing better on such table
03:05tasks?
03:06The answer is a new approach they refer to
as table-tuning.
03:08This approach is inspired from instruction-tuning
which proved to be successful for large language
03:13models.
03:14As a short reminder for what is instruction-tuning,
large language models are first pre-trained
03:19on huge amount of text to learn general purpose
knowledge.
03:22This step helps the LLM to be good at predicting
the next token in a sequence, so for example
03:27given an input such as "write a bed-time _",
the LLM would be able to complete it with
03:32a reasonable word, such as "story".
03:35However, after the pre-training stage the
model is still not good at following human
03:40instructions.
03:41For this reason, we have the instruction-tuning
step, where we fine-tune the model on an instructions
03:46dataset, where each sample from that dataset
is a tuple of instruction and a response which
03:51we use as the label.
03:52After this step, the model is good at following
instructions, and we ignore reinforcement
03:56learning here for simplicity.
03:58So, table-tuning is another step that can
run either on the pre-trained LLM or on the
04:03instruction-tuned LLM, where we fine-tune
the model on a tables instructions dataset.
04:08Here, each sample in the dataset is a triplet
with an instruction, a table, and a response,
04:14similar to the examples that we saw earlier.
04:16We'll talk in a minute about the creation
of the tables dataset.
04:20After this step, the model is tuned to understand
tables properly.
04:23We can take another look at this with the
following figure from the paper.
04:27On the left side we can see instruction-tuning,
where base large language models are trained
04:32over tuples of instruction and response, called
completion here, in order to create chat expert
04:37language models such as ChatGPT.
04:38And on the right, we can see table-tuning,
where either base large language model as
04:44GPT, or an instruction-tuned model such as
ChatGPT, are further trained using triplets
04:49of instruction, table and response, in order
to create a table-tuned version of the models.
04:54Before moving on, if you enjoy this content
then please subscribe to the channel and hit
04:58the like button to help this channel grow.
05:01Let's now dive deeper into how the dataset
used for table-tuning was created.
05:06The researchers refer to their method of creating
that dataset as synthesis-then-augment.
05:10We first note that there is a limited diversity
of existing labeled data available.
05:15So, the goal is to create a diverse large
enough set of labeled data, but without expensive
05:20human labeling.
05:21We start with a large set of real tables,
without instructions or labels, where 2.9
05:27million tables are taken from Wikipedia and
188k more are database tables.
05:33The first step is synthesis, which results
in a dataset of labeled tables instructions.
05:38In each synthesis step, we sample a real table,
and a task from a set of supported tasks,
05:43and create a new sample of (instruction, table,
response).
05:46The table in the generated sample is not necessarily
identical to the input table.
05:51For example, here we sample the data imputation
task, where the model needs to fill a missing
05:56value.
05:57We sample a table and randomly replace one
cell with the [TO-FILL] token, and we use
06:01the original cell value as the label.
06:04Regarding the instruction, they can be manually
crafted and reused for other samples of the
06:09same task, with different tables.
06:11Another example that we saw earlier is column
finding, where we ask to determine which column
06:16contains a certain value.
06:18For a sampled table, it is possible to detect
a value which just appears once in the table
06:22and automatically generate the instruction
to look for that value, in this case "93".
06:28And we use the value's column as the label,
in this case "music".
06:31And there are various more tasks that the
researchers have synthesized data for, such
06:35as error detection, where a typo was injected
automatically to a random cell, and the original
06:41cell value is used as the label.
06:43Another one is table summarization, where
a title from a Wikipedia table is used as
06:48the label.
06:49And various more.
06:50We can see a summary of the different tasks
in the following table from the paper.
06:54We won't go over each one, but we can see
that some tasks are used for training only,
06:59some are used for training and test and some
are used only in test time.
07:03Additionally, we can see that the data for
most tasks is synthesized, while for a few
07:08more complex tasks they use previous research
data.
07:11So, after the synthesis step, we already have
a diverse tables instructions dataset, but
07:17to create even more diverse dataset, we have
the second step, which is augmentation.
07:21There are three types of augmentations used
here.
07:24First is instruction-level augmentation.
07:26We mentioned earlier that the instructions
are shared between different instances of
07:30the same task.
07:31For example, if we ask the model to summarize
a table, the instruction can stay the same
07:36for different tables.
07:37To avoid overfitting and to create more diverse
samples, they use LLMs to paraphrase the manually
07:43crafted instruction.
07:45Second is table-level augmentation.
07:47Here we create more samples by changing the
table itself, but without changing its semantic
07:52meaning.
07:53We do that by re-ordering columns or rows,
which should mainly not impact the table semantics.
07:58The third type is label-level or ground-truth-level
augmentation, where we create additional samples
08:04by providing a LLM with the correct answer
and ask it to add reasoning for the answer.
08:09Let's move on to see some of the results presented
in the paper.
08:13We start with a comparison of ChatGPT with
a table-tuned version of ChatGPT.
08:18In the following figure from the paper, we
see the results for 8 tasks types, where for
08:22each task we have 4 bars.
08:24The left two bars in each task are zero-shot
results, whereas prompt they just use an instruction
08:29and a table, and the right two bars are few-shot,
where few examples are added to the prompt
08:35in addition to the target instruction and
table.
08:37The green bars are ChatGPT and the orange
bars are the table-tuned version.
08:42We can clearly see improvement for most of
the tasks with table-tuning.
08:46Noticeably for example is error detection,
where the table-tuned version zero-shot performance
08:51is improved dramatically.
08:53Interestingly, the 4 charts at the bottom
are for tasks which the table-tuned model
08:57did not train on, yet it is still able to
improve the performance on top of ChatGPT.
09:02What if we compare GPT 3.5, rather than ChatGPT
which is instruction-tuned?
09:08In the following figure from the paper, we
can see again a similar trend, this time GPT
09:133.5 in blue and the table-tuned version in
red.
09:16Similarly, it achieves better performance,
and also able to generalize well for unseen
09:21tasks as we can see in the 4 charts at the
bottom.
09:24Thank you for watching and I hope to see you
again in the next video.