Given that some specific tasks do not benefit from generic language models pre-trained on large amounts of data, this research project had sought to investigate the adaptation of domain-specific BERT models in the French language to the legal domain, with the ultimate goal of helping law professionals. The project further explored the use of smaller architectures in domain-specific sub-languages.
The resulting set of BERT models, called JuriBERT, proved that domain-specific pre-trained models can perform better than their equivalent generalised ones in the legal domain.
In particular, the team applied JuriBERT to help speed up case assignment between the Cour’s distinct formations, a task that until then was done manually and slowed down the cassation proceedings substantially. The model was able to accurately predict the most relevant formation for judgment based on the text of the appeal brief. The research further included preliminary results as to the ways to compute the complexity of a given case, again based on the text of the appeal brief.

JuriBERT models are pretrained on 6.3GB of legal french raw text from two different sources: the first dataset is crawled from Légifrance and the other one consists of anonymized court’s decisions and the Claimant’s pleadings from the Court of Cassation. The latter contains more than 100k long documents from different court cases.
JuriBERT models are pretrained using Nvidia GTX 1080Ti and evaluated on a legal specific downstream task which consists of assigning the court Claimant’s pleadings to a chamber and a section of the court. While JuriBERT_SMALL outperforms the general-domain BERT models (CamemBERT_BASE and CamemBERT_LARGE), the other models have a similar performance.

The four JuriBERT models are available freely to the research community following the link below.

DATASET Publication Demo

Let's keep in touch !