Help Center/ MetaStudio/ User Guide/ Image Modeling/ Shooting a Human Video

Updated on 2024-12-26 GMT+08:00

View PDF

Shooting a Human Video

Thanks for using Huawei Cloud MetaStudio. This guide will help you produce virtual avatar images.

Please note that the quality of virtual avatar image modeling is closely related to your recordings. To create better virtual avatars, please read the following virtual avatar video shooting standards carefully.

Overview of virtual avatar video shooting standards:

Shooting Specifications
The total video duration is 5 minutes, the resolution and frame rate are 4K/25 FPS or higher, and the format is MP4/MOV.
Site Layout
- Background: Use a green-screen background with even color and without damage or wrinkles.
- Lighting: Use even and stable illumination and standard daylight color temperature. Ensure that the model is well-lighted and there is no shadow on his/her face, and the light does not change significantly during recording.
- Camera position: The camera is aligned with the person's eyes and focuses on the face area to ensure that the face is clear. A vertical video is recommended.
- Voice pickup: Keep the recording environment quiet and free from noise interference.
Model Image
- Face: Avoid light reflection caused by excessive oil on the face. Ensure that there is no scattered hair on the face. To avoid light reflection, do not wear glasses. Ensure that the facial outline is clear and the model is enthusiastic.
- Dress code: Do not wear clothes in a color similar to green, or clothes with green patterns. Do not wear metal earrings, bracelets, or watches that may reflect light.
Shooting Process
The model should smile and act naturally (including the head). After the action is complete, hands should be put back to the initial position. The model should keep the mouth closed when he or she is not speaking.
Video Submission
The original voice of the video for training must be retained, and the audio and video must be in sync. Do not clip the video. Ensure that the narration, silence, and gestures of the video are retained in the same video.

Shooting guide and script examples

Appendix 1: Check Items
Appendix 2: Guide to Choreography Customization
Appendix 3: Script Examples

Camera Installation & Shooting Specifications

Notes:

You are advised to use a lens with an equivalent focal length of 40 mm to 85 mm. Do not use an ultra wide angle lens.
Vertically secure the camera to the tripod at a proper height. The camera is aligned with the person's eyes and focuses on the face area (see Figure 1) to ensure that the face is clear. If you record the whole body, ensure that there is a green-screen gap at the bottom of the model's feet.
Figure 1 Camera position
A vertical video is recommended. The model is in the middle of the frame. Keep the margin to ensure that the model is within the frame when he or she is making a gesture. See Figure 2.
Figure 2 Shooting example
Avoid overexposure and underexposure.
The model should be at least 1.5 meters away from the green-screen background to avoid shadows.

See Table 1.

**Table 1** Recommended camera shooting specifications
Shooting Specifications	Standard
Resolution and frame rate	4K/25 FPS or higher
Aperture	Less than F4 to avoid significant blurring due to a shallow depth of field
ISO	100–800 to avoid noise caused by excessive ISO
White balance	3,500–5,500 K, fixed white balance throughout the process
Recording format	H.264/H.265 encoding
Bitrate	> 60 Mbit/s
Color bit depth	10 bits or 8 bits
Shutter speed	≤ 1/ (Frame rate x 4) Example: When the frame rate is 60 FPS, the shutter speed must be ≤ 1/ (60 x 4 = 240).

FAQs:

What if my device can shoot 1080p videos only?
If the device cannot shoot 4K videos, try 1080p (1080 x 1920) half-body shooting to capture face details.
Can I use my mobile phone for shooting if I do not have a camera?
Mobile phones are not recommended for photo taking. If necessary, adjust the recording specifications of your phone to 4K/30 FPS or 4K/60 FPS and use a stabilizer to ensure video stability. Other shooting requirements, such as lighting and green screen, are the same as those in Camera Installation & Shooting Specifications.

Lighting

A proper lighting environment will significantly improve the shooting. We recommend that:

Use three or four professional photography lights for lighting, including the main light, auxiliary light, product light (for the shooting of desktop products), background light, and outline light (optional). For details, see Figure 3.
Figure 3 Site lighting
Ensure that the light does not change significantly during recording.
The green-screen background is even and bright to avoid shadows or uneven brightness. Ensure that there is no shadow or reflection between the model/object and the green-screen background.

FAQs:

What if I do not have so many lighting devices?
There is no need to worry about this. You only need to ensure that the person is evenly and stably illuminated and can be clearly distinguished from the green-screen background. Ensure that there is no significant shadow on the face or body of the person. If the number of lighting devices is limited, light the subject to be photographed first, and then fill light on the green screen.

Voice Pickup

MetaStudio will synchronize the voice in the video with the lip movement of the person to improve lip sync training. The training result will go through technical review by experts.

For voice pickup:

The audio and video must be in sync.
The environment is quiet and free from noise interference, and the voice of the model is clear. Minimize the background noise of the video.
Use a loudspeaker or other professional microphones with the camera, which will greatly reduce the background noise and other environmental noise. Avoid including the microphone in the frame so that the microphone will not appear in the image of the virtual avatar.

FAQs:

If someone intrudes or there is unexpected sound, such as thunder or car horn, during my shooting, do I need to shoot the video again?
According to our experience, a sound intrusion shorter than three seconds does not significantly affect the training. You only need to minimize the occurrences of similar burst sounds.
If I do not have a professional microphone or loudspeaker, can I use the built-in microphone of the camera for recording?
Most cameras' built-in microphones can also meet our voice recording requirements. MetaStudio can make a moderate compromise on the clarity of the voice, but ensure that the background noise is not too loud and there is no other sound, especially when the model is speaking.

Model Image

Virtual avatars do not support replacement of clothes, so the wearing of the model during recording will be the wearing of the virtual avatar.

Before shooting, check the following items of the image:

Dress code
- Avoid any color that is similar to that of the background. For example, do not wear green clothes or clothes with any green pattern when a green screen is used.
- Avoid using semi-transparent, transparent, and reflective materials, or wearing clothes with face patterns or excessive wrinkles.
- Avoid wearing clothes with dense stripes, grids, or spots because clothes of these types may cause moire patterns on the image.
- Avoid wearing reflective and green watches, ear nails, necklaces, or bracelets.
  Figure 4 Dress code
Facial requirements
- Keep a light and clean makeup to avoid reflection caused by excessive oil on the face.
- Do not wear glasses, sunglasses, or hats that will block your forehead and eyebrows.
- There is no scattered hair on the face and the background cannot be seen through the hair gap.
- There is no long scattered beard on the face.
  Figure 5 Incorrect examples
  
  Figure 6 Correct examples

Requirements for Green-Screen Virtual Avatar Video Shooting

The hand movements, facial expressions, and postures of the model determine the postures and actions of the virtual avatar. Therefore, models should act and speak as naturally as possible during video shooting.

Record the video as instructed and evaluate the recording process based on the actual requirements.

Interaction not required: 15–20 seconds of silence + 4–5 minutes of natural expression
Interaction required: 15–20 seconds of silence + non-semantic action + 4–5 minutes of natural expression (For details, see Requirements for Interactive Virtual Avatar Video Shooting.)
Choreography: 15–20 seconds of silence + 4–5 minutes of natural expression + shooting of video clips with choreography (Keep the camera position and person position unchanged. For details, see Appendix 2: Guide to Choreography Customization.)

Details:

Silent period: Record the person in silence for about 15–20 seconds.
The model looks at the camera, smiles with his or her mouth closed, and remains silent. The model puts the hands at the initial position, as shown in the following figure.

Keep a proper proportion of the model.

Figure 7 Silent period
Natural expression period: The lip movement, action, posture, and facial expression of the model speaking naturally are recorded for about four and half minutes.
- The model reads the script paragraph by paragraph at a natural speaking speed and acts slightly. The head can move naturally.
  Figure 8 Demo
- The model's mouth should be fully closed and hands should be put back to the initial position during pauses. (Practice is recommended before shooting.)
  Figure 9 Incorrect examples
  
  Figure 10 Correct and incorrect examples

Recording precautions:

If a speech error occurs, skip it or continue the speech without interrupting the shooting.
Rotate or move the head within 15 degrees.
Avoid actions with clear meanings, such as thumbs-up signs and number.
Avoid actions that may block the face, such as resting the cheek on hands and scratching the head.
Move within the frame and do not block the face (such as hands above the chin).
Keep eye contact with the lens.
Avoid wrong pronunciation. Do not speak too fast, too slow, or abruptly.

Requirements for Interactive Virtual Avatar Video Shooting

Recording requirements
- Keep the body still: Record all actions at a time. During the shooting, the model remains at the same position and does not shake. When the shooting starts, the model should keep the silent action for more than ten seconds.
- Keep silent during the five-second interval of actions: There are four actions. When speaking, act as naturally as possible. Keep silent during the silent period.
For details, see the example figure. Act as naturally as possible based on the action habits of the model.

Recording examples

When acting, the model can remain silent or read related text. According to our experience, speaking while acting makes the action more natural. The reference text is as follows.

**Table 2** Recording examples
Step	1	2	3
Text and action (Perform an action marked in <> while reading the text in the same line.)	<Keep the silent action for ten seconds.>	Welcome to the virtual human forum. Now, let's learn about the main application scenarios of virtual humans and related cutting-edge technologies.	5, 4, 3, 2, 1. <Count the number without speaking.> As mentioned above: <Stretch hands and then perform the silent action.>
Examples

**Table 3** Recording examples
Step	4	5	6
Text and action (Perform an action marked in <> while reading the text in the same line.)	5, 4, 3, 2, 1. <Count the number without speaking.> Next, let's look at product A. <Move the left hand past the chest and then perform the silent action.>	5, 4, 3, 2, 1. <Count the number without speaking.> Next, let's look at product B. <Move the right hand past the chest and then perform the silent action.>	5, 4, 3, 2, 1. <Count the number without speaking.> That is all for today's interaction recording. <Perform the silent action after lying down hands.>
Examples

Then, let's shoot a video with natural expression of four to five minutes.

Requirements for Walking Virtual Avatar Video Shooting

If you need to record the image of a walking virtual avatar, the model should:

face the camera (kept still) and look at the lens when walking
walk in one direction for up to three steps.
Figure 11 Shooting example

Requirements for Real-Scene Virtual Avatar Video Shooting

If you want to create a real-scene virtual avatar image, ensure that the background is static and does not change regularly or significantly.

Real-scene virtual avatar videos do not need to be cropped. Therefore, green or transparent clothes is allowed.

Video Submission

See Table 4.

**Table 4** Video submission requirements
Item	Description
Deliverable	After the shooting is complete, upload a video for training: MP4 (recommended) Less than 5 GB Original voice retained Same frame rate for the exported video and the source video when the video is post-processed
Duration	Five to six minutes. Do not clip the video. Ensure that the narration, silence, and gestures of the video are retained in the same video.
Filter	If you need to add filters to the video, check the filter effect and ensure that the video is not distorted, blurred, or shaken. Then you can upload the video for training.
Cropping	Within a specified period of time, the streamer's entire body must be included in the frame, and other unnecessary elements around the portrait, such as the edge of the green screen and shooting devices, must be eliminated.
Naming	Naming rule: Company name_Model name_Shooting time (YYYYMMDD) Example: Huawei Cloud_Yunling_20230925.mp4

Appendix 1: Check Items

After the shooting is complete, check whether the video meets the requirement of each check item in the table.

**Table 5** Check items
Item	Satisfied (Yes/No)
The model keeps the mouth closed when he or she is not speaking.
There is no significant shadow on the green screen.
There is no broken or scattered hair on the model's face and head.
The model remains at the same position throughout the shooting.
The model does not stare at the teleprompter.
The video duration is at least five minutes.
The model does not wear green clothes or clothes with green patterns.
The model does not wear reflective metal watches, ear nails, or bracelets.
The model does not wear semi-transparent, transparent, or reflective clothes.
The model does not wear glasses, sunglasses, or a hat.
The model is within the frame when making gestures.
The model keeps the microphone unseen.
The video does not contain actions with clear meanings, such as thumbs-up sign and number.
The model acts appropriately without blocking the face.

Appendix 2: Guide to Choreography Customization

Due to the complexity of the choreography algorithm, the following requirements are added:

After the 5-minute video of natural expression is recorded, keep the camera and model positions unchanged when recording the video of choreography.
Keep the body still for more than 10 seconds.
Putting hands back to the initial position each time an action is completed, and ensure that the hand movements are consistent, including the positions of two hands (for example, keeping the left hand above the right hand) and finger shapes. Wait for three to four seconds before starting the next action.
To ensure natural facial expressions of the virtual avatar, the model needs to speak while acting.

Details:

Reference actions:
Number: 1, 2, 3

Common actions: greeting, saying goodbye/OK, giving likes, spreading hands leftward/rightward, lifting the left/right hand, and clenching a fist, etc.

Basic actions: stretching to the left or right, slightly/fully stretching to both sides, pointing to the upper left or upper right, and spreading forwards with one hand

Other actions: You can record any action you need. The number of actions is not limited, as long as the initial positions before and after recording the action are the same.

Note that the result of choreography depends on the consistency between the initial positions of the hand and body during recording. If the recording result is poor, choreography cannot be performed. Record videos as instructed by the guide. You are advised to record an action twice or three times and select the best one from them.

Reference action examples:


Initial position	Stretching the left hand to the left	Stretching the right hand to the left

Waving to say hello	Spreading both hands forward	Number gestures

Script examples of choreography:

When acting, the model can remain silent or read related text. According to our experience, speaking while acting makes the action more natural.

Reference process and text:

Start recording the atomic actions as arranged. <Speak without acting.>
5, 4, 3, 2, 1. <Count the number without speaking.> Give the gesture of the number 1. <Return to the silent action after making the gesture of the number 1 with either the left or right hand.>
5, 4, 3, 2, 1. <Count the number without speaking.> Give the gesture of the number 2. <Return to the silent action after making the gesture of the number 2 with either the left or right hand.>
5, 4, 3, 2, 1. <Count the number without speaking.> Give the gesture of the number 3. <Return to the silent action after making the gesture of the number 3 with either the left or right hand.>
5, 4, 3, 2, 1. <Count the number without speaking.> Please make a gesture of greeting or saying goodbye. <Return to the silent action after making the gesture of greeting with either the left or right hand.> The model can perform different directional actions, as long as their hands can be put back to the initial position between each action and wait for four to five seconds before performing the next action.

Other actions:

Let's make an <OK gesture>.
Let's make a gesture of <giving likes>.
Let's make a gesture of <clenching a fist>.
Let's take a look at the right screen. <Point to the right with the right hand and then perform the silent action>.
Let's take a look at the left screen. <Point to the left with the left hand and then perform the silent action>.
Let's review. <Point forwards with both hands and then perform the silent action>.
As mentioned above: <Point forwards slightly with both hands and then perform the silent action>.
Let's go on. <Point forwards with the right hand and then perform the silent action>.
That's all for today's choreography. <Silence>

The text in black is the recommended text to read. Text in <> indicates actions. The interval between two actions is three to four seconds.

Appendix 3: Script Examples

You can select any of the following examples to read. Errors and stuttering do not affect the video recording. The same content can be repeated twice at most. These examples are for reference only. You are advised to read a piece of script that you are familiar with or improvise without script. A natural speech can improve the virtual avatar performance.

Parent topic: Image Modeling

Previous topic: Image Modeling

Next topic: Downloading the Video Shooting Guide